SYSUSELab
/

DCS-CodeLlama-7B-It-MNTP

Model card Files Files and versions

DCS-CodeLlama-7B-It-MNTP / README.md

pitt111's picture

Upload 7 files

4a4585b verified 17 days ago

|

history blame contribute delete

3.47 kB

	---
	license: apache-2.0
	language:
	- code
	library_name: peft
	tags:
	- llm2vec
	- mntp
	- decoder-only
	- pre-training
	- codegemma
	---

	## 📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search?

	This model is an official artifact from our research paper: "[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)".

	In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.

	For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:

	➡️ [GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)

	---

	## Model Card: CodeGemma-7B - MNTP Pre-trained Model

	### 📜 Model Description

	This is a PEFT adapter for the `meta-llama/CodeLlama-7b-Instruct-hf` model, pre-trained with the Masked Next Token Prediction (MNTP) objective from the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework.

	Important Note on its Role:
	This model is not intended for direct downstream task evaluation. Instead, it serves as a crucial foundational prerequisite for our supervised fine-tuned (SupCon) models. The MNTP pre-training enables the decoder-only model to learn bidirectional representations, which is an essential step before applying supervised contrastive learning.

	### 🚀 How to Use

	#### Standalone Use (for Base Embeddings)

	You can also use this MNTP model by itself to generate text or code embeddings.

	```python
	from transformers import AutoTokenizer, AutoModel, AutoConfig
	from peft import PeftModel
	from llm2vec import LLM2Vec

	base_model_id = "meta-llama/CodeLlama-7b-Instruct-hf"
	mntp_model_id = "SYSUSELab/DCS-CodeLlama-7B-It-MNTP"

	tokenizer = AutoTokenizer.from_pretrained(base_model_id)
	config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
	model = AutoModel.from_pretrained(base_model_id, trust_remote_code=True, config=config,
	torch_dtype=torch.bfloat16, device_map="auto")
	model = PeftModel.from_pretrained(model, mntp_model_id)

	l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
	embeddings = l2v.encode(["def hello_world():\n print('Hello, World!')"])
	print("Embedding from MNTP model:", embeddings.shape)
	```

	### ⚙️ Training Methodology

	This model was pre-trained using the MNTP objective as described in the `llm2vec` paper. If you wish to train your own MNTP model from scratch, please refer to the instructions in the `Fine-tuning/Fine-tuning_method/MNTP/` directory of our GitHub repository.

	### 📄 Citation

	If you use this model, please cite both our paper and the foundational work of `llm2vec`.

	```bibtex
	@article{chen2024decoder,
	title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
	author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
	journal={arXiv preprint arXiv:2410.22240},
	year={2024}
	}

	@article{vaishaal2024llm2vec,
	title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
	author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
	journal={arXiv preprint arXiv:2404.05961},
	year={2024}
	}
	```