pitt111's picture
Upload 7 files
4a4585b verified
---
license: apache-2.0
language:
- code
library_name: peft
tags:
- llm2vec
- mntp
- decoder-only
- pre-training
- codegemma
---
## πŸ“– Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.
In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.
For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:
➑️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**
---
## Model Card: CodeGemma-7B - MNTP Pre-trained Model
### πŸ“œ Model Description
This is a PEFT adapter for the **`meta-llama/CodeLlama-7b-Instruct-hf`** model, pre-trained with the **Masked Next Token Prediction (MNTP)** objective from the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework.
**Important Note on its Role**:
This model is **not intended for direct downstream task evaluation**. Instead, it serves as a crucial **foundational prerequisite** for our supervised fine-tuned (SupCon) models. The MNTP pre-training enables the decoder-only model to learn bidirectional representations, which is an essential step before applying supervised contrastive learning.
### πŸš€ How to Use
#### Standalone Use (for Base Embeddings)
You can also use this MNTP model by itself to generate text or code embeddings.
```python
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec
base_model_id = "meta-llama/CodeLlama-7b-Instruct-hf"
mntp_model_id = "SYSUSELab/DCS-CodeLlama-7B-It-MNTP"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model_id, trust_remote_code=True, config=config,
torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, mntp_model_id)
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
embeddings = l2v.encode(["def hello_world():\n print('Hello, World!')"])
print("Embedding from MNTP model:", embeddings.shape)
```
### βš™οΈ Training Methodology
This model was pre-trained using the **MNTP** objective as described in the `llm2vec` paper. If you wish to train your own MNTP model from scratch, please refer to the instructions in the `Fine-tuning/Fine-tuning_method/MNTP/` directory of our GitHub repository.
### πŸ“„ Citation
If you use this model, please cite both our paper and the foundational work of `llm2vec`.
```bibtex
@article{chen2024decoder,
title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
journal={arXiv preprint arXiv:2410.22240},
year={2024}
}
@article{vaishaal2024llm2vec,
title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
journal={arXiv preprint arXiv:2404.05961},
year={2024}
}
```