|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- code |
|
|
library_name: peft |
|
|
tags: |
|
|
- llm2vec |
|
|
- mntp |
|
|
- decoder-only |
|
|
- pre-training |
|
|
- codegemma |
|
|
--- |
|
|
|
|
|
## π Are Decoder-Only Large Language Models the Silver Bullet for Code Search? |
|
|
|
|
|
This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**. |
|
|
|
|
|
In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies. |
|
|
|
|
|
For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository: |
|
|
|
|
|
β‘οΈ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)** |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Card: CodeGemma-7B - MNTP Pre-trained Model |
|
|
|
|
|
### π Model Description |
|
|
|
|
|
This is a PEFT adapter for the **`meta-llama/CodeLlama-7b-Instruct-hf`** model, pre-trained with the **Masked Next Token Prediction (MNTP)** objective from the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework. |
|
|
|
|
|
**Important Note on its Role**: |
|
|
This model is **not intended for direct downstream task evaluation**. Instead, it serves as a crucial **foundational prerequisite** for our supervised fine-tuned (SupCon) models. The MNTP pre-training enables the decoder-only model to learn bidirectional representations, which is an essential step before applying supervised contrastive learning. |
|
|
|
|
|
### π How to Use |
|
|
|
|
|
#### Standalone Use (for Base Embeddings) |
|
|
|
|
|
You can also use this MNTP model by itself to generate text or code embeddings. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel, AutoConfig |
|
|
from peft import PeftModel |
|
|
from llm2vec import LLM2Vec |
|
|
|
|
|
base_model_id = "meta-llama/CodeLlama-7b-Instruct-hf" |
|
|
mntp_model_id = "SYSUSELab/DCS-CodeLlama-7B-It-MNTP" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
|
|
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True) |
|
|
model = AutoModel.from_pretrained(base_model_id, trust_remote_code=True, config=config, |
|
|
torch_dtype=torch.bfloat16, device_map="auto") |
|
|
model = PeftModel.from_pretrained(model, mntp_model_id) |
|
|
|
|
|
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512) |
|
|
embeddings = l2v.encode(["def hello_world():\n print('Hello, World!')"]) |
|
|
print("Embedding from MNTP model:", embeddings.shape) |
|
|
``` |
|
|
|
|
|
### βοΈ Training Methodology |
|
|
|
|
|
This model was pre-trained using the **MNTP** objective as described in the `llm2vec` paper. If you wish to train your own MNTP model from scratch, please refer to the instructions in the `Fine-tuning/Fine-tuning_method/MNTP/` directory of our GitHub repository. |
|
|
|
|
|
### π Citation |
|
|
|
|
|
If you use this model, please cite both our paper and the foundational work of `llm2vec`. |
|
|
|
|
|
```bibtex |
|
|
@article{chen2024decoder, |
|
|
title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?}, |
|
|
author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin}, |
|
|
journal={arXiv preprint arXiv:2410.22240}, |
|
|
year={2024} |
|
|
} |
|
|
|
|
|
@article{vaishaal2024llm2vec, |
|
|
title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders}, |
|
|
author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran}, |
|
|
journal={arXiv preprint arXiv:2404.05961}, |
|
|
year={2024} |
|
|
} |
|
|
``` |