--- license: apache-2.0 language: - code library_name: peft tags: - llm2vec - mntp - decoder-only - pre-training - codegemma --- ## 📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search? This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**. In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies. For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository: ➡️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)** --- ## Model Card: CodeGemma-7B - MNTP Pre-trained Model ### 📜 Model Description This is a PEFT adapter for the **`google/codegemma-7b-it`** model, pre-trained with the **Masked Next Token Prediction (MNTP)** objective from the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework. **Important Note on its Role**: This model is **not intended for direct downstream task evaluation**. Instead, it serves as a crucial **foundational prerequisite** for our supervised fine-tuned (SupCon) models. The MNTP pre-training enables the decoder-only model to learn bidirectional representations, which is an essential step before applying supervised contrastive learning. ### 🚀 How to Use #### Standalone Use (for Base Embeddings) You can also use this MNTP model by itself to generate text or code embeddings. ```python from transformers import AutoTokenizer, AutoModel, AutoConfig from peft import PeftModel from llm2vec import LLM2Vec base_model_id = "google/codegemma-7b-it" mntp_model_id = "SYSUSELab/DCS-CodeGemma-7b-It-MNTP" tokenizer = AutoTokenizer.from_pretrained(base_model_id) config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True) model = AutoModel.from_pretrained(base_model_id, trust_remote_code=True, config=config, torch_dtype=torch.bfloat16, device_map="auto") model = PeftModel.from_pretrained(model, mntp_model_id) l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512) embeddings = l2v.encode(["def hello_world():\n print('Hello, World!')"]) print("Embedding from MNTP model:", embeddings.shape) ``` ### ⚙️ Training Methodology This model was pre-trained using the **MNTP** objective as described in the `llm2vec` paper. If you wish to train your own MNTP model from scratch, please refer to the instructions in the `Fine-tuning/Fine-tuning_method/MNTP/` directory of our GitHub repository. ### 📄 Citation If you use this model, please cite both our paper and the foundational work of `llm2vec`. ```bibtex @article{chen2024decoder, title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?}, author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin}, journal={arXiv preprint arXiv:2410.22240}, year={2024} } @article{vaishaal2024llm2vec, title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders}, author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran}, journal={arXiv preprint arXiv:2404.05961}, year={2024} } ```