Configuration Parsing Warning: In adapter_config.json: "peft.task_type" must be a string

📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search?

This model is an official artifact from our research paper: "Are Decoder-Only Large Language Models the Silver Bullet for Code Search?".

In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.

For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:

➡️ GitHub: Georgepitt/DecoderLLMs-CodeSearch

Model Card: CodeGemma-7B - MNTP Pre-trained Model

📜 Model Description

This is a PEFT adapter for the google/codegemma-7b-it model, pre-trained with the Masked Next Token Prediction (MNTP) objective from the llm2vec framework.

Important Note on its Role: This model is not intended for direct downstream task evaluation. Instead, it serves as a crucial foundational prerequisite for our supervised fine-tuned (SupCon) models. The MNTP pre-training enables the decoder-only model to learn bidirectional representations, which is an essential step before applying supervised contrastive learning.

🚀 How to Use

Standalone Use (for Base Embeddings)

You can also use this MNTP model by itself to generate text or code embeddings.

from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec

base_model_id = "google/codegemma-7b-it"
mntp_model_id = "SYSUSELab/DCS-CodeGemma-7b-It-MNTP"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model_id, trust_remote_code=True, config=config, 
                                  torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, mntp_model_id)

l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
embeddings = l2v.encode(["def hello_world():\n    print('Hello, World!')"])
print("Embedding from MNTP model:", embeddings.shape)

⚙️ Training Methodology

This model was pre-trained using the MNTP objective as described in the llm2vec paper. If you wish to train your own MNTP model from scratch, please refer to the instructions in the Fine-tuning/Fine-tuning_method/MNTP/ directory of our GitHub repository.

📄 Citation

If you use this model, please cite both our paper and the foundational work of llm2vec.

@article{chen2024decoder,
  title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
  author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
  journal={arXiv preprint arXiv:2410.22240},
  year={2024}
}

@article{vaishaal2024llm2vec,
    title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
    author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
    journal={arXiv preprint arXiv:2404.05961},
    year={2024}
}

Downloads last month: 26

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support