🧑⚖️ "Replacing Judges with Juries" using distilabel
TL;DR
distilabel is a framework to build pipelines for synthetic data generation and AI Feedback (AIF) as a Direct Acyclic Graph (DAG) using LLMs that comes with a growing collection of pre-defined tasks. "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" is a recent publication from Cohere that explores the problematic around using a single large LLM as a judge for the generations, and proposes the usage of a Panel of LLm evaluators (PoLL), the so called juries, composed of more and smaller LLMs, leading to a more diverse, less intra-bias, and less expensive generation judgement.
Introduction
"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" is a paper published by Cohere (Path Verga et al.) that explores the problematic around using a single large model like GPT-4 from OpenAI to judge / score either a single LLM generation or a comparison between multiple LLM generations, since they claim it introduces intra-model bias and most of the times using models that large is often unnecessary. So on, they propose what they call a Panel of LLm evaluators (PoLL), the so called "juries", which is a pool of more and smaller LLMs to judge / score the LLM outputs and then use an aggregation or average pooling of those scores instead of the single score provided by the larger LLM, the so called "judge".
Using the proposed PoLL is not only cheaper, but also avoids somehow the intra-model bias due to its composition fo disjoint model families; in the paper being Claude Haiku from Anthropic, GTP-3.5 from OpenAI, and Command R Plus from Cohere; the latter being open-source, while the rest are commercial / proprietary models.
The idea of this post is to reproduce a similar pipeline where some LLMs (Gemma 1.1 7B Instruct, Llama 3 8B Instruct, Phi 3 Mini (4K) Instruct, and Mistral 7B v0.2 Instruct; all being open and available within the Hugging Face Hub) are used to generate completions for a given collection of instructions / prompts, and then have other LLMs (Claude Haiku, GTP-3.5, and Command R Plus) judge those using the UltraFeedback prompt, to finally aggregate the scores so as to calculate the avarage score for each generation and use that score to binarize the dataset into a preference dataset based on the PoLL scores instead of solely on the GPT-4 score as formerly done in UltraFeedback.
What's distilabel?
distilabel is a framework to build pipelines for synthetic data generation and AI Feedback (AIF), defining a series of steps and connecting them as a Direct Acyclic Graph (DAG), so as to easily combine data processing steps with steps running LLMs for diverse tasks such as text generation, preference rating, etc.
This post covers the implementation assuming
distilabelv1.0.0 is used, since the previous versions were still experimental.
The basic concepts of distilabel are the following:
- Step: a step is a process that receives data in batches as input and produces or alters the recieved data as output, and it is the most basic step.
- GeneratorStep: a
Stepthat only generates data i.e. that doesn't receive any input. - GlobalStep: a
Stepthat receives inputs and produces outputs as the defaultStep, but it's global, meaning that it's blocking and it won't be executed until all the batches from the previous steps are processed. - Task: a task is a special type of
Stepthat contains a mandatory arg which is theLLMand will handle the processing so that when called, the input data will be prepared and streamed to theLLMas inputs, and then the outputs generated by theLLMwill be handled and formatted according to the task. - Pipeline: a pipeline is the main class that orchestrates the execution of all the steps defined as part of the
Pipeline, and will handle the batching of the data as well as the validation, logging, and any other related logic.
For more details about distilabel, I'd recommend you to go check distilabel - Documentation, specifically the section dedicated to "Learn".
Installation
To install it you can use pip as follows, which will also install both the anthropic, hf-inference-endpoints, and openai extras, which are required for the Anthropic, Inference Endpoints and OpenAI integrations, respectively.
distilabelwill be installed from thedevelopbranch since it has some features used within this post, but feel free to pin it to v1.1.0 once it's released. See the GitHub Milestone at https://github.com/argilla-io/distilabel/milestone/8
pip install "distilabel[anthropic,hf-inference-endpoints,openai] @ git+https://github.com/argilla-io/distilabel.git@develop"
Additionally, you will need to set the following environment variables to run the Pipeline below:
ANTHROPIC_API_KEY: is the Anthropic API Key required to send requests to the Anthropic models via their API.HF_TOKEN: is the Hugging Face authentication token required to use the Inference Endpoints and to finally push the generateddistilabel.Distisetto the Hugging Face Hub.OPENAI_API_KEY: is the OpenAI API Key required to send requests to the OpenAI models via their API.
Code
Building blocks
LoadHubDataset: is a
GeneratorStepthat will load a dataset from the Hugging Face Hub and stream that in batches provided to the follow up steps. In this case, since the dataset we're using isHuggingFaceH4/instruction-dataset, we will need to rename the columnprompttoinstructionas that's what theTextGenerationtask expects as input.TextGeneration: is a
Taskthat will generate an assistant response for a giveninstructionprovided as input, generating thegenerationcolumn in the output. TheTextGenerationtask expects anLLMas an arg, and in this case we'll be using:- InferenceEndpointsLLM: is an
LLMimplementation for Hugging Face Inference Endpoints, that supports serverless, dedicated and TGI endpoints. In this case we'll be using the following models for the generations:google/gemma-1.1-7b-it,meta-llama/Meta-Llama-3-8B-Instruct,mistralai/Mistral-7B-Instruct-v0.2, andmicrosoft/Phi-3-mini-4k-instruct.
- InferenceEndpointsLLM: is an
CombineColumns: since the
TextGenerationtasks connected to this step run in parallel, the outputs are not combined, so this step will receive that as an input i.e. all the outputs from all the previous steps, and merge those into a list. So that for eachgenerationwe'll have a list namedgenerationsthat contains eachgenerationvalue for the received inputs. This is also useful in order to prepare the data for the next step,UltraFeedback, as it expects aninstructionand a list ofgenerationsas input.UltraFeedback: is a
Taskthat implements the UltraFeedback prompts and post-processing, so as to judge a list of generations for a given instruction using GPT-4, but in this case we'll use it with smaller LLMs as mentioned in the introduction, since that's what the paper is about. In this case we'll use the followingLLMimplementations:- InferenceEndpointsLLM: already mentioned above, and in this case it will run
CohereForAI/c4ai-command-r-plusas opposed toCohereForAI/c4ai-command-ras mentioned in the paper, but only because Command R+ has a serverless endpoint available in the Hugging Face Hub. - AnthropicLLM: is an
LLMimplementation for Anthropic's models, and in this case we'll be using Claude 3 Haiku, their fastest and most compact model, designed for near-instant responsiveness and seamless AI experiences that mimic human interactions. Even though for the currentUltraFeedbackprompts hasn't proven to be so strong, will be detailed further on. - OpenAILLM: is an
LLMimplementation for OpenAI's models via their client, which can be extended to more APIs that are comliant with OpenAI's client; even though in this case we'll use it for GPT-3.5 (gpt-3.5-turbo-0125).
- InferenceEndpointsLLM: already mentioned above, and in this case it will run
CombineColumns: as mentioned below, is a step that expects inputs from more than one step and combines the provided columns into a list under another column name, and in this case we'll group the
ratings,rationales, andmodel_nameto calculate the average of the ratings, while leaving the rest for reference.AvgPooling: is a custom defined
Stepvia thestepdecorator, that will expectpoll_ratingsas input, and will calculate the average of those ratings and put the average for each generation under a list that will match the length of the generations, in this case four. It also showcases how easy it is to create customStepimplementations withdistilabelvia thestepdecorator.
Implementation
from distilabel.llms import InferenceEndpointsLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
CombineColumns,
KeepColumns,
LoadHubDataset,
StepInput,
step,
)
from distilabel.steps.formatting import FormatTextGenerationDPO
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.steps.typing import StepOutput
@step(inputs=["poll_ratings"], outputs=["avg_poll_ratings"])
def AveragePooling(*inputs: StepInput) -> StepOutput:
"""Custom `Step` that calculates the average of the ratings for each generation."""
for input in inputs:
for item in input:
item["avg_poll_ratings"] = [
sum(col) / len(col) for col in zip(*item["poll_ratings"])
]
yield input
if __name__ == "__main__":
# We use `Pipeline` context manager to ensure all the steps defined inside
# are included as part of the `pipeline`
with Pipeline(name="replacing-judges-with-juries") as pipeline:
# First we load the dataset from the Hugging Face Hub, but for local testing
# one could just define a dataset a list of dicts and provide that to `LoadDataFromDicts`
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="HuggingFaceH4/instruction-dataset",
split="test",
num_examples=10,
output_mappings={"prompt": "instruction"},
)
# We create a `TextGeneration` task running Llama 3 on serverless endpoints
text_generation_llama3 = TextGeneration(
name="text_generation_llama3",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
),
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
# We create a `TextGeneration` task running Gemma 1.1 on serverless endpoints
text_generation_gemma = TextGeneration(
name="text_generation_gemma",
llm=InferenceEndpointsLLM(
model_id="google/gemma-1.1-7b-it",
),
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
# We create a `TextGeneration` task running Phi 3 on serverless endpoints
text_generation_phi3 = TextGeneration(
name="text_generation_phi3",
llm=InferenceEndpointsLLM(
model_id="microsoft/Phi-3-mini-4k-instruct",
),
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
# We create a `TextGeneration` task running Mistral v0.2 on serverless endpoints
text_generation_mistral = TextGeneration(
name="text_generation_mistral",
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
# Combine the `generation` and `generation_model` columns from the previous step
# under a single column name as a list
combine_generation_columns = CombineColumns(
name="combine_generation_columns",
columns=["generation", "generation_model"],
output_columns=["generations", "generation_models"],
)
# We create the UltraFeedback task with the `instruction-following` aspect to evaluate
# the LLM capabilities on following instructions, running Command R+ on serverless
# endpoints and GPT-3.5 from OpenAI
ultrafeedback_cmdr_plus = UltraFeedback(
name="ultrafeedback_cmdr_plus",
llm=InferenceEndpointsLLM(
model_id="CohereForAI/c4ai-command-r-plus",
),
input_batch_size=5,
aspect="instruction-following",
)
ultrafeedback_gpt35 = UltraFeedback(
name="ultrafeedback_gpt35",
llm=OpenAILLM(
model="gpt-3.5-turbo-0125",
),
input_batch_size=5,
aspect="instruction-following",
)
# Then we combine again the generated `ratings` and `rationales` into a single column
combine_ultrafeedback_columns = CombineColumns(
name="combine_ultrafeedback_columns",
columns=["ratings", "rationales", "model_name"],
output_columns=["poll_ratings", "poll_rationales", "poll_models"],
)
# Finally, we call our custom task to calculate the average of the ratings for each generation
avg_pooling = AveragePooling(name="avg_pooling", input_batch_size=1)
# Here we define the orchestration of the steps using the `rshift` operator showing how the
# different steps are connected to each other in the `Pipeline`
(
load_dataset
>> [text_generation_llama3, text_generation_gemma, text_generation_phi3, text_generation_mistral]
>> combine_generation_columns
>> [ultrafeedback_cmdr_plus, ultrafeedback_gpt35]
>> combine_ultrafeedback_columns
>> avg_pooling
)
Finally, once the Pipeline has been defined, you can run it as it follows, defining some runtime parameters to mainly control the generation of the LLMs.
distiset = pipeline.run(
parameters={
"text_generation_llama3": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 1024,
"stop_sequences": ["<|eot_id|>", "<|end_of_text|>"],
},
},
},
"text_generation_gemma": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 1024,
"stop_sequences": ["<eos>", "<end_of_turn>"],
},
},
},
"text_generation_phi3": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 1024,
"stop_sequences": ["</s>", "<|endoftext|>"],
},
},
},
"text_generation_mistral": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 1024,
"stop_sequences": ["</s>"],
},
},
},
# "ultrafeedback_haiku": {
# "llm": {"generation_kwargs": {"temperature": 0.7, "max_tokens": 4096}},
# },
"ultrafeedback_cmdr_plus": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 4096,
"stop_sequences": ["<EOS_TOKEN>", "<|END_OF_TURN_TOKEN|>"],
},
},
},
"ultrafeedback_gpt35": {
"llm": {
"generation_kwargs": {"temperature": 0.7, "max_new_tokens": 4096}
},
},
}
)
Finally, we can optionally push the generated dataset i.e. distilabel.Distiset, to the Hugging Face Hub via the push_to_hub method, so that each subset generated in the leaf steps is pushed to the Hub, in this case since there's only one leaf step, only that will be pushed; but if there were many, then each leaf step would be pushed under a different configuration in the Hub.
distiset.push_to_hub("replacing-judges-with-juries-distilabel")
🤗 Dataset available at alvarobartt/replacing-judges-with-juries-distilabel
Notes (as of May 3rd, 2024)
Note that you can replace the LLMs used below with the ones from your choice, the idea of using those was because the ones used for the
TextGenerationtask are provided as serverless endpoints within the Hugging Face Hub and the ones used forUltraFeedbackare the same ones as used in the official paper.
In order to use extensively the serverless Inference Endpoints deployed in the Hugging Face Hub, subscribing to Pro is recommended (see pricing), since Inference for PROs will be enabled and you will have improved rate limits for the usage of the free Inference API.
I've encounter issues when using Claude Haiku with UltraFeedback prompts, since apparently it's not able to generate something that's compliant with the expected formatting, but I'll investigate that; in the meantime, the code for running Claude Haiku with UltraFeedback has been commented. That could be fixed by just ignoring the values with
rating=None, but until further investigation is done, I feel like it's better to leave that aside for the moment.
What's next?
Currently we're actively working on distilabel v1.1.0 trying to make it as developer friendly as possible, encouraging everyone in the community to build with distilabel, as well as aiming to bridge the gap on sythetic data generation with open models and on consumer hardware.
