Open-Source AI Cookbook documentation

使用 Cleanlab 进行主动学习标注文本数据

Open-Source AI Cookbook

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

使用 Cleanlab 进行主动学习标注文本数据

作者: Aravind Putrevu

在本 Notebook 中，我重点介绍了如何使用主动学习来改进一个经过微调的 Hugging Face Transformer 文本分类模型，同时保持从人工标注者收集的标签总数较低。当资源限制使得无法为整个数据集获取标签时，主动学习通过选择数据标注者应该投入精力标注的样本，旨在节省时间和成本。

什么是主动学习？

主动学习帮助优先选择需要标注的数据，以最大化基于标注数据训练的监督学习模型的性能。这个过程通常是迭代进行的——在每一轮中，主动学习告诉我们应该为哪些示例收集额外的注释，以在有限的标注预算下最大限度地提高当前模型的性能。ActiveLab 是一种特别有用的主动学习算法，尤其在来自人工标注者的标签存在噪声，并且我们应该为以前标注过的示例（其标签看起来可疑）收集更多注释，而不是为尚未标注的示例收集注释时。收集这些新注释以扩充我们的训练数据集后，我们会重新训练模型并评估其测试准确率。

ActiveLab thumb.webp

在本 Notebook 中，我考虑了一个二分类文本分类任务：预测特定短语是礼貌的还是不礼貌的。

与随机选择相比，使用 ActiveLab 进行主动学习在为 Transformer 模型收集额外注释时效果要好得多。无论总的标注预算是多少，它始终能产生更好的模型，错误率约减少 50%。

接下来的部分将介绍如何使用开源代码来实现这些结果。

设置环境

!pip install datasets==2.20.0 transformers==4.25.1 scikit-learn==1.1.2 matplotlib==3.5.3 cleanlab

import pandas as pd
pd.set_option('max_colwidth', None)
import numpy as np
import random
import transformers
import datasets
import matplotlib.pyplot as plt

from cleanlab.multiannotator import get_majority_vote_label, get_active_learning_scores, get_label_quality_multiannotator
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset, Dataset, DatasetDict, ClassLabel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from scipy.special import softmax
from datetime import datetime

收集并组织数据

这里我们下载我们需要的 Notebook 数据。

labeled_data_file = {"labeled": "X_labeled_full.csv"}
unlabeled_data_file = {"unlabeled": "X_labeled_full.csv"}
test_data_file = {"test": "test.csv"}

X_labeled_full = load_dataset("Cleanlab/stanford-politeness", split="labeled", data_files=labeled_data_file)
X_unlabeled = load_dataset("Cleanlab/stanford-politeness", split="unlabeled", data_files=unlabeled_data_file)
test = load_dataset("Cleanlab/stanford-politeness", split="test", data_files=test_data_file)

!wget -nc -O 'extra_annotations.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/extra_annotations.npy?download=true'

extra_annotations = np.load("extra_annotations.npy",allow_pickle=True).item()

X_labeled_full = X_labeled_full.to_pandas()
X_labeled_full.set_index('id', inplace=True)
X_unlabeled = X_unlabeled.to_pandas()
X_unlabeled.set_index('id', inplace=True)
test = test.to_pandas()

文本礼貌性分类

我们使用的是斯坦福礼貌语料库作为数据集。

该数据集被结构化为一个二分类文本分类任务，目标是分类每个短语是礼貌的还是不礼貌的。人工标注者会被提供一个选定的文本短语，并对其礼貌性进行标注：（不礼貌为 0，礼貌为 1）。

我们在标注数据上训练一个 Transformer 分类器，并在一组保留的测试样本上衡量模型的准确性。由于这些测试样本的标签来自五个标注者的共识，我对其真实标签有较高的信心。

至于训练数据，我们有以下内容：

X_labeled_full：我们的初始训练集，包含 100 个文本示例，并且每个示例有 2 个标注。
X_unlabeled：一个包含 1900 个未标注文本示例的大型数据集，供我们考虑让标注者进行标注。
extra_annotations：当请求某个示例的标注时，我们可以从中提取的附加标注池。

可视化数据

# Multi-annotated Data
X_labeled_full.head()

# Unlabeled Data
X_unlabeled.head()

# extra_annotations contains the annotations that we will use when an additional annotation is requested.
extra_annotations

# Random sample of extra_annotations to see format.
{k:extra_annotations[k] for k in random.sample(extra_annotations.keys(), 5)}

查看一些来自测试集的示例

>>> num_to_label = {0:'Impolite', 1:"Polite"}
>>> for i in range(2):
...     print(f"{num_to_label[i]} examples:")
...     subset=test[test.label==i][['text']].sample(n=3, random_state=2)
...     print(subset)

Impolite examples:

不礼貌的示例：

	文本
120	和浪费我们的时间一样。我只能重复一句话：你为什么不通过添加你心爱的马其顿的内容来做一些建设性的工作？
150	与其告诉我我关闭某些 afd 是多么错误，也许你应该把时间花在处理当前的 afd 积压上 <url>。如果我的决定这么错误，为什么你没有重新打开它们？
326	根据 CFD，这本应已经转移到 <url>。为什么没有被转移？

礼貌的示例：

	文本
498	你好，我提出了取消保护 tamazepam 页面 <url> 的可能性。你有什么想法？
132	由于某些编辑，页面的对齐方式发生了变化。你能帮忙吗？
131	我很高兴你对整体外观感到满意。在我标注所有街道之前，文字大小、字体样式等都可以吗？

辅助方法

以下部分包含了本 Notebook 所需的所有辅助方法。

get_idx_to_label 旨在用于主动学习场景，特别是在处理标注和未标注数据的混合时。它的主要目标是根据主动学习分数确定应选择哪些示例（来自标注数据集和未标注数据集）进行额外标注。

# Helper method to get indices of examples with the lowest active learning score to collect more labels for.
def get_idx_to_label(
    X_labeled_full,
    X_unlabeled,
    extra_annotations,
    batch_size_to_label,
    active_learning_scores,
    active_learning_scores_unlabeled=None,
):
    if active_learning_scores_unlabeled is None:
        active_learning_scores_unlabeled = np.array([])

    to_label_idx = []
    to_label_idx_unlabeled = []

    num_labeled = len(active_learning_scores)
    active_learning_scores_combined = np.concatenate((active_learning_scores, active_learning_scores_unlabeled))
    to_label_idx_combined = np.argsort(active_learning_scores_combined)

    # We want to collect the n=batch_size best examples to collect another annotation for.
    i = 0
    while (len(to_label_idx)+len(to_label_idx_unlabeled)) < batch_size_to_label:
        idx = to_label_idx_combined[i]
        # We know this is an already annotated example.
        if idx < num_labeled:
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)
        # We know this is an example that is currently not annotated.
        else:
            # Subtract off offset to get back original index.
            idx -= num_labeled
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)
        i+=1

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

get_idx_to_label_random 旨在主动学习的上下文中使用，其中数据点的选择是随机进行的，而不是基于模型的不确定性或学习分数。这种方法可以作为基准，与更复杂的主动学习策略进行比较，或者用于在不确定如何为示例打分的场景中。

# Helper method to get indices of random examples to collect more labels for.
def get_idx_to_label_random(
    X_labeled_full,
    X_unlabeled,
    extra_annotations,
    batch_size_to_label
):
    to_label_idx = []
    to_label_idx_unlabeled = []

    # Generate list of indices for both sets of examples.
    labeled_idx = [(x, 'labeled') for x in range(len(X_labeled_full))]
    unlabeled_idx = []
    if X_unlabeled is not None:
        unlabeled_idx = [(x, 'unlabeled') for x in range(len(X_unlabeled))]
    combined_idx = labeled_idx + unlabeled_idx

    # We want to collect the n=batch_size random examples to collect another annotation for.
    while (len(to_label_idx)+len(to_label_idx_unlabeled)) < batch_size_to_label:
        # Random choice from indices.
        # We time-seed to ensure randomness.
        random.seed(datetime.now().timestamp())
        choice = random.choice(combined_idx)
        idx, which_subset = choice
        # We know this is an already annotated example.
        if which_subset == 'labeled':
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)
            combined_idx.remove(choice)
        # We know this is an example that is currently not annotated.
        else:
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)
            combined_idx.remove(choice)

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

以下是一些实用方法，帮助我们计算标准差、选择之前标注过该示例的特定标注者，以及一些用于文本示例标记化的令牌化函数。

# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev

# Helper method to select which annotator we should collect another annotation from.
def choose_existing(annotators, existing_annotators):
    for annotator in annotators:
        # If we find one that has already given an annotation, we return it.
        if annotator in existing_annotators:
            return annotator
    # If we don't find an existing, just return a random one.
    choice = random.choice(list(annotators.keys()))
    return choice

# Helper method for Trainer.
def compute_metrics(p):
    logits, labels = p
    pred = np.argmax(logits, axis=1)
    pred_probs = softmax(logits, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    return {"logits":logits, "pred_probs":pred_probs, "accuracy": accuracy}

# Helper method to tokenize text.
def tokenize_function(examples):
    model_name = "distilbert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Helper method to tokenize given dataset.
def tokenize_data(data):
    dataset = Dataset.from_dict({"label":data['label'] , "text": data['text'].values})
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    tokenized_dataset = tokenized_dataset.cast_column("label", ClassLabel(names = ["0","1"]))
    return tokenized_dataset

get_trainer 函数旨在为文本分类任务设置训练环境，使用的是 DistilBERT，这是一种轻量级且速度更快的 BERT 模型的蒸馏版。

# Helper method to initiate a new Trainer with given train and test sets.
def get_trainer(train_set, test_set):

    # Model params.
    model_name = "distilbert-base-uncased"
    model_folder = "model_training"
    max_training_steps = 300
    num_classes = 2

    # Set training args.
    # We time-seed to ensure randomness between different benchmarking runs.
    training_args = TrainingArguments(
        max_steps=max_training_steps,
        output_dir=model_folder,
        seed = int(datetime.now().timestamp())
    )

    # Tokenize train/test set.
    train_tokenized_dataset = tokenize_data(train_set)
    test_tokenized_dataset = tokenize_data(test_set)

    # Initiate a pre-trained model.
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics = compute_metrics,
        train_dataset = train_tokenized_dataset,
        eval_dataset = test_tokenized_dataset,
    )
    return trainer

get_pred_probs 函数通过交叉验证对给定数据集执行样本外预测概率计算，并额外处理未标注数据。

# Helper method to manually compute cross-validated predicted probabilities needed for ActiveLab.
def get_pred_probs(X, X_unlabeled):
    """Uses cross-validation to obtain out-of-sample predicted probabilities
    for given dataset"""

    # Generate cross-val splits.
    n_splits = 3
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True)
    skf_splits = [
        [train_index, test_index]
        for train_index, test_index in skf.split(X=X['text'], y=X['label'])
    ]

    # Initiate empty array to store pred_probs.
    num_examples, num_classes = len(X), len(X.label.value_counts())
    pred_probs = np.full((num_examples, num_classes), np.NaN)
    pred_probs_unlabeled = None

    # If we use up all examples from the initial unlabeled pool, X_unlabeled will be None.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.full((n_splits, len(X_unlabeled), num_classes), np.NaN)

    # Iterate through cross-validation folds.
    for split_num, split in enumerate(skf_splits):
        train_index, test_index = split

        train_set = X.iloc[train_index]
        test_set = X.iloc[test_index]

        # Get trainer with train/test subsets.
        trainer = get_trainer(train_set, test_set)
        trainer.train()
        eval_metrics = trainer.evaluate()

        # Get pred_probs and insert into dataframe.
        pred_probs_fold = eval_metrics['eval_pred_probs']
        pred_probs[test_index] = pred_probs_fold

        # Since we don't have labels for the unlabeled pool, we compute pred_probs at each round of CV
        # and then average the results at the end.
        if X_unlabeled is not None:
            dataset_unlabeled = Dataset.from_dict({"text": X_unlabeled['text'].values})
            unlabeled_tokenized_dataset = dataset_unlabeled.map(tokenize_function, batched=True)
            logits = trainer.predict(unlabeled_tokenized_dataset).predictions
            curr_pred_probs_unlabeled = softmax(logits, axis=1)
            pred_probs_unlabeled[split_num] = curr_pred_probs_unlabeled

    # Here we average the pred_probs from each round of CV to get pred_probs for the unlabeled pool.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.mean(np.array(pred_probs_unlabeled), axis=0)

    return pred_probs, pred_probs_unlabeled

get_annotator 函数根据一组标准确定最合适的标注者，从特定示例中收集新的标注。而 get_annotation 函数则专注于从选择的标注者处收集给定示例的实际标注，同时还会将已收集的标注从池中删除，以防止其再次被选择。

# Helper method to determine which annotator to collect annotation from for given example.
def get_annotator(example_id):
    # Update who has already annotated atleast one example.
    existing_annotators = set(X_labeled_full.drop('text', axis=1).columns)
    # Returns the annotator we want to collect annotation from.
    # Chooses existing annotators first.
    annotators = extra_annotations[example_id]
    chosen_annotator = choose_existing(annotators, existing_annotators)
    return chosen_annotator

# Helper method to collect an annotation for given text example.
def get_annotation(example_id, chosen_annotator):

    # Collect new annotation.
    new_annotation = extra_annotations[example_id][chosen_annotator]

    # Remove annotation.
    del extra_annotations[example_id][chosen_annotator]

    return new_annotation

运行以下代码单元以隐藏下一个模型训练块的 HTML 输出。

%%html
<style>
    div.output_stderr {
    display: none;
    }
</style>

使用的方法

对于每一轮主动学习，我们：

计算基于迄今为止收集的所有标注的 ActiveLab 共识标签，应用于每个训练示例。
使用这些共识标签，在当前训练集上训练我们的 Transformer 分类模型。
在测试集上评估模型的测试准确率（该测试集具有高质量的真实标签）。
通过交叉验证，获取模型对整个训练集和未标注集的样本外预测类别概率。
获取训练集和未标注集上每个示例的 ActiveLab 主动学习分数。这些分数估算了为每个示例收集另一个标注的潜在信息量。
选择一组主动学习分数最低的示例子集（n = batch_size）。
为每个选定的 n 个示例收集一个额外的标注。
将新的标注（以及如果选择的非标注示例）添加到训练集中，用于下一轮迭代。

接下来，我将比较通过主动学习标注的数据与通过随机选择标注的数据训练的模型。对于每一轮随机选择，我使用多数投票共识而非 ActiveLab 共识（步骤 1），然后仅随机选择n个示例来收集额外标签，而不是使用 ActiveLab 分数（步骤 6）。

关于 ActiveLab 共识标签和主动学习分数的更多直觉解释将在本笔记本的后续部分中分享。

模型训练与评估

我首先对测试集和训练集进行标记化处理，然后初始化一个预训练的 DistilBert Transformer 模型。通过 300 次训练步骤对 DistilBert 进行微调，在我的数据上取得了准确率与训练时间之间的良好平衡。该分类器输出预测类别概率，我在评估其准确性之前，将概率转换为类别预测。

使用主动学习分数决定下一个标注内容

在每轮主动学习中，我们通过 3 折交叉验证拟合当前训练集上的 Transformer 模型。这使我们能够获得训练集上每个示例的样本外预测类别概率，同时我们还可以使用训练好的 Transformer 获取未标注池中每个示例的样本外预测类别概率。所有这些都在 get_pred_probs 辅助方法中实现。样本外预测的使用帮助我们避免了由于潜在过拟合导致的偏差。

一旦得到这些概率预测，我将它们传入开源 cleanlab 包中的 get_active_learning_scores 方法，该方法实现了 ActiveLab 算法。此方法为我们提供了所有标注数据和未标注数据的分数。较低的分数表示收集一个额外标签对当前模型来说最具信息性（这些分数在标注数据和未标注数据之间是直接可比较的）。

我将得分最低的示例形成一个批次，作为需要收集标注的示例（通过 get_idx_to_label 方法）。在每轮中，我始终收集相同数量的标注（无论是在主动学习还是随机选择方法下）。对于这个应用，我将每个示例的最大标注次数限制为 5 次（避免重复标注同一个示例）。

添加新的标注

combined_example_ids 是我们希望收集标注的文本示例的 ID。对于每一个示例，我们使用 get_annotation 辅助方法从标注者那里收集新的标注。在此过程中，我们优先选择那些已经标注过其他示例的标注者。如果给定示例的标注者在训练集中不存在，我们将随机选择一个标注者。在这种情况下，我们会在训练集中添加一列，表示新的标注者。最后，我们将新收集的标注添加到训练集中。如果对应的示例之前没有标注过，我们还会将其添加到训练集中，并从未标注数据集中移除。

我们现在已经完成了一轮新的标注收集，并基于更新后的训练集重新训练 Transformer 模型。我们将通过多轮的迭代重复此过程，不断扩展训练数据集并提升模型性能。

# For this Active Learning demo, we add 25 additional annotations to the training set
# each iteration, for 25 rounds.
num_rounds = 25
batch_size_to_label = 25
model_accuracy_arr = np.full(num_rounds, np.nan)

# The 'selection_method' varible determines if we use ActiveLab or random selection
# to choose the new annotations each round.
selection_method = 'random'
# selection_method = 'active_learning'

# Each round we:
# - train our model
# - evaluate on unchanging test set
# - collect and add new annotations to training set
for i in range(num_rounds):

    # X_labeled_full is updated each iteration. We drop the text column which leaves us with just the annotations.
    multiannotator_labels = X_labeled_full.drop(['text'], axis=1)

    # Use majority vote when using random selection to select the consensus label for each example.
    if i == 0 or selection_method == 'random':
        consensus_labels = get_majority_vote_label(multiannotator_labels)

    # When using ActiveLab, use cleanlab's CrowdLab to select the consensus label for each example.
    else:
        results = get_label_quality_multiannotator(
            multiannotator_labels,
            pred_probs_labeled,
            calibrate_probs=True,
        )
        consensus_labels = results["label_quality"]["consensus_label"].values

    # We only need the text and label columns.
    train_set = X_labeled_full[['text']]
    train_set['label'] = consensus_labels
    test_set = test[['text', 'label']]

    # Train our Transformer model on the full set of labeled data to evaluate model accuracy for the current round.
    # This is an optional step for demonstration purposes, in practical applications
    # you may not have ground truth labels.
    trainer = get_trainer(train_set, test_set)
    trainer.train()
    eval_metrics = trainer.evaluate()
    # set statistics
    model_accuracy_arr[i] = eval_metrics['eval_accuracy']

    # For ActiveLab, we need to run cross-validation to get out-of-sample predicted probabilites.
    if selection_method == 'active_learning':
        pred_probs, pred_probs_unlabeled = get_pred_probs(train_set, X_unlabeled)

        # Compute active learning scores.
        active_learning_scores, active_learning_scores_unlabeled = get_active_learning_scores(
            multiannotator_labels, pred_probs, pred_probs_unlabeled
        )

        # Get the indices of examples to collect more labels for.
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label(
            X_labeled_full,
            X_unlabeled,
            extra_annotations,
            batch_size_to_label,
            active_learning_scores,
            active_learning_scores_unlabeled,
        )

    # We don't need to run cross-validation, just get random examples to collect annotations for.
    if selection_method == 'random':
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label_random(
        X_labeled_full,
        X_unlabeled,
        extra_annotations,
        batch_size_to_label
        )

    unlabeled_example_ids = np.array([])
    # Check to see if we still have unlabeled examples left.
    if X_unlabeled is not None:
        # Get unlabeled text examples we want to collect annotations for.
        new_text = X_unlabeled.iloc[chosen_examples_unlabeled]
        unlabeled_example_ids = new_text.index.values
        num_ex, num_annot = len(new_text), multiannotator_labels.shape[1]
        empty_annot = pd.DataFrame(data = np.full((num_ex, num_annot), np.NaN), columns = multiannotator_labels.columns, index=unlabeled_example_ids)
        new_unlabeled_df = pd.concat([new_text, empty_annot], axis=1)

        # Combine unlabeled text examples with existing, labeled examples.
        X_labeled_full = pd.concat([X_labeled_full, new_unlabeled_df], axis=0)

        # Remove examples from X_unlabeled and check if empty.
        # Once it is empty we set it to None to handle appropriately elsewhere.
        X_unlabeled = X_unlabeled.drop(new_text.index)
        if X_unlabeled.empty:
            X_unlabeled = None

    if selection_method == 'active_learning':
        # Update pred_prob arrays with newly added examples if necessary.
        if pred_probs_unlabeled is not None and len(chosen_examples_unlabeled) != 0:
            pred_probs_new = pred_probs_unlabeled[chosen_examples_unlabeled, :]
            pred_probs_labeled = np.concatenate((pred_probs, pred_probs_new))
            pred_probs_unlabeled = np.delete(
                pred_probs_unlabeled, chosen_examples_unlabeled, axis=0
            )
        # Otherwise we have nothing to modify.
        else:
            pred_probs_labeled = pred_probs

    # Get combined list of text ID's to relabel.
    labeled_example_ids = X_labeled_full.iloc[chosen_examples_labeled].index.values
    combined_example_ids = np.concatenate([labeled_example_ids, unlabeled_example_ids])

    # Now we collect annotations for the selected examples.
    for example_id in combined_example_ids:
        # Choose which annotator to collect annotation from.
        chosen_annotator = get_annotator(example_id)
        # Collect new annotation.
        new_annotation = get_annotation(example_id, chosen_annotator)
        # New annotator has been selected.
        if chosen_annotator not in X_labeled_full.columns.values:
            empty_col = np.full((len(X_labeled_full),), np.nan)
            X_labeled_full[chosen_annotator] = empty_col

        # Add selected annotation to the training set.
        X_labeled_full.at[example_id, chosen_annotator] = new_annotation

结果

在进行 25 轮主动学习（每轮标注一批数据并重新训练 Transformer 模型）后，每轮收集 25 个标注。我重复了整个过程，下一次使用随机选择来决定每轮标注哪些示例——作为基准比较。在标注额外数据之前，两种方法都从相同的初始训练集（包含 100 个示例）开始（因此在第一轮中，Transformer 的准确率大致相同）。由于训练 Transformer 模型时固有的随机性，我对这个过程进行了五次运行（每种数据标注策略），并报告了五次重复运行中测试准确率的标准差（阴影区域）和均值（实线）。

# Get numpy array of results.
!wget -nc -O 'random_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/activelearn_acc.npy'
!wget -nc -O 'activelearn_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/random_acc.npy'

# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev

>>> al_acc = np.load('activelearn_acc.npy')
>>> rand_acc = np.load('random_acc.npy')

>>> rand_acc_std = compute_std_dev(rand_acc)
>>> al_acc_std = compute_std_dev(al_acc)

>>> plt.plot(range(1, al_acc.shape[1]+1), np.mean(al_acc, axis=0), label="active learning", color='green')
>>> plt.fill_between(range(1, al_acc.shape[1]+1), al_acc_std[0], al_acc_std[1], alpha=0.3, color='green')

>>> plt.plot(range(1, rand_acc.shape[1]+1), np.mean(rand_acc, axis=0), label="random", color='red')
>>> plt.fill_between(range(1, rand_acc.shape[1]+1), rand_acc_std[0], rand_acc_std[1], alpha=0.1, color='red')

>>> plt.hlines(y=0.9, xmin=1.0, xmax=25.0, color='black', linestyle='dotted')
>>> plt.legend()
>>> plt.xlabel("Round Number")
>>> plt.ylabel("Test Accuracy")
>>> plt.title("ActiveLab vs Random Annotation Selection --- 5 Runs")
>>> plt.savefig("al-results.png")
>>> plt.show()

我们可以看到，选择下一步标注哪些数据对模型性能有着显著的影响。使用 ActiveLab 的主动学习在每一轮中都显著优于随机选择。例如，在第 4 轮，总共有 275 个标注的训练集时，通过主动学习我们获得了 91% 的准确率，而在没有智能选择标注示例的情况下，准确率仅为 76%。总体来说，使用主动学习构建的数据集上训练的 Transformer 模型，其误差率大约是传统方法的 50%，无论总的标注预算是多少！

在进行文本分类数据标注时，你应考虑使用主动学习，并配合重新标注选项，以更好地应对标注者的不完美性。

Update on GitHub

←使用 Cleanlab 检测文本数据集中的问题构建一个基于 Gemma、Elasticsearch 和 Hugging Face 模型的 RAG 系统→