Spaces:
Running
Running
| APP_TITLE = "📐 NER Metrics Comparison ⚖️" | |
| APP_INTRO = "The NER task is performed over a piece of text and involves recognition of entities belonging to a desired entity set and classifying them. The various metrics are explained in the explanation tab. Once you go through them, head to the comparision tab to test out some examples." | |
| ### EXPLANATION TAB ### | |
| EVAL_FUNCTION_INTRO = "An evaluation function tells us how well a model is performing. The basic working of any evaluation function involves comparing the model's output with the ground truth to give a score of correctness." | |
| EVAL_FUNCTION_PROPERTIES = """ | |
| Some basic properties of an evaluation function are - | |
| 1. Give an output score equivalent to the upper bound when the prediction is completely correct(in some tasks, multiple variations of a predictions can be considered correct) | |
| 2. Give an output score equivalent to the lower bound when the prediction is completely wrong. | |
| 3. GIve an output score between upper and lower bound in other cases, corresponding to the degree of correctness. | |
| """ | |
| NER_TASK_EXPLAINER = """ | |
| The output of the NER task can be represented in either token format or span format. | |
| """ | |
| SPAN_BASED_METRICS_EXPLANATION = """ | |
| Span based metrics use the offsets & labels of the NER spans to compare the ground truths and predictions. These are present in the NER Span representation object, which looks like this | |
| ``` | |
| span_ner_object = {"start_offset": 3, "end_offset":5, "label":"label_name"} | |
| ``` | |
| Comparing the ground truth and predicted span objects we get the following broad categories of cases (detailed explanation can be found [here](https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/)) | |
| ##### Comparison Cases | |
| | Category | Explanation | | |
| | --------------- | --------------------------------------------------------------------------------------- | | |
| | Correct (COR) | both are the same | | |
| | Incorrect (INC) | the output of a system and the golden annotation don’t match | | |
| | Partial (PAR) | system and the golden annotation are somewhat “similar” but not the same | | |
| | Missing (MIS) | a golden annotation is not captured by a system | | |
| | Spurious (SPU) | system produces a response which doesn’t exist in the golden annotation (Hallucination) | | |
| The specifics of this categorization is defined based on our metric of choice. For example, in some cases we might want to consider a partial overlap of offsets correct and in other cases incorrect. | |
| Based on this we have the Partial & Exact span based criterias. These categorization of these schemas are shown below | |
| | Ground Truth Entity | Ground Truth String | Pred Entity | Pred String | Partial | Exact | | |
| | ------------------- | ------------------- | ----------- | ------------------- | ------- | ----- | | |
| | BRAND | tikosyn | - | - | MIS | MIS | | |
| | - | - | BRAND | healthy | SPU | SPU | | |
| | DRUG | warfarin | DRUG | of warfarin | COR | INC | | |
| | DRUG | propranolol | BRAND | propranolol | INC | INC | | |
| | DRUG | phenytoin | DRUG | phenytoin | COR | COR | | |
| | GROUP | contraceptives | DRUG | oral contraceptives | INC | INC | | |
| To compute precision, recall and f1-score from these cases, | |
| $$ Precision = TP / (TP + FP) = COR / (COR + INC + PAR + SPU) $$ | |
| $$ Recall = TP / (TP+FN) = COR / (COR + INC + PAR + MIS) $$ | |
| This f1-score is then computed using the harmonic mean of precision and recall. | |
| """ | |
| TOKEN_BASED_METRICS_EXPLANATION = """ | |
| Token based metrics use the NER token based representation object, which tokenized the input text and assigns a label to each of the token. This essentially transforms the evaluation/modelling task to a classification task. | |
| The token based representation object is shown below | |
| ``` | |
| # Here, O represents the null label | |
| token_ner_object = [('My', O), ('name', O), ('is', O), ('John', NAME), ('.', O)] | |
| ``` | |
| Once we have the token_objects for ground truth and predictions, we compute a classification report comparing the labels. | |
| The final evaluation score is calculated based on exact token metric of choice. | |
| ###### Macro Average | |
| Calculates the metrics for each label, and finds their unweighted mean. This does not take label imbalance into account. | |
| ###### Micro Average | |
| Calculates the metrics globally by counting the total true positives, false negatives and false positives. | |
| ###### Weighted Average | |
| Calculates the metrics for each label, and finds their average weighted by support (the number of true instances for each label). | |
| This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall. | |
| """ | |
| ### COMPARISION TAB ### | |
| PREDICTION_ADDITION_INSTRUCTION = """ | |
| Add predictions to the list of predictions on which the evaluation metric will be caculated. | |
| - Select the entity type/label name and then highlight the span in the text below. | |
| - To remove a span, double click on the higlighted text. | |
| - Once you have your desired prediction, click on the 'Add' button.(The prediction created is shown in a json below) | |
| """ | |