Spaces:

MohamedRashad
/

arabic-tokenizers-leaderboard

Running

MohamedRashad commited on May 17, 2024

Commit

e4cac44

verified ·

1 Parent(s): a9aadc2

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -161,8 +161,9 @@ def tokenize_text(text, chosen_model, better_tokenization=False):
     return gr.HighlightedText(output, color_map)
-leaderboard_description = """The numbers in this leaderboard are based on the total number of tokens in the Arabic
-dataset [rasaif-translations](https://huggingface.co/datasets/MohamedRashad/rasaif-translations).
 """
 with gr.Blocks() as demo:

     return gr.HighlightedText(output, color_map)
+leaderboard_description = """The `Total Number of Tokens` in this leaderboard is based on the total number of tokens summed on the Arabic section of [rasaif-translations](https://huggingface.co/datasets/MohamedRashad/rasaif-translations) dataset.
+This dataset was chosen because it represents Arabic Fusha text in a small and consentrated manner.
+A tokenizer that scores high in this leaderboard will be efficient in parsing Arabic in its different dialects and forms.
 """
 with gr.Blocks() as demo: