Afrikaans Next-Word Prediction GPT

Author: Louis Wilkinson (25948873) Project: BDatSci Research Project 2025

Model Description

This model is a GPT-based next-word prediction system specifically designed for Afrikaans text. It uses a custom ByteLevel BPE tokeniser with a 12,000 token vocabulary and implements a token-aware prediction strategy for accurate next-word suggestions in Afrikaans language contexts.

Model Architecture

  • Type: GPT (Generative Pre-trained Transformer)
  • Layers: 6 transformer blocks
  • Attention Heads: 8
  • Embedding Dimension: 256
  • Context Window: 32 tokens
  • Vocabulary Size: 12 000 tokens
  • Tokeniser: ByteLevel BPE with special tokens

Special Tokens

The tokeniser handles:

  • <NAME>: Personal names
  • <URL>: Web addresses
  • <EMAIL>: Email addresses
  • <PHONE>: Phone numbers
  • <NUM>: Numeric values
  • <EMOJI>: Emoji characters

Afrikaans Contractions

The model is trained to recognise common Afrikaans contractions:

  • ek's, jy's, hy's, sy's, dit's, ons's, julle's, hulle's
  • daar's, hier's, wat's, dis, 'n

Repository Structure

Afrikaans-NWP-GPT/
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ Afrikaans_NWP_GPT.pt       # Trained model checkpoint
β”‚   └── tokeniser.json              # ByteLevel BPE tokeniser
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ text_cleaner.py             # Comprehensive Afrikaans text cleaning
β”‚   β”œβ”€β”€ build_tokeniser.py          # Tokeniser construction script
β”‚   └── train_model.py              # Model training script
β”œβ”€β”€ demo/
β”‚   β”œβ”€β”€ demo.py                     # Interactive GUI demo application
β”‚   β”œβ”€β”€ run_demo.sh                 # Shell script to launch demo
β”‚   └── sentence_starters_top100.txt # Common Afrikaans sentence starters
└── tests/
    └── evaluate_model.py           # Model evaluation script

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • tokenisers (Hugging Face)
  • tkinter (for demo GUI)
  • Standard libraries: json, re, unicodedata, dataclasses

Install dependencies:

pip install torch tokenisers

Usage

Running the Demo

The easiest way to try the model is through the interactive demo:

cd demo
bash run_demo.sh

Or directly:

cd demo
python demo.py

The demo provides a tkinter GUI where you can type Afrikaans text and see real-time next-word predictions.

Training the Model

To train the model from scratch:

  1. Clean the text data:

    cd scripts
    python text_cleaner.py --input raw_text.txt --output cleaned_text.txt
    
  2. Build the tokeniser:

    python build_tokeniser.py --input cleaned_text.txt --output ../model/tokeniser.json
    
  3. Train the model:

    python train_model.py --data cleaned_text.txt --tokeniser ../model/tokeniser.json --output ../model/Afrikaans_NWP_GPT.pt
    

Evaluating the Model

To evaluate the model on test data:

cd tests
python evaluate_model.py --tokeniser ../model/tokeniser.json --checkpoint ../model/Afrikaans_NWP_GPT.pt

The evaluation script computes three metrics:

  • KSS (Keystroke Savings): Percentage of keystrokes saved by accepting predictions
  • MRR@3 (Mean Reciprocal Rank): Average of 1/rank for correct predictions in top-3
  • RWKS@3 (Rank-Weighted Keystroke Savings): Combines rank and keystrokes saved

Training Details

  • Optimiser: AdamW with weight decay 0.1
  • Learning Rate: Cosine schedule with warmup
  • Mixed Precision: Automatic Mixed Precision (AMP) enabled
  • Gradient Accumulation: Used for effective larger batch sizes
  • Regularisation: Dropout 0.1 during training, 0.0 during inference

Text Cleaning Pipeline

The comprehensive text cleaner (scripts/text_cleaner.py) performs:

  1. Unicode normalisation (NFC)
  2. Whitespace normalisation
  3. Special token replacement (names, URLs, emails, phone numbers)
  4. Afrikaans contraction handling
  5. Punctuation normalisation
  6. Character encoding fixes
  7. Removal of control characters and invalid sequences

Prediction Strategy

The model uses a token-aware prediction strategy:

  1. Tokenise input text into BPE tokens
  2. Use m-1 tokens for context (where m is the number of tokens in current text)
  3. Generate top-k predictions from the model
  4. Filter and clean predictions
  5. Return top-3 word-level predictions

This approach ensures predictions align with natural word boundaries while leveraging the full power of subword tokenisation.

Intended Uses & Limitations

Intended Uses

This model is designed for:

  • Text completion: Assisting users in typing Afrikaans text more efficiently
  • Academic research: Studying next-word prediction for low-resource languages
  • Educational applications: Learning tools for Afrikaans language learners
  • Accessibility: Helping users with typing difficulties in Afrikaans contexts

Limitations

  • Domain specificity: The model's performance depends on the training data distribution. It may not perform well on domains or registers significantly different from the training corpus.
  • Context window: Limited to 32 tokens, which may be insufficient for very long-range dependencies.
  • Low-resource language: As Afrikaans is a relatively low-resource language, the model may not match the performance of similar models trained on high-resource languages like English.
  • Special tokens: While the model handles common special tokens (names, URLs, emails), it may not generalize well to all types of special content.
  • Contractions: Although trained on common Afrikaans contractions, rare or informal contractions may not be predicted accurately.

Out-of-Scope Uses

This model should not be used for:

  • Commercial applications without explicit permission from the author (see License section)
  • Generating harmful, biased, or offensive content
  • Making critical decisions without human oversight
  • Languages other than Afrikaans (the model is specifically trained for Afrikaans)

Ethical Considerations & Biases

Potential Biases

As with any language model trained on real-world text data:

  • The model may reflect biases present in the training data, including but not limited to demographic, cultural, or socioeconomic biases.
  • Predictions may favor more common language patterns and may not adequately represent minority dialects or informal Afrikaans variants.
  • The model's training data sources and their representativeness should be considered when interpreting results.

Recommendations

  • Users should be aware of potential biases and should not rely solely on the model's predictions for sensitive applications.
  • The model should be regularly evaluated for fairness across different demographic groups and text domains.
  • User feedback mechanisms should be implemented in production systems to identify and mitigate problematic predictions.

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

For Academic and Research Use Only

This software and associated model are provided for academic research and educational purposes only. Commercial use is strictly prohibited without prior written permission from the author.

Plagiarism Notice

This is original research conducted as part of a BDatSci honours project. If you use this work in your research, you must provide proper attribution:

Citation:

Wilkinson, L.A. (2025). Next word-prediction in Afrikaans: A comparative study of Transformer models and traditional methods.
Bachelor of Data Science (Analytics and Optimisation) Research Project, Stellenbosch University.

Failure to cite this work constitutes academic plagiarism. Any use, modification, or distribution of this work must include this citation and maintain the original authorship attribution.

Permitted Uses:

  • Academic research and study
  • Educational purposes
  • Non-commercial applications
  • Modified versions (with proper attribution)

Prohibited Uses:

  • Commercial applications without permission
  • Claiming authorship or removing attribution
  • Use without proper citation in academic work

Contact

Louis Anthony Wilkinson (25948873) BDatSci Research Project 2025

For permissions beyond the scope of this license, please contact the author.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results