Afrikaans Next-Word Prediction GPT

Author: Louis Wilkinson (25948873) Project: BDatSci Research Project 2025

Model Description

This model is a GPT-based next-word prediction system specifically designed for Afrikaans text. It uses a custom ByteLevel BPE tokeniser with a 12,000 token vocabulary and implements a token-aware prediction strategy for accurate next-word suggestions in Afrikaans language contexts.

Model Architecture

Type: GPT (Generative Pre-trained Transformer)
Layers: 6 transformer blocks
Attention Heads: 8
Embedding Dimension: 256
Context Window: 32 tokens
Vocabulary Size: 12 000 tokens
Tokeniser: ByteLevel BPE with special tokens

Special Tokens

The tokeniser handles:

<NAME>: Personal names
<URL>: Web addresses
<EMAIL>: Email addresses
<PHONE>: Phone numbers
<NUM>: Numeric values
<EMOJI>: Emoji characters

Afrikaans Contractions

The model is trained to recognise common Afrikaans contractions:

ek's, jy's, hy's, sy's, dit's, ons's, julle's, hulle's
daar's, hier's, wat's, dis, 'n

Repository Structure

Afrikaans-NWP-GPT/
├── model/
│   ├── Afrikaans_NWP_GPT.pt       # Trained model checkpoint
│   └── tokeniser.json              # ByteLevel BPE tokeniser
├── scripts/
│   ├── text_cleaner.py             # Comprehensive Afrikaans text cleaning
│   ├── build_tokeniser.py          # Tokeniser construction script
│   └── train_model.py              # Model training script
├── demo/
│   ├── demo.py                     # Interactive GUI demo application
│   ├── run_demo.sh                 # Shell script to launch demo
│   └── sentence_starters_top100.txt # Common Afrikaans sentence starters
└── tests/
    └── evaluate_model.py           # Model evaluation script

Requirements

Python 3.8+
PyTorch 2.0+
tokenisers (Hugging Face)
tkinter (for demo GUI)
Standard libraries: json, re, unicodedata, dataclasses

Install dependencies:

pip install torch tokenisers

Usage

Running the Demo

The easiest way to try the model is through the interactive demo:

cd demo
bash run_demo.sh

Or directly:

cd demo
python demo.py

The demo provides a tkinter GUI where you can type Afrikaans text and see real-time next-word predictions.

Training the Model

To train the model from scratch:

Clean the text data:

cd scripts
python text_cleaner.py --input raw_text.txt --output cleaned_text.txt

Build the tokeniser:

python build_tokeniser.py --input cleaned_text.txt --output ../model/tokeniser.json

Train the model:

python train_model.py --data cleaned_text.txt --tokeniser ../model/tokeniser.json --output ../model/Afrikaans_NWP_GPT.pt

Evaluating the Model

To evaluate the model on test data:

cd tests
python evaluate_model.py --tokeniser ../model/tokeniser.json --checkpoint ../model/Afrikaans_NWP_GPT.pt

The evaluation script computes three metrics:

KSS (Keystroke Savings): Percentage of keystrokes saved by accepting predictions
MRR@3 (Mean Reciprocal Rank): Average of 1/rank for correct predictions in top-3
RWKS@3 (Rank-Weighted Keystroke Savings): Combines rank and keystrokes saved

Training Details

Optimiser: AdamW with weight decay 0.1
Learning Rate: Cosine schedule with warmup
Mixed Precision: Automatic Mixed Precision (AMP) enabled
Gradient Accumulation: Used for effective larger batch sizes
Regularisation: Dropout 0.1 during training, 0.0 during inference

Text Cleaning Pipeline

The comprehensive text cleaner (scripts/text_cleaner.py) performs:

Unicode normalisation (NFC)
Whitespace normalisation
Special token replacement (names, URLs, emails, phone numbers)
Afrikaans contraction handling
Punctuation normalisation
Character encoding fixes
Removal of control characters and invalid sequences

Prediction Strategy

The model uses a token-aware prediction strategy:

Tokenise input text into BPE tokens
Use m-1 tokens for context (where m is the number of tokens in current text)
Generate top-k predictions from the model
Filter and clean predictions
Return top-3 word-level predictions

This approach ensures predictions align with natural word boundaries while leveraging the full power of subword tokenisation.

Intended Uses & Limitations

Intended Uses

This model is designed for:

Text completion: Assisting users in typing Afrikaans text more efficiently
Academic research: Studying next-word prediction for low-resource languages
Educational applications: Learning tools for Afrikaans language learners
Accessibility: Helping users with typing difficulties in Afrikaans contexts

Limitations

Domain specificity: The model's performance depends on the training data distribution. It may not perform well on domains or registers significantly different from the training corpus.
Context window: Limited to 32 tokens, which may be insufficient for very long-range dependencies.
Low-resource language: As Afrikaans is a relatively low-resource language, the model may not match the performance of similar models trained on high-resource languages like English.
Special tokens: While the model handles common special tokens (names, URLs, emails), it may not generalize well to all types of special content.
Contractions: Although trained on common Afrikaans contractions, rare or informal contractions may not be predicted accurately.

Out-of-Scope Uses

This model should not be used for:

Commercial applications without explicit permission from the author (see License section)
Generating harmful, biased, or offensive content
Making critical decisions without human oversight
Languages other than Afrikaans (the model is specifically trained for Afrikaans)

Ethical Considerations & Biases

Potential Biases

As with any language model trained on real-world text data:

The model may reflect biases present in the training data, including but not limited to demographic, cultural, or socioeconomic biases.
Predictions may favor more common language patterns and may not adequately represent minority dialects or informal Afrikaans variants.
The model's training data sources and their representativeness should be considered when interpreting results.

Recommendations

Users should be aware of potential biases and should not rely solely on the model's predictions for sensitive applications.
The model should be regularly evaluated for fairness across different demographic groups and text domains.
User feedback mechanisms should be implemented in production systems to identify and mitigate problematic predictions.

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

For Academic and Research Use Only

This software and associated model are provided for academic research and educational purposes only. Commercial use is strictly prohibited without prior written permission from the author.

Plagiarism Notice

This is original research conducted as part of a BDatSci honours project. If you use this work in your research, you must provide proper attribution:

Citation:

Wilkinson, L.A. (2025). Next word-prediction in Afrikaans: A comparative study of Transformer models and traditional methods.
Bachelor of Data Science (Analytics and Optimisation) Research Project, Stellenbosch University.

Failure to cite this work constitutes academic plagiarism. Any use, modification, or distribution of this work must include this citation and maintain the original authorship attribution.

Permitted Uses:

Academic research and study
Educational purposes
Non-commercial applications
Modified versions (with proper attribution)

Prohibited Uses:

Commercial applications without permission
Claiming authorship or removing attribution
Use without proper citation in academic work

Contact

Louis Anthony Wilkinson (25948873) BDatSci Research Project 2025

For permissions beyond the scope of this license, please contact the author.

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Keystroke Savings (KSS)
self-reported

0.000
Mean Reciprocal Rank@3 (MRR@3)
self-reported

0.000
Rank-Weighted Keystroke Savings@3 (RWKS@3)
self-reported

0.000

Metadata error: specify a dataset to view leaderboard