hf-jp-gpt-wiki (Custom Japanese GPT, hyper-small)

This repository contains a hyper-small Japanese GPT model exported in a Hugging Face-compatible layout, with a vendored backbone and SentencePiece tokenizer.

Architecture: custom GPT (vendored), similar to a tiny GPT-2-like decoder
Parameters (training config):
- context_length: 256
- emb_dim: 128
- n_layers: 4
- n_heads: 4
- drop_rate: 0.1
- qkv_bias: False
- vocab_size: 32000 (SentencePiece)
Tokenizer: SentencePiece (jp_tok_wiki.model, jp_tok_wiki.vocab)
Load requirement: trust_remote_code=True

Quick Start

from transformers import AutoModelForCausalLM
import torch
import sentencepiece as spm

# Load model (trust_remote_code is required)
model = AutoModelForCausalLM.from_pretrained(
    "oga5/hf-jp-gpt-wiki",  # or local folder path
    trust_remote_code=True
)
model.eval()

# Load SentencePiece tokenizer
sp = spm.SentencePieceProcessor(model_file="jp_tok_wiki.model")  # if local
# If running from the Hub, download the files and reference their path, or use hf_hub_download
# from huggingface_hub import hf_hub_download
# tok_path = hf_hub_download("oga5/hf-jp-gpt-wiki", filename="jp_tok_wiki.model")
# sp = spm.SentencePieceProcessor(model_file=tok_path)

eos_id = sp.eos_id()
prompt = "こんにちは。最近あった面白いことは、"
input_ids = sp.encode(prompt, out_type=int)
input_ids = torch.tensor([input_ids], dtype=torch.long)

max_new_tokens = 50
ctx = model.config.context_length

with torch.no_grad():
    for _ in range(max_new_tokens):
        idx_cond = input_ids[:, -ctx:]
        out = model(input_ids=idx_cond)
        logits = out["logits"] if isinstance(out, dict) else out.logits
        next_id = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True)
        if next_id.item() == eos_id:
            break
        input_ids = torch.cat([input_ids, next_id], dim=1)

print(sp.decode(input_ids[0].tolist()))

Notes

This model uses a vendored minimal backbone (modeling_custom_gpt.py) so it can be loaded from the Hub without external project files.
The tokenizer is SentencePiece; AutoTokenizer is not provided. You can load SentencePiece directly as shown above.
For sampling with temperature/top-k, you can implement a simple sampler using logits from model(...).

Tokenizer loading (local / Hugging Face Hub)

If you encounter OSError: Not found: "jp_tok_wiki.model" when running the sample, make sure you pass an existing file path to SentencePiece. Here are reliable patterns:

Local folder (e.g., when running sample/sample.py under .../llmtest01/sample/):

import os
import torch
import sentencepiece as spm
from transformers import AutoModelForCausalLM

# Resolve the repo dir relative to this script file
BASE_DIR = os.path.dirname(os.path.abspath(__file__))  # points to sample/
repo_dir = os.path.normpath(os.path.join(BASE_DIR, "..", "hf_jp_gpt_wiki"))
spm_path = os.path.join(repo_dir, "jp_tok_wiki.model")

print("SPM path:", spm_path, "exists?", os.path.exists(spm_path))

model = AutoModelForCausalLM.from_pretrained(repo_dir, trust_remote_code=True)
model.eval()

sp = spm.SentencePieceProcessor(model_file=spm_path)

From the Hugging Face Hub using hf_hub_download:

import torch
import sentencepiece as spm
from transformers import AutoModelForCausalLM
from huggingface_hub import hf_hub_download

repo_id = "oga5/hf-jp-gpt-wiki"

model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

# Download the tokenizer model and pass the absolute path to SentencePiece
spm_path = hf_hub_download(repo_id=repo_id, filename="jp_tok_wiki.model")
print("Downloaded SPM path:", spm_path)
sp = spm.SentencePieceProcessor(model_file=spm_path)

Tip: print the current working directory and directory listings to verify paths:

import os
print("CWD:", os.getcwd())
print("Here:", os.listdir("."))

License

Model code: Derived from "LLMs from Scratch" examples (Apache 2.0). Source: https://github.com/rasbt/LLMs-from-scratch
Training dataset: fujiki/wiki40b_ja. This dataset is a reformatted version of the Japanese portion of wiki40b dataset. When you use this dataset, please cite the original paper:

@inproceedings{guo-etal-2020-wiki,
    title = "{W}iki-40{B}: Multilingual Language Model Dataset",
    author = "Guo, Mandy  and
      Dai, Zihang  and
      Vrande{\v{c}}i{\'c}, Denny  and
      Al-Rfou, Rami",
    booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.297",
    pages = "2440--2452",
    abstract = "We propose a new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families. With around 40 billion characters, we hope this new resource will accelerate the research of multilingual modeling. We train monolingual causal language models using a state-of-the-art model (Transformer-XL) establishing baselines for many languages. We also introduce the task of multilingual causal language modeling where we train our model on the combined text of 40+ languages from Wikipedia with different vocabulary sizes and evaluate on the languages individually. We released the cleaned-up text of 40+ Wikipedia language editions, the corresponding trained monolingual language models, and several multilingual language models with different fixed vocabulary sizes.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

Citation

If you use this model, please consider citing the original book/code and this repository.

Downloads last month: 2

oga5
/

hf-jp-gpt-wiki