Update README.md

94c9aa1 verified 6 months ago

4.31 kB

metadata

license: apache-2.0
language:
  - tr

Bert2DModel

Bert2DModel is a new take on the classic BERT architecture, built specifically for languages that have really complex word structures, like Turkish.

Think of it this way: regular BERT sees a sentence as a flat line of words. But for some languages, words themselves have a lot of internal structure (prefixes, suffixes, etc.). Bert2D is cool because it uses a "2D embedding" system. It not only looks at a word's position in the sentence (the first dimension) but also at the position of the sub-pieces inside that word (the second dimension). This gives it a much deeper understanding of the grammar and meaning, especially when words can change form in many different ways. This first version is trained for Turkish!

You can find all the original [Bert2DModel] checkpoints under the yigitbekir collection.

Click on the [Bert2DModel] models in the right sidebar for more examples of how to apply [Bert2DModel] to different text and token classification tasks.

The example below demonstrates how to use the fill-mask pipeline with Bert2DModel or load it directly with the [AutoModel] class.

from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM

# 1. Define your model repository ID
repo_id = "yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2"

# 2. Explicitly load the tokenizer and model first.
#    This is the robust way that you have confirmed works.
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)

print("Loading model...")
model = AutoModelForMaskedLM.from_pretrained(repo_id, trust_remote_code=True)

# 3. Pass the fully initialized OBJECTS directly to the pipeline.
#    This removes all guesswork for the pipeline function.
print("Initializing pipeline...")
fill_masker = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer
)

# 4. Now, this will work correctly
print("Predicting mask...")
masked_sentence = "Adamın mesleği [MASK] midir acaba?"
predictions = fill_masker(masked_sentence)

# 5. Print the results
print("\n--- Predictions ---")
for prediction in predictions:
    print(f"  Sequence: {prediction['sequence']}")
    print(f"  Token: {prediction['token_str']}")
    print(f"  Score: {prediction['score']:.4f}")
    print("-" * 20)

# Expected output:
# Sequence: Adamın mesleği mühendis midir acaba?
# Score: 0.2393
# --------------------
# Sequence: Adamın mesleği doktor midir acaba?
# Score: 0.1698
# --------------------

from transformers import AutoTokenizer, Bert2DModel

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2")
model = Bert2DModel.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2")

# Example text
text = "Türkiye'nin başkenti Ankara'dır."
inputs = tokenizer(text, return_tensors="pt")

# Get model outputs
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

transformers-cli pipeline fill-mask "yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2" --top_k 5 <<< "Adamın mesleği [MASK] midir acaba?"

Notes

Configuration is Key: Bert2D introduces new configuration parameters that are not present in a standard BERT model. You must use the Bert2DConfig and be mindful of these settings when training or fine-tuning. Failing to do so will lead to unexpected behavior. The two key new parameters are max_word_position_embeddings and max_intermediate_subword_position_embeddings.

from transformers import AutoConfig

# Load the custom config from a pretrained model
config = AutoConfig.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2")

# Access new parameters
print(f"Max Word Positions: {config.max_word_position_embeddings}")
# Expected output: Max Word Positions: 512

print(f"Intermediate Subword Position: {config.max_intermediate_subword_position_embeddings}")
# Expected output: Intermediate Subword Position: 2