MarianMT¶
Bugs: If you see something strange, file a Github Issue and assign @patrickvonplaten.
Translations should be similar, but not identical to output in the test set linked to in each model card.
Implementation Notes¶
- Each model is about 298 MB on disk, there are more than 1,000 models. 
- The list of supported language pairs can be found here. 
- Models were originally trained by Jörg Tiedemann using the Marian C++ library, which supports fast training and translation. 
- All models are transformer encoder-decoders with 6 layers in each component. Each model’s performance is documented in a model card. 
- The 80 opus models that require BPE preprocessing are not supported. 
- The modeling code is the same as - BartForConditionalGenerationwith a few minor modifications:- static (sinusoid) positional embeddings ( - MarianConfig.static_position_embeddings=True)
- a new final_logits_bias ( - MarianConfig.add_bias_logits=True)
- no layernorm_embedding ( - MarianConfig.normalize_embedding=False)
- the model starts generating with - pad_token_id(which has 0 as a token_embedding) as the prefix (Bart uses- <s/>),
 
- Code to bulk convert models can be found in - convert_marian_to_pytorch.py.
Naming¶
- All model names use the following format: - Helsinki-NLP/opus-mt-{src}-{tgt}
- The language codes used to name models are inconsistent. Two digit codes can usually be found here, three digit codes require googling “language code {code}”. 
- Codes formatted like - es_ARare usually- code_{region}. That one is Spanish from Argentina.
- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second group use a combination of ISO-639-5 codes and ISO-639-2 codes. 
Examples¶
- Since Marian models are smaller than many other translation models available in the library, they can be useful for fine-tuning experiments and integration tests. 
Multilingual Models¶
- All model names use the following format: - Helsinki-NLP/opus-mt-{src}-{tgt}:
- If a model can output multiple languages, and you should specify a language code by prepending the desired output language to the - src_text.
- You can see a models’s supported language codes in its model card, under target constituents, like in opus-mt-en-roa. 
- Note that if a model is only multilingual on the source side, like - Helsinki-NLP/opus-mt-roa-en, no language codes are required.
New multi-lingual models from the Tatoeba-Challenge repo require 3 character language codes:
from transformers import MarianMTModel, MarianTokenizer
src_text = [
    '>>fra<< this is a sentence in english that we want to translate to french',
    '>>por<< This should go to portuguese',
    '>>esp<< And this to Spanish'
]
model_name = 'Helsinki-NLP/opus-mt-en-roa'
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text, return_tensors="pt"))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
# ["c'est une phrase en anglais que nous voulons traduire en français",
# 'Isto deve ir para o portuguĂŞs.',
# 'Y esto al español']
Code to see available pretrained models:
from transformers.hf_api import HfApi
model_list = HfApi().model_list()
org = "Helsinki-NLP"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
suffix = [x.split('/')[1] for x in model_ids]
old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
Old Style Multi-Lingual Models¶
These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language group:
['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
 'Helsinki-NLP/opus-mt-ROMANCE-en',
 'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
 'Helsinki-NLP/opus-mt-de-ZH',
 'Helsinki-NLP/opus-mt-en-CELTIC',
 'Helsinki-NLP/opus-mt-en-ROMANCE',
 'Helsinki-NLP/opus-mt-es-NORWAY',
 'Helsinki-NLP/opus-mt-fi-NORWAY',
 'Helsinki-NLP/opus-mt-fi-ZH',
 'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
 'Helsinki-NLP/opus-mt-sv-NORWAY',
 'Helsinki-NLP/opus-mt-sv-ZH']
GROUP_MEMBERS = {
 'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
 'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
 'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
 'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
 'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}
Example of translating english to many romance languages, using old-style 2 character language codes
MarianConfig¶
- 
class transformers.MarianConfig(activation_dropout=0.0, extra_pos_embeddings=2, activation_function='gelu', vocab_size=50265, d_model=1024, encoder_ffn_dim=4096, encoder_layers=12, encoder_attention_heads=16, decoder_ffn_dim=4096, decoder_layers=12, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, attention_dropout=0.0, dropout=0.1, max_position_embeddings=1024, init_std=0.02, classifier_dropout=0.0, num_labels=3, is_encoder_decoder=True, normalize_before=False, add_final_layer_norm=False, do_blenderbot_90_layernorm=False, scale_embedding=False, normalize_embedding=True, static_position_embeddings=False, add_bias_logits=False, force_bos_token_to_be_generated=False, use_cache=True, pad_token_id=1, bos_token_id=0, eos_token_id=2, **common_kwargs)[source]¶
- This is the configuration class to store the configuration of a - MarianMTModel. It is used to instantiate a Marian model according to the specified arguments, defining the model architecture.- Configuration objects inherit from - PretrainedConfigand can be used to control the model outputs. Read the documentation from- PretrainedConfigfor more information.- Parameters
- vocab_size ( - int, optional, defaults to 58101) – Vocabulary size of the Marian model. Defines the number of different tokens that can be represented by the- inputs_idspassed when calling- MarianMTModel.
- d_model ( - int, optional, defaults to 512) – Dimensionality of the layers and the pooler layer.
- encoder_layers ( - int, optional, defaults to 6) – Number of encoder layers.
- decoder_layers ( - int, optional, defaults to 6) – Number of decoder layers.
- encoder_attention_heads ( - int, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer encoder.
- decoder_attention_heads ( - int, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer decoder.
- decoder_ffn_dim ( - int, optional, defaults to 2048) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in decoder.
- encoder_ffn_dim ( - int, optional, defaults to 2048) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in decoder.
- activation_function ( - stror- function, optional, defaults to- "gelu") – The non-linear activation function (function or string) in the encoder and pooler. If string,- "gelu",- "relu",- "silu"and- "gelu_new"are supported.
- dropout ( - float, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
- attention_dropout ( - float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.
- activation_dropout ( - float, optional, defaults to 0.0) – The dropout ratio for activations inside the fully connected layer.
- classifier_dropout ( - float, optional, defaults to 0.0) – The dropout ratio for classifier.
- max_position_embeddings ( - int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
- init_std ( - float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
- add_bias_logits ( - bool, optional, defaults to- False) – This should be completed, specific to marian.
- normalize_before ( - bool, optional, defaults to- False) – Call layernorm before attention ops.
- normalize_embedding ( - bool, optional, defaults to- False) – Call layernorm after embeddings.
- static_position_embeddings ( - bool, optional, defaults to- True) – Don’t learn positional embeddings, use sinusoidal.
- add_final_layer_norm ( - bool, optional, defaults to- False) – Why not add another layernorm?
- scale_embedding ( - bool, optional, defaults to- False) – Scale embeddings by diving by sqrt(d_model).
- eos_token_id ( - int, optional, defaults to 2) – End of stream token id.
- pad_token_id ( - int, optional, defaults to 1) – Padding token id.
- bos_token_id ( - int, optional, defaults to 0) – Beginning of stream token id.
- encoder_layerdrop – ( - float, optional, defaults to 0.0): The LayerDrop probability for the encoder. See the LayerDrop paper for more details.
- decoder_layerdrop – ( - float, optional, defaults to 0.0): The LayerDrop probability for the decoder. See the LayerDrop paper for more details.
- extra_pos_embeddings – ( - int, optional, defaults to 2): How many extra learned positional embeddings to use.
- is_encoder_decoder ( - bool, optional, defaults to- True) – Whether this is an encoder/decoder model
- force_bos_token_to_be_generated ( - bool, optional, defaults to- False) – Whether or not to force BOS token to be generated at step 1 (after- decoder_start_token_id).
 
 
MarianTokenizer¶
- 
class transformers.MarianTokenizer(vocab, source_spm, target_spm, source_lang=None, target_lang=None, unk_token='<unk>', eos_token='</s>', pad_token='<pad>', model_max_length=512, **kwargs)[source]¶
- Construct a Marian tokenizer. Based on SentencePiece. - This tokenizer inherits from - PreTrainedTokenizerwhich contains most of the main methods. Users should refer to this superclass for more information regarding those methods.- Parameters
- source_spm ( - str) – SentencePiece file (generally has a .spm extension) that contains the vocabulary for the source language.
- target_spm ( - str) – SentencePiece file (generally has a .spm extension) that contains the vocabulary for the target language.
- source_lang ( - str, optional) – A string representing the source language.
- target_lang ( - str, optional) – A string representing the target language.
- unk_token ( - str, optional, defaults to- "<unk>") – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
- eos_token ( - str, optional, defaults to- "</s>") – The end of sequence token.
- pad_token ( - str, optional, defaults to- "<pad>") – The token used for padding, for example when batching sequences of different lengths.
- model_max_length ( - int, optional, defaults to 512) – The maximum sentence length the model accepts.
- additional_special_tokens ( - List[str], optional, defaults to- ["<eop>", "<eod>"]) – Additional special tokens used by the tokenizer.
 
 - Examples: - >>> from transformers import MarianTokenizer >>> tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de') >>> src_texts = [ "I am a small frog.", "Tom asked his teacher for advice."] >>> tgt_texts = ["Ich bin ein kleiner Frosch.", "Tom bat seinen Lehrer um Rat."] # optional >>> batch_enc: BatchEncoding = tok.prepare_seq2seq_batch(src_texts, tgt_texts=tgt_texts, return_tensors="pt") >>> # keys [input_ids, attention_mask, labels]. >>> # model(**batch) should work - 
prepare_seq2seq_batch(src_texts: List[str], tgt_texts: Optional[List[str]] = None, max_length: Optional[int] = None, max_target_length: Optional[int] = None, return_tensors: Optional[str] = None, truncation=True, padding='longest', **unused) → transformers.tokenization_utils_base.BatchEncoding[source]¶
- Prepare model inputs for translation. For best performance, translate one sentence at a time. - Parameters
- src_texts ( - List[str]) – List of documents to summarize or source language texts.
- tgt_texts ( - list, optional) – List of summaries or target language texts.
- max_length ( - int, optional) – Controls the maximum length for encoder inputs (documents to summarize or source language texts) If left unset or set to- None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
- max_target_length ( - int, optional) – Controls the maximum length of decoder inputs (target language texts or summaries) If left unset or set to- None, this will use the max_length value.
- padding ( - bool,- stror- PaddingStrategy, optional, defaults to- False) –- Activates and controls padding. Accepts the following values: - Trueor- 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- Falseor- 'do_not_pad'(default): No padding (i.e., can output a batch with sequences of different lengths).
 
- return_tensors ( - stror- TensorType, optional) –- If set, will return tensors instead of list of python integers. Acceptable values are: - 'tf': Return TensorFlow- tf.constantobjects.
- 'pt': Return PyTorch- torch.Tensorobjects.
- 'np': Return Numpy- np.ndarrayobjects.
 
- truncation ( - bool,- stror- TruncationStrategy, optional, defaults to- True) –- Activates and controls truncation. Accepts the following values: - Trueor- 'longest_first': Truncate to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_first': Truncate to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_second': Truncate to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- Falseor- 'do_not_truncate'(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
 
- **kwargs – Additional keyword arguments passed along to - self.__call__.
 
- Returns
- A - BatchEncodingwith the following fields:- input_ids – List of token ids to be fed to the encoder. 
- attention_mask – List of indices specifying which tokens should be attended to by the model. 
- labels – List of token ids for tgt_texts. 
 - The full set of keys - [input_ids, attention_mask, labels], will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.
- Return type
 
 
MarianMTModel¶
- 
class transformers.MarianMTModel(config: transformers.models.bart.configuration_bart.BartConfig)[source]¶
- Pytorch version of marian-nmt’s transformer.h (c++). Designed for the OPUS-NMT translation checkpoints. Available models are listed here. - This class overrides - BartForConditionalGeneration. Please check the superclass for the appropriate documentation alongside usage examples.- Examples: - >>> from transformers import MarianTokenizer, MarianMTModel >>> from typing import List >>> src = 'fr' # source language >>> trg = 'en' # target language >>> sample_text = "où est l'arrêt de bus ?" >>> mname = f'Helsinki-NLP/opus-mt-{src}-{trg}' >>> model = MarianMTModel.from_pretrained(mname) >>> tok = MarianTokenizer.from_pretrained(mname) >>> batch = tok.prepare_seq2seq_batch(src_texts=[sample_text], return_tensors="pt") # don't need tgt_text for inference >>> gen = model.generate(**batch) # for forward pass: model(**batch) >>> words: List[str] = tok.batch_decode(gen, skip_special_tokens=True) # returns "Where is the bus stop ?" 
TFMarianMTModel¶
- 
class transformers.TFMarianMTModel(*args, **kwargs)[source]¶
- Marian model for machine translation This model inherits from - TFBartForConditionalGeneration. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)- This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior. - Note - TF 2.0 models accepts two formats as inputs: - having all inputs as keyword arguments (like PyTorch models), or 
- having all inputs as a list, tuple or dict in the first positional arguments. 
 - This second option is useful when using - tf.keras.Model.fit()method which currently requires having all the tensors in the first argument of the model call function:- model(inputs).- If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument : - a single Tensor with - input_idsonly and nothing else:- model(input_ids)
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: - model([input_ids, attention_mask])or- model([input_ids, attention_mask, token_type_ids])
- a dictionary with one or several input Tensors associated to the input names given in the docstring: - model({"input_ids": input_ids, "token_type_ids": token_type_ids})
 - Parameters
- config ( - MarianConfig) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the- from_pretrained()method to load the model weights.