Transformers documentation
Dia
This model was released on 2025-04-21 and added to Hugging Face Transformers on 2025-06-26.
Dia
Overview
Dia is an open-source text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs. It can generate highly realistic dialogue from transcript including non-verbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).
Model Architecture: Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio.
Usage Tips
Generation with Text
from transformers import AutoProcessor, DiaForConditionalGeneration, infer_device
torch_device = infer_device()
model_checkpoint = "nari-labs/Dia-1.6B-0626"
text = ["[S1] Dia is an open weights text to dialogue model."]
processor = AutoProcessor.from_pretrained(model_checkpoint)
inputs = processor(text=text, padding=True, return_tensors="pt").to(torch_device)
model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
outputs = model.generate(**inputs, max_new_tokens=256)  # corresponds to around ~2s
# save audio to a file
outputs = processor.batch_decode(outputs)
processor.save_audio(outputs, "example.wav")
Generation with Text and Audio (Voice Cloning)
from datasets import load_dataset, Audio
from transformers import AutoProcessor, DiaForConditionalGeneration, infer_device
torch_device = infer_device()
model_checkpoint = "nari-labs/Dia-1.6B-0626"
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=44100))
audio = ds[-1]["audio"]["array"]
# text is a transcript of the audio + additional text you want as new audio
text = ["[S1] I know. It's going to save me a lot of money, I hope. [S2] I sure hope so for you."]
processor = AutoProcessor.from_pretrained(model_checkpoint)
inputs = processor(text=text, audio=audio, padding=True, return_tensors="pt").to(torch_device)
prompt_len = processor.get_audio_prompt_len(inputs["decoder_attention_mask"])
model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
outputs = model.generate(**inputs, max_new_tokens=256)  # corresponds to around ~2s
# retrieve actually generated audio and save to a file
outputs = processor.batch_decode(outputs, audio_prompt_len=prompt_len)
processor.save_audio(outputs, "example_with_audio.wav")Training
from datasets import load_dataset, Audio
from transformers import AutoProcessor, DiaForConditionalGeneration, infer_device
torch_device = infer_device()
model_checkpoint = "nari-labs/Dia-1.6B-0626"
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=44100))
audio = ds[-1]["audio"]["array"]
# text is a transcript of the audio
text = ["[S1] I know. It's going to save me a lot of money, I hope."]
processor = AutoProcessor.from_pretrained(model_checkpoint)
inputs = processor(
    text=text,
    audio=audio,
    generation=False,
    output_labels=True,
    padding=True,
    return_tensors="pt"
).to(torch_device)
model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
out = model(**inputs)
out.loss.backward()This model was contributed by Jaeyong Sung, Arthur Zucker, and Anton Vlasjuk. The original code can be found here.
DiaConfig
class transformers.DiaConfig
< source >( encoder_config: typing.Optional[transformers.models.dia.configuration_dia.DiaEncoderConfig] = None decoder_config: typing.Optional[transformers.models.dia.configuration_dia.DiaDecoderConfig] = None norm_eps: float = 1e-05 is_encoder_decoder: bool = True pad_token_id: int = 1025 eos_token_id: int = 1024 bos_token_id: int = 1026 delay_pattern: typing.Optional[list[int]] = None initializer_range: float = 0.02 use_cache: bool = True **kwargs )
Parameters
-  encoder_config (DiaEncoderConfig, optional) — Configuration for the encoder part of the model. If not provided, a defaultDiaEncoderConfigwill be used.
-  decoder_config (DiaDecoderConfig, optional) — Configuration for the decoder part of the model. If not provided, a defaultDiaDecoderConfigwill be used.
-  norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the normalization layers.
-  is_encoder_decoder (bool, optional, defaults toTrue) — Indicating that this model uses an encoder-decoder architecture.
-  pad_token_id (int, optional, defaults to 1025) — Padding token id.
-  eos_token_id (int, optional, defaults to 1024) — End of stream token id.
-  bos_token_id (int, optional, defaults to 1026) — Beginning of stream token id.
-  delay_pattern (list[int], optional, defaults to[0, 8, 9, 10, 11, 12, 13, 14, 15]) — The delay pattern for the decoder. The length of this list must matchdecoder_config.num_channels.
-  initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-  use_cache (bool, optional, defaults toTrue) — Whether or not the model should return the last key/values attentions (not used by all models).
This is the configuration class to store the configuration of a DiaModel. It is used to instantiate a Dia model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the nari-labs/Dia-1.6B architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import DiaConfig, DiaModel
>>> # Initializing a DiaConfig with default values
>>> configuration = DiaConfig()
>>> # Initializing a DiaModel (with random weights) from the configuration
>>> model = DiaModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configDefaulting to audio config as it’s the decoder in this case which is usually the text backbone
DiaDecoderConfig
class transformers.DiaDecoderConfig
< source >( max_position_embeddings: int = 3072 num_hidden_layers: int = 18 hidden_size: int = 2048 intermediate_size: int = 8192 num_attention_heads: int = 16 num_key_value_heads: int = 4 head_dim: int = 128 cross_num_attention_heads: int = 16 cross_head_dim: int = 128 cross_num_key_value_heads: int = 16 cross_hidden_size: int = 1024 norm_eps: float = 1e-05 vocab_size: int = 1028 hidden_act: str = 'silu' num_channels: int = 9 rope_theta: float = 10000.0 rope_scaling: typing.Optional[dict] = None initializer_range: float = 0.02 use_cache: bool = True is_encoder_decoder: bool = True **kwargs )
Parameters
-  max_position_embeddings (int, optional, defaults to 3072) — The maximum sequence length that this model might ever be used with.
-  num_hidden_layers (int, optional, defaults to 18) — Number of hidden layers in the Transformer decoder.
-  hidden_size (int, optional, defaults to 2048) — Dimensionality of the decoder layers and the pooler layer.
-  intermediate_size (int, optional, defaults to 8192) — Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer decoder.
-  num_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer decoder.
-  num_key_value_heads (int, optional, defaults to 4) — Number of key and value heads for each attention layer in the Transformer decoder.
-  head_dim (int, optional, defaults to 128) — Dimensionality of the attention head.
-  cross_num_attention_heads (int, optional, defaults to 16) — Number of attention heads for each cross-attention layer in the Transformer decoder.
-  cross_head_dim (int, optional, defaults to 128) — Dimensionality of the cross-attention head.
-  cross_num_key_value_heads (int, optional, defaults to 16) — Number of key and value heads for each cross-attention layer in the Transformer decoder.
-  cross_hidden_size (int, optional, defaults to 1024) — Dimensionality of the cross-attention layers.
-  norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the normalization layers.
-  vocab_size (int, optional, defaults to 1028) — Vocabulary size of the Dia model. Defines the number of different tokens that can be represented by theinputs_idspassed when calling DiaModel.
-  hidden_act (strorfunction, optional, defaults to"silu") — The non-linear activation function (function or string) in the decoder. If string,"gelu","relu","swish"and"gelu_new"are supported.
-  num_channels (int, optional, defaults to 9) — Number of channels for the Dia decoder.
-  rope_theta (float, optional, defaults to 10000.0) — The base period of the RoPE embeddings.
-  rope_scaling (dict, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longermax_position_embeddings, we recommend you to update this value accordingly. Expected contents:rope_type(str): The sub-variant of RoPE to use. Can be one of [‘default’, ‘linear’, ‘dynamic’, ‘yarn’, ‘longrope’, ‘llama3’], with ‘default’ being the original RoPE implementation.factor(float, optional): Used with all rope types except ‘default’. The scaling factor to apply to the RoPE embeddings. In most scaling types, afactorof x will enable the model to handle sequences of length x original maximum pre-trained length.original_max_position_embeddings(int, optional): Used with ‘dynamic’, ‘longrope’ and ‘llama3’. The original max position embeddings used during pretraining.attention_factor(float, optional): Used with ‘yarn’ and ‘longrope’. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using thefactorfield to infer the suggested value.beta_fast(float, optional): Only used with ‘yarn’. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32.beta_slow(float, optional): Only used with ‘yarn’. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1.short_factor(List[float], optional): Only used with ‘longrope’. The scaling factor to be applied to short contexts (<original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2long_factor(List[float], optional): Only used with ‘longrope’. The scaling factor to be applied to long contexts (<original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2low_freq_factor(float, optional): Only used with ‘llama3’. Scaling factor applied to low frequency components of the RoPEhigh_freq_factor(float, optional*): Only used with ‘llama3’. Scaling factor applied to high frequency components of the RoPE
-  initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-  use_cache (bool, optional, defaults toTrue) — Whether or not the model should return the last key/values attentions (not used by all models).
-  is_encoder_decoder (bool, optional, defaults toTrue) — Indicating that this model is part of an encoder-decoder architecture.
This is the configuration class to store the configuration of a DiaDecoder. It is used to instantiate a Dia
decoder according to the specified arguments, defining the decoder architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
DiaEncoderConfig
class transformers.DiaEncoderConfig
< source >( max_position_embeddings: int = 1024 num_hidden_layers: int = 12 hidden_size: int = 1024 num_attention_heads: int = 16 num_key_value_heads: int = 16 head_dim: int = 128 intermediate_size: int = 4096 norm_eps: float = 1e-05 vocab_size: int = 256 hidden_act: str = 'silu' rope_theta: float = 10000.0 rope_scaling: typing.Optional[dict] = None initializer_range: float = 0.02 **kwargs )
Parameters
-  max_position_embeddings (int, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with.
-  num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
-  hidden_size (int, optional, defaults to 1024) — Dimensionality of the encoder layers and the pooler layer.
-  num_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer encoder.
-  num_key_value_heads (int, optional, defaults to 16) — Number of key and value heads for each attention layer in the Transformer encoder.
-  head_dim (int, optional, defaults to 128) — Dimensionality of the attention head.
-  intermediate_size (int, optional, defaults to 4096) — Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
-  norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the normalization layers.
-  vocab_size (int, optional, defaults to 256) — Vocabulary size of the Dia model. Defines the number of different tokens that can be represented by theinputs_idspassed when calling DiaModel.
-  hidden_act (strorfunction, optional, defaults to"silu") — The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","swish"and"gelu_new"are supported.
-  rope_theta (float, optional, defaults to 10000.0) — The base period of the RoPE embeddings.
-  rope_scaling (dict, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longermax_position_embeddings, we recommend you to update this value accordingly. Expected contents:rope_type(str): The sub-variant of RoPE to use. Can be one of [‘default’, ‘linear’, ‘dynamic’, ‘yarn’, ‘longrope’, ‘llama3’], with ‘default’ being the original RoPE implementation.factor(float, optional): Used with all rope types except ‘default’. The scaling factor to apply to the RoPE embeddings. In most scaling types, afactorof x will enable the model to handle sequences of length x original maximum pre-trained length.original_max_position_embeddings(int, optional): Used with ‘dynamic’, ‘longrope’ and ‘llama3’. The original max position embeddings used during pretraining.attention_factor(float, optional): Used with ‘yarn’ and ‘longrope’. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using thefactorfield to infer the suggested value.beta_fast(float, optional): Only used with ‘yarn’. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32.beta_slow(float, optional): Only used with ‘yarn’. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1.short_factor(List[float], optional): Only used with ‘longrope’. The scaling factor to be applied to short contexts (<original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2long_factor(List[float], optional): Only used with ‘longrope’. The scaling factor to be applied to long contexts (<original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2low_freq_factor(float, optional): Only used with ‘llama3’. Scaling factor applied to low frequency components of the RoPEhigh_freq_factor(float, optional*): Only used with ‘llama3’. Scaling factor applied to high frequency components of the RoPE
-  initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
This is the configuration class to store the configuration of a DiaEncoder. It is used to instantiate a Dia
encoder according to the specified arguments, defining the encoder architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
DiaTokenizer
class transformers.DiaTokenizer
< source >( pad_token: typing.Optional[str] = '<pad>' unk_token: typing.Optional[str] = '<pad>' max_length: typing.Optional[int] = 1024 offset: int = 0 **kwargs )
Parameters
-  pad_token (str, optional, defaults to"<pad>") — The token used for padding, for example when batching sequences of different lengths.
-  unk_token (str, optional, defaults to"<pad>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
-  max_length (int, optional, defaults to 1024) — The maximum length of the sequences when encoding. Sequences longer than this will be truncated.
-  offset (int, optional, defaults to 0) — The offset of the tokenizer.
Construct a Dia tokenizer. Dia simply uses raw bytes utf-8 encoding except for special tokens [S1] and [S2].
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
__call__
< source >( text: typing.Union[str, list[str], list[list[str]], NoneType] = None text_pair: typing.Union[str, list[str], list[list[str]], NoneType] = None text_target: typing.Union[str, list[str], list[list[str]], NoneType] = None text_pair_target: typing.Union[str, list[str], list[list[str]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs ) → BatchEncoding
Parameters
-  text (str,list[str],list[list[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences).
-  text_pair (str,list[str],list[list[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences).
-  text_target (str,list[str],list[list[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences).
-  text_pair_target (str,list[str],list[list[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences).
-  add_special_tokens (bool, optional, defaults toTrue) — Whether or not to add special tokens when encoding the sequences. This will use the underlyingPretrainedTokenizerBase.build_inputs_with_special_tokensfunction, which defines which tokens are automatically added to the input ids. This is useful if you want to addbosoreostokens automatically.
-  padding (bool,stror PaddingStrategy, optional, defaults toFalse) — Activates and controls padding. Accepts the following values:- Trueor- 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- 'max_length': Pad to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- Falseor- 'do_not_pad'(default): No padding (i.e., can output a batch with sequences of different lengths).
 
-  truncation (bool,stror TruncationStrategy, optional, defaults toFalse) — Activates and controls truncation. Accepts the following values:- Trueor- 'longest_first': Truncate to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_first': Truncate to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_second': Truncate to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- Falseor- 'do_not_truncate'(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
 
-  max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
-  stride (int, optional, defaults to 0) — If set to a number along withmax_length, the overflowing tokens returned whenreturn_overflowing_tokens=Truewill contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
-  is_split_into_words (bool, optional, defaults toFalse) — Whether or not the input is already pre-tokenized (e.g., split into words). If set toTrue, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
-  pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value. Requirespaddingto be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability>= 7.5(Volta).
-  padding_side (str, optional) — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
-  return_tensors (stror TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:- 'tf': Return TensorFlow- tf.constantobjects.
- 'pt': Return PyTorch- torch.Tensorobjects.
- 'np': Return Numpy- np.ndarrayobjects.
 
-  return_token_type_ids (bool, optional) — Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by thereturn_outputsattribute.
-  return_attention_mask (bool, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by thereturn_outputsattribute.
-  return_overflowing_tokens (bool, optional, defaults toFalse) — Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided withtruncation_strategy = longest_firstorTrue, an error is raised instead of returning overflowing tokens.
-  return_special_tokens_mask (bool, optional, defaults toFalse) — Whether or not to return special tokens mask information.
-  return_offsets_mapping (bool, optional, defaults toFalse) — Whether or not to return(char_start, char_end)for each token.This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise NotImplementedError.
-  return_length  (bool, optional, defaults toFalse) — Whether or not to return the lengths of the encoded inputs.
-  verbose (bool, optional, defaults toTrue) — Whether or not to print more information and warnings.
-  **kwargs — passed to the self.tokenize()method
Returns
A BatchEncoding with the following fields:
- 
input_ids — List of token ids to be fed to a model. 
- 
token_type_ids — List of token type ids to be fed to a model (when return_token_type_ids=Trueor if “token_type_ids” is inself.model_input_names).
- 
attention_mask — List of indices specifying which tokens should be attended to by the model (when return_attention_mask=Trueor if “attention_mask” is inself.model_input_names).
- 
overflowing_tokens — List of overflowing tokens sequences (when a max_lengthis specified andreturn_overflowing_tokens=True).
- 
num_truncated_tokens — Number of tokens truncated (when a max_lengthis specified andreturn_overflowing_tokens=True).
- 
special_tokens_mask — List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=Trueandreturn_special_tokens_mask=True).
- 
length — The length of the inputs (when return_length=True)
Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.
DiaFeatureExtractor
class transformers.DiaFeatureExtractor
< source >( feature_size: int = 1 sampling_rate: int = 16000 padding_value: float = 0.0 hop_length: int = 512 **kwargs )
Parameters
-  feature_size (int, optional, defaults to 1) — The feature dimension of the extracted features. Use 1 for mono, 2 for stereo.
-  sampling_rate (int, optional, defaults to 16000) — The sampling rate at which the audio waveform should be digitalized, expressed in hertz (Hz).
-  padding_value (float, optional, defaults to 0.0) — The value that is used for padding.
-  hop_length (int, optional, defaults to 512) — Overlap length between successive windows.
Constructs an Dia feature extractor.
This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
__call__
< source >( raw_audio: typing.Union[numpy.ndarray, list[float], list[numpy.ndarray], list[list[float]]] padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy, NoneType] = None truncation: typing.Optional[bool] = False max_length: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None sampling_rate: typing.Optional[int] = None )
Parameters
-  raw_audio (np.ndarray,list[float],list[np.ndarray],list[list[float]]) — The sequence or batch of sequences to be processed. Each sequence can be a numpy array, a list of float values, a list of numpy arrays or a list of list of float values. The numpy array must be of shape(num_samples,)for mono audio (feature_size = 1), or(2, num_samples)for stereo audio (feature_size = 2).
-  padding (bool,stror PaddingStrategy, optional, defaults toTrue) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:- Trueor- 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- Falseor- 'do_not_pad'(default): No padding (i.e., can output a batch with sequences of different lengths).
 
-  truncation (bool, optional, defaults toFalse) — Activates truncation to cut input sequences longer thanmax_lengthtomax_length.
-  max_length (int, optional) — Maximum length of the returned list and optionally padding length (see above).
-  return_tensors (stror TensorType, optional, default to ‘pt’) — If set, will return tensors instead of list of python integers. Acceptable values are:- 'tf': Return TensorFlow- tf.constantobjects.
- 'pt': Return PyTorch- torch.Tensorobjects.
- 'np': Return Numpy- np.ndarrayobjects.
 
-  sampling_rate (int, optional) — The sampling rate at which theaudioinput was sampled. It is strongly recommended to passsampling_rateat the forward call to prevent silent errors.
Main method to featurize and prepare for the model one or several sequence(s).
DiaProcessor
class transformers.DiaProcessor
< source >( feature_extractor tokenizer audio_tokenizer )
Parameters
-  feature_extractor (DiaFeatureExtractor) — An instance of DiaFeatureExtractor. The feature extractor is a required input.
-  tokenizer (DiaTokenizer) — An instance of DiaTokenizer. The tokenizer is a required input.
-  audio_tokenizer (DacModel) — An instance of DacModel used to encode/decode audio into/from codebooks. It is is a required input.
Constructs a Dia processor which wraps a DiaFeatureExtractor, DiaTokenizer, and a DacModel into
a single processor. It inherits, the audio feature extraction, tokenizer, and audio encode/decode functio-
nalities. See call(), ~DiaProcessor.encode, and decode() for more
information.
__call__
< source >( text: typing.Union[str, list[str]] audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor'], NoneType] = None output_labels: typing.Optional[bool] = False **kwargs: typing_extensions.Unpack[transformers.models.dia.processing_dia.DiaProcessorKwargs] )
Main method to prepare text(s) and audio to be fed as input to the model. The audio argument is
forwarded to the DiaFeatureExtractor’s call() and subsequently to the
DacModel’s encode(). The text argument to call(). Please refer
to the docstring of the above methods for more information.
batch_decode
< source >( decoder_input_ids: torch.Tensor audio_prompt_len: typing.Optional[int] = None **kwargs: typing_extensions.Unpack[transformers.models.dia.processing_dia.DiaProcessorKwargs] )
Decodes a batch of audio codebook sequences into their respective audio waveforms via the
audio_tokenizer. See decode() for more information.
decode
< source >( decoder_input_ids: torch.Tensor audio_prompt_len: typing.Optional[int] = None **kwargs: typing_extensions.Unpack[transformers.models.dia.processing_dia.DiaProcessorKwargs] )
Decodes a single sequence of audio codebooks into the respective audio waveform via the
audio_tokenizer. See decode() and batch_decode() for more information.
DiaModel
class transformers.DiaModel
< source >( config: DiaConfig )
Parameters
- config (DiaConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Dia model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_position_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Union[transformers.modeling_outputs.BaseModelOutput, tuple, NoneType] = None past_key_values: typing.Optional[transformers.cache_utils.EncoderDecoderCache] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **kwargs  ) → transformers.modeling_outputs.Seq2SeqModelOutput or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  attention_mask (torch.LongTensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
-  decoder_input_ids (torch.LongTensorof shape `(batch_size * num_codebooks, target_sequence_length) —
-  or (batch_size, target_sequence_length, num_codebooks)`, optional) —
- 
(batch_size * num_codebooks, target_sequence_length): corresponds to the general use case where the audio input codebooks are flattened into the batch dimension. This also aligns with the flat- tened audio logits which are used to calculate the loss. 
- 
(batch_size, sequence_length, num_codebooks): corresponds to the internally used shape of Dia to calculate embeddings and subsequent steps more efficiently. 
 If no decoder_input_idsare provided, it will create a tensor ofbos_token_idwith shape(batch_size, 1, num_codebooks). Indices can be obtained using the DiaProcessor. See DiaProcessor.call() for more details.
- 
-  decoder_position_ids (torch.LongTensorof shape(batch_size, target_sequence_length)) — Indices of positions of each input sequence tokens in the position embeddings. Used to calculate the position embeddings up toconfig.decoder_config.max_position_embeddings.
-  decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to make sure the model can only look at previous inputs in order to predict the future.
-  encoder_outputs (Union[~modeling_outputs.BaseModelOutput, tuple, NoneType]) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
-  past_key_values (~cache_utils.EncoderDecoderCache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input. If past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length).
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
Returns
transformers.modeling_outputs.Seq2SeqModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (None) and inputs.
- 
last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( EncoderDecoderCache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a EncoderDecoderCache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The DiaModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
DiaForConditionalGeneration
class transformers.DiaForConditionalGeneration
< source >( config: DiaConfig )
Parameters
- config (DiaConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Dia model consisting of a (byte) text encoder and audio decoder with a prediction head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_position_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Union[transformers.modeling_outputs.BaseModelOutput, tuple, NoneType] = None past_key_values: typing.Optional[transformers.cache_utils.EncoderDecoderCache] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.LongTensor] = None cache_position: typing.Optional[torch.LongTensor] = None **kwargs  ) → transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  attention_mask (torch.LongTensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
-  decoder_input_ids (torch.LongTensorof shape `(batch_size * num_codebooks, target_sequence_length) —
-  or (batch_size, target_sequence_length, num_codebooks)`, optional) —
- 
(batch_size * num_codebooks, target_sequence_length): corresponds to the general use case where the audio input codebooks are flattened into the batch dimension. This also aligns with the flat- tened audio logits which are used to calculate the loss. 
- 
(batch_size, sequence_length, num_codebooks): corresponds to the internally used shape of Dia to calculate embeddings and subsequent steps more efficiently. 
 If no decoder_input_idsare provided, it will create a tensor ofbos_token_idwith shape(batch_size, 1, num_codebooks). Indices can be obtained using the DiaProcessor. See DiaProcessor.call() for more details.
- 
-  decoder_position_ids (torch.LongTensorof shape(batch_size, target_sequence_length)) — Indices of positions of each input sequence tokens in the position embeddings. Used to calculate the position embeddings up toconfig.decoder_config.max_position_embeddings.
-  decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to make sure the model can only look at previous inputs in order to predict the future.
-  encoder_outputs (Union[~modeling_outputs.BaseModelOutput, tuple, NoneType]) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
-  past_key_values (~cache_utils.EncoderDecoderCache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input. If past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length).
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  labels (torch.LongTensorof shape(batch_size * num_codebooks,), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.decoder_config.vocab_size - 1]or -100. Tokens with indices set to-100are ignored (masked).
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
Returns
transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqLMOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (None) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss.
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( EncoderDecoderCache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a EncoderDecoderCache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The DiaForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
generate
< source >( inputs: typing.Optional[torch.Tensor] = None generation_config: typing.Optional[transformers.generation.configuration_utils.GenerationConfig] = None logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None prefix_allowed_tokens_fn: typing.Optional[typing.Callable[[int, torch.Tensor], list[int]]] = None synced_gpus: typing.Optional[bool] = None assistant_model: typing.Optional[ForwardRef('PreTrainedModel')] = None streamer: typing.Optional[ForwardRef('BaseStreamer')] = None negative_prompt_ids: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None use_model_defaults: typing.Optional[bool] = None custom_generate: typing.Optional[str] = None **kwargs )