Transformers documentation
Informer
Informer
개요
The Informer 모델은 Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, Wancai Zhang가 제안한 Informer: 장기 시퀀스 시계열 예측(LSTF)을 위한 더욱 효율적인 트랜스포머(Beyond Efficient Transformer)라는 논문에서 소개되었습니다.
이 방법은 확률적 어텐션 메커니즘을 도입하여 “게으른” 쿼리가 아닌 “활성” 쿼리를 선택하고, 희소 트랜스포머를 제공하여 기존 어텐션의 이차적 계산 및 메모리 요구사항을 완화합니다.
해당 논문의 초록입니다:
*실제로 많은 응용프로그램에서는 장기 시퀀스 시계열 예측(LSTF)을 필요로 합니다. LSTF는 출력 - 입력 간 정확한 장기 의존성 결합도를 포착해내는 높은 예측 능력을 모델에 요구합니다. 최근 연구들은 예측 능력을 향상시킬 수 있는 트랜스포머의 잠재력을 보여주고 있습니다. 그러나, 트랜스포머를 LSTF에 직접 적용하지 못하도록 막는 몇 심각한 문제점들이 있습니다. 예로, 이차 시간 복잡도, 높은 메모리 사용량, 인코더-디코더 아키텍처의 본질적 한계를 들 수 있습니다. 이러한 문제를 해결하기 위해 LSTF를 위한 효율적인 트랜스포머 기반 모델인 Informer를 설계했습니다.
Informer의 세가지 독특한 특성: (i) ProbSparse 셀프 어텐션 메커니즘으로, 시간 복잡도와 메모리 사용량에서 O(L logL)를 달성하며 시퀀스 의존성 정렬에서 비교 가능한 성능을 보입니다. (ii) 셀프 어텐션 증류는 계단식 레이어 입력을 반으로 줄여 지배적인 어텐션을 강조하고 극단적으로 긴 입력 시퀀스를 효율적으로 처리합니다. (iii) 생성 스타일 디코더는 개념적으로 단순하지만 장기 시계열 시퀀스를 단계별 방식이 아닌 한 번의 전방 연산으로 예측하여 장기 시퀀스 예측의 추론 속도를 크게 향상시킵니다. 4개의 대규모 데이터셋에 걸친 광범위한 실험은 Informer가 기존 방법들을 크게 능가하며 LSTF 문제에 새로운 해결책을 제공함을 보여줍니다.*
이 모델은 elisim와 kashif가 기여했습니다. 원본 코드는 이곳에서 확인할 수 있습니다.
자료
시작하는 데 도움이 되는 Hugging Face와 community 자료 목록(🌎로 표시됨) 입니다. 여기에 포함될 자료를 제출하고 싶으시다면 PR(Pull Request)를 열어주세요. 리뷰 해드리겠습니다! 자료는 기존 자료를 복제하는 대신 새로운 내용을 담고 있어야 합니다.
- HuggingFace 블로그에서 Informer 포스트를 확인하세요: Informer를 활용한 다변량 확률적 시계열 예측
InformerConfig
class transformers.InformerConfig
< source >( prediction_length: typing.Optional[int] = None context_length: typing.Optional[int] = None distribution_output: str = 'student_t' loss: str = 'nll' input_size: int = 1 lags_sequence: typing.Optional[list[int]] = None scaling: typing.Union[bool, str, NoneType] = 'mean' num_dynamic_real_features: int = 0 num_static_real_features: int = 0 num_static_categorical_features: int = 0 num_time_features: int = 0 cardinality: typing.Optional[list[int]] = None embedding_dimension: typing.Optional[list[int]] = None d_model: int = 64 encoder_ffn_dim: int = 32 decoder_ffn_dim: int = 32 encoder_attention_heads: int = 2 decoder_attention_heads: int = 2 encoder_layers: int = 2 decoder_layers: int = 2 is_encoder_decoder: bool = True activation_function: str = 'gelu' dropout: float = 0.05 encoder_layerdrop: float = 0.1 decoder_layerdrop: float = 0.1 attention_dropout: float = 0.1 activation_dropout: float = 0.1 num_parallel_samples: int = 100 init_std: float = 0.02 use_cache = True attention_type: str = 'prob' sampling_factor: int = 5 distil: bool = True **kwargs )
Parameters
-  prediction_length (int) — The prediction length for the decoder. In other words, the prediction horizon of the model. This value is typically dictated by the dataset and we recommend to set it appropriately.
-  context_length (int, optional, defaults toprediction_length) — The context length for the encoder. IfNone, the context length will be the same as theprediction_length.
-  distribution_output (string, optional, defaults to"student_t") — The distribution emission head for the model. Could be either “student_t”, “normal” or “negative_binomial”.
-  loss (string, optional, defaults to"nll") — The loss function for the model corresponding to thedistribution_outputhead. For parametric distributions it is the negative log likelihood (nll) - which currently is the only supported one.
-  input_size (int, optional, defaults to 1) — The size of the target variable which by default is 1 for univariate targets. Would be > 1 in case of multivariate targets.
-  scaling (stringorbool, optional defaults to"mean") — Whether to scale the input targets via “mean” scaler, “std” scaler or no scaler ifNone. IfTrue, the scaler is set to “mean”.
-  lags_sequence (list[int], optional, defaults to[1, 2, 3, 4, 5, 6, 7]) — The lags of the input time series as covariates often dictated by the frequency of the data. Default is[1, 2, 3, 4, 5, 6, 7]but we recommend to change it based on the dataset appropriately.
-  num_time_features (int, optional, defaults to 0) — The number of time features in the input time series.
-  num_dynamic_real_features (int, optional, defaults to 0) — The number of dynamic real valued features.
-  num_static_categorical_features (int, optional, defaults to 0) — The number of static categorical features.
-  num_static_real_features (int, optional, defaults to 0) — The number of static real valued features.
-  cardinality (list[int], optional) — The cardinality (number of different values) for each of the static categorical features. Should be a list of integers, having the same length asnum_static_categorical_features. Cannot beNoneifnum_static_categorical_featuresis > 0.
-  embedding_dimension (list[int], optional) — The dimension of the embedding for each of the static categorical features. Should be a list of integers, having the same length asnum_static_categorical_features. Cannot beNoneifnum_static_categorical_featuresis > 0.
-  d_model (int, optional, defaults to 64) — Dimensionality of the transformer layers.
-  encoder_layers (int, optional, defaults to 2) — Number of encoder layers.
-  decoder_layers (int, optional, defaults to 2) — Number of decoder layers.
-  encoder_attention_heads (int, optional, defaults to 2) — Number of attention heads for each attention layer in the Transformer encoder.
-  decoder_attention_heads (int, optional, defaults to 2) — Number of attention heads for each attention layer in the Transformer decoder.
-  encoder_ffn_dim (int, optional, defaults to 32) — Dimension of the “intermediate” (often named feed-forward) layer in encoder.
-  decoder_ffn_dim (int, optional, defaults to 32) — Dimension of the “intermediate” (often named feed-forward) layer in decoder.
-  activation_function (strorfunction, optional, defaults to"gelu") — The non-linear activation function (function or string) in the encoder and decoder. If string,"gelu"and"relu"are supported.
-  dropout (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the encoder, and decoder.
-  encoder_layerdrop (float, optional, defaults to 0.1) — The dropout probability for the attention and fully connected layers for each encoder layer.
-  decoder_layerdrop (float, optional, defaults to 0.1) — The dropout probability for the attention and fully connected layers for each decoder layer.
-  attention_dropout (float, optional, defaults to 0.1) — The dropout probability for the attention probabilities.
-  activation_dropout (float, optional, defaults to 0.1) — The dropout probability used between the two layers of the feed-forward networks.
-  num_parallel_samples (int, optional, defaults to 100) — The number of samples to generate in parallel for each time step of inference.
-  init_std (float, optional, defaults to 0.02) — The standard deviation of the truncated normal weight initialization distribution.
-  use_cache (bool, optional, defaults toTrue) — Whether to use the past key/values attentions (if applicable to the model) to speed up decoding.
-  attention_type (str, optional, defaults to “prob”) — Attention used in encoder. This can be set to “prob” (Informer’s ProbAttention) or “full” (vanilla transformer’s canonical self-attention).
-  sampling_factor (int, optional, defaults to 5) — ProbSparse sampling factor (only makes affect whenattention_type=“prob”). It is used to control the reduced query matrix (Q_reduce) input length.
-  distil (bool, optional, defaults toTrue) — Whether to use distilling in encoder.
This is the configuration class to store the configuration of an InformerModel. It is used to instantiate an Informer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Informer huggingface/informer-tourism-monthly architecture.
Configuration objects inherit from PretrainedConfig can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import InformerConfig, InformerModel
>>> # Initializing an Informer configuration with 12 time steps for prediction
>>> configuration = InformerConfig(prediction_length=12)
>>> # Randomly initializing a model (with random weights) from the configuration
>>> model = InformerModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configInformerModel
class transformers.InformerModel
< source >( config: InformerConfig )
Parameters
- config (InformerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Informer Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( past_values: Tensor past_time_features: Tensor past_observed_mask: Tensor static_categorical_features: typing.Optional[torch.Tensor] = None static_real_features: typing.Optional[torch.Tensor] = None future_values: typing.Optional[torch.Tensor] = None future_time_features: typing.Optional[torch.Tensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.Tensor] = None decoder_head_mask: typing.Optional[torch.Tensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[list[torch.FloatTensor]] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None use_cache: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None  ) → transformers.modeling_outputs.Seq2SeqTSModelOutput or tuple(torch.FloatTensor)
Parameters
-  past_values (torch.FloatTensorof shape(batch_size, sequence_length)or(batch_size, sequence_length, input_size)) — Past values of the time series, that serve as context in order to predict the future. The sequence size of this tensor must be larger than thecontext_lengthof the model, since the model will use the larger size to construct lag features, i.e. additional values from the past which are added in order to serve as “extra context”.The sequence_lengthhere is equal toconfig.context_length+max(config.lags_sequence), which if nolags_sequenceis configured, is equal toconfig.context_length+ 7 (as by default, the largest look-back index inconfig.lags_sequenceis 7). The property_past_lengthreturns the actual length of the past.The past_valuesis what the Transformer encoder gets as input (with optional additional features, such asstatic_categorical_features,static_real_features,past_time_featuresand lags).Optionally, missing values need to be replaced with zeros and indicated via the past_observed_mask.For multivariate time series, the input_size> 1 dimension is required and corresponds to the number of variates in the time series per time step.
-  past_time_features (torch.FloatTensorof shape(batch_size, sequence_length, num_features)) — Required time features, which the model internally will add topast_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional time features. The Time Series Transformer only learns additional embeddings for static_categorical_features.Additional dynamic real covariates can be concatenated to this tensor, with the caveat that these features must but known at prediction time. The num_featureshere is equal toconfig.num_time_features+config.num_dynamic_real_features`.
-  past_observed_mask (torch.BoolTensorof shape(batch_size, sequence_length)or(batch_size, sequence_length, input_size), optional) — Boolean mask to indicate whichpast_valueswere observed and which were missing. Mask values selected in[0, 1]:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
 
-  static_categorical_features (torch.LongTensorof shape(batch_size, number of static categorical features), optional) — Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.Static categorical features are features which have the same value for all time steps (static over time). A typical example of a static categorical feature is a time series ID. 
-  static_real_features (torch.FloatTensorof shape(batch_size, number of static real features), optional) — Optional static real features which the model will add to the values of the time series.Static real features are features which have the same value for all time steps (static over time). A typical example of a static real feature is promotion information. 
-  future_values (torch.FloatTensorof shape(batch_size, prediction_length)or(batch_size, prediction_length, input_size), optional) — Future values of the time series, that serve as labels for the model. Thefuture_valuesis what the Transformer needs during training to learn to output, given thepast_values.The sequence length here is equal to prediction_length.See the demo notebook and code snippets for details. Optionally, during training any missing values need to be replaced with zeros and indicated via the future_observed_mask.For multivariate time series, the input_size> 1 dimension is required and corresponds to the number of variates in the time series per time step.
-  future_time_features (torch.FloatTensorof shape(batch_size, prediction_length, num_features)) — Required time features for the prediction window, which the model internally will add tofuture_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional time features. The Time Series Transformer only learns additional embeddings for static_categorical_features.Additional dynamic real covariates can be concatenated to this tensor, with the caveat that these features must but known at prediction time. The num_featureshere is equal toconfig.num_time_features+config.num_dynamic_real_features`.
-  decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to make sure the model can only look at previous inputs in order to predict the future.
-  head_mask (torch.Tensorof shape(num_heads,)or(num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  decoder_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  cross_attn_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists oflast_hidden_state,hidden_states(optional) andattentions(optional)last_hidden_stateof shape(batch_size, sequence_length, hidden_size)(optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
-  past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input. If past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length).
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
Returns
transformers.modeling_outputs.Seq2SeqTSModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqTSModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (InformerConfig) and inputs.
- 
last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( EncoderDecoderCache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a EncoderDecoderCache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
loc ( torch.FloatTensorof shape(batch_size,)or(batch_size, input_size), optional) — Shift values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to shift back to the original magnitude.
- 
scale ( torch.FloatTensorof shape(batch_size,)or(batch_size, input_size), optional) — Scaling values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to rescale back to the original magnitude.
- 
static_features ( torch.FloatTensorof shape(batch_size, feature size), optional) — Static features of each time series’ in a batch which are copied to the covariates at inference time.
The InformerModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import InformerModel
>>> file = hf_hub_download(
...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)
>>> model = InformerModel.from_pretrained("huggingface/informer-tourism-monthly")
>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_values=batch["future_values"],
...     future_time_features=batch["future_time_features"],
... )
>>> last_hidden_state = outputs.last_hidden_stateInformerForPrediction
class transformers.InformerForPrediction
< source >( config: InformerConfig )
Parameters
- config (InformerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Informer Model with a distribution head on top for time-series forecasting.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( past_values: Tensor past_time_features: Tensor past_observed_mask: Tensor static_categorical_features: typing.Optional[torch.Tensor] = None static_real_features: typing.Optional[torch.Tensor] = None future_values: typing.Optional[torch.Tensor] = None future_time_features: typing.Optional[torch.Tensor] = None future_observed_mask: typing.Optional[torch.Tensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.Tensor] = None decoder_head_mask: typing.Optional[torch.Tensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[list[torch.FloatTensor]] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None use_cache: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None  ) → transformers.modeling_outputs.Seq2SeqTSModelOutput or tuple(torch.FloatTensor)
Parameters
-  past_values (torch.FloatTensorof shape(batch_size, sequence_length)or(batch_size, sequence_length, input_size)) — Past values of the time series, that serve as context in order to predict the future. The sequence size of this tensor must be larger than thecontext_lengthof the model, since the model will use the larger size to construct lag features, i.e. additional values from the past which are added in order to serve as “extra context”.The sequence_lengthhere is equal toconfig.context_length+max(config.lags_sequence), which if nolags_sequenceis configured, is equal toconfig.context_length+ 7 (as by default, the largest look-back index inconfig.lags_sequenceis 7). The property_past_lengthreturns the actual length of the past.The past_valuesis what the Transformer encoder gets as input (with optional additional features, such asstatic_categorical_features,static_real_features,past_time_featuresand lags).Optionally, missing values need to be replaced with zeros and indicated via the past_observed_mask.For multivariate time series, the input_size> 1 dimension is required and corresponds to the number of variates in the time series per time step.
-  past_time_features (torch.FloatTensorof shape(batch_size, sequence_length, num_features)) — Required time features, which the model internally will add topast_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional time features. The Time Series Transformer only learns additional embeddings for static_categorical_features.Additional dynamic real covariates can be concatenated to this tensor, with the caveat that these features must but known at prediction time. The num_featureshere is equal toconfig.num_time_features+config.num_dynamic_real_features`.
-  past_observed_mask (torch.BoolTensorof shape(batch_size, sequence_length)or(batch_size, sequence_length, input_size), optional) — Boolean mask to indicate whichpast_valueswere observed and which were missing. Mask values selected in[0, 1]:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
 
-  static_categorical_features (torch.LongTensorof shape(batch_size, number of static categorical features), optional) — Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.Static categorical features are features which have the same value for all time steps (static over time). A typical example of a static categorical feature is a time series ID. 
-  static_real_features (torch.FloatTensorof shape(batch_size, number of static real features), optional) — Optional static real features which the model will add to the values of the time series.Static real features are features which have the same value for all time steps (static over time). A typical example of a static real feature is promotion information. 
-  future_values (torch.FloatTensorof shape(batch_size, prediction_length)or(batch_size, prediction_length, input_size), optional) — Future values of the time series, that serve as labels for the model. Thefuture_valuesis what the Transformer needs during training to learn to output, given thepast_values.The sequence length here is equal to prediction_length.See the demo notebook and code snippets for details. Optionally, during training any missing values need to be replaced with zeros and indicated via the future_observed_mask.For multivariate time series, the input_size> 1 dimension is required and corresponds to the number of variates in the time series per time step.
-  future_time_features (torch.FloatTensorof shape(batch_size, prediction_length, num_features)) — Required time features for the prediction window, which the model internally will add tofuture_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional time features. The Time Series Transformer only learns additional embeddings for static_categorical_features.Additional dynamic real covariates can be concatenated to this tensor, with the caveat that these features must but known at prediction time. The num_featureshere is equal toconfig.num_time_features+config.num_dynamic_real_features`.
-  future_observed_mask (torch.BoolTensorof shape(batch_size, sequence_length)or(batch_size, sequence_length, input_size), optional) — Boolean mask to indicate whichfuture_valueswere observed and which were missing. Mask values selected in[0, 1]:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
 This mask is used to filter out missing values for the final loss calculation. 
-  decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to make sure the model can only look at previous inputs in order to predict the future.
-  head_mask (torch.Tensorof shape(num_heads,)or(num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  decoder_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  cross_attn_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists oflast_hidden_state,hidden_states(optional) andattentions(optional)last_hidden_stateof shape(batch_size, sequence_length, hidden_size)(optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
-  past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input. If past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length).
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
Returns
transformers.modeling_outputs.Seq2SeqTSModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqTSModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (InformerConfig) and inputs.
- 
last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( EncoderDecoderCache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a EncoderDecoderCache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
loc ( torch.FloatTensorof shape(batch_size,)or(batch_size, input_size), optional) — Shift values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to shift back to the original magnitude.
- 
scale ( torch.FloatTensorof shape(batch_size,)or(batch_size, input_size), optional) — Scaling values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to rescale back to the original magnitude.
- 
static_features ( torch.FloatTensorof shape(batch_size, feature size), optional) — Static features of each time series’ in a batch which are copied to the covariates at inference time.
The InformerForPrediction forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import InformerForPrediction
>>> file = hf_hub_download(
...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)
>>> model = InformerForPrediction.from_pretrained(
...     "huggingface/informer-tourism-monthly"
... )
>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_values=batch["future_values"],
...     future_time_features=batch["future_time_features"],
... )
>>> loss = outputs.loss
>>> loss.backward()
>>> # during inference, one only provides past values
>>> # as well as possible additional features
>>> # the model autoregressively generates future values
>>> outputs = model.generate(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_time_features=batch["future_time_features"],
... )
>>> mean_prediction = outputs.sequences.mean(dim=1)