Transformers documentation
Vision Transformer (ViT)
Vision Transformer (ViT)
개요
Vision Transformer (ViT) 모델은 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby가 제안한 논문 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale에서 소개되었습니다. 이는 Transformer 인코더를 ImageNet에서 성공적으로 훈련시킨 첫 번째 논문으로, 기존의 잘 알려진 합성곱 신경망(CNN) 구조와 비교해 매우 우수한 결과를 달성했습니다.
논문의 초록은 다음과 같습니다:
Transformer 아키텍처는 자연어 처리 작업에서 사실상 표준으로 자리 잡았으나, 컴퓨터 비전 분야에서의 적용은 여전히 제한적입니다. 비전에서 어텐션 메커니즘은 종종 합성곱 신경망(CNN)과 결합하여 사용되거나, 전체 구조를 유지하면서 합성곱 신경망의 특정 구성 요소를 대체하는 데 사용됩니다. 우리는 이러한 CNN 의존성이 필요하지 않으며, 이미지 패치를 순차적으로 입력받는 순수한 Transformer가 이미지 분류 작업에서 매우 우수한 성능을 발휘할 수 있음을 보여줍니다. 대규모 데이터로 사전 학습된 후, ImageNet, CIFAR-100, VTAB 등 다양한 중소형 이미지 인식 벤치마크에 적용하면 Vision Transformer(ViT)는 최신 합성곱 신경망과 비교해 매우 우수한 성능을 발휘하면서도 훈련에 필요한 계산 자원을 상당히 줄일 수 있습니다.
 ViT 아키텍처. 원본 논문에서 발췌.
 ViT 아키텍처. 원본 논문에서 발췌. 원래의 Vision Transformer에 이어, 여러 후속 연구들이 진행되었습니다:
- DeiT (Data-efficient Image Transformers) (Facebook AI 개발). DeiT 모델은 distilled vision transformers입니다. DeiT의 저자들은 더 효율적으로 훈련된 ViT 모델도 공개했으며, 이는 ViTModel 또는 ViTForImageClassification에 바로 사용할 수 있습니다. 여기에는 3가지 크기로 4개의 변형이 제공됩니다: facebook/deit-tiny-patch16-224, facebook/deit-small-patch16-224, facebook/deit-base-patch16-224 and facebook/deit-base-patch16-384. 그리고 모델에 이미지를 준비하려면 - DeiTImageProcessor를 사용해야 한다는 점에 유의하십시오.
- BEiT (BERT pre-training of Image Transformers) (Microsoft Research 개발). BEiT 모델은 BERT (masked image modeling)에 영감을 받고 VQ-VAE에 기반한 self-supervised 방법을 이용하여 supervised pre-trained vision transformers보다 더 우수한 성능을 보입니다. 
- DINO (Vision Transformers의 self-supervised 훈련을 위한 방법) (Facebook AI 개발). DINO 방법으로 훈련된 Vision Transformer는 학습되지 않은 상태에서도 객체를 분할할 수 있는 합성곱 신경망에서는 볼 수 없는 매우 흥미로운 능력을 보여줍니다. DINO 체크포인트는 hub에서 찾을 수 있습니다. 
- MAE (Masked Autoencoders) (Facebook AI 개발). Vision Transformer를 비대칭 인코더-디코더 아키텍처를 사용하여 마스크된 패치의 높은 비율(75%)에서 픽셀 값을 재구성하도록 사전 학습함으로써, 저자들은 이 간단한 방법이 미세 조정 후 supervised 방식의 사전 학습을 능가한다는 것을 보여주었습니다. 
이 모델은 nielsr에 의해 기여되었습니다. 원본 코드(JAX로 작성됨)은 여기에서 확인할 수 있습니다.
참고로, 우리는 Ross Wightman의 timm 라이브러리에서 JAX에서 PyTorch로 변환된 가중치를 다시 변환했습니다. 모든 공로는 그에게 돌립니다!
사용 팁
- Transformer 인코더에 이미지를 입력하기 위해, 각 이미지는 고정 크기의 겹치지 않는 패치들로 분할된 후 선형 임베딩됩니다. 전체 이미지를 대표하는 [CLS] 토큰이 추가되어, 분류에 사용할 수 있습니다. 저자들은 또한 절대 위치 임베딩을 추가하여, 결과적으로 생성된 벡터 시퀀스를 표준 Transformer 인코더에 입력합니다.
- Vision Transformer는 모든 이미지가 동일한 크기(해상도)여야 하므로, [ViTImageProcessor]를 사용하여 이미지를 모델에 맞게 리사이즈(또는 리스케일)하고 정규화할 수 있습니다.
- 사전 학습이나 미세 조정 시 사용된 패치 해상도와 이미지 해상도는 각 체크포인트의 이름에 반영됩니다. 예를 들어, google/vit-base-patch16-224는 패치 해상도가 16x16이고 미세 조정 해상도가 224x224인 기본 크기 아키텍처를 나타냅니다. 모든 체크포인트는 hub에서 확인할 수 있습니다.
- 사용할 수 있는 체크포인트는 (1) ImageNet-21k (1,400만 개의 이미지와 21,000개의 클래스)에서만 사전 학습되었거나, 또는 (2) ImageNet (ILSVRC 2012, 130만 개의 이미지와 1,000개의 클래스)에서 추가로 미세 조정된 경우입니다.
- Vision Transformer는 224x224 해상도로 사전 학습되었습니다. 미세 조정 시, 사전 학습보다 더 높은 해상도를 사용하는 것이 유리한 경우가 많습니다 ((Touvron et al., 2019), (Kolesnikovet al., 2020). 더 높은 해상도로 미세 조정하기 위해, 저자들은 원본 이미지에서의 위치에 따라 사전 학습된 위치 임베딩의 2D 보간(interpolation)을 수행합니다.
- 최고의 결과는 supervised 방식의 사전 학습에서 얻어졌으며, 이는 NLP에서는 해당되지 않는 경우가 많습니다. 저자들은 마스크된 패치 예측(마스크된 언어 모델링에서 영감을 받은 self-supervised 사전 학습 목표)을 사용한 실험도 수행했습니다. 이 접근 방식으로 더 작은 ViT-B/16 모델은 ImageNet에서 79.9%의 정확도를 달성하였으며, 이는 처음부터 학습한 것보다 2% 개선된 결과이지만, 여전히 supervised 사전 학습보다 4% 낮습니다.
Scaled Dot Product Attention (SDPA) 사용하기
PyTorch는 torch.nn.functional의 일부로서 native scaled dot-product attention (SDPA) 연산자를 포함하고 있습니다. 이 함수는 입력 및 사용 중인 하드웨어에 따라 여러 구현 방식을 적용할 수 있습니다.자세한 내용은 공식 문서나 GPU 추론 페이지를 참조하십시오.
SDPA는 torch>=2.1.1에서 구현이 가능한 경우 기본적으로 사용되지만, from_pretrained()에서 attn_implementation="sdpa"로 설정하여 SDPA를 명시적으로 요청할 수도 있습니다.
from transformers import ViTForImageClassification
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
...최적의 속도 향상을 위해 모델을 반정밀도(예: torch.float16 또는 torch.bfloat16)로 로드하는 것을 권장합니다.
로컬 벤치마크(A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04)에서 float32와 google/vit-base-patch16-224 모델을 사용한 추론 시, 다음과 같은 속도 향상을 확인했습니다.
| Batch size | Average inference time (ms), eager mode | Average inference time (ms), sdpa model | Speed up, Sdpa / Eager (x) | 
|---|---|---|---|
| 1 | 7 | 6 | 1.17 | 
| 2 | 8 | 6 | 1.33 | 
| 4 | 8 | 6 | 1.33 | 
| 8 | 8 | 6 | 1.33 | 
리소스
ViT의 추론 및 커스텀 데이터에 대한 미세 조정과 관련된 데모 노트북은 여기에서 확인할 수 있습니다. Hugging Face에서 공식적으로 제공하는 자료와 커뮤니티(🌎로 표시된) 자료 목록은 ViT를 시작하는 데 도움이 될 것입니다. 이 목록에 포함될 자료를 제출하고 싶다면 Pull Request를 열어 주시면 검토하겠습니다. 새로운 내용을 설명하는 자료가 가장 이상적이며, 기존 자료를 중복하지 않도록 해주십시오.
ViTForImageClassification 은 다음에서 지원됩니다:
- Hugging Face Transformers로 ViT를 이미지 분류에 맞게 미세 조정하는 방법에 대한 블로그 포스트
- Hugging Face Transformers와 Keras를 사용한 이미지 분류에 대한 블로그 포스트
- Hugging Face Transformers를 사용한 이미지 분류 미세 조정에 대한 노트북
- Hugging Face Trainer로 CIFAR-10에서 Vision Transformer 미세 조정에 대한 노트북
- PyTorch Lightning으로 CIFAR-10에서 Vision Transformer 미세 조정에 대한 노트북
⚗️ 최적화
- Optimum을 사용한 양자화를 통해 Vision Transformer(ViT) 가속에 대한 블로그 포스트
⚡️ 추론
🚀 배포
- TF Serving으로 Hugging Face에서 Tensorflow Vision 모델 배포에 대한 블로그 포스트
- Vertex AI에서 Hugging Face ViT 배포에 대한 블로그 포스트
- TF Serving을 사용하여 Kubernetes에서 Hugging Face ViT 배포에 대한 블로그 포스트
ViTConfig
class transformers.ViTConfig
< source >( hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-12 image_size = 224 patch_size = 16 num_channels = 3 qkv_bias = True encoder_stride = 16 pooler_output_size = None pooler_act = 'tanh' **kwargs )
Parameters
-  hidden_size (int, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer.
-  num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
-  num_attention_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.
-  intermediate_size (int, optional, defaults to 3072) — Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
-  hidden_act (strorfunction, optional, defaults to"gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","selu"and"gelu_new"are supported.
-  hidden_dropout_prob (float, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-  attention_probs_dropout_prob (float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities.
-  initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-  layer_norm_eps (float, optional, defaults to 1e-12) — The epsilon used by the layer normalization layers.
-  image_size (int, optional, defaults to 224) — The size (resolution) of each image.
-  patch_size (int, optional, defaults to 16) — The size (resolution) of each patch.
-  num_channels (int, optional, defaults to 3) — The number of input channels.
-  qkv_bias (bool, optional, defaults toTrue) — Whether to add a bias to the queries, keys and values.
-  encoder_stride (int, optional, defaults to 16) — Factor to increase the spatial resolution by in the decoder head for masked image modeling.
-  pooler_output_size (int, optional) — Dimensionality of the pooler layer. If None, defaults tohidden_size.
-  pooler_act (str, optional, defaults to"tanh") — The activation function to be used by the pooler. Keys of ACT2FN are supported for Flax and Pytorch, and elements of https://www.tensorflow.org/api_docs/python/tf/keras/activations are supported for Tensorflow.
This is the configuration class to store the configuration of a ViTModel. It is used to instantiate an ViT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ViT google/vit-base-patch16-224 architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import ViTConfig, ViTModel
>>> # Initializing a ViT vit-base-patch16-224 style configuration
>>> configuration = ViTConfig()
>>> # Initializing a model (with random weights) from the vit-base-patch16-224 style configuration
>>> model = ViTModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configViTFeatureExtractor
Preprocess an image or a batch of images.
ViTImageProcessor
class transformers.ViTImageProcessor
< source >( do_resize: bool = True size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_convert_rgb: typing.Optional[bool] = None **kwargs )
Parameters
-  do_resize (bool, optional, defaults toTrue) — Whether to resize the image’s (height, width) dimensions to the specified(size["height"], size["width"]). Can be overridden by thedo_resizeparameter in thepreprocessmethod.
-  size (dict, optional, defaults to{"height" -- 224, "width": 224}): Size of the output image after resizing. Can be overridden by thesizeparameter in thepreprocessmethod.
-  resample (PILImageResampling, optional, defaults toResampling.BILINEAR) — Resampling filter to use if resizing the image. Can be overridden by theresampleparameter in thepreprocessmethod.
-  do_rescale (bool, optional, defaults toTrue) — Whether to rescale the image by the specified scalerescale_factor. Can be overridden by thedo_rescaleparameter in thepreprocessmethod.
-  rescale_factor (intorfloat, optional, defaults to1/255) — Scale factor to use if rescaling the image. Can be overridden by therescale_factorparameter in thepreprocessmethod.
-  do_normalize (bool, optional, defaults toTrue) — Whether to normalize the image. Can be overridden by thedo_normalizeparameter in thepreprocessmethod.
-  image_mean (floatorlist[float], optional, defaults toIMAGENET_STANDARD_MEAN) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_meanparameter in thepreprocessmethod.
-  image_std (floatorlist[float], optional, defaults toIMAGENET_STANDARD_STD) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_stdparameter in thepreprocessmethod.
-  do_convert_rgb (bool, optional) — Whether to convert the image to RGB.
Constructs a ViT image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None resample: Resampling = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None do_convert_rgb: typing.Optional[bool] = None )
Parameters
-  images (ImageInput) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False.
-  do_resize (bool, optional, defaults toself.do_resize) — Whether to resize the image.
-  size (dict[str, int], optional, defaults toself.size) — Dictionary in the format{"height": h, "width": w}specifying the size of the output image after resizing.
-  resample (PILImageResamplingfilter, optional, defaults toself.resample) —PILImageResamplingfilter to use if resizing the image e.g.PILImageResampling.BILINEAR. Only has an effect ifdo_resizeis set toTrue.
-  do_rescale (bool, optional, defaults toself.do_rescale) — Whether to rescale the image values between [0 - 1].
-  rescale_factor (float, optional, defaults toself.rescale_factor) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_normalize (bool, optional, defaults toself.do_normalize) — Whether to normalize the image.
-  image_mean (floatorlist[float], optional, defaults toself.image_mean) — Image mean to use ifdo_normalizeis set toTrue.
-  image_std (floatorlist[float], optional, defaults toself.image_std) — Image standard deviation to use ifdo_normalizeis set toTrue.
-  return_tensors (strorTensorType, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of np.ndarray.
- TensorType.TENSORFLOWor- 'tf': Return a batch of type- tf.Tensor.
- TensorType.PYTORCHor- 'pt': Return a batch of type- torch.Tensor.
- TensorType.NUMPYor- 'np': Return a batch of type- np.ndarray.
- TensorType.JAXor- 'jax': Return a batch of type- jax.numpy.ndarray.
 
- Unset: Return a list of 
-  data_format (ChannelDimensionorstr, optional, defaults toChannelDimension.FIRST) — The channel dimension format for the output image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
 
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
-  do_convert_rgb (bool, optional, defaults toself.do_convert_rgb) — Whether to convert the image to RGB.
Preprocess an image or batch of images.
ViTImageProcessorFast
class transformers.ViTImageProcessorFast
< source >( **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs] )
Constructs a fast Vit image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] *args **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs]  ) → <class 'transformers.image_processing_base.BatchFeature'>
Parameters
-  images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False.
-  do_resize (bool, optional) — Whether to resize the image.
-  size (dict[str, int], optional) — Describes the maximum input dimensions to the model.
-  default_to_square (bool, optional) — Whether to default to a square image when resizing, if size is an int.
-  resample (Union[PILImageResampling, F.InterpolationMode, NoneType]) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Only has an effect ifdo_resizeis set toTrue.
-  do_center_crop (bool, optional) — Whether to center crop the image.
-  crop_size (dict[str, int], optional) — Size of the output image after applyingcenter_crop.
-  do_rescale (bool, optional) — Whether to rescale the image.
-  rescale_factor (Union[int, float, NoneType]) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_normalize (bool, optional) — Whether to normalize the image.
-  image_mean (Union[float, list[float], NoneType]) — Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  image_std (Union[float, list[float], NoneType]) — Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  do_convert_rgb (bool, optional) — Whether to convert the image to RGB.
-  return_tensors (Union[str, ~utils.generic.TensorType, NoneType]) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
-  data_format (~image_utils.ChannelDimension, optional) — OnlyChannelDimension.FIRSTis supported. Added for compatibility with slow processors.
-  input_data_format (Union[~image_utils.ChannelDimension, str, NoneType]) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
-  device (torch.device, optional) — The device to process the images on. If unset, the device is inferred from the input images.
-  disable_grouping (bool, optional) — Whether to disable grouping of images by size to process them individually and not in batches. If None, will be set to True if the images are on CPU, and False otherwise. This choice is based on empirical observations, as detailed here: https://github.com/huggingface/transformers/pull/38157
Returns
<class 'transformers.image_processing_base.BatchFeature'>
- data (dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.).
- tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/TensorFlow/Numpy Tensors at initialization.
ViTModel
class transformers.ViTModel
< source >( config: ViTConfig add_pooling_layer: bool = True use_mask_token: bool = False )
Parameters
- config (ViTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
-  add_pooling_layer (bool, optional, defaults toTrue) — Whether to add a pooling layer
-  use_mask_token (bool, optional, defaults toFalse) — Whether to use a mask token for masked image modeling.
The bare Vit Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: typing.Optional[torch.Tensor] = None bool_masked_pos: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.Tensor] = None interpolate_pos_encoding: typing.Optional[bool] = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]  ) → transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
-  pixel_values (torch.Tensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using ViTImageProcessor. See ViTImageProcessor.call() for details (processor_classuses ViTImageProcessor for processing images).
-  bool_masked_pos (torch.BoolTensorof shape(batch_size, num_patches), optional) — Boolean masked positions. Indicates which patches are masked (1) and which aren’t (0).
-  head_mask (torch.Tensorof shape(num_heads,)or(num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  interpolate_pos_encoding (bool, optional) — Whether to interpolate the pre-trained position encodings.
Returns
transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (ViTConfig) and inputs.
- 
last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
pooler_output ( torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The ViTModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
ViTForMaskedImageModeling
class transformers.ViTForMaskedImageModeling
< source >( config: ViTConfig )
Parameters
- config (ViTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
ViT Model with a decoder on top for masked image modeling, as proposed in SimMIM.
Note that we provide a script to pre-train this model on custom data in our examples directory.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: typing.Optional[torch.Tensor] = None bool_masked_pos: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.Tensor] = None interpolate_pos_encoding: typing.Optional[bool] = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]  ) → transformers.modeling_outputs.MaskedImageModelingOutput or tuple(torch.FloatTensor)
Parameters
-  pixel_values (torch.Tensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using ViTImageProcessor. See ViTImageProcessor.call() for details (processor_classuses ViTImageProcessor for processing images).
-  bool_masked_pos (torch.BoolTensorof shape(batch_size, num_patches)) — Boolean masked positions. Indicates which patches are masked (1) and which aren’t (0).
-  head_mask (torch.Tensorof shape(num_heads,)or(num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  interpolate_pos_encoding (bool, optional) — Whether to interpolate the pre-trained position encodings.
Returns
transformers.modeling_outputs.MaskedImageModelingOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.MaskedImageModelingOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (ViTConfig) and inputs.
- loss (torch.FloatTensorof shape(1,), optional, returned whenbool_masked_posis provided) — Reconstruction loss.
- reconstruction (torch.FloatTensorof shape(batch_size, num_channels, height, width)) — Reconstructed / completed images.
- hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or
- when config.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.
- attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or when
- config.output_attentions=True): Tuple of- torch.FloatTensor(one for each layer) of shape- (batch_size, num_heads, patch_size, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ViTForMaskedImageModeling forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoImageProcessor, ViTForMaskedImageModeling
>>> import torch
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> model = ViTForMaskedImageModeling.from_pretrained("google/vit-base-patch16-224-in21k")
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
>>> # create random boolean mask of shape (batch_size, num_patches)
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
>>> loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
>>> list(reconstructed_pixel_values.shape)
[1, 3, 224, 224]ViTForImageClassification
class transformers.ViTForImageClassification
< source >( config: ViTConfig )
Parameters
- config (ViTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
ViT Model transformer with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.
Note that it’s possible to fine-tune ViT on higher resolution images than the ones it has been trained on, by
setting interpolate_pos_encoding to True in the forward of the model. This will interpolate the pre-trained
position embeddings to the higher resolution.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None interpolate_pos_encoding: typing.Optional[bool] = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]  ) → transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)
Parameters
-  pixel_values (torch.Tensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using ViTImageProcessor. See ViTImageProcessor.call() for details (processor_classuses ViTImageProcessor for processing images).
-  head_mask (torch.Tensorof shape(num_heads,)or(num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  labels (torch.LongTensorof shape(batch_size,), optional) — Labels for computing the image classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1a classification loss is computed (Cross-Entropy).
-  interpolate_pos_encoding (bool, optional) — Whether to interpolate the pre-trained position encodings.
Returns
transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.ImageClassifierOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (ViTConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
logits ( torch.FloatTensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The ViTForImageClassification forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, ViTForImageClassification
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
...TFViTModel
class transformers.TFViTModel
< source >( config: ViTConfig *inputs add_pooling_layer = True **kwargs )
Parameters
- config (ViTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare ViT Model transformer outputting raw hidden-states without any specific head on top.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers accept two formats as input:
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models
and layers. Because of this support, when using methods like model.fit() things should “just work” for you - just
pass your inputs and labels in any format that model.fit() supports! If, however, you want to use the second
format outside of Keras methods like fit() and predict(), such as when creating your own layers or models with
the Keras Functional API, there are three possibilities you can use to gather all the input Tensors in the first
positional argument:
- a single Tensor with pixel_valuesonly and nothing else:model(pixel_values)
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([pixel_values, attention_mask])ormodel([pixel_values, attention_mask, token_type_ids])
- a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({"pixel_values": pixel_values, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
call
< source >( pixel_values: TFModelInputType | None = None head_mask: np.ndarray | tf.Tensor | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None interpolate_pos_encoding: bool | None = None return_dict: bool | None = None training: bool = False  ) → transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)
Parameters
-  pixel_values (np.ndarray,tf.Tensor,list[tf.Tensor]`dict[str, tf.Tensor]ordict[str, np.ndarray]and each example must have the shape(batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using AutoImageProcessor. See ViTImageProcessor.call() for details.
-  head_mask (np.ndarrayortf.Tensorof shape(num_heads,)or(num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the config will be used instead.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the config will be used instead.
-  interpolate_pos_encoding (bool, optional) — Whether to interpolate the pre-trained position encodings.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple. This argument can be used in eager mode, in graph mode the value will always be set to True.
-  training (bool, optional, defaults to `False“) — Whether or not to use the model in training mode (some modules like dropout modules have different behaviors between training and evaluation).
Returns
transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or a tuple of tf.Tensor (if
return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the
configuration (ViTConfig) and inputs.
- 
last_hidden_state ( tf.Tensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
pooler_output ( tf.Tensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence. 
- 
hidden_states ( tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The TFViTModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, TFViTModel
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> model = TFViTModel.from_pretrained("google/vit-base-patch16-224-in21k")
>>> inputs = image_processor(image, return_tensors="tf")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 197, 768]TFViTForImageClassification
class transformers.TFViTForImageClassification
< source >( config: ViTConfig *inputs **kwargs )
Parameters
- config (ViTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
ViT Model transformer with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.
Note that it’s possible to fine-tune ViT on higher resolution images than the ones it has been trained on, by
setting interpolate_pos_encoding to True in the forward of the model. This will interpolate the pre-trained
position embeddings to the higher resolution.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers accept two formats as input:
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models
and layers. Because of this support, when using methods like model.fit() things should “just work” for you - just
pass your inputs and labels in any format that model.fit() supports! If, however, you want to use the second
format outside of Keras methods like fit() and predict(), such as when creating your own layers or models with
the Keras Functional API, there are three possibilities you can use to gather all the input Tensors in the first
positional argument:
- a single Tensor with pixel_valuesonly and nothing else:model(pixel_values)
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([pixel_values, attention_mask])ormodel([pixel_values, attention_mask, token_type_ids])
- a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({"pixel_values": pixel_values, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
call
< source >( pixel_values: TFModelInputType | None = None head_mask: np.ndarray | tf.Tensor | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None interpolate_pos_encoding: bool | None = None return_dict: bool | None = None labels: np.ndarray | tf.Tensor | None = None training: bool | None = False  ) → transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)
Parameters
-  pixel_values (np.ndarray,tf.Tensor,list[tf.Tensor]`dict[str, tf.Tensor]ordict[str, np.ndarray]and each example must have the shape(batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using AutoImageProcessor. See ViTImageProcessor.call() for details.
-  head_mask (np.ndarrayortf.Tensorof shape(num_heads,)or(num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the config will be used instead.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the config will be used instead.
-  interpolate_pos_encoding (bool, optional) — Whether to interpolate the pre-trained position encodings.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple. This argument can be used in eager mode, in graph mode the value will always be set to True.
-  training (bool, optional, defaults to `False“) — Whether or not to use the model in training mode (some modules like dropout modules have different behaviors between training and evaluation).
-  labels (tf.Tensorornp.ndarrayof shape(batch_size,), optional) — Labels for computing the image classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1a classification loss is computed (Cross-Entropy).
Returns
transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFSequenceClassifierOutput or a tuple of tf.Tensor (if
return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the
configuration (ViTConfig) and inputs.
- 
loss ( tf.Tensorof shape(batch_size, ), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
logits ( tf.Tensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
hidden_states ( tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The TFViTForImageClassification forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, TFViTForImageClassification
>>> import tensorflow as tf
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image"))
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> model = TFViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
>>> inputs = image_processor(image, return_tensors="tf")
>>> logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = int(tf.math.argmax(logits, axis=-1))
>>> print(model.config.id2label[predicted_label])
Egyptian catFlaxVitModel
class transformers.FlaxViTModel
< source >( config: ViTConfig input_shape = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )
Parameters
- config (ViTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
-  dtype (jax.numpy.dtype, optional, defaults tojax.numpy.float32) — The data type of the computation. Can be one ofjax.numpy.float32,jax.numpy.float16(on GPUs) andjax.numpy.bfloat16(on TPUs).This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters. If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16(). 
The bare ViT Model transformer outputting raw hidden-states without any specific head on top.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading, saving and converting weights from PyTorch models)
This model is also a flax.linen.Module subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
__call__
< source >( pixel_values params: typing.Optional[dict] = None dropout_rng: <function PRNGKey at 0x7f36ed8705e0> = None train: bool = False output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None  ) → transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or tuple(torch.FloatTensor)
Returns
transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.vit.configuration_vit.ViTConfig'>) and inputs.
- 
last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
pooler_output ( jnp.ndarrayof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The FlaxViTPreTrainedModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoImageProcessor, FlaxViTModel
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> model = FlaxViTModel.from_pretrained("google/vit-base-patch16-224-in21k")
>>> inputs = image_processor(images=image, return_tensors="np")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_stateFlaxViTForImageClassification
class transformers.FlaxViTForImageClassification
< source >( config: ViTConfig input_shape = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )
Parameters
- config (ViTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
-  dtype (jax.numpy.dtype, optional, defaults tojax.numpy.float32) — The data type of the computation. Can be one ofjax.numpy.float32,jax.numpy.float16(on GPUs) andjax.numpy.bfloat16(on TPUs).This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters. If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16(). 
ViT Model transformer with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading, saving and converting weights from PyTorch models)
This model is also a flax.linen.Module subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
__call__
< source >( pixel_values params: typing.Optional[dict] = None dropout_rng: <function PRNGKey at 0x7f36ed8705e0> = None train: bool = False output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None  ) → transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or tuple(torch.FloatTensor)
Returns
transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.vit.configuration_vit.ViTConfig'>) and inputs.
- 
logits ( jnp.ndarrayof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The FlaxViTPreTrainedModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, FlaxViTForImageClassification
>>> from PIL import Image
>>> import jax
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> model = FlaxViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
>>> inputs = image_processor(images=image, return_tensors="np")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = jax.numpy.argmax(logits, axis=-1)
>>> print("Predicted class:", model.config.id2label[predicted_class_idx.item()])