HunyuanImage2.1

HunyuanImage-2.1 is a 17B text-to-image model that is capable of generating 2K (2048 x 2048) resolution images

HunyuanImage-2.1 comes in the following variants:

model type	model id
HunyuanImage-2.1	hunyuanvideo-community/HunyuanImage-2.1-Diffusers
HunyuanImage-2.1-Distilled	hunyuanvideo-community/HunyuanImage-2.1-Distilled-Diffusers
HunyuanImage-2.1-Refiner	hunyuanvideo-community/HunyuanImage-2.1-Refiner-Diffusers

[!TIP][Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs.

HunyuanImage-2.1

HunyuanImage-2.1 applies Adaptive Projected Guidance (APG) combined with Classifier-Free Guidance (CFG) in the denoising loop. HunyuanImagePipeline has a guider component (read more about Guider) and does not take a guidance_scale parameter at runtime. To change guider-related parameters, e.g., guidance_scale, you can update the guider configuration instead.

import torch
from diffusers import HunyuanImagePipeline

pipe = HunyuanImagePipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanImage-2.1-Diffusers", 
    torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")

You can inspect the guider object:

>>> pipe.guider
AdaptiveProjectedMixGuidance {
  "_class_name": "AdaptiveProjectedMixGuidance",
  "_diffusers_version": "0.36.0.dev0",
  "adaptive_projected_guidance_momentum": -0.5,
  "adaptive_projected_guidance_rescale": 10.0,
  "adaptive_projected_guidance_scale": 10.0,
  "adaptive_projected_guidance_start_step": 5,
  "enabled": true,
  "eta": 0.0,
  "guidance_rescale": 0.0,
  "guidance_scale": 3.5,
  "start": 0.0,
  "stop": 1.0,
  "use_original_formulation": false
}

State:
  step: None
  num_inference_steps: None
  timestep: None
  count_prepared: 0
  enabled: True
  num_conditions: 2
  momentum_buffer: None
  is_apg_enabled: False
  is_cfg_enabled: True

To update the guider with a different configuration, use the new() method. For example, to generate an image with guidance_scale=5.0 while keeping all other default guidance parameters:

import torch
from diffusers import HunyuanImagePipeline

pipe = HunyuanImagePipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanImage-2.1-Diffusers", 
    torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")

# Update the guider configuration
pipe.guider = pipe.guider.new(guidance_scale=5.0)

prompt = (
    "A cute, cartoon-style anthropomorphic penguin plush toy with fluffy fur, standing in a painting studio, "
    "wearing a red knitted scarf and a red beret with the word 'Tencent' on it, holding a paintbrush with a "
    "focused expression as it paints an oil painting of the Mona Lisa, rendered in a photorealistic photographic style."
)

image = pipe(
    prompt=prompt, 
    num_inference_steps=50, 
    height=2048, 
    width=2048,
).images[0]
image.save("image.png")

HunyuanImage-2.1-Distilled

use distilled_guidance_scale with the guidance-distilled checkpoint,

import torch
from diffusers import HunyuanImagePipeline
pipe = HunyuanImagePipeline.from_pretrained("hunyuanvideo-community/HunyuanImage-2.1-Distilled-Diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

prompt = (
    "A cute, cartoon-style anthropomorphic penguin plush toy with fluffy fur, standing in a painting studio, "
    "wearing a red knitted scarf and a red beret with the word 'Tencent' on it, holding a paintbrush with a "
    "focused expression as it paints an oil painting of the Mona Lisa, rendered in a photorealistic photographic style."
)

out = pipe(
    prompt,
    num_inference_steps=8,
    distilled_guidance_scale=3.25,
    height=2048,
    width=2048,
    generator=generator,
).images[0]

HunyuanImagePipeline

class diffusers.HunyuanImagePipeline

< source >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLHunyuanImage text_encoder: Qwen2_5_VLForConditionalGeneration tokenizer: Qwen2Tokenizer text_encoder_2: T5EncoderModel tokenizer_2: ByT5Tokenizer transformer: HunyuanImageTransformer2DModel guider: typing.Optional[diffusers.guiders.adaptive_projected_guidance_mix.AdaptiveProjectedMixGuidance] = None ocr_guider: typing.Optional[diffusers.guiders.adaptive_projected_guidance_mix.AdaptiveProjectedMixGuidance] = None )

Parameters

transformer (HunyuanImageTransformer2DModel) — Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
vae (AutoencoderKLHunyuanImage) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (Qwen2.5-VL-7B-Instruct) — Qwen2.5-VL-7B-Instruct, specifically the Qwen2.5-VL-7B-Instruct variant.
tokenizer (Qwen2Tokenizer) — Tokenizer of class [Qwen2Tokenizer].
text_encoder_2 (T5EncoderModel) — T5EncoderModel variant.
tokenizer_2 (ByT5Tokenizer) — Tokenizer of class [ByT5Tokenizer]
guider (AdaptiveProjectedMixGuidance) — [AdaptiveProjectedMixGuidance]to be used to guide the image generation.
ocr_guider (AdaptiveProjectedMixGuidance, optional) — [AdaptiveProjectedMixGuidance] to be used to guide the image generation when text rendering is needed.

The HunyuanImage pipeline for text-to-image generation.

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 distilled_guidance_scale: typing.Optional[float] = 3.25 sigmas: typing.Optional[typing.List[float]] = None num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → ~pipelines.hunyuan_image.HunyuanImagePipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined and negative_prompt_embeds is not provided, will use an empty negative prompt. Ignored when not using guidance. ).
height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image. This is set to 1024 by default for the best results.
width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image. This is set to 1024 by default for the best results.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
distilled_guidance_scale (float, optional, defaults to None) — A guidance scale value for guidance distilled models. Unlike the traditional classifier-free guidance where the guidance scale is applied during inference through noise prediction rescaling, guidance distilled models take the guidance scale directly as an input parameter during forward pass. Guidance is enabled by setting distilled_guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality. For guidance distilled models, this parameter is required. For non-distilled models, this parameter will be ignored.
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
prompt_embeds_mask (torch.Tensor, optional) — Pre-generated text embeddings mask. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings mask will be generated from prompt input argument.
prompt_embeds_2 (torch.Tensor, optional) — Pre-generated text embeddings for ocr. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings for ocr will be generated from prompt input argument.
prompt_embeds_mask_2 (torch.Tensor, optional) — Pre-generated text embeddings mask for ocr. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings mask for ocr will be generated from prompt input argument.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
negative_prompt_embeds_mask (torch.Tensor, optional) — Pre-generated negative text embeddings mask. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative text embeddings mask will be generated from negative_prompt input argument.
negative_prompt_embeds_2 (torch.Tensor, optional) — Pre-generated negative text embeddings for ocr. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative text embeddings for ocr will be generated from negative_prompt input argument.
negative_prompt_embeds_mask_2 (torch.Tensor, optional) — Pre-generated negative text embeddings mask for ocr. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative text embeddings mask for ocr will be generated from negative_prompt input argument.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.qwenimage.QwenImagePipelineOutput instead of a plain tuple.
attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

Returns

~pipelines.hunyuan_image.HunyuanImagePipelineOutput or tuple

~pipelines.hunyuan_image.HunyuanImagePipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import HunyuanImagePipeline

>>> pipe = HunyuanImagePipeline.from_pretrained(
...     "hunyuanvideo-community/HunyuanImage-2.1-Diffusers", torch_dtype=torch.bfloat16
... )
>>> pipe.to("cuda")
>>> prompt = "A cat holding a sign that says hello world"
>>> # Depending on the variant being used, the pipeline call will slightly vary.
>>> # Refer to the pipeline documentation for more details.
>>> image = pipe(prompt, negative_prompt="", num_inference_steps=50).images[0]
>>> image.save("hunyuanimage.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None batch_size: int = 1 num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None )

Parameters

prompt (str or List[str], optional) — prompt to be encoded
device — (torch.device): torch device
batch_size (int) — batch size of prompts, defaults to 1
num_images_per_prompt (int) — number of images that should be generated per prompt
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. If not provided, text embeddings will be generated from prompt input argument.
prompt_embeds_mask (torch.Tensor, optional) — Pre-generated text mask. If not provided, text mask will be generated from prompt input argument.
prompt_embeds_2 (torch.Tensor, optional) — Pre-generated glyph text embeddings from ByT5. If not provided, will be generated from prompt input argument using self.tokenizer_2 and self.text_encoder_2.
prompt_embeds_mask_2 (torch.Tensor, optional) — Pre-generated glyph text mask from ByT5. If not provided, will be generated from prompt input argument using self.tokenizer_2 and self.text_encoder_2.

HunyuanImageRefinerPipeline

class diffusers.HunyuanImageRefinerPipeline

< source >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLHunyuanImageRefiner text_encoder: Qwen2_5_VLForConditionalGeneration tokenizer: Qwen2Tokenizer transformer: HunyuanImageTransformer2DModel guider: typing.Optional[diffusers.guiders.adaptive_projected_guidance_mix.AdaptiveProjectedMixGuidance] = None )

Parameters

transformer (HunyuanImageTransformer2DModel) — Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
vae (AutoencoderKLHunyuanImageRefiner) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (Qwen2.5-VL-7B-Instruct) — Qwen2.5-VL-7B-Instruct, specifically the Qwen2.5-VL-7B-Instruct variant.
tokenizer (Qwen2Tokenizer) — Tokenizer of class [Qwen2Tokenizer].

The HunyuanImage pipeline for text-to-image generation.

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None distilled_guidance_scale: typing.Optional[float] = 3.25 image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 4 sigmas: typing.Optional[typing.List[float]] = None num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → ~pipelines.hunyuan_image.HunyuanImagePipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, will use an empty negative prompt. Ignored when not using guidance.
distilled_guidance_scale (float, optional, defaults to None) — A guidance scale value for guidance distilled models. Unlike the traditional classifier-free guidance where the guidance scale is applied during inference through noise prediction rescaling, guidance distilled models take the guidance scale directly as an input parameter during forward pass. Guidance is enabled by setting distilled_guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality. For guidance distilled models, this parameter is required. For non-distilled models, this parameter will be ignored.
num_images_per_prompt (int, optional, defaults to 1) —
height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image. This is set to 1024 by default for the best results.
width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image. This is set to 1024 by default for the best results.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.qwenimage.QwenImagePipelineOutput instead of a plain tuple.
attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

Returns

~pipelines.hunyuan_image.HunyuanImagePipelineOutput or tuple

~pipelines.hunyuan_image.HunyuanImagePipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import HunyuanImageRefinerPipeline

>>> pipe = HunyuanImageRefinerPipeline.from_pretrained(
...     "hunyuanvideo-community/HunyuanImage-2.1-Refiner-Diffusers", torch_dtype=torch.bfloat16
... )
>>> pipe.to("cuda")
>>> prompt = "A cat holding a sign that says hello world"
>>> image = load_image("path/to/image.png")
>>> # Depending on the variant being used, the pipeline call will slightly vary.
>>> # Refer to the pipeline documentation for more details.
>>> image = pipe(prompt, image=image, num_inference_steps=4).images[0]
>>> image.save("hunyuanimage.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None device: typing.Optional[torch.device] = None batch_size: int = 1 num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None )

Parameters

prompt (str or List[str], optional) — prompt to be encoded
device — (torch.device): torch device
batch_size (int) — batch size of prompts, defaults to 1
num_images_per_prompt (int) — number of images that should be generated per prompt
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. If not provided, text embeddings will be generated from prompt input argument.
prompt_embeds_mask (torch.Tensor, optional) — Pre-generated text mask. If not provided, text mask will be generated from prompt input argument.
prompt_embeds_2 (torch.Tensor, optional) — Pre-generated glyph text embeddings from ByT5. If not provided, will be generated from prompt input argument using self.tokenizer_2 and self.text_encoder_2.
prompt_embeds_mask_2 (torch.Tensor, optional) — Pre-generated glyph text mask from ByT5. If not provided, will be generated from prompt input argument using self.tokenizer_2 and self.text_encoder_2.

HunyuanImagePipelineOutput

class diffusers.pipelines.hunyuan_image.pipeline_output.HunyuanImagePipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Parameters

images (List[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or numpy array of shape (batch_size, height, width, num_channels). PIL images or numpy array present the denoised images of the diffusion pipeline.

Output class for HunyuanImage pipelines.

Update on GitHub

Diffusers

HunyuanImage2.1

HunyuanImage-2.1

HunyuanImage-2.1-Distilled

HunyuanImagePipeline

class diffusers.HunyuanImagePipeline

__call__

encode_prompt

HunyuanImageRefinerPipeline

class diffusers.HunyuanImageRefinerPipeline

__call__

encode_prompt

HunyuanImagePipelineOutput

class diffusers.pipelines.hunyuan_image.pipeline_output.HunyuanImagePipelineOutput

call

call