Prior Transformer
译者:片刻小哥哥
项目地址:https://huggingface.apachecn.org/docs/diffusers/api/models/prior_transformer
原始地址:https://huggingface.co/docs/diffusers/api/models/prior_transformer
The Prior Transformer was originally introduced in Hierarchical Text-Conditional Image Generation with CLIP Latents by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process.
The abstract from the paper is:
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
PriorTransformer
class
diffusers.
PriorTransformer
[<
source
](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/models/prior_transformer.py#L36)
(
num_attention_heads
: int = 32
attention_head_dim
: int = 64
num_layers
: int = 20
embedding_dim
: int = 768
num_embeddings
= 77
additional_embeddings
= 4
dropout
: float = 0.0
time_embed_act_fn
: str = 'silu'
norm_in_type
: typing.Optional[str] = None
embedding_proj_norm_type
: typing.Optional[str] = None
encoder_hid_proj_type
: typing.Optional[str] = 'linear'
added_emb_type
: typing.Optional[str] = 'prd'
time_embed_dim
: typing.Optional[int] = None
embedding_proj_dim
: typing.Optional[int] = None
clip_embed_dim
: typing.Optional[int] = None
)
Parameters
- num_attention_heads
(
int
, optional , defaults to 32) — The number of heads to use for multi-head attention. - attention_head_dim
(
int
, optional , defaults to 64) — The number of channels in each head. - num_layers
(
int
, optional , defaults to 20) — The number of layers of Transformer blocks to use. - embedding_dim
(
int
, optional , defaults to 768) — The dimension of the model inputhidden_states
- num_embeddings
(
int
, optional , defaults to 77) — The number of embeddings of the model inputhidden_states
- additional_embeddings
(
int
, optional , defaults to 4) — The number of additional tokens appended to the projectedhidden_states
. The actual length of the usedhidden_states
isnum_embeddings + additional_embeddings
. - dropout
(
float
, optional , defaults to 0.0) — The dropout probability to use. - time_embed_act_fn
(
str
, optional , defaults to ‘silu’) — The activation function to use to create timestep embeddings. - norm_in_type
(
str
, optional , defaults to None) — The normalization layer to apply on hidden states before passing to Transformer blocks. Set it toNone
if normalization is not needed. - embedding_proj_norm_type
(
str
, optional , defaults to None) — The normalization layer to apply on the inputproj_embedding
. Set it toNone
if normalization is not needed. - encoder_hid_proj_type
(
str
, optional , defaults tolinear
) — The projection layer to apply on the inputencoder_hidden_states
. Set it toNone
ifencoder_hidden_states
isNone
. - added_emb_type
(
str
, optional , defaults toprd
) — Additional embeddings to condition the model. Choose fromprd
orNone
. if chooseprd
, it will prepend a token indicating the (quantized) dot product between the text embedding and image embedding as proposed in the unclip paper https://arxiv.org/abs/2204.06125 If it isNone
, no additional embeddings will be prepended. - time_embed_dim
(
int, *optional*, defaults to None) -- The dimension of timestep embeddings. If None, will be set to
num_attention_heads * attention_head_dim` - embedding_proj_dim
(
int
, optional , default to None) — The dimension ofproj_embedding
. If None, will be set toembedding_dim
. - clip_embed_dim
(
int
, optional , default to None) — The dimension of the output. If None, will be set toembedding_dim
.
A Prior Transformer model.
forward
[<
source
](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/models/prior_transformer.py#L247)
(
hidden_states
timestep
: typing.Union[torch.Tensor, float, int]
proj_embedding
: FloatTensor
encoder_hidden_states
: typing.Optional[torch.FloatTensor] = None
attention_mask
: typing.Optional[torch.BoolTensor] = None
return_dict
: bool = True
)
→
export const metadata = 'undefined';
PriorTransformerOutput
or
tuple
Parameters
- hidden_states
(
torch.FloatTensor
of shape(batch_size, embedding_dim)
) — The currently predicted image embeddings. - timestep
(
torch.LongTensor
) — Current denoising step. - proj_embedding
(
torch.FloatTensor
of shape(batch_size, embedding_dim)
) — Projected embedding vector the denoising process is conditioned on. - encoder_hidden_states
(
torch.FloatTensor
of shape(batch_size, num_embeddings, embedding_dim)
) — Hidden states of the text embeddings the denoising process is conditioned on. - attention_mask
(
torch.BoolTensor
of shape(batch_size, num_embeddings)
) — Text mask for the text embeddings. - return_dict
(
bool
, optional , defaults toTrue
) — Whether or not to return a PriorTransformerOutput instead of a plain tuple.
Returns
export const metadata = 'undefined';
PriorTransformerOutput
or
tuple
export const metadata = 'undefined';
If return_dict is True, a PriorTransformerOutput is returned, otherwise a tuple is returned where the first element is the sample tensor.
The PriorTransformer forward method.
set_attn_processor
[<
source
](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/models/prior_transformer.py#L195)
(
processor
: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor]]]
_remove_lora
= False
)
Parameters
- processor
(
dict
ofAttentionProcessor
or onlyAttentionProcessor
) — The instantiated processor class or a dictionary of processor classes that will be set as the processor for allAttention
layers.
If
processor
is a dict, the key needs to define the path to the corresponding cross attention
processor. This is strongly recommended when setting trainable attention processors.
Sets the attention processor to use to compute attention.
set_default_attn_processor
[<
source
](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/models/prior_transformer.py#L232)
(
)
Disables custom attention processors and sets the default attention implementation.
PriorTransformerOutput
class
diffusers.models.prior_transformer.
PriorTransformerOutput
[<
source
](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/models/prior_transformer.py#L24)
(
predicted_image_embedding
: FloatTensor
)
Parameters
- predicted_image_embedding
(
torch.FloatTensor
of shape(batch_size, embedding_dim)
) — The predicted CLIP image embedding conditioned on the CLIP text embedding input.
The output of PriorTransformer .