parent: stable_diffusion
SEED autoregressive
DeepFloyd is a StableDiffusion style image model that more or less replaced CLIP with a full LLM, more like Google's Imagen model.
it's a cascaded diffusion model conditioned on the T5 encoder
Inversionby Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration
iterative restoration from low-quality and high-quality paired examples
Muse diffusion alternative, Masked Generative Transformers, T5 text discrete tokens
super-resolution
transformers insteadof unet, DiT
StraIT Non-autoregressive Generation with Stratified Image Transformer
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
VAR a new visual generation method elevates GPT-style models beyond diffusion
outperforms Diffusion Transformer (DiT) in quality, inference speed, data efficiency, and scalability
GenTron Delving Deep into Diffusion Transformers for Image and Video Generation
Lucas Beyer represent videos-images as collections of units of data called patches, akin to a gpt token
now you can train diffusion transformers on data like: different durations, resolutions, aspect ratios
ZigMa Zigzag Mamba Diffusion Model
mamba(state space) instead of transformer
FiT Flexible Vision Transformer for Diffusion Model
architecture designed for generating images with unrestricted resolutions and aspect ratios
promoting resolution generalization, eliminating biases induced by image cropping
PixArt-α FastTrainingof Diffusion Transformer for Photorealistic Text-to-Image Synthesis (model ==best==
only takes 10.8% of Stable Diffusion, less than8VRAM
controlnetand lcm
PIXART-δ Fast and Controllable Image Generation with Latent Consistency Models (other lcm controlnet)
PIXART-Σ Weak-to-StrongTraining of Diffusion Transformer for 4K Text-to-Image Generation
smaller size (0.6B parameters) than SDXL (2.6B parameters) and SD Cascade (5.1B parameters)
SiT Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
Scalable Interpolant Transformers (SiT)
using discrete vs continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions
Diffusion-RWKV Scaling RWKV-Like Architectures for Diffusion Models
RWKV(CNN) instead of tranformers
Composer better impainting, training independently semantic components
High FidelityImage Synthesis With Deep VAEs In Latent Space
hierarchical variational autoencoders (VAEs)
Binary LatentDiffusion; binary latent space, binary latent diffusion model; 1/3 of LDM parameters
they tie the "probability" of discrete representation to the probability of the dataset: Variational Inference itself
Self-conditionedImage Generationvia Generating Representations ==best==
RCG: Representation-Conditioned image Generation
does not condition on any human annotations, instead using a pre-trained encoder
CLIP-VQDiffusion Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model
uses clip image encoder instead at train time, then clip text encoder at test time
representation diffusion model (RDM)
HunyuanDiT SD3-likearchitecture text-to-imge model (Diffusion Transformers) by Tencent (and 5 times smaller)
StableCascade by Stability, a new text to image model building upon the Würstchen architecture
working at a much smaller latent space, 42x compression vs 8x
the faster you can run inference and the cheaper the training becomes
DiffusionModel with Perceptual Loss ==best==
the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance
the diffusion model itself is a perceptual network (training objetive)
models capable of generating more realistic samples (at lower steps)
Kandinsky 2 imagefusion, inpainting, open source (apache)
(uses XLM-Roberta-LARGE an LLM); BERT, but uses a byte-level BPE as a tokenizer
maps CLIP text CLIP image; allows image mixing and blending
ELLA Equip Diffusion Models with LLM for Enhanced Semantic Alignment
without training of either U-Net or LLM, 2 pre-trained models bridged with Timestep-Aware Semantic Connector Module, which adapts semantic features at different stages of the denoising
interpreting lengthy and intricate prompts over sampling timesteps
SnapFusion Text-to-ImageDiffusion Model on Mobile Devices within Two Seconds
mobile devices = 2 seconds, reducing the computation of the image decoder via data distillation
BeyondU: Making Diffusion Models Faster & Lighter
continuous dynamical systems to design a novel denoising network
1/4 of parameters and 30% flops than sd, 70% faster inference
Consistency Models: consistency distillationvs progressivedistillation
DiffusionWorld Model (DWM) ==best==
long-horizon predictions in a single forward pass, eliminating the need for recursive quires
enables offline Q-learning with synthetic data
distribution matching distillation (DMD
multi-step process of traditional diffusion models into a single step, through a teacher-student model
Flow Straightand Fast Learning to Generate and Transfer Data with Rectified Flow
unified solution to generative modeling and domain transfer
simple approach to learning models to transport between two observed distributions
shortest paths between two points, increasingly straight paths
uses: image generation, image-to-image translation, and domain adaptation
⚡InstaFlow One-StepStableDiffusion with Rectified Flow
Leveraging pre-trained Stable Diffusion; one step = faster, 0.12s per image
can quickly choose one lowresolution images: fast previewer
can have controlnet and lora
Boosting LatentDiffusion with Flow Matching
FM between diffusion model and the convolutional decoder = high-resolution and reduced computational
diffusion provides generation diversity, FM maps the small latent space to a high-dimensional one
Stable Diffusion3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
biasing rectified flow models towards perceptually relevant scales
bidirectional flow of information between image and text tokens
Any-to-AnyGeneration via Composable Diffusion (audio, imagen, text)
SyncDiffusion Coherent Montage via Synchronized Joint Diffusions (synchronizes them) ==best==
RAPHAEL Text-to-ImageGeneration via Large Mixture of Diffusion Paths
mixture-of-experts (MoEs) layers, encompassing multiple nouns, adjectives, and verbs
trained on 1000 gpus for 2 months
DistriFusion Distributed Parallel Inference for High-Resolution Diffusion Models
multiple GPUs to accelerate diffusion model, coherent output
Training DataProtection with Compositional Diffusion Models; (CDM) parallel training ==best==
method to train different diffusion models on distinct data and compose them at inference time
PanGu-Draw Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
novel latent diffusion model designed for resource-efficient and multiple control signals
split structure and texture generators
cutting data preparation by 48% and reducing training resources by 51%
cooperatively use different latent spaces within a unified denoising process
multi-control image synthesis
Versatile Diffusion Text, Images and Variations All in One Diffusion Model
disentanglement of style and semantics, dual- and multi-context blending
generate similar expressions from reference text
unidiffuser marginal, conditional, and joint diffusion, paperarxiv
extra diffusion conditions; perturbs data in all modalities
image, text, text-to-image, image-to-text, and image-text pair generation
GigaGAN adobe implementation
StyleGAN-T nvidia (style transfer)
diffusion as alternative to gans: DIFFMORPHER