📓 nodes/20230628203916-diffusion_alternative.org by @tekakutli-org ☆

parent: stable_diffusion
SEED autoregressive
karlo stablekarlo (image generation based on unclip)
DeepFloyd is a StableDiffusion style image model that more or less replaced CLIP with a full LLM, more like Google's Imagen model.
- it's a cascaded diffusion model conditioned on the T5 encoder
Inversionby Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration
- iterative restoration from low-quality and high-quality paired examples

TRANSFORMERS

Muse diffusion alternative, Masked Generative Transformers, T5 text discrete tokens
- super-resolution
transformers insteadof unet, DiT
StraIT Non-autoregressive Generation with Stratified Image Transformer

GPT

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- VAR a new visual generation method elevates GPT-style models beyond diffusion
- outperforms Diffusion Transformer (DiT) in quality, inference speed, data efficiency, and scalability

DIFFUSION TRANSFORMER

GenTron Delving Deep into Diffusion Transformers for Image and Video Generation
Lucas Beyer represent videos-images as collections of units of data called patches, akin to a gpt token
- now you can train diffusion transformers on data like: different durations, resolutions, aspect ratios
ZigMa Zigzag Mamba Diffusion Model
- mamba(state space) instead of transformer

FIT TRANSFORMER

FiT Flexible Vision Transformer for Diffusion Model
- architecture designed for generating images with unrestricted resolutions and aspect ratios
- promoting resolution generalization, eliminating biases induced by image cropping

PIXART

PixArt-α Fast Trainingof Diffusion Transformer for Photorealistic Text-to-Image Synthesis (model ==best==
- only takes 10.8% of Stable Diffusion, less than8VRAM
- controlnetand lcm
- PIXART-δ Fast and Controllable Image Generation with Latent Consistency Models (other lcm controlnet)
- PIXART-Σ Weak-to-StrongTraining of Diffusion Transformer for 4K Text-to-Image Generation
  - smaller size (0.6B parameters) than SDXL (2.6B parameters) and SD Cascade (5.1B parameters)

SiT

SiT Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
- Scalable Interpolant Transformers (SiT)
- using discrete vs continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions

RWKV

Diffusion-RWKV Scaling RWKV-Like Architectures for Diffusion Models
- RWKV(CNN) instead of tranformers

STILL DIFFUSION

Composer better impainting, training independently semantic components
High FidelityImage Synthesis With Deep VAEs In Latent Space
- hierarchical variational autoencoders (VAEs)
Binary LatentDiffusion; binary latent space, binary latent diffusion model; 1/3 of LDM parameters
- they tie the "probability" of discrete representation to the probability of the dataset: Variational Inference itself
Self-conditioned Image Generationvia Generating Representations ==best==
- RCG: Representation-Conditioned image Generation
- does not condition on any human annotations, instead using a pre-trained encoder
CLIP-VQDiffusion Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model
- uses clip image encoder instead at train time, then clip text encoder at test time
  - representation diffusion model (RDM)
HunyuanDiT SD3-likearchitecture text-to-imge model (Diffusion Transformers) by Tencent (and 5 times smaller)

STABLE CASCADE

Stable Cascade by Stability, a new text to image model building upon the Würstchen architecture
- working at a much smaller latent space, 42x compression vs 8x
- the faster you can run inference and the cheaper the training becomes

PERCEPTUAL LOSS

DiffusionModel with Perceptual Loss ==best==
- the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance
- the diffusion model itself is a perceptual network (training objetive)
- models capable of generating more realistic samples (at lower steps)

WITH LLM

Kandinsky 2 imagefusion, inpainting, open source (apache)
- (uses XLM-Roberta-LARGE an LLM); BERT, but uses a byte-level BPE as a tokenizer
- maps CLIP text CLIP image; allows image mixing and blending
ELLA Equip Diffusion Models with LLM for Enhanced Semantic Alignment
- without training of either U-Net or LLM, 2 pre-trained models bridged with Timestep-Aware Semantic Connector Module, which adapts semantic features at different stages of the denoising
- interpreting lengthy and intricate prompts over sampling timesteps

FASTER

SnapFusion Text-to-ImageDiffusion Model on Mobile Devices within Two Seconds
- mobile devices = 2 seconds, reducing the computation of the image decoder via data distillation
BeyondU: Making Diffusion Models Faster & Lighter
- continuous dynamical systems to design a novel denoising network
- 1/4 of parameters and 30% flops than sd, 70% faster inference

ONE STEP DIFFUSION

Consistency Models: consistency distillationvs progressive distillation
DiffusionWorld Model (DWM) ==best==
- long-horizon predictions in a single forward pass, eliminating the need for recursive quires
  - enables offline Q-learning with synthetic data
distribution matching distillation (DMD
- multi-step process of traditional diffusion models into a single step, through a teacher-student model

RECTIFIED FLOW

Flow Straight and Fast Learning to Generate and Transfer Data with Rectified Flow
- unified solution to generative modeling and domain transfer
- simple approach to learning models to transport between two observed distributions
- shortest paths between two points, increasingly straight paths
- uses: image generation, image-to-image translation, and domain adaptation
⚡InstaFlow One-Step StableDiffusion with Rectified Flow
- Leveraging pre-trained Stable Diffusion; one step = faster, 0.12s per image
- can quickly choose one lowresolution images: fast previewer
- can have controlnet and lora
- PERFLOW
Boosting LatentDiffusion with Flow Matching
- FM between diffusion model and the convolutional decoder = high-resolution and reduced computational
- diffusion provides generation diversity, FM maps the small latent space to a high-dimensional one

STABLE DIFFUSION 3

Stable Diffusion3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- biasing rectified flow models towards perceptually relevant scales
- bidirectional flow of information between image and text tokens

MULTIPLE DIFFUSIONcomposable

Any-to-AnyGeneration via Composable Diffusion (audio, imagen, text)
SyncDiffusion Coherent Montage via Synchronized Joint Diffusions (synchronizes them) ==best==
RAPHAEL Text-to-ImageGeneration via Large Mixture of Diffusion Paths
- mixture-of-experts (MoEs) layers, encompassing multiple nouns, adjectives, and verbs
- trained on 1000 gpus for 2 months
DistriFusion Distributed Parallel Inference for High-Resolution Diffusion Models
- multiple GPUs to accelerate diffusion model, coherent output

COMPOSITIONAL DIFFUSION

Training DataProtection with Compositional Diffusion Models; (CDM) parallel training ==best==
- method to train different diffusion models on distinct data and compose them at inference time
SEGMOE

PANGU

PanGu-Draw Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
- novel latent diffusion model designed for resource-efficient and multiple control signals
  - split structure and texture generators
  - cutting data preparation by 48% and reducing training resources by 51%
- cooperatively use different latent spaces within a unified denoising process
  - multi-control image synthesis

MULTIMODAL DIFFUSION

Versatile Diffusion Text, Images and Variations All in One Diffusion Model
- disentanglement of style and semantics, dual- and multi-context blending
- generate similar expressions from reference text
unidiffuser marginal, conditional, and joint diffusion, paper arxiv
- extra diffusion conditions; perturbs data in all modalities
- image, text, text-to-image, image-to-text, and image-text pair generation

GAN

GigaGAN adobe implementation
- StyleGAN-T nvidia (style transfer)
diffusion as alternative to gans: DIFFMORPHER

BETTER DECODER

Designinga Better Asymmetric VQGAN for StableDiffusion better vqgan
- only need to retrain a new asymmetric decoder for vanilla sd; better text
k-diffusion: OpenAI'sconsistency decoder (HF model) as a k-diffusion v-prediction denoiser
- supports n>2 step sampling
- sdxl-diffusion-decoder