:PROPERTIES:
:ID:       f9437b93-c5a5-4cbb-8a66-51556df3d313
:END:
#+title: diffusion alternative
#+filetags: :neuralnomicon:
#+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup
- parent: [[id:c7fe7e79-73d3-4cc7-a673-2c2e259ab5b5][stable_diffusion]]
- [[id:c0fc01e0-6db2-42c1-8053-55cddfcd496c][SEED]]: autoregressive
- [[https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/6585][karlo]], [[https://github.com/Stability-AI/stablediffusion/blob/main/doc/UNCLIP.MD ][stable]] karlo (image generation based on unclip)
- DeepFloyd is a StableDiffusion style image model that more or less replaced CLIP with a full LLM, more like Google's Imagen model.
  - it's a cascaded diffusion model conditioned on the T5 encoder
- [[https://twitter.com/_akhaliq/status/1674280829382541312][Inversion]] by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration
  - iterative restoration from low-quality and high-quality paired examples
* TRANSFORMERS
- [[https://arxiv.org/pdf/2301.00704.pdf][Muse]]: diffusion alternative, Masked Generative Transformers, T5 text discrete tokens
  - super-resolution
- [[https://arxiv.org/abs/2212.09748][transformers instead]] of unet, [[https://github.com/facebookresearch/DiT][DiT]]
- [[https://arxiv.org/pdf/2303.00750.pdf][StraIT]]: Non-autoregressive Generation with Stratified Image Transformer
** GPT
- Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
  - [[https://github.com/FoundationVision/VAR][VAR]]: a new visual generation method elevates GPT-style models beyond diffusion
  - outperforms Diffusion Transformer (DiT) in quality, inference speed, data efficiency, and scalability
** DIFFUSION TRANSFORMER
:PROPERTIES:
:ID:       c06a7c92-0abc-44c7-8a89-9ff92bc5d4e9
:END:
- [[https://twitter.com/_akhaliq/status/1732971872575271282][GenTron]]: Delving Deep into Diffusion Transformers for Image and Video Generation
- [[https://twitter.com/giffmana/status/1758228203540226350][Lucas Beyer]]: represent videos-images as collections of units of data called patches, akin to a gpt token
  - now you can train diffusion transformers on data like: different durations, resolutions, aspect ratios
- [[https://twitter.com/_akhaliq/status/1770668624392421512][ZigMa]]: Zigzag Mamba Diffusion Model
  - mamba(state space) instead of transformer
*** FIT TRANSFORMER
:PROPERTIES:
:ID:       bbc5a347-bc62-4b5e-b659-1c6a57d6a2a5
:END:
- [[https://twitter.com/_akhaliq/status/1759826568539496554][FiT]]: Flexible Vision Transformer for Diffusion Model
  - architecture designed for generating images with unrestricted resolutions and aspect ratios
  - promoting resolution generalization, eliminating biases induced by image cropping
*** PIXART
- [[https://twitter.com/_akhaliq/status/1709055269165060345][PixArt-α]]: [[https://twitter.com/_akhaliq/status/1715038043495686545][Fast]] [[https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS][Training]] of Diffusion Transformer for Photorealistic Text-to-Image Synthesis ([[https://rentry.org/wgq4n][model]]) ==best==
  - only takes 10.8% of Stable Diffusion, [[https://github.com/huggingface/diffusers/pull/5814][less than]] 8VRAM
  - [[https://github.com/PixArt-alpha/PixArt-alpha][controlnet]] and lcm
  - [[https://twitter.com/_akhaliq/status/1745284887068688822][PIXART-δ]]: Fast and Controllable Image Generation with Latent Consistency Models (other lcm controlnet)
  - [[https://lemmy.dbzer0.com/post/16046400][PIXART-Σ]]: [[https://github.com/PixArt-alpha/PixArt-sigma][Weak-to-Strong]] Training of Diffusion Transformer for 4K Text-to-Image Generation
    - smaller size (0.6B parameters) than SDXL (2.6B parameters) and SD Cascade (5.1B parameters)
*** SiT
:PROPERTIES:
:ID:       43b7488b-1536-41eb-baf2-a21dbb7defcf
:END:
- [[https://twitter.com/_akhaliq/status/1747846066848903619][SiT]]: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
  - Scalable Interpolant Transformers (SiT)
  - using discrete vs continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions
** RWKV
- [[https://twitter.com/_akhaliq/status/1777539303100285313][Diffusion-RWKV]]: Scaling RWKV-Like Architectures for Diffusion Models
  - RWKV(CNN) instead of tranformers
* STILL DIFFUSION
- [[https://arxiv.org/abs/2302.09778][Composer]]: better impainting, training independently semantic components
- [[https://arxiv.org/abs/2303.13714][High Fidelity]] Image Synthesis With Deep VAEs In Latent Space
  - hierarchical variational autoencoders (VAEs)
- [[https://arxiv.org/pdf/2304.04820.pdf][Binary Latent]] Diffusion; binary latent space, binary latent diffusion model; 1/3 of LDM parameters
  - they tie the "probability" of discrete representation to the probability of the dataset: Variational Inference itself
- [[https://arxiv.org/abs/2312.03701][Self-conditioned]] [[https://github.com/LTH14/rcg][Image Generation]] via Generating Representations ==best==
  - RCG: Representation-Conditioned image Generation
  - does not condition on any human annotations, instead using a pre-trained encoder
- [[https://arxiv.org/pdf/2403.14944.pdf][CLIP-VQDiffusion]]: Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model
  - uses clip image encoder instead at train time, then clip text encoder at test time
    - representation diffusion model (RDM)
- [[https://huggingface.co/Tencent-Hunyuan/HunyuanDiT][HunyuanDiT]]: [[https://www.reddit.com/r/StableDiffusion/comments/1crorvv/hunyuandit_is_just_out_open_source_sd3like/][SD3-like]] architecture text-to-imge model (Diffusion Transformers) by Tencent (and 5 times smaller)
** STABLE CASCADE
:PROPERTIES:
:ID:       28491008-6287-47c6-ac2e-ed22f862c997
:END:
- [[https://twitter.com/_akhaliq/status/1757430985732313167][Stable]] [[https://github.com/Stability-AI/StableCascade][Cascade]]: by Stability, a new text to image model building upon the Würstchen architecture
  - working at a much smaller latent space, 42x compression vs 8x
  - the faster you can run inference and the cheaper the training becomes
** PERCEPTUAL LOSS
:PROPERTIES:
:ID:       6de49835-beb2-4ae3-8e2a-15f930724667
:END:
- [[https://twitter.com/_akhaliq/status/1742255547741544602][Diffusion]] Model with Perceptual Loss ==best==
  - the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance
  - the diffusion model itself is a perceptual network (training objetive)
  - models capable of generating more realistic samples (at lower steps)
** WITH LLM
- [[https://github.com/ai-forever/Kandinsky-2][Kandinsky 2]]: [[https://twitter.com/_akhaliq/status/1710106706569478573][image]] fusion, inpainting, open source (apache)
  - (uses XLM-Roberta-LARGE an LLM); BERT, but uses a byte-level BPE as a tokenizer
  - maps CLIP text CLIP image; allows image mixing and blending
- [[https://twitter.com/_akhaliq/status/1767030017978949914][ELLA]]: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
  - without training of either U-Net or LLM, 2 pre-trained models bridged with Timestep-Aware Semantic Connector Module, which adapts semantic features at different stages of the denoising
  - interpreting lengthy and intricate prompts over sampling timesteps
** FASTER
- [[https://twitter.com/_akhaliq/status/1664505785076908032][SnapFusion]]: [[https://huggingface.co/papers/2306.00980][Text-to-Image]] Diffusion Model on Mobile Devices within Two Seconds
  - mobile devices = 2 seconds, reducing the computation of the image decoder via data distillation
- [[https://twitter.com/_akhaliq/status/1719561227536355590][Beyond]] U: Making Diffusion Models Faster & Lighter
  - continuous dynamical systems to design a novel denoising network
  - 1/4 of parameters and 30% flops than sd, 70% faster inference
*** ONE STEP DIFFUSION
:PROPERTIES:
:ID:       3c3b352c-c73e-49e2-8ddc-81a8569229a2
:END:
- Consistency Models: [[https://arxiv.org/pdf/2303.01469.pdf][consistency distillation]] vs [[https://github.com/openai/consistency_models][progressive]] [[https://github.com/cloneofsimo/consistency_models][distillation]]
- [[https://twitter.com/_akhaliq/status/1755085353155785110][Diffusion]] World Model (DWM) ==best==
  - long-horizon predictions in a single forward pass, eliminating the need for recursive quires
    - enables offline Q-learning with synthetic data
- distribution matching distillation ([[https://news.mit.edu/2024/ai-generates-high-quality-images-30-times-faster-single-step-0321][DMD]])
  - multi-step process of traditional diffusion models into a single step, through a teacher-student model
**** RECTIFIED FLOW
- [[https://github.com/gnobitab/RectifiedFlow][Flow Straight]] [[https://arxiv.org/abs/2209.03003][and Fast]]: Learning to Generate and Transfer Data with Rectified Flow
  - unified solution to generative modeling and domain transfer
  - simple approach to learning models to transport between two observed distributions
  - shortest paths between two points, increasingly straight paths
  - uses: image generation, image-to-image translation, and domain adaptation
- [[https://github.com/gnobitab/InstaFlow][⚡InstaFlow]]! [[https://twitter.com/XingchaoL/status/1727355780901544398][One-Step]] [[https://twitter.com/XingchaoL/status/1731712300359553206][Stable]] Diffusion with Rectified Flow
  - Leveraging pre-trained Stable Diffusion; one step = faster, 0.12s per image
  - can quickly choose one lowresolution images: fast previewer
  - can have controlnet and lora
  - [[id:44943c87-ca5b-4604-840e-ff52993c1bf1][PERFLOW]]
- [[https://arxiv.org/abs/2312.07360][Boosting Latent]] Diffusion with Flow Matching
  - FM between diffusion model and the convolutional decoder = high-resolution and reduced computational
  - diffusion provides generation diversity, FM maps the small latent space to a high-dimensional one
***** STABLE DIFFUSION 3
:PROPERTIES:
:ID:       fdbe1937-ac2f-4eb6-b617-8e48fca083e4
:END:
- [[https://twitter.com/_akhaliq/status/1764893921602068515][Stable Diffusion]] 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
  - biasing rectified flow models towards perceptually relevant scales
  - bidirectional flow of information between image and text tokens
** MULTIPLE DIFFUSION :composable:
:PROPERTIES:
:ID:       6a66690f-b76f-441a-a093-3c83ca73af2d
:END:
- [[https://arxiv.org/pdf/2305.11846.pdf][Any-to-Any]] Generation via Composable Diffusion (audio, imagen, text)
- [[https://twitter.com/_akhaliq/status/1667033318590672896][SyncDiffusion]]: Coherent Montage via Synchronized Joint Diffusions (synchronizes them) ==best==
- [[https://huggingface.co/papers/2305.18295][RAPHAEL]]: [[https://raphael-painter.github.io/][Text-to-Image]] Generation via Large Mixture of Diffusion Paths
  - mixture-of-experts (MoEs) layers, encompassing multiple nouns, adjectives, and verbs
  - trained on 1000 gpus for 2 months
- [[https://github.com/mit-han-lab/distrifuser][DistriFusion]]: Distributed Parallel Inference for High-Resolution Diffusion Models
  -  multiple GPUs to accelerate diffusion model, coherent output
*** COMPOSITIONAL DIFFUSION
:PROPERTIES:
:ID:       d26589e5-f84a-4df0-9fcc-0524daeb7b1e
:END:
- [[https://twitter.com/_akhaliq/status/1688398350133940224][Training Data]] Protection with Compositional Diffusion Models; (CDM) parallel training ==best==
  - method to train different diffusion models on distinct data and compose them at inference time
- [[id:7c77fcdf-8b60-48dc-bb7a-11c9d6aad309][SEGMOE]]
**** PANGU
- [[https://twitter.com/_akhaliq/status/1740575242798465309][PanGu-Draw]]: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
  - novel latent diffusion model designed for resource-efficient and multiple control signals
    - split structure and texture generators
    - cutting data preparation by 48% and reducing training resources by 51%
  - cooperatively use different latent spaces within a unified denoising process
    - multi-control image synthesis
** MULTIMODAL DIFFUSION
- Versatile [[https://github.com/SHI-Labs/Versatile-Diffusion][Diffusion]]: Text, Images and Variations All in One Diffusion Model
  - disentanglement of style and semantics, dual- and multi-context blending
  - generate similar expressions from reference text
- [[https://github.com/thu-ml/unidiffuser][unidiffuser]]: marginal, conditional, and joint diffusion, [[https://ml.cs.tsinghua.edu.cn/diffusion/unidiffuser.pdf][paper]] [[https://arxiv.org/abs/2303.06555][arxiv]]
  - extra diffusion conditions; perturbs data in all modalities
  - image, text, text-to-image, image-to-text, and image-text pair generation
* GAN
:PROPERTIES:
:ID:       a9581c97-2976-4b91-a9f2-567fe0149698
:END:
- [[https://mingukkang.github.io/GigaGAN/][GigaGAN]]: adobe [[https://github.com/lucidrains/gigagan-pytorch][implementation]]
  - [[https://www.youtube.com/watch?v=qnHbGXmGJCM][StyleGAN-T]]: nvidia (style transfer)
- diffusion as alternative to gans: [[id:60a63fe6-8088-4b2b-af55-f1d5e23e804b][DIFFMORPHER]]
* BETTER DECODER
:PROPERTIES:
:ID:       1f239b1d-dca2-468b-87a2-878e44688e73
:END:
- [[https://twitter.com/_akhaliq/status/1666633498558361600][Designing]] a Better Asymmetric VQGAN for StableDiffusion **better vqgan**
  - only need to retrain a new asymmetric decoder for vanilla sd; better text
- k-diffusion: [[https://twitter.com/Birchlabs/status/1728925730678161602][OpenAI's]] consistency decoder (HF model) as a k-diffusion v-prediction denoiser
  - supports n>2 step sampling
  - sdxl-diffusion-decoder