:PROPERTIES: :ID: f9437b93-c5a5-4cbb-8a66-51556df3d313 :END: #+title: diffusion alternative #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - parent: [[id:c7fe7e79-73d3-4cc7-a673-2c2e259ab5b5][stable_diffusion]] - [[id:c0fc01e0-6db2-42c1-8053-55cddfcd496c][SEED]]: autoregressive - [[https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/6585][karlo]], [[https://github.com/Stability-AI/stablediffusion/blob/main/doc/UNCLIP.MD ][stable]] karlo (image generation based on unclip) - DeepFloyd is a StableDiffusion style image model that more or less replaced CLIP with a full LLM, more like Google's Imagen model. - it's a cascaded diffusion model conditioned on the T5 encoder - [[https://twitter.com/_akhaliq/status/1674280829382541312][Inversion]] by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration - iterative restoration from low-quality and high-quality paired examples * TRANSFORMERS - [[https://arxiv.org/pdf/2301.00704.pdf][Muse]]: diffusion alternative, Masked Generative Transformers, T5 text discrete tokens - super-resolution - [[https://arxiv.org/abs/2212.09748][transformers instead]] of unet, [[https://github.com/facebookresearch/DiT][DiT]] - [[https://arxiv.org/pdf/2303.00750.pdf][StraIT]]: Non-autoregressive Generation with Stratified Image Transformer ** GPT - Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction - [[https://github.com/FoundationVision/VAR][VAR]]: a new visual generation method elevates GPT-style models beyond diffusion - outperforms Diffusion Transformer (DiT) in quality, inference speed, data efficiency, and scalability ** DIFFUSION TRANSFORMER :PROPERTIES: :ID: c06a7c92-0abc-44c7-8a89-9ff92bc5d4e9 :END: - [[https://twitter.com/_akhaliq/status/1732971872575271282][GenTron]]: Delving Deep into Diffusion Transformers for Image and Video Generation - [[https://twitter.com/giffmana/status/1758228203540226350][Lucas Beyer]]: represent videos-images as collections of units of data called patches, akin to a gpt token - now you can train diffusion transformers on data like: different durations, resolutions, aspect ratios - [[https://twitter.com/_akhaliq/status/1770668624392421512][ZigMa]]: Zigzag Mamba Diffusion Model - mamba(state space) instead of transformer *** FIT TRANSFORMER :PROPERTIES: :ID: bbc5a347-bc62-4b5e-b659-1c6a57d6a2a5 :END: - [[https://twitter.com/_akhaliq/status/1759826568539496554][FiT]]: Flexible Vision Transformer for Diffusion Model - architecture designed for generating images with unrestricted resolutions and aspect ratios - promoting resolution generalization, eliminating biases induced by image cropping *** PIXART - [[https://twitter.com/_akhaliq/status/1709055269165060345][PixArt-α]]: [[https://twitter.com/_akhaliq/status/1715038043495686545][Fast]] [[https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS][Training]] of Diffusion Transformer for Photorealistic Text-to-Image Synthesis ([[https://rentry.org/wgq4n][model]]) ==best== - only takes 10.8% of Stable Diffusion, [[https://github.com/huggingface/diffusers/pull/5814][less than]] 8VRAM - [[https://github.com/PixArt-alpha/PixArt-alpha][controlnet]] and lcm - [[https://twitter.com/_akhaliq/status/1745284887068688822][PIXART-δ]]: Fast and Controllable Image Generation with Latent Consistency Models (other lcm controlnet) - [[https://lemmy.dbzer0.com/post/16046400][PIXART-Σ]]: [[https://github.com/PixArt-alpha/PixArt-sigma][Weak-to-Strong]] Training of Diffusion Transformer for 4K Text-to-Image Generation - smaller size (0.6B parameters) than SDXL (2.6B parameters) and SD Cascade (5.1B parameters) *** SiT :PROPERTIES: :ID: 43b7488b-1536-41eb-baf2-a21dbb7defcf :END: - [[https://twitter.com/_akhaliq/status/1747846066848903619][SiT]]: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers - Scalable Interpolant Transformers (SiT) - using discrete vs continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions ** RWKV - [[https://twitter.com/_akhaliq/status/1777539303100285313][Diffusion-RWKV]]: Scaling RWKV-Like Architectures for Diffusion Models - RWKV(CNN) instead of tranformers * STILL DIFFUSION - [[https://arxiv.org/abs/2302.09778][Composer]]: better impainting, training independently semantic components - [[https://arxiv.org/abs/2303.13714][High Fidelity]] Image Synthesis With Deep VAEs In Latent Space - hierarchical variational autoencoders (VAEs) - [[https://arxiv.org/pdf/2304.04820.pdf][Binary Latent]] Diffusion; binary latent space, binary latent diffusion model; 1/3 of LDM parameters - they tie the "probability" of discrete representation to the probability of the dataset: Variational Inference itself - [[https://arxiv.org/abs/2312.03701][Self-conditioned]] [[https://github.com/LTH14/rcg][Image Generation]] via Generating Representations ==best== - RCG: Representation-Conditioned image Generation - does not condition on any human annotations, instead using a pre-trained encoder - [[https://arxiv.org/pdf/2403.14944.pdf][CLIP-VQDiffusion]]: Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model - uses clip image encoder instead at train time, then clip text encoder at test time - representation diffusion model (RDM) - [[https://huggingface.co/Tencent-Hunyuan/HunyuanDiT][HunyuanDiT]]: [[https://www.reddit.com/r/StableDiffusion/comments/1crorvv/hunyuandit_is_just_out_open_source_sd3like/][SD3-like]] architecture text-to-imge model (Diffusion Transformers) by Tencent (and 5 times smaller) ** STABLE CASCADE :PROPERTIES: :ID: 28491008-6287-47c6-ac2e-ed22f862c997 :END: - [[https://twitter.com/_akhaliq/status/1757430985732313167][Stable]] [[https://github.com/Stability-AI/StableCascade][Cascade]]: by Stability, a new text to image model building upon the Würstchen architecture - working at a much smaller latent space, 42x compression vs 8x - the faster you can run inference and the cheaper the training becomes ** PERCEPTUAL LOSS :PROPERTIES: :ID: 6de49835-beb2-4ae3-8e2a-15f930724667 :END: - [[https://twitter.com/_akhaliq/status/1742255547741544602][Diffusion]] Model with Perceptual Loss ==best== - the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance - the diffusion model itself is a perceptual network (training objetive) - models capable of generating more realistic samples (at lower steps) ** WITH LLM - [[https://github.com/ai-forever/Kandinsky-2][Kandinsky 2]]: [[https://twitter.com/_akhaliq/status/1710106706569478573][image]] fusion, inpainting, open source (apache) - (uses XLM-Roberta-LARGE an LLM); BERT, but uses a byte-level BPE as a tokenizer - maps CLIP text CLIP image; allows image mixing and blending - [[https://twitter.com/_akhaliq/status/1767030017978949914][ELLA]]: Equip Diffusion Models with LLM for Enhanced Semantic Alignment - without training of either U-Net or LLM, 2 pre-trained models bridged with Timestep-Aware Semantic Connector Module, which adapts semantic features at different stages of the denoising - interpreting lengthy and intricate prompts over sampling timesteps ** FASTER - [[https://twitter.com/_akhaliq/status/1664505785076908032][SnapFusion]]: [[https://huggingface.co/papers/2306.00980][Text-to-Image]] Diffusion Model on Mobile Devices within Two Seconds - mobile devices = 2 seconds, reducing the computation of the image decoder via data distillation - [[https://twitter.com/_akhaliq/status/1719561227536355590][Beyond]] U: Making Diffusion Models Faster & Lighter - continuous dynamical systems to design a novel denoising network - 1/4 of parameters and 30% flops than sd, 70% faster inference *** ONE STEP DIFFUSION :PROPERTIES: :ID: 3c3b352c-c73e-49e2-8ddc-81a8569229a2 :END: - Consistency Models: [[https://arxiv.org/pdf/2303.01469.pdf][consistency distillation]] vs [[https://github.com/openai/consistency_models][progressive]] [[https://github.com/cloneofsimo/consistency_models][distillation]] - [[https://twitter.com/_akhaliq/status/1755085353155785110][Diffusion]] World Model (DWM) ==best== - long-horizon predictions in a single forward pass, eliminating the need for recursive quires - enables offline Q-learning with synthetic data - distribution matching distillation ([[https://news.mit.edu/2024/ai-generates-high-quality-images-30-times-faster-single-step-0321][DMD]]) - multi-step process of traditional diffusion models into a single step, through a teacher-student model **** RECTIFIED FLOW - [[https://github.com/gnobitab/RectifiedFlow][Flow Straight]] [[https://arxiv.org/abs/2209.03003][and Fast]]: Learning to Generate and Transfer Data with Rectified Flow - unified solution to generative modeling and domain transfer - simple approach to learning models to transport between two observed distributions - shortest paths between two points, increasingly straight paths - uses: image generation, image-to-image translation, and domain adaptation - [[https://github.com/gnobitab/InstaFlow][⚡InstaFlow]]! [[https://twitter.com/XingchaoL/status/1727355780901544398][One-Step]] [[https://twitter.com/XingchaoL/status/1731712300359553206][Stable]] Diffusion with Rectified Flow - Leveraging pre-trained Stable Diffusion; one step = faster, 0.12s per image - can quickly choose one lowresolution images: fast previewer - can have controlnet and lora - [[id:44943c87-ca5b-4604-840e-ff52993c1bf1][PERFLOW]] - [[https://arxiv.org/abs/2312.07360][Boosting Latent]] Diffusion with Flow Matching - FM between diffusion model and the convolutional decoder = high-resolution and reduced computational - diffusion provides generation diversity, FM maps the small latent space to a high-dimensional one ***** STABLE DIFFUSION 3 :PROPERTIES: :ID: fdbe1937-ac2f-4eb6-b617-8e48fca083e4 :END: - [[https://twitter.com/_akhaliq/status/1764893921602068515][Stable Diffusion]] 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis - biasing rectified flow models towards perceptually relevant scales - bidirectional flow of information between image and text tokens ** MULTIPLE DIFFUSION :composable: :PROPERTIES: :ID: 6a66690f-b76f-441a-a093-3c83ca73af2d :END: - [[https://arxiv.org/pdf/2305.11846.pdf][Any-to-Any]] Generation via Composable Diffusion (audio, imagen, text) - [[https://twitter.com/_akhaliq/status/1667033318590672896][SyncDiffusion]]: Coherent Montage via Synchronized Joint Diffusions (synchronizes them) ==best== - [[https://huggingface.co/papers/2305.18295][RAPHAEL]]: [[https://raphael-painter.github.io/][Text-to-Image]] Generation via Large Mixture of Diffusion Paths - mixture-of-experts (MoEs) layers, encompassing multiple nouns, adjectives, and verbs - trained on 1000 gpus for 2 months - [[https://github.com/mit-han-lab/distrifuser][DistriFusion]]: Distributed Parallel Inference for High-Resolution Diffusion Models - multiple GPUs to accelerate diffusion model, coherent output *** COMPOSITIONAL DIFFUSION :PROPERTIES: :ID: d26589e5-f84a-4df0-9fcc-0524daeb7b1e :END: - [[https://twitter.com/_akhaliq/status/1688398350133940224][Training Data]] Protection with Compositional Diffusion Models; (CDM) parallel training ==best== - method to train different diffusion models on distinct data and compose them at inference time - [[id:7c77fcdf-8b60-48dc-bb7a-11c9d6aad309][SEGMOE]] **** PANGU - [[https://twitter.com/_akhaliq/status/1740575242798465309][PanGu-Draw]]: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion - novel latent diffusion model designed for resource-efficient and multiple control signals - split structure and texture generators - cutting data preparation by 48% and reducing training resources by 51% - cooperatively use different latent spaces within a unified denoising process - multi-control image synthesis ** MULTIMODAL DIFFUSION - Versatile [[https://github.com/SHI-Labs/Versatile-Diffusion][Diffusion]]: Text, Images and Variations All in One Diffusion Model - disentanglement of style and semantics, dual- and multi-context blending - generate similar expressions from reference text - [[https://github.com/thu-ml/unidiffuser][unidiffuser]]: marginal, conditional, and joint diffusion, [[https://ml.cs.tsinghua.edu.cn/diffusion/unidiffuser.pdf][paper]] [[https://arxiv.org/abs/2303.06555][arxiv]] - extra diffusion conditions; perturbs data in all modalities - image, text, text-to-image, image-to-text, and image-text pair generation * GAN :PROPERTIES: :ID: a9581c97-2976-4b91-a9f2-567fe0149698 :END: - [[https://mingukkang.github.io/GigaGAN/][GigaGAN]]: adobe [[https://github.com/lucidrains/gigagan-pytorch][implementation]] - [[https://www.youtube.com/watch?v=qnHbGXmGJCM][StyleGAN-T]]: nvidia (style transfer) - diffusion as alternative to gans: [[id:60a63fe6-8088-4b2b-af55-f1d5e23e804b][DIFFMORPHER]] * BETTER DECODER :PROPERTIES: :ID: 1f239b1d-dca2-468b-87a2-878e44688e73 :END: - [[https://twitter.com/_akhaliq/status/1666633498558361600][Designing]] a Better Asymmetric VQGAN for StableDiffusion **better vqgan** - only need to retrain a new asymmetric decoder for vanilla sd; better text - k-diffusion: [[https://twitter.com/Birchlabs/status/1728925730678161602][OpenAI's]] consistency decoder (HF model) as a k-diffusion v-prediction denoiser - supports n>2 step sampling - sdxl-diffusion-decoder