📓 nodes/20230628213013-diffusion_train.org by @tekakutli-org ☆

parent: stable_diffusion train
BETTER DECODERblue noise: NOISE CONTROL
400x (and use vae leafing to make big)
Diffusers CompatibleSDXL Unet Rewrite (520 lines)
ScaleLong Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection
- scaling the coefficients of LSC(which connect distant blocks) in UNet to improve training stability of UNet
Cas-DM Bring Metric Functions into Diffusion Models (incorporating additional metric functions, objectives)
QuantumDenoising Diffusion Models
- explores integrating variational quantum circuits to augment efficacy of diffusion
MPI: Masked Pre-trained Model EnablesUniversal Zero-shot Denoiser
- spontaneously attains the underlying potential for strong image denoising
Simplified DiffusionSchrödinger Bridge
- simplification of the Diffusion Schrödinger Bridge (DSB) that facilitates its unification with Score-based Generative Models (SGMs)

RLCM

RL forConsistency Models, Faster Reward Guided Text-to-Image Generation
- to optimize for task specific rewards, enable fast training-inference, we propose fine-tuning via RL
- Reinforcement Learning for Consistency Model (RLCM)
- objectives challenging with prompting, like image compressibility and human feedback

IDEAS

REMEMBER

LP-DiF LearningPrompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning
- continuously learn new classes without forgetting old ones
S2-DMsSkip-Step Diffusion Models
- new training method, Lskip, designed to reintegrate omitted info during the selective sampling phase

BEFORE-AFTER

SwitchEMA: A Free Lunch for Better Flatness and Sharpness
- switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA)
- free lunch by boosting convergence speeds
RollingDiffusion Model (VIDEO)
- a sliding window denoising process
- more noise to frames that appear later in a sequence

ONLY ONCE

FixedPoint Diffusion Models
- reallocating computation across timesteps and reusing fixed point solutions between timesteps
- 87% fewer parameters, consumes 60% less training memory
Analyzingand Improving the Training Dynamics of Diffusion Models
- redesigned, so better networks at equal computational complexity
- precise tuning of EMA length without the cost of performing several training runs

CONTEXT

ConPreDiff Improving Diffusion-Based Image Synthesis with Context Prediction (better zeroshot)
Any-ShiftPrompting for Generalization over Distributions
- encode the distribution information and their relationships
  - guide the generalization of the CLIP image-language model from training to any test distribution

PRIORS

STRUCTURE

BATCH STRUCTURE
- StructurePreserving Diffusion Models
  - result: if you rotate the input, the output also rotates unharmed; learn structures

VAE TRAINING

DeconstructingDenoising Diffusion Models for Self-Supervised Learning
- gradually transforming a Denoising Diffusion Models (DDM) into a classical Denoising Autoencoder (DAE) (VAE)
FLAWED The VAE used for Stable Diffusion 1.x/2.x and other models (KL-F8) has a critical flaw, probably due to bad training, needs a new trained from scratch like SDXL ==best==
- the encoder is having to do a lot of extra work to get around the bad latent space

3D INCORPORATED

GIBR

DISTRIBUTED TRAINING

distributed-diffusionusing hivemind (distributed training) vs Deepspeed
COMPOSITIONAL DIFFUSION
SiTdiscrete transformers

DIFFUSION QUANTIZATION

4, 8 bit models, Q-Diffusion insight reddit quantization
- Memory-EfficientPersonalization using Quantized Diffusion Model (enhancing it)
EnhancedDistribution Alignment for Post-Training Quantization of Diffusion Models
- align outputs of the quantized model and the full-precision model at different network granularity
QuEST Low-bit Diffusion Model Quantization via Efficient Selective Finetuning
- finetuning the quantized model to better adapt to the activation distribution (mitigation)
Task-OrientedDiffusion Model Compression
- satisfactory output quality with 39.2% and 56.4% reduction in model footprint and 81.4% and 68.7%
- applying it to InstructPix2Pix and StableSR

ACADEMIC

GIT RE-BASIN MERGING MODELS MODULO PERMUTATION SYMMETRIES
- transfer knowledge between teacher to student model
IdempotentGenerative Network
- f(f(z))=f(z), can generate an output in one step
- step towards a "global projector" = projecting any input into a target data distribution

CLIP RELATED

uform clip not required, trained in a day
cloneofsimo: learning fromthe clip
- wanna perform affordable kernel regression on l2-normalized data?
  - get yourself Spherical Random Features for Polynomial Kernels
  - relevant if you are aiming for large scale non-parametric regression on CLIP projected feature spaces

CHEAPER TRAINING

Efficient DiffusionTraining via Min-SNRWeighting Strategy
- slow convergence due to conflicting optimization directions between timesteps, 3.4 times faster
Imagen suggests that scaling the text encoder is much more impactful than scaling the UNet
- at least for diffusion models
mosaiclml custom$50k stable diffusion training, reddit post
Patch Diffusion Faster and More Data-Efficient Training of Diffusion Models
compressed-stable-diffusion 36% reduced parameters and latency
Wuerstchen EfficientPretraining of Text-to-Image Models
- 16 times faster to train, 2 times faster inference, , only 9200 GPU hours (42 time compression rate vs 8 of sd)
DREAM Diffusion Rectification and Estimation-Adaptive Models (requiring minimal code changes)
- 2 to 3 times faster training convergence
PERCEPTUAL LOSS=best==

FASTER TOO

LCM==best==
UFOGen You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
- integrating diffusion with a GAN objective, one step

DIFFERENT ARCHITECTURE

faster using electric flow-charges https://www.assemblyai.com/blog/an-introduction-to-poisson-flow-generative-models/
- better than inference: https://twitter.com/_akhaliq/status/1620958983639924736 <button class="pull-tweet" value=https://twitter.com/_akhaliq/status/1620958983639924736>pull</button> https://arxiv.org/pdf/2302.00482.pdf
Spectral Diffusion slim Standard Diffusion, 20 times smaller in size
- Wavelet diffusion code
  - Wavelet Diffusion Models arefast and scalable Image Generators
Score-BasedDiffusion Models as Principled Priors for Inverse Imaging (more complex priors)
COMPOSITIONAL DIFFUSION DIFFUSION TRANSFORMER

DATASET MANIPULATION

Shifted Diffusion==Corgi== for Text-to-image Generation: from clip straight to diffusion, ==only 1.7 of the images required captions==
Object Detection: CutLER
D3S Invariant Learning via Diffusion Dreamed Distribution Shifts, separating foreground-background
- disentangling foreground from background by chopping-pasting them out in the synthetic training dataset
- like SVDiff
A Pictureis Worth a Thousand Words: Principled Recaptioning Improves Image Generation
- automatic captioning is better than crawled low quality captions
- CapsFusion Rethinking Image-Text Data at Scale
  - hindered by simplistic captioners, consolidate and refine information

BATCH STRUCTURE

Structure-GuidedAdversarial Training of Diffusion Models
- compel the model to learn manifold structures between samples in each training batch

ATLAS

IMAGE CLUSTERING
Neural Congealing Aligning Imagesto a Joint Semantic Atlas
- zeroshot leaning concept-shapes
- ASIC Aligning Sparsein-the-wild Image Collections
Ablating Concepts inText-to-Image Diffusion Models (adobe)

MASKS

masking to accelerate learning VQ-Diffusion https://arxiv.org/pdf/2111.14822.pdf
DeepMIM Deep Supervisionfor Masked Image Modeling
- pre-trains a Vision Transformer (ViT) via a mask-and-predict scheme.
MDT Masked DiffusionTransformer (3 times faster)
Predictingmasked tokens in stochastic locations improves masked image modeling
- learning features that are more robust to location uncertainties; Masked Image Modeling (MIM)

MATHEMATICAL (COPY PASTED COMMENT YET TO ANALYZE)

I have recently written a paper on understanding transformer learning via the lens of coinduction & Hopf algebra. https://arxiv.org/abs/2302.01834

The learning mechanism of transformer models was poorly understood however it turns out that a transformer is like a circuit with a feedback.

I argue that autodiff can be replaced with what I call in the paper Hopf coherence which happens within the single layer as opposed to across the whole graph.

Furthermore, if we view transformers as Hopf algebras, one can bring convolutional models, diffusion models and transformers under a single umbrella.

I'm working on a next gen Hopf algebra based machine learning framework.

Join my discord if you want to discuss this further https://discord.gg/mr9TAhpyBW