:PROPERTIES:
:ID: c7fe7e79-73d3-4cc7-a673-2c2e259ab5b5
:END:
#+title: stable diffusion
#+filetags: :neuralnomicon:
#+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup
- parent: [[id:82127d6a-b3bb-40bf-a912-51fa5134dacc][diffusion]]
- related: [[id:58c585b9-a03e-4320-a313-e00e68c4ce7e][diffusion video]] [[id:75929071-e62b-4c0a-8374-8ca322d0a020][software]]
- combining [[https://github.com/huggingface/diffusers/tree/main/examples/community#stable-diffusion-mega][pipelines]], creating [[https://huggingface.co/docs/diffusers/main/en/using-diffusers/contribute_pipeline][pipelines]]
- generate: [[id:3a9a6a52-b3a6-4a69-b402-531b3b1e2d91][NOVEL VIEW]]
- how to [[https://wandb.ai/johnowhitaker/midu-guidance/reports/-Mid-U-Guidance-Fast-Classifier-Guidance-for-Latent-Diffusion-Models—VmlldzozMjg0NzA1][guidance-classifier]] the diffusion
* SD MODELS
- [[https://twitter.com/iScienceLuvr/status/1717359916422496596 ][CommonCanvas]]: [[https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md#coming-soon][An Open]] [[https://github.com/mosaicml/diffusion][Diffusion]] Model Trained with Creative-Commons Images
- CC-licensed images with BLIP-2 captions, similar performance to Stable Diffusion 2 (apache license)
- [[https://huggingface.co/ptx0/terminus-xl-gamma-v1][Terminus]] XL Gamma: simpler SDXL, for inpainting tasks, super-resolution, style transfer
- [[SDXL-DPO]]
- [[https://huggingface.co/wangfuyun/AnimateLCM-SVD-xt][AnimateLCM-SVD-xt]]: image to video
- stable-cascade: würstchen architecture = even smaller latent space
- [[https://huggingface.co/KBlueLeaf/Stable-Cascade-FP16-fixed][Stable-Cascade-FP16]]
- sd x8 compression (1024x1024 > 128x128) vs cascade x42 compression, (1024x1024 > 24x24)
- faster inference, cheaper training
- [[id:fdbe1937-ac2f-4eb6-b617-8e48fca083e4][STABLE DIFFUSION 3]]
- [[https://huggingface.co/fal/AuraFlow][AuraFlow]]: actually open source (apache 2) model, by simo
** DISTILLATION
- [[https://twitter.com/camenduru/status/1716817970255831414 ][SSD1B]] ([[https://blog.segmind.com/introducing-segmind-ssd-1b/][distilled]] [[https://huggingface.co/segmind/SSD-1B/blob/main/SSD-1B.safetensors][SDXL]]) 60% Fast -40% VRAM
- [[https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic][Playground v2]]
- [[https://twitter.com/_akhaliq/status/1760157720173224301 ][SDXL-Lightning]]: [[https://twitter.com/_akhaliq/status/1760157720173224301 ][a lightning]] fast 1024px text-to-image generation model (few-steps generation)
- progressive adversarial diffusion distillation
- [[https://snap-research.github.io/BitsFusion/][BitsFusion]]: 1.99 bits Weight Quantization of Diffusion Model
- SD 1.5 quantized to 1.99 bits (instead of 8B)
*** ONE STEP DIFFUSION
:PROPERTIES:
:ID: 9e94f7d8-752f-48e9-9ef1-9c79eba258e3
:END:
- [[https://tianweiy.github.io/dmd/][One-step Diffusion]] with Distribution Matching Distillation
- comparable with v1.5 while being 30x faster
- critic similar to GANs in that is jointly trained with the generator
- differs in that it does not play adversarial game, and can fully leverage a pretrained model
*** SDXS
:PROPERTIES:
:ID: 98791065-c5dc-4f12-8c0c-fffad5715a2e
:END:
- [[https://lemmy.dbzer0.com/post/17299457][SDXS]]: Real-Time One-Step Latent Diffusion Models with Image Conditions
- knowledge distillation to streamline the U-Net and image decoder architectures
- one-step DM training technique that utilizes feature matching and score distillation
- speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a GPU
- image-conditioned control, facilitating efficient image-to-image translation.
** IRIS LUX
https://civitai.com/models/201287
Model created through consensus via statistical filtering (novel consensus merge)
https://gist.github.com/Extraltodeus/0700821a3df907914994eb48036fc23e
** EMOJIS
- [[https://twitter.com/_akhaliq/status/1726817847525978514 ][Text-to-Sticker]]: Style Tailoring Latent Diffusion Models for Human Expression
- emojis, stickers
** MERGING MODELS
- where the text encoder is different for each, by training a difference
- https://www.reddit.com/r/StableDiffusion/comments/1g6500o/ive_managed_to_merge_two_models_with_very/
*** SEGMOE
:PROPERTIES:
:ID: 7c77fcdf-8b60-48dc-bb7a-11c9d6aad309
:END:
- [[https://huggingface.co/segmind/SegMoE-SD-4x2-v0][SegMoE]] - [[https://youtu.be/6Q4BJOcvwGE?si=zBNrQrKIgmwmPPvI][The Stable]] [[https://huggingface.co/segmind][Diffusion]] [[https://lemmy.dbzer0.com/post/13761591][Mixture]] of Experts for Image Generation, Mixture of Diffusion Experts
- training free, creation of larger models on the fly, larger knowledge
* GENERATION CONTROL
- [[EXTRA PRETRAINED]] [[id:208c064d-f700-4e8f-a4ab-2c73c557f9e3][DRAG]] [[MAPPED INPAINTING]]
- [[https://www.storminthecastle.com/posts/01_head_poser/][hyperparameters with]] extra network [[https://wandb.ai/johnowhitaker/midu-guidance/reports/Mid-U-Guidance-Fast-Classifier-Guidance-for-Latent-Diffusion-Models—VmlldzozMjg0NzA1][Mid-U Guidance]]
- block [[https://github.com/hako-mikan/sd-webui-lora-block-weight#%E6%A6%82%E8%A6%81][weights lora]]
- [[https://twitter.com/_akhaliq/status/1759789799685202011 ][DiLightNet]]: [[https://arxiv.org/abs/2402.11929][Fine-grained]] Lighting Control for Diffusion-based Image Generation
- using light hints to resynthetize a prompt with user-defined consistent lighting
- [[https://arxiv.org/abs/2403.06452][Text2QR]]: [[https://github.com/mulns/Text2QR][Harmonizing]] Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation
- refines the output iteratively in the latent space
- [[https://twitter.com/_akhaliq/status/1778606395014676821 ][ControlNet++]]: Improving Conditional Controls with Efficient Consistency Feedback
- explicitly optimizing pixel-level cycle consistency between generated images
** MATERIAL EXTRACTION
- [[https://arxiv.org/pdf/2403.20231.pdf][U-VAP]]: User-specified Visual Appearance Personalization via Decoupled Self Augmentation
- generates images with the material or color extracted from the input image
- sentence describing the desired attribute
- learn user-specified visual attributes
- [[https://ttchengab.github.io/zest/][ZeST]]: Zero-Shot Material Transfer from a Single Image
- leverages adapters to extract implicit material representation from exemplar image
** LIGHT CONTROL
- [[https://github.com/DiffusionLight/DiffusionLight][DiffusionLight]]: Light Probes for Free by Painting a Chrome Ball
- render a chrome ball into the input image
- produces convincing light estimates
** BACKGROUND
- [[https://twitter.com/bria_ai_/status/1754846894675673097 ][BriaAI]]: [[https://twitter.com/camenduru/status/1755038599500718083 ][Open-Source]] Background Removal (RMBG v1.4)
- [[https://github.com/layerdiffusion/sd-forge-layerdiffusion][LayerDiffusion]]: Transparent Image Layer Diffusion using Latent Transparency
- layers with alpha, generate pngs, remove backgrounds (more like generate with removable background)
- method learns a “latent transparency”
- [[https://huggingface.co/LayerDiffusion/layerdiffusion-v1/tree/main][models]]
** EMOTIONS
- [[https://arxiv.org/abs/2401.01207][Towards]] a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
- face swapping and reenactment, interpolate between emotions
- [[https://arxiv.org/abs/2401.04608][EmoGen]]: Emotional Image Content Generation with Text-to-Image Diffusion Models
- clip, abstract emotions
- [[https://arxiv.org/abs/2403.08255][Make Me]] Happier: Evoking Emotions Through Image Diffusion Models
- understanding and editing source images emotions cues
** NOISE CONTROL
:PROPERTIES:
:ID: b68ef215-2e3e-4cd5-abbd-dffcc30acdae
:END:
- offset noise(darkness capable loras), pyramid noise
- [[https://arxiv.org/pdf/2305.08891.pdf][Common Diffusion]] Noise Schedules and Sample Steps are Flawed (and several proposed fixes)
- native offset noise
- [[https://github.com/Extraltodeus/noise_latent_perlinpinpin][noisy perlin]] latent
- you can reinject the same noise pattern after an upscale, more coherent results and better upscaling
- [[https://arxiv.org/abs/2402.04930][Blue noise]] for diffusion models
- allows introducing correlation across images within a single mini-batch to improve gradient flow
** GUIDING FUNCTION
:PROPERTIES:
:ID: ddd3588a-dc3c-426d-a94e-9aa373fabff9
:END:
- [[https://github.com/arpitbansal297/Universal-Guided-Diffusion][Universal Guided Diffusion]] (face and style transfer)
- [[https://arxiv.org/abs/2303.09833][FreeDoM]]: [[https://github.com/vvictoryuki/FreeDoM][Training-Free]] Energy-Guided Conditional Diffusion Model <<FreeDoM>>
- extra: repo has list of deblurring, super-resolution and restoration methods
- masks as energy function
- Diffusion Self-Guidance [[https://dave.ml/selfguidance/][for Controllable]] Image Generation
- steer sampling, similarly to classifier guidance, but using signals in the pretrained model itself
- instructional transfomations
- [[https://mcm-diffusion.github.io/][MCM]] [[https://arxiv.org/pdf/2302.12764.pdf][Modulating Pretrained]] Diffusion Models for Multimodal Image Synthesis (module after denoiser) mmc
- mask like control to tilt the noise, maybe useful for text <<MCM>>
*** ADAPTIVE GUIDANCE
- [[https://twitter.com/_akhaliq/status/1737695636814712844 ][Adaptive Guidance]]: Training-free Acceleration of Conditional Diffusion Models
- AG, efficient variant of CFG(Classifier-Free Guidance); reducing computation by 25%
- omits network evaluations when the denoising process displays convergence
- second half of the denoising process redundant; plug-and-play alternative to Guidance Distillation
- LinearAG: entire neural-evaluations can be replaced by affine transformations of past estimates
** CONTROL NETWORKS, CONTROLNET
- [[id:33903015-49dd-4a1a-81b5-78350c074fff][REFERENENET]] [[id:d1d1a9ff-670e-4bed-9087-ad0b8b71ee7a][CONTROLNET FOR 3D]] [[CCM]] [[id:fd3d677f-1b5e-46a3-8ee9-6524baa07339][CONTROLNET VIDEO]]
- why controlnet, alternatives https://github.com/lllyasviel/ControlNet/discussions/188
- [[https://github.com/Sierkinhane/VisorGPT][VisorGPT]]: Learning Visual Prior via Generative Pre-Training
- [[https://huggingface.co/papers/2305.13777][gpt]] that learns to tranform normal prompts into controlnet primitives
- [[https://twitter.com/_akhaliq/status/1735515389692424461 ][FineControlNet]]: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection
- geometric control via human pose images and appearance control via instance-level text prompts
- [[https://twitter.com/_akhaliq/status/1734808238753788179 ][FreeControl]]: [[https://github.com/kijai/ComfyUI-Diffusers-freecontrol?tab=readme-ov-file][Training-Free]] Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
- alignment with guidance image: lidar, face mesh, wireframe mesh, rag doll
- [[https://github.com/SamsungLabs/FineControlNet][FineControlNet]]: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection
- instance-specific text description, better prompt following
*** SKETCH
- [[https://arxiv.org/abs/2401.00739][diffmorph]]: text-less image morphing with diffusion models
- sketch-to-image module
- [[https://lemmy.dbzer0.com/post/15434577][Block]] and Detail: Scaffolding Sketch-to-Image Generation
- sketch-to-image, can generate coherent elements from partial sketches, generate beyond the sketch following the prompt
- [[https://arxiv.org/abs/2402.17624][CustomSketching]]: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing
- one for contour, the other flow lines representing texture
*** ALTERNATIVES
- controlNet (total control of image generation, from doodles to masks)
- T2I-Adapter (lighter, composable), [[https://www.reddit.com/r/StableDiffusion/comments/11v3dgj/comment/jcrag7x/?utm_source=share&utm_medium=web2x&context=3][how color pallete]]
- lora like (old) https://github.com/HighCWu/ControlLoRA
- [[https://vislearn.github.io/ControlNet-XS/][ControlNet-XS]]: 1% of the parameters
- [[https://twitter.com/_akhaliq/status/1732585051039088837 ][LooseControl]]: [[https://github.com/shariqfarooq123/LooseControl][Lifting]] ControlNet for Generalized Depth Conditioning
- loosely specifying scenes with boxes
- controlnet-lltite by [[https://github.com/kohya-ss/sd-scripts/blob/sdxl/docs/train_lllite_README.md][kohya]]
- [[https://twitter.com/_akhaliq/status/1736991952283783568 ][SCEdit]]: [[https://github.com/mkshing/scedit-pytorch][Efficient]] [[https://scedit.github.io/][and Controllable]] Image Diffusion Generation via Skip Connection Editing
- lightweight tuning module named SC-Tuner, synthesis by injecting different conditions
- reduces training parameters and memory requirements
- Integrated Into SCEPTER and SWIFT
- [[https://lemmy.dbzer0.com/post/12591345][Compose and]] [[https://twitter.com/_akhaliq/status/1747857732818854040 ][Conquer]]: Diffusion-Based 3D Depth Aware Composable Image Synthesis
- imposing global semantics onto targeted regions without the use of any additional localization cues
- alternative to controlnet and t2i-adapter
**** CTRLORA
- https://github.com/xyfJASON/ctrlora
*** TIP: text restoration
- [[https://twitter.com/_akhaliq/status/1737318799634755765 ][TIP]]: Text-Driven Image Processing with Semantic and Restoration Instructions ==best==
- controlnet architecture, leverages natural language as interface to control image restoration
- instruction driven, can inprint text into image
*** HANDS
- [[id:3f752b46-cae4-49d9-948d-50e3c500727e][HANDS DATASET]]
- [[https://arxiv.org/abs/2312.04867][HandDiffuse]]: Generative Controllers for Two-Hand Interactions via Diffusion Models
- two-hand interactions, motion in-betweening and trajectory control
**** RESTORING HANDS
- [[https://arxiv.org/abs/2312.04236][Detecting]] and Restoring Non-Standard Hands in Stable Diffusion Generated Images
- body pose estimation to understand hand orientation for accurate anomaly correction
- integration of ControlNet and InstructPix2Pix
- [[https://github.com/wenquanlu/HandRefiner][HandRefiner]]: [[https://github.com/wenquanlu/HandRefiner][Refining]] Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting
- incorrect number of fingers, irregular shapes, effectively rectified
- utilize ControlNet modules to re-inject corrected information, 1.5
*** USING ATTENTION MAP
- [[CONES]] [[The Chosen One]] [[id:65812d6a-a81d-47f2-a7ad-25c94e2ff70a][STORYTELLER DIFFUSION]]
- [[https://rival-diff.github.io/][RIVAL]]: Real-World Image Variation by Aligning Diffusion Inversion Chain ==best==
**** MASA
- [[https://ljzycmd.github.io/projects/MasaCtrl/][MasaCtrl]]: [[https://github.com/TencentARC/MasaCtrl][Tuning-free]] Mutual Self-Attention Control for Consistent Image Synthesis and Editing
- same thing different views or poses
- by querying the attention map from another image
- better than ddim inversion, consistent SD animations; mixable with T2I-Adapter
***** TI-GUIDED-EDIT
- [[https://arxiv.org/abs/2401.02126][Unified]] [[https://github.com/Kihensarn/TI-Guided-Edit][Diffusion-Based]] Rigid and Non-Rigid Editing with Text and Image Guidance
- rigid=conserve the structure
**** LLLYASVIEL
- reference-only preprocessor doesnt require any control models, generate variations
- can guide the diffusion directly using images as references, and generate variations
- [[https://github.com/lllyasviel/ControlNet#guess-mode—non-prompt-mode][Guess Mode]] / [[https://github.com/lllyasviel/ControlNet/discussions/188][Non-Prompt]] Mode, now named: Control Modes, how much prompt vs controlnet; [[https://github.com/comfyanonymous/ComfyUI_experiments][comfy node]]
*** SEVERAL CONTROLS IN ONE
- [[https://huggingface.co/papers/2305.11147][UniControl]]: [[https://www.reddit.com/r/StableDiffusion/comments/15851w6/code_for_unicontrol_has_been_released/][A Unified]] [[https://twitter.com/CaimingXiong/status/1662250281315315713 ][Diffusion]] Model for Controllable Visual Generation In the Wild
- several controlnets in one, contextual understanding
- image deblurring, image colorization
- [[https://twitter.com/abhi1thakur/status/1684926197870870529 ][using UniControl]] with Stable Diffusion XL 1.0 Refiner; sketch to image tool
- In-[[https://github.com/Zhendong-Wang/Prompt-Diffusion][Context]] [[https://zhendong-wang.github.io/prompt-diffusion.github.io/][Learning]] Unlocked for Diffusion Models
- learn translation of image to hed, depth, segmentation, outline
** HUMAN PAINT
- [[https://arxiv.org/pdf/2108.01073.pdf][SDEdit]]: guided image synthesis and editing with stochastic differential equation
- stroke based inpainting-editing
- [[https://arxiv.org/pdf/2402.03705.pdf][FOOLSDEDIT]]: Deceptively Steering Your Edits Towards Targeted Attribute-aware Distribution
- forcing SDEdit to generate a data distribution aligned a specified attribute (e.g. female)
- [[https://zhexinliang.github.io/Control_Color/][Control]] [[https://github.com/ZhexinLiang/Control-Color][Color]]: Multimodal Diffusion-Based Interactive Image Colorization
- paint over grayscale to recolor it
** LAYOUT DIFFUSION
:PROPERTIES:
:ID: dafb1713-5d08-40de-b445-76d25f2cf070
:END:
- 3d: [[id:5e1ee0b4-8493-44e4-b0cf-89b429a78532][ROOM LAYOUT]]
- [[ATTENTION LAYOUT]] [[id:65812d6a-a81d-47f2-a7ad-25c94e2ff70a][STORYTELLER DIFFUSION]]
- ZestGuide: [[https://twitter.com/_akhaliq/status/1673539960664911874 ][Zero-shot]] [[https://twitter.com/gcouairon/status/1721529637690327062 ][spatial]] layout conditioning for text-to-image diffusion models
- implicit segmentation maps can be extracted from cross-attention layers
- spatial conditioning to sd without finetunning
- [[https://arxiv.org/abs/2402.04754][Towards Aligned]] Layout Generation via Diffusion Model with Aesthetic Constraints
- constraints representing design intentions
- continuous state-space design can incorporate differentiable aesthetic constraint functions in training
- by introducing conditions via masked input
- [[https://arxiv.org/abs/2402.12908][RealCompo]]: [[https://github.com/YangLing0818/RealCompo][Dynamic Equilibrium]] between Realism and Compositionality Improves Text-to-Image Diffusion Models
- dynamically balance the strengths of the two models in denoising process
- [[https://spright-t2i.github.io/][Getting]] it Right: Improving Spatial Consistency in Text-to-Image Models
- better representing spatial relationships
- faithfully follow the spatial relationships specified in the text prompt
*** SCENES
- [[https://twitter.com/_akhaliq/status/1674623306551508993 ][Generate Anything]] Anywhere in Any Scene <<layout aware>>
- training guides to focus on object identity, personalized concept with localization controllability
- [[ANYDOOR]] [[id:4b8a772d-e3ad-4183-863b-eeddb47bab9e][ALDM]]
*** WITH BOXES
- [[https://gligen.github.io/][GLIGEN]]: Open-Set Grounded Text-to-Image Generation (boxes)
- [[https://twitter.com/_akhaliq/status/1645253639575830530 ][Training-Free]] Layout Control with Cross-Attention Guidance
- [[https://arxiv.org/pdf/2304.14573.pdf][SceneGenie]]: Scene Graph Guided Diffusion Models for Image Synthesis
- [[https://twitter.com/_akhaliq/status/1683340606217781248 ][BoxDiff]]: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
- [[https://people.eecs.berkeley.edu/~xdwang/projects/InstDiff/][InstanceDiffusion]]: [[https://lemmy.dbzer0.com/post/13827955][Instance-level]] Control for Image Generation
- conditional generation, hierarchical bounding-boxes structure, featur(prompt) at point
- single points, scribbles, bounding boxes or segmentation masks
- [[https://arxiv.org/abs/2402.17910][Box It]] to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models
- bounding boxes with attribute(prompt) binding
*** ALDM
:PROPERTIES:
:ID: 4b8a772d-e3ad-4183-863b-eeddb47bab9e
:END:
- [[https://lemmy.dbzer0.com/post/12605682][ALDM]]: Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive
- layout faithfulness
*** OPEN-VOCABULARY
:PROPERTIES:
:ID: 19d99453-7d66-41b1-80e8-fbe91d035084
:END:
- [[https://arxiv.org/abs/2401.16157][Spatial-Aware Latent]] Initialization for Controllable Image Generation
- inverted reference image contains spatial awareness regarding positions, resulting in similar layouts
- open-vocabulary framework to customize a spatial-aware initialization
*** CARTOON
:PROPERTIES:
:ID: cc058fea-c2dd-4f7f-aa59-156825bed0ef
:END:
- [[https://whaohan.github.io/desigen/][Desigen]]: A Pipeline for Controllable Design Template Generation
- generating images with proper layout space for text; generating the template itself
**** COGCARTOON
- [[https://arxiv.org/pdf/2312.10718.pdf][CogCartoon]]: Towards Practical Story Visualization
- plugin-guided and layout-guided inference; specific character = 316 KB plugin
** IMAGE PROMPT - ONE IMAGE
- [[suti]] [[custom-edit diffusion]]
*** UNET LESS
- [[https://github.com/drboog/ProFusion][ProFusion]]: Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach
- and can interpolate between two
- promptnet (embedding), encoder based, for style transform
- one image, no regularization needed
- [[https://twitter.com/kelvinckchan/status/1680288217378197504 ][Taming]] Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
- using CLIP features extracted from the subject
*** IMAGE-SUGGESTION
- [[SEMANTIC CORRESPONDENCE]]
- UMM-Diffusion, TIUE: [[https://arxiv.org/abs/2303.09319][Unified Multi-Modal]] Latent Diffusion for Joint Subject and Text Conditional Image Generation
- takes joint texts and images
- only the image-mapping to a pseudo word embedding is learned
**** ZERO SHOT
- [[https://twitter.com/_akhaliq/status/1732592245105185195 ][Context Diffusion]]: In-Context Aware Image Generation
- separates the encoding of the visual context; prompt not needed
- ReVision - Unclip https://comfyanonymous.github.io/ComfyUI_examples/sdxl/
- Revision gives the model the pooled output from CLIPVision G instead of the CLIP G text encoder
- [[https://github.com/Xiaojiu-z/SSR_Encoder][SSR-Encoder]]: Encoding Selective Subject Representation for Subject-Driven Generation
- architecture designed for selectively capturing any subject from single or multiple reference images
***** IP-ADAPTER
- [[https://twitter.com/_akhaliq/status/1691341380348682240 ][IP-Adapter]]: [[https://github.com/tencent-ailab/IP-Adapter][Text Compatible]] Image Prompt Adapter for Text-to-Image Diffusion Models ==stock SD==
- works with other controlnets
- [[https://huggingface.co/h94/IP-Adapter-FaceID][IP-Adapter-FaceID]] (face recognition model)
****** LCM-LOOKAHEAD
:PROPERTIES:
:ID: 28ff20ec-5501-47da-9ac1-8adc65303376
:END:
- [[https://lcm-lookahead.github.io/][LCM-Lookahead]] for Encoder-based Text-to-Image Personalization
- LCM-based approach for propagating image-space losses to personalization model training and classifier guidance
***** SEECODERS
:PROPERTIES:
:ID: 1c014bca-d8db-4d28-9c49-5297626d4484
:END:
- [[https://arxiv.org/abs/2305.16223][Seecoders]]: [[https://github.com/SHI-Labs/Prompt-Free-Diffusion][Prompt-Free]] Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
- Semantic Context Encoder, replaces clip with seecoder; works with ==stock SD==
- input image and controlnet
- unlike unclip, seecoders uses extra model
- one image into several perspectives ([[id:505848e8-02a5-4699-be28-6e7b2e91837c][MULTIVIEW DIFFUSION]])
- the embeddings can be textures, effects, objects, semantics(contexts)
tics, etc.
**** PERSONALIZATION
- [[https://twitter.com/_akhaliq/status/1645254918121422859 ][InstantBooth]]: Personalized Text-to-Image Generation without Test-Time Finetuning
- personalized images with only a single forward pass
- [[https://twitter.com/AbermanKfir/status/1679689404573679616 ][HyperDreamBooth]]: HyperNetworks for Fast Personalization of Text-to-Image Models; just one image
*** IDENTITY
- [[https://github.com/cloneofsimo/lora/discussions/96][masked score estimation]]
- HiPer: [[https://arxiv.org/abs/2303.08767][Highly Personalized]] Text Embedding for Image Manipulation by Stable Diffusion
- one image single thing, gets the clip
- [[IP-ADAPTER]]
**** STORYTELLER DIFFUSION
:PROPERTIES:
:ID: 65812d6a-a81d-47f2-a7ad-25c94e2ff70a
:END:
- [[https://consistory-paper.github.io/][ConsiStory]]: Training-Free Consistent Text-to-Image Generation
- training-free approach for consistent subject(object) generation x20 faster, multi-subject scenarios
- by sharing the internal activations of the pretrained model
**** ANYDOOR
- [[https://github.com/damo-vilab/AnyDoor][AnyDoor]]: [[https://damo-vilab.github.io/AnyDoor-Page/][Zero-shot]] [[https://twitter.com/_akhaliq/status/1738772616142303728 ][Object-level]] [[https://twitter.com/_akhaliq/status/1738775751887860120 ][Image]] Customization
- teleport target objects to new scenes at user-specified locations
- identity feature with detail feature
- moving objects, swapping them, multi-subject composition, try-on a cloth
**** SUBJECT
- [[https://huggingface.co/papers/2306.00926][Inserting Anybody]] in Diffusion Models via Celeb Basis
- one facial photograph, 1024 learnable parameters, 3 minutes; several at once
- [[https://twitter.com/_akhaliq/status/1683294368940318720 ][Subject-Diffusion]]:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning
- multi subject, single reference image
- [[https://twitter.com/_akhaliq/status/1701777751286366283 ][PhotoVerse]]: Tuning-Free Image Customization with Text-to-Image Diffusion Models
- incorporates facial identity loss, single facial photo, single training phase
- [[https://twitter.com/_akhaliq/status/1725365231050793081 ][The Chosen]] [[https://omriavrahami.com/the-chosen-one/][One]]: Consistent Characters in Text-to-Image Diffusion Models
- <<The Chosen One>> sole input being text
- generate gallery of images, use pre-trained feature extractor to choose the most cohesive cluster
- [[https://twitter.com/_akhaliq/status/1732222107583500453 ][FaceStudio]]: Put Your Face Everywhere in Seconds ==best==
- direct feed-forward mechanism, circumventing the need for intensive fine-tuning
- stylized images, facial images, and textual prompts to guide the image generation process
- [[https://arxiv.org/abs/2402.00631][SeFi-IDE]]: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation
- face-wise attention loss to fit the face region
***** IDENTITY IN VIDEO
:PROPERTIES:
:ID: 8872aa4e-0394-4066-822b-9145f14caf6f
:END:
- [[https://magic-me-webpage.github.io/][Magic-Me]]: Identity-Specific Video Customized Diffusion
****** STABLEIDENTITY
:PROPERTIES:
:ID: 55829fe3-d777-4723-8b48-5c9454822b5e
:END:
- [[https://arxiv.org/abs/2401.15975][StableIdentity]]: Inserting Anybody into Anywhere at First Sight
- identity recontextualization with just one face image without finetuning
- also for into video/3D generation
***** IDENTITY ZERO-SHOT
- [[https://github.com/InstantID/InstantID][InstantID]]: [[https://instantid.github.io/][Zero-shot]] Identity-Preserving Generation in Seconds (using face encoder)
- [[https://github.com/TencentARC/PhotoMaker][PhotoMaker]]: Customizing Realistic Human Photos via Stacked ID Embedding Paper page
- [[https://twitter.com/_akhaliq/status/1769930922525159883 ][Infinite-ID]]: Identity-preserved Personalization via ID-semantics Decoupling Paradigm ==best==
- identity provided by the reference image while mitigating interference from textual input
- [[https://caphuman.github.io/][CapHuman]]: Capture Your Moments in Parallel Universes
- encode then learn to align, identity preservation for new individuals without tuning
- [[https://twitter.com/_akhaliq/status/1740616525478781168 ][SSR-Encoder]]: Encoding Selective Subject Representation for Subject-Driven Generation ==best==
- Token-to-Patch Aligner = preserving fine features of the subjects; multiple subjects
- combinable with controlnet, and across styles
- [[https://twitter.com/_akhaliq/status/1764514136849846667 ][RealCustom]]: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
- gradually narrowing to the specific subject, iteratively update the influence scope
***** PHOTOMAKER
- [[https://huggingface.co/papers/2312.04461][PhotoMaker]]: [[https://twitter.com/_akhaliq/status/1732965700405281099 ][Customizing]] [[https://arxiv.org/pdf/2312.04461.pdf][Realistic]] Human Photos via Stacked ID Embedding
- encodes (into mlp) images into embedding wich preserves id
**** ANIME
- [[https://github.com/7eu7d7/DreamArtist-sd-webui-extension][DreamArtist]]: a single one image and target text (mainly works with anime)
- [[https://twitter.com/_akhaliq/status/1738018255720030343 ][DreamTuner]]: Single Image is Enough for Subject-Driven Generation
- subject-encoder for coarse subject identity preservation, training-free
- [[https://github.com/laksjdjf/pfg][pfg]] Prompt free generation; learns to interpret (anime) input-images
- old one: [[https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/6585][PaintByExample]]
*** VARIATIONS
- others: [[USING ATTENTION MAP]] [[VARIATIONS]] [[Elite]] [[ZERO SHOT]]
- image variations model (mix images): https://twitter.com/Buntworthy/status/1615302310854381571
- by versatile diffusion model guy, [[https://www.reddit.com/r/StableDiffusion/comments/10ent88/guy_who_made_the_image_variations_model_is_making/][reddit]]
- improved: https://github.com/SHI-Labs/Versatile-Diffusion
- stable diffusion reimagine: conditioning the unet with the image clip embeddings, then training
* BETTER DIFFUSION
- editing [[https://time-diffusion.github.io/TIME_paper.pdf][default]] of a prompt: https://github.com/bahjat-kawar/time-diffusion
- [[https://github.com/SusungHong/Self-Attention-Guidance][Self-Attention Guidance]] (SAG): [[https://arxiv.org/pdf/2210.00939.pdf][SAG leverages]] [[https://github.com/ashen-sensored/sd_webui_SAG][intermediate attention]] maps of diffusion models at each iteration to capture essential information for the generative process and guide it accordingly
- pretty much just reimplemented the attention function without changing much else
- [[https://github.com/ChenyangSi/FreeU#freeu-code][FreeU]]: [[https://twitter.com/_akhaliq/status/1704721496122266035 ][Free]] Lunch in Diffusion U-Net (unet) ==best==
- improves diffusion model sample quality at no costs
- more color variance
- [[https://twitter.com/_akhaliq/status/1683293200574988289 ][Diffusion Sampling]] with Momentum for Mitigating Divergence Artifacts
- incorporation of: Heavy Ball (HB) momentum = expand stability regions; Generalized HB (GHVB) = supression
- better low step sampling
- DG: [[https://github.com/luping-liu/Detector-Guidance][Detector Guidance]] for Multi-Object Text-to-Image Generation
- mid-diffusion, performs latent object detection then enhances following CAMs(cross-attention maps)
** SCHEDULER
- [[https://arxiv.org/abs/2301.11093v1][simple diffusion]]: End-to-end diffusion for high resolution images
- shifted scheduled noise
- [[https://github.com/Extraltodeus/sigmas_tools_and_the_golden_scheduler][Sigmas Tools]] and The Golden Scheduler
** QUALITY
- [[RESOLUTION]]
- [[https://twitter.com/_akhaliq/status/1707253415061938424 ][Emu]]: Enhancing Image Generation Models Using Photogenic Needles in a Haystack (dataset method)
- guide pre-trained model to exclusively generate good images
- [[https://twitter.com/_akhaliq/status/1712830952441819382 ][HyperHuman]]: Hyper-Realistic Human Generation with Latent Structural Diffusion
- Latent Structural Diffusion Model that simultaneously denoises depth and surface normal with RGB image
- [[https://github.com/openai/consistencydecoder][Consistency]] Distilled Diff VAE
- Improved decoding for stable diffusion vaes
** HUMAN FEEDBACK
:PROPERTIES:
:ID: 59d1d337-eff3-42bb-9398-1e51b0739074
:END:
- [[id:37688f5e-9dc2-48ed-a3f9-eeb318c64f02][RLCM]]
- Aligning Text-to-Image Models using Human Feedback https://arxiv.org/abs/2302.12192
- [[https://tgxs002.github.io/align_sd_web/][Better Aligning]] Text-to-Image Models with Human Preference
- [[https://github.com/GanjinZero/RRHF][RRHF]]: Rank Responses to Align Language Models with Human Feedback without tears
- [[https://github.com/THUDM/ImageReward][ImageReward]]: [[https://arxiv.org/abs/2304.05977][Learning]] and Evaluating Human Preferences for Text-to-Image Generation
- [[https://twitter.com/_akhaliq/status/1681870383408984064 ][FABRIC]]: [[https://twitter.com/dvruette/status/1681942402582425600 ][Personalizing]] Diffusion Models with Iterative Feedback
- training-free approach, exploits the self-attention layer
- improve the results of any Stable Diffusion model
- [[https://twitter.com/_akhaliq/status/1727575485717021062 ][Using]] Human Feedback to Fine-tune Diffusion Models without Any Reward Model
- Direct Preference for Denoising Diffusion Policy Optimization (D3PO)
- omits training a reward model
- [[https://twitter.com/_akhaliq/status/1727565261375418555 ][Diffusion-DPO]]: [[https://github.com/SalesforceAIResearch/DiffusionDPO][Diffusion]] Model Alignment Using Direct Preference Optimization ([[https://github.com/huggingface/diffusers/tree/main/examples/research_projects/diffusion_dpo][training script]])
- improving visual appeal and prompt alignment, using direct preference optimization
- [[https://twitter.com/_akhaliq/status/1737132429385576704 ][SDXL]]: [[https://huggingface.co/mhdang/dpo-sdxl-text2image-v1][Direct]] Preference Optimization (better images) <<SDXL-DPO>> (and [[https://huggingface.co/mhdang/dpo-sd1.5-text2image-v1][SD 1.5]])
- [[id:4b8a772d-e3ad-4183-863b-eeddb47bab9e][ALDM]] layout
- [[https://twitter.com/_akhaliq/status/1749978885893063029 ][RL Diffusion]]: Large-scale Reinforcement Learning for Diffusion Models (improves pretrained)
- [[https://twitter.com/_akhaliq/status/1758000776801137055 ][PRDP]]: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models ==best==
- better training stability for unseen prompts
- reward difference of generated image pairs from their denoising trajectories
- [[id:411493fe-a082-477f-923c-9a048dab036e][MESH HUMAN FEEDBACK]]
*** ACTUALLY SELF-FEEDBACK
- SPIN-Diffusion: [[https://arxiv.org/abs/2402.10210][Self-Play]] Fine-Tuning of Diffusion Models for Text-to-Image Generation ==best==
- diffusion model engages in competition with its earlier versions, iterative self-improvement
- [[https://arxiv.org/abs/2403.13352][AGFSync]]: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
- use Vision Models (VLM) to assess quality across style, coherence, and aesthetics, generating feedback
** SD GENERATION OPTIMIZATION
- [[id:3c3b352c-c73e-49e2-8ddc-81a8569229a2][ONE STEP DIFFUSION]] [[SAMPLERS]] [[id:28491008-6287-47c6-ac2e-ed22f862c997][STABLE CASCADE]]
- [[https://twitter.com/Birchlabs/status/1640033271512702977 ][turning off]] [[https://github.com/Birch-san/diffusers-play/commit/77fa7f965edf7ab7280a47d2f8fc0362d4b135a9][CFG when]] denoising sigmas below 1.1
- Tomesd: [[https://github.com/dbolya/tomesd][Token Merging]] for [[https://arxiv.org/abs/2303.17604][Stable Diffusion]] [[https://git.mmaker.moe/mmaker/sd-webui-tome][code]]
- [[https://lemmy.dbzer0.com/post/14962261][ToDo]]: Token Downsampling for Efficient Generation of High-Resolution Images
- token downsampling of key and value tokens to accelerate inference 2x-4x
- [[https://twitter.com/bahjat_kawar/status/1684827989408673793 ][Nested Diffusion]] Processes for Anytime Image Generation
- can generate viable when stopped arbitrarily before completion
- [[https://twitter.com/_akhaliq/status/1668076625924177921 ][BOOT]]: Data-free Distillation of Denoising Diffusion Models with Bootstrapping
- use sd as teacher model and train faster one using it as bootstrap; 30 fps
- Divide & Bind Your Attention for Improved Generative Semantic Nursing
- [[https://twitter.com/YumengLi_007/status/1682404804583104512 ][novel objective]] [[https://sites.google.com/view/divide-and-bind][functions]]: can handle complex prompts with proper attribute binding
- [[https://twitter.com/_akhaliq/status/1709059088636612739 ][Conditional]] Diffusion Distillation
- added parameters, suplementing image conditions to the diffusion priors
- super-resolution, image editing, and depth-to-image generation
- [[SAMPLERS]] [[ADAPTIVE GUIDANCE]]
- [[https://github.com/Oneflow-Inc/onediff/tree/main][OneDiff]]: [[https://lemmy.dbzer0.com/post/15883033][acceleration]] library for diffusion models, [[https://github.com/Oneflow-Inc/onediff/tree/main][ComfyUI Nodes]]
- [[https://twitter.com/_akhaliq/status/1760859243018703040 ][T-Stitch]]: Accelerating Sampling in Pre-trained Diffusion Models with Trajectory Stitching
- improve sampling efficiency with no generation degradation
- smaller DPM in the initial steps, larger DPM at a later stage, 40% of the early timesteps
- [[https://lemmy.dbzer0.com/post/18177662][The Missing]] U for Efficient Diffusion Models
- operates with approximately a quarter of the parameters, diffusion models 80% faster
*** ULTRA SPEED
- [[https://twitter.com/StabilityAI/status/1729589510155948074 ][SDXL Turbo]]: A real-time text-to-image generation model (distillation)
- [[https://github.com/aifartist/ArtSpew/][ArtSpew]]: SD at 149 images per second (high volume random image generation)
- [[https://twitter.com/cumulo_autumn/status/1732309219041571163 ][StreamDiffusion]]: A Pipeline-level Solution for Real-time Interactive Generation (10ms)
- transforms sequential denoising into the batching denoising
- [[https://lemmy.dbzer0.com/post/13491532][MobileDiffusion]]: Subsecond Text-to-Image Generation on Mobile Devices
- diffusion-GAN finetuning techniques to achieve 8-step and 1-step inference
- [[https://arxiv.org/abs/2402.17376][Accelerating]] Diffusion Sampling with Optimized Time Steps
- image performance compared to using uniform time steps
*** CACHE
- [[https://twitter.com/_akhaliq/status/1731888038626615703 ][DeepCache]]: Accelerating Diffusion Models for Free ==best==
- exploits temporal redundancy observed in the sequential denoising steps
- superiority over existing pruning and distillation
- [[https://twitter.com/_akhaliq/status/1732587729970479354 ][Cache Me]] if You Can: Accelerating Diffusion Models through Block Caching
- reuse outputs from layer blocks of previous steps, automatically determine caching schedules
- [[https://twitter.com/_akhaliq/status/1736615005913591865 ][Faster Diffusion]]: Rethinking the Role of UNet Encoder in Diffusion Models ==best==
- reuse cyclically the encoder features in the previous time-steps for the decoder
- [[https://arxiv.org/abs/2401.01008][Fast]] Inference Through The Reuse Of Attention Maps In Diffusion Models
- structured reuse of attention maps during sampling
- [[https://github.com/HaozheLiu-ST/T-GATE][T-GATE]]: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
- two stages: semantics-planning phase, and subsequent fidelity-improving phase
- so caching cross-attention output once converges and fixing it during the remaining inference
**** EXPLOITING FEATURES
- [[https://arxiv.org/abs/2312.03517][FRDiff]]: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models
- Reusing feature maps with high temporal similarity
- [[https://arxiv.org/abs/2312.08128][Clockwork Diffusion]]: Efficient Generation With Model-Step Distillation
- high-res features sensitive to small perturbations; low-res feature only sets semantic layout
- so reuses computation from preceding steps for low-res
*** LCM
:PROPERTIES:
:ID: 7396b121-d509-461a-b5ed-8c75d4718519
:END:
- LCMs: [[https://latent-consistency-models.github.io/][Latent Consistency]] Models: Synthesizing High-Resolution Images with Few-step Inference
- inference with minimal steps (2-4)
- training LCM model: only 32 A100 GPU hours
- Latent Consistency Fine-tuning (LCF) custom datasets
- [[https://github.com/0xbitches/ComfyUI-LCM][comfyui]] [[https://github.com/0xbitches/sd-webui-lcm][auto1111]] [[https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7][the model]]
- [[https://twitter.com/SimianLuo/status/1722845777868075455 ][LCM-LoRA]]: A Universal Stable-Diffusion Acceleration Module
- universally applicable accelerator for diffusion models, plug-in neural PF-ODE solver
- [[https://twitter.com/_akhaliq/status/1735514410049794502 ][VideoLCM]]: Video Latent Consistency Model
- smooth video synthesis with only four sampling steps
- [[id:ccc8f98c-34eb-448b-b2d8-6ef662627fa4][ANIMATELCM]]
- [[https://twitter.com/fffiloni/status/1756719446578585709 ][Quick]] Image Variations with LCM and Image Caption
- [[https://github.com/jabir-zheng/TCD][TCD]]: [[https://twitter.com/_akhaliq/status/1763436246565572891 ][Trajectory]] Consistency Distillation ([[https://huggingface.co/h1t/TCD-SDXL-LoRA][lora]])
- accurately trace the entire trajectory of the Probability Flow ODE
- https://github.com/dfl/comfyui-tcd-scheduler
- [[id:28ff20ec-5501-47da-9ac1-8adc65303376][LCM-LOOKAHEAD]]
**** CCM
- [[https://twitter.com/_akhaliq/status/1734804912809148750 ][CCM]]: Adding Conditional Controls to Text-to-Image Consistency Models
- ControlNet-like, lightweight adapter can be jointly optimized while consistency training
**** PERFLOW
:PROPERTIES:
:ID: 44943c87-ca5b-4604-840e-ff52993c1bf1
:END:
- [[https://github.com/magic-research/piecewise-rectified-flow][PeRFlow]] (Piecewise Rectified Flow)
- fast generation, 4 steps, 4,000 training iterations
- multiview normal maps and textures from text prompts instantly
** PROMPT CORRECTNESS
- [[https://arxiv.org/abs/2211.15518][ReCo]]: region control, counting donuts
- [[https://github.com/hnmr293/sd-webui-cutoff][sd-webui-cutoff]], hide tokens for each separated group, limits the token influence scope (color control)
- hard-prompts-made-easy
- [[https://huggingface.co/spaces/Gustavosta/MagicPrompt-Stable-Diffusion][magic prompt]]: amplifies-improves the prompt
- [[https://github.com/sen-mao/SuppressEOT][Get What]] You Want, Not What You Don’t: Image Content Suppression for Text-to-Image Diffusion Models
- suppress unwanted content generation of the prompt, and encourages the generation of desired content
- better than negative prompts
- [[https://dpt-t2i.github.io/][Discriminative]] Probing and Tuning for Text-to-Image Generation
- discriminative adapter to improve their text-image alignment
- global matching and local grounding
- [[https://twitter.com/_akhaliq/status/1776074505351282720 ][CoMat]]: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
- fine-tuning strategy with an image-to-text(captioning model) concept matching mechanism
- [[https://youtu.be/_Pr7aFkkAvY?si=Xr5e_RL-rwcdL10q
][ELLA]] - [[https://github.com/DataCTE/ELLA_Training][A Powerful]] Adapter for Complex Stable Diffusion Prompts
- using an adaptor for an llm instead of clip
*** ATTENTION LAYOUT
- Attend-and-Excite ([[https://attendandexcite.github.io/Attend-and-Excite/][excite]] the ignored prompt [[https://github.com/AttendAndExcite/Attend-and-Excite][tokens]]) (no retrain)
- [[https://arxiv.org/abs/2304.03869][Harnessing]] the [[https://github.com/UCSB-NLP-Chang/Diffusion-SpaceTime-Attn][Spatial-Temporal]] Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis
- [[https://arxiv.org/pdf/2302.13153.pdf][Directed Diffusion]]: [[https://github.com/hohonu-vicml/DirectedDiffusion][Direct Control]] of Object Placement through Attention Guidance (no retrain) [[https://github.com/giga-bytes-dev/stable-diffusion-webui-two-shot/tree/ashen-sensored_directed-diffusion][repo]]
- [[https://twitter.com/_akhaliq/status/1696155079458406758 ][DenseDiffusion]]: Dense Text-to-Image Generation with Attention Modulation
- training free, layout guidance
*** LANGUAGE ENHANCEMENT
- [[IMAGE RELATIONSHIPS]]
- [[https://twitter.com/_akhaliq/status/1670190734543134720 ][Linguistic]] Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
- using prompt sentence structure during inference to improve the faithfulness
- [[https://weixi-feng.github.io/structure-diffusion-guidance/][Training-Free Structured]] [[https://arxiv.org/abs/2212.05032][Diffusion]] Guidance for Compositional [[https://arxiv.org/pdf/2212.05032.pdf][Text-to-Image Synthesis]]
- exploiting language sentences semantical hierarchies (lojban)
- [[https://github.com/weixi-feng/Structured-Diffusion-Guidance][Structured Diffusion Guidance]], language enhanced clip enforces on unet
- Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering
- prompt learning, improve the matches between the input text and the generated
**** PROMPT EXPANSION, PROMPT AUGMENTATION
- [[https://huggingface.co/KBlueLeaf/DanTagGen?not-for-all-audiences=true][DanTagGen]]: LLaMA arch
- [[https://github.com/sammcj/superprompter][superprompter]]: Supercharge your AI/LLM prompts
- [[https://arxiv.org/pdf/2403.19716.pdf][Capability-aware]] Prompt Reformulation Learning for Text-to-Image Generation
- effectively learn diverse reformulation strategies across various user capacities to simulate high-capability user reformulation
**** TOKENCOMPOSE
:PROPERTIES:
:ID: bb79e50e-ed85-4f37-bd0c-6cad6acd0a6e
:END:
- [[https://mlpc-ucsd.github.io/TokenCompose/][TokenCompose]]: Grounding Diffusion with Token-level Supervision ==best==
- finetuned with token-wise grounding objectives for multi-category instance composition
- exploiting binary segmentation maps from SAM
- compositions that are unlikely to appear simultaneously in a natural scene
** BIGGER COHERENCE
:PROPERTIES:
:ID: b211cec9-6cf2-4f6d-9e1e-10186f513da1
:END:
- [[INTERPOLATION]] [[id:18c951a2-6883-4010-ad9d-9dee396b9839][VIDEO COHERENCE]]
- [[https://arxiv.org/pdf/2404.03109.pdf][Many-to-many]] Image Generation with Auto-regressive Diffusion Models
*** PANORAMAS
- [[https://research.nvidia.com/labs/dir/diffcollage/][DiffCollage]]: Parallel Generation of Large Content with Diffusion Models (panoramas)
- [[https://twitter.com/_akhaliq/status/1678943514917326848 ][Collaborative]] Score Distillation for Consistent Visual Synthesis
- consistent visual synthesis across multiple samples ==best one==
- distill generative priors over a set of images synchronously
- zoom, video, panoramas
- [[https://syncdiffusion.github.io/][SyncDiffusion]]: Coherent Montage via Synchronized Joint Diffusions
- plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss
- [[https://chengzhag.github.io/publication/panfusion/][Taming Stable]] Diffusion for Text to 360° Panorama Image Generation
- minimize distortion during the collaborative denoising process
**** OUTPAINTING
- [[BETTER INPAINTING]]
- [[https://arxiv.org/abs/2401.15652][Continuous-Multiple Image]] Outpainting in One-Step via Positional Query and A Diffusion-based Approach
- generate content beyond boundaries using relative positional information
- [[https://tencentarc.github.io/BrushNet/][BrushNet]]: [[https://github.com/TencentARC/BrushNet][A Plug-and-Play]] [[https://github.com/nullquant/ComfyUI-BrushNet][Image]] Inpainting Model with Decomposed Dual-Branch Diffusion
- pre-trained SD model, useful in product exhibitions, virtual try-on, or background replacement
*** RESOLUTION
- [[https://twitter.com/_akhaliq/status/1697522827992150206 ][Any-Size-Diffusion]]: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
- training on images of unlimited sizes is unfeasible
- Fast Seamless Tiled Diffusion (FSTD)
- [[https://yingqinghe.github.io/scalecrafter/][ScaleCrafter]]: Tuning-free Higher-Resolution Visual Generation with Diffusion Models (video too)
- generating images at much higher resolutions than the training image sizes
- does not require any training or optimization
- [[https://twitter.com/iScienceLuvr/status/1716789813750493468 ][Matryoshka]] [[https://twitter.com/_akhaliq/status/1716831652545208407 ][Diffusion]] Models
- diffusion process that denoises inputs at multiple resolutions jointly
- [[id:bbc5a347-bc62-4b5e-b659-1c6a57d6a2a5][FIT TRANSFORMER]]
- [[https://lemmy.dbzer0.com/post/17799119][Upsample Guidance]]: Scale Up Diffusion Models without Training
- technique that adapts pretrained model to generate higher-resolution images by adding a single term in the sampling process, without any additional training or relying on external models
- can be applied to various models, such as pixel-space, latent space, and video diffusion models
**** ARBITRARY
- [[https://github.com/MoayedHajiAli/ElasticDiffusion-official][ElasticDiffusion]]: Training-free Arbitrary Size Image Generation
- decoding method better than MultiDiffusion
- [[https://lemmy.dbzer0.com/post/15814254][ResAdapter]]: Domain Consistent Resolution Adapter for Diffusion Models
- unlike post-process, directly generates images with the dynamical resolution
- compatible with ControlNet, IP-Adapter and LCM-LoRA; can be integrated with ElasticDiffusion
* SAMPLERS
- [[https://arxiv.org/pdf/2210.05475.pdf][GENIE]]: Higher-Order Denoising Diffusion Solvers
- faster diffusion equation?
- DDIM vs GENIE
- 4 time less expensive upsampling
- fastest solver https://arxiv.org/abs/2301.12935
- another accelerator: https://arxiv.org/abs/2301.11558
- unipc sampler (sampling in 5 steps)
- [[https://blog.novelai.net/introducing-nai-smea-higher-image-generation-resolutions-9b0034ffdc4b][smea]]: (nai) global attention sampling
- Karras no blurry improvement [[https://www.reddit.com/r/StableDiffusion/comments/11mulj6/quality_improvements_to_dpm_2m_karras_sampling/][reddit]]
- [[https://twitter.com/_akhaliq/status/1716332535142117852 ][DPM-Solver-v3]]: Improved Diffusion ODE Solver with Empirical Model Statistics
- several coefficients efficiently computed on the pretrained mode, faster
- [[id:bc0dd47c-4f46-4cd0-9606-555990c06626][STABLESR]] novel approach
- [[DIRECT CONSISTENCY OPTIMIZATION]]: controls intensity of style
* IMAGE EDITING
- [[id:b4052ea2-df86-4c37-91b0-e2c2448ab08c][3D-AWARE IMAGE EDITING]]
- [[https://arxiv.org/pdf/2211.09794.pdf][null-text]] [[https://github.com/cccntu/efficient-prompt-to-prompt][inversion]]: prompttoprompt but better
- [[https://github.com/ShivamShrirao/diffusers/tree/main/examples/imagic][imagic]]: editing photo with prompt
** IMAGE SCULPTING ==best==
:PROPERTIES:
:ID: 303a8796-8fc8-4c2f-92f6-62516c8a6ea1
:END:
- [[https://github.com/vision-x-nyu/image-sculpting][Image]] Sculpting: Precise Object Editing with 3D Geometry Control
- enables direct interaction with their 3D geometry
- pose editing, translation, rotation, carving, serial addition, space deformation
- turned into nerf using Zero-1-to-3, then returned to image including features
** STYLE
- [[https://huggingface.co/papers/2306.00983][StyleDrop]]: [[https://styledrop.github.io/][Text-to-Image]] [[https://github.com/zideliu/StyleDrop-PyTorch][Generation]] in Any Style (muse architecture)
- 1% of parameters (painting style)
- [[https://twitter.com/_akhaliq/status/1685898061221076992 ][PromptStyler]]: Prompt-driven Style Generation for Source-free Domain Generalization
- learnable style word vectors, style-content features to be located nearby
- [[https://arxiv.org/abs/2304.03119][Zero-shot]] [[https://arxiv.org/pdf/2304.03119.pdf][Generative]] [[https://github.com/Picsart-AI-Research/IPL-Zero-Shot-Generative-Model-Adaptation][Model]] Adaptation via Image-specific Prompt Learning
- adapt style to concept
- [[https://twitter.com/_akhaliq/status/1699267731332182491 ][StyleAdapter]]: A Single-Pass LoRA-Free Model for Stylized Image Generation
- process the prompt and style features separately
- [[https://twitter.com/_akhaliq/status/1702193640687235295 ][DreamStyler]]: Paint by Style Inversion with Text-to-Image Diffusion Models
- textual embedding with style guidance
- [[https://garibida.github.io/cross-image-attention/][Cross-Image]] Attention for Zero-Shot Appearance Transfer
- zero-shot appearance transfer by building on the self-attention layers of image diffusion models
- architectural transfer
- [[id:4c93f57d-43b7-4fbe-9415-e007a06efd46][STYLECRAFTER]] transfer to video
- [[https://github.com/google/style-aligned/][Style Aligned]] Image Generation via Shared Attention ==best== ([[https://github.com/Mikubill/sd-webui-controlnet/commit/47dfefa54fb128035cc6e84c2fca0b4bc28be62f][as controlnet extension]])
- color palette too
- [[https://freestylefreelunch.github.io/][FreeStyle]]: Free Lunch for Text-guided Style Transfer using Diffusion Models
- style transfer built upon sd, dual-stream encoder and single-stream decoder architecture
- content into pixelart, origami, anime
- [[https://cszy98.github.io/PLACE/][PLACE]]: [[https://lemmy.dbzer0.com/post/15768553][Adaptive]] Layout-Semantic Fusion for Semantic Image Synthesis
- image from segmentation map and also using semantic features
- [[https://curryjung.github.io/VisualStylePrompt/][Visual Style]] Prompting with Swapping Self-Attention
- consistent style across generations
- unlike others (ip-adapter) disentangle other semantics away (like pose)
- [[https://tianhao-qi.github.io/DEADiff/][DEADiff]]: An Efficient Stylization Diffusion Model with Disentangled Representations ==best==
- decouple the style and semantics of reference images
- optimal balance between the text controllability and style similarity
- [[https://twitter.com/_akhaliq/status/1775718553448051022 ][InstantStyle]]: Free Lunch towards Style-Preserving in Text-to-Image Generation
- decouples style and content from reference images within the feature space
- [[https://mshu1.github.io/dreamwalk.github.io/][DreamWalk]]: Style Space Exploration using Diffusion Guidance
- decompose the text prompt into conceptual elements, apply a separate guidance for each element
- [[id:28ff20ec-5501-47da-9ac1-8adc65303376][LCM-LOOKAHEAD]]
*** B-LoRA
- [[https://arxiv.org/abs/2403.14572][Implicit Style-Content]] [[https://twitter.com/yarden343/status/1772894805313405151 #m][Separation]] using B-LoRA
- preserving its underlying objects, structures, and concepts
- LoRA of two specific blocks
- image style transfer, text-based stylization, consistent style generation, and style-content mixing
*** STYLE TOOLS
- [[https://github.com/learn2phoenix/CSD][Measuring]] Style Similarity in Diffusion Models
- compute similarity score
*** DIRECT CONSISTENCY OPTIMIZATION
- DCO: [[https://lemmy.dbzer0.com/post/14778281][Direct Consistency]] Optimization for Compositional Text-to-Image Personalization
- minimally fine-tuning pretrained to achieve consistency
- new sampling method that controls the tradeoff between image fidelity and prompt fidelity
** REGIONS
- different inpainting ways with diffusers: https://github.com/huggingface/diffusers/pull/1585
- [[https://zengyu.me/scenec/][SceneComposer]]: paint with words but cooler
- bounding boxes instead: [[https://github.com/gligen/GLIGEN][GLIGEN]]: image grounding
- better VAE and better masks: https://lipurple.github.io/Grounded_Diffusion/
- [[https://arxiv.org/abs/2403.05018][InstructGIE]]: Towards Generalizable Image Editing
- leveraging the VMamba Block, aligns language embeddings with editing semantics
- editing instructions dataset
*** REGIONS MERGE
- [[id:6a66690f-b76f-441a-a093-3c83ca73af2d][MULTIPLE DIFFUSION]] [[id:b211cec9-6cf2-4f6d-9e1e-10186f513da1][BIGGER COHERENCE]] [[HARMONIZATION]] [[id:3f126569-6deb-45e1-9535-77883fc7ad8b][MULTIPLE LORA]]
- [[https://arxiv.org/pdf/2303.13126.pdf][MagicFusion]]: [[https://magicfusion.github.io/][Boosting]] Text-to-Image Generation Performance by Fusing Diffusion Models
- blending the predicted noises of two diffusion models in a saliency-aware manner (composite)
- [[https://twitter.com/_akhaliq/status/1681865088838270978 ][Text2Layer]]: [[https://huggingface.co/papers/2307.09781][Layered]] Image Generation using Latent Diffusion Model
- train an autoencoder to reconstruct layered images and train models on the latent representation
- generate background, foreground, layer mask, and the composed image simultaneously
- [[https://lemmy.dbzer0.com/post/17448456][Isolated Diffusion]]: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance
- bind each attachment to corresponding subjects separately with split text prompts
- object segmentation to obtain the layouts of subjects, then isolate and resynthesize individually
- [[https://lemmy.dbzer0.com/post/17243698][Be Yourself]]: Bounded Attention for Multi-Subject Text-to-Image Generation
- bounded attention, training-free method; bounding information flow in the sampling process
- prevents leakage, promotes each subject’s individuality, even with complex multi-subject conditioning
**** INTERPOLATION
- [[https://github.com/lunarring/latentblending][Latent]] Blending (interpolate latents)
- latent couple, multidiffusion, [[https://note.com/gcem156/n/nb3d516e376d7][attention couple]]
- comfy ui like but [[https://github.com/omerbt/MultiDiffusion][masks]]
- [[https://twitter.com/_akhaliq/status/1683753746315239425 ][Interpolating]] between Images with Diffusion Models
- convincing interpolations across diverse subject poses, image styles, and image content
- [[https://twitter.com/_akhaliq/status/1732973286206636454 ][Smooth Diffusion]]: [[https://github.com/SHI-Labs/Smooth-Diffusion][Crafting]] [[https://arxiv.org/abs/2312.04410][Smooth]] [[https://github.com/SHI-Labs/Smooth-Diffusion][Latent]] Spaces in Diffusion Models ==best==
- steady change in the output image, plug-and-play Smooth-LoRA; best interpolation
- perhaps for video or drag diffusion
- [[https://kongzhecn.github.io/omg-project/][OMG]]: Occlusion-friendly Personalized Multi-concept Generation In Diffusion Models
- integrate multiple concepts within a single image
- combined with LoRA and InstantID
***** DIFFMORPHER
:PROPERTIES:
:ID: 60a63fe6-8088-4b2b-af55-f1d5e23e804b
:END:
- [[https://twitter.com/_akhaliq/status/1734778250574840146 ][DiffMorpher]]: [[https://twitter.com/sze68zkw/status/1738407559009366025 ][Unleashing]] the Capability of Diffusion Models for Image Morphing ==best==
- alternative to gan; interpolate between their loras (not just their latents)
*** MINIMAL CHANGES
- [[id:db81202f-abf0-410e-98c2-c202fa2ca350][SEMANTICALLY DEFORMED]]
- [[https://delta-denoising-score.github.io/][Delta]] [[https://arxiv.org/abs/2304.07090][Denoising]] Score: minimal modifications, keeping the image
**** HARMONIZATION
- [[REGIONS MERGE]]
- SEELE: [[https://yikai-wang.github.io/seele/][Repositioning]] The Subject Within Image
- minimal changes like moving people, subject removal, subject completion and harmonization
- [[https://arxiv.org/pdf/2303.00262.pdf][Collage]] [[https://twitter.com/VSarukkai/status/1701293909647958490 ][Diffusion]] (harmonize collaged images)
- [[https://twitter.com/_akhaliq/status/1770645980767011268 ][Magic Fixup]]: [[https://twitter.com/HadiZayer/status/1773457936309682661 ][Streamlining]] Photo Editing by Watching Dynamic Videos
- given a coarsely edited image (cut and move blob), synthesizes a photorealistic output
***** SWAPANYTHING
- [[https://twitter.com/_akhaliq/status/1777551248775901647 ][SwapAnything]]: Enabling Arbitrary Object Swapping in Personalized Visual Editing
- keeping the context unchanged (like it’s in texture clothes)
**** REGION EXCHANGE
- [[id:992f12e2-c595-4aca-8129-6dace7d2f3ba][VIDEO EXCHANGE]] [[SWAPANYTHING]]
- [[https://github.com/haha-lisa/RDM-Region-Aware-Diffusion-Model][RDM-Region-Aware-Diffusion-Model]] edits only the region of interest
- [[https://github.com/cloneofsimo/magicmix][magicmix]] merge their noise shapes
- [[https://omriavrahami.com/blended-latent-diffusion-page/][Blended]] Latent Diffusion
- input image and a mask, modifies the masked area according to a guiding text prompt
***** SUBJECT SWAPPING
- [[https://huggingface.co/papers/2305.18286][Photoswap]]: Personalized Subject Swapping in Images
- [[https://arxiv.org/abs/2402.18351][LatentSwap]]: An Efficient Latent Code Mapping Framework for Face Swapping
***** BETTER INPAINTING
- [[OUTPAINTING]]
- [[https://powerpaint.github.io/][A Task]] [[https://github.com/open-mmlab/mmagic/tree/main/projects/powerpaint][is Worth]] One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
- inpainting model: context-aware image and shape-guided object inpainting, object removal, controlnet
- [[https://huggingface.co/spaces/modelscope/ReplaceAnything][ReplaceAnything]] [[https://github.com/AIGCDesignGroup/ReplaceAnything][as you want]]: Ultra-high quality content replacement
- masked region is strictly retained
- [[OUTPAINTING]]
- [[https://arxiv.org/pdf/2404.03642.pdf][DiffBody]]: Human Body Restoration by Imagining with Generative Diffusion Prior
- good proportions, (clothes) texture quality, no limb distortions
- [[https://github.com/htyjers/StrDiffusion][StrDiffusion]]: Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
- semantically sparse structure in early stage, dense texture in late stage
- [[https://powerpaint.github.io/][A Task is]] Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
****** MAPPED INPAINTING
- [[https://dangeng.github.io/motion_guidance/][Motion Guidance]]: Diffusion-Based Image Editing with Differentiable Motion Estimators
******* DIFFERENTIAL DIFFUSION
- [[https://lemmy.dbzer0.com/post/13246157][Differential]] [[https://github.com/exx8/differential-diffusion][Diffusion]]: [[https://differential-diffusion.github.io/][Giving]] Each Pixel Its Strength ==best==
- control of the extent to which individual objects are modified, or the ability to introduce gradual spatial changes
- using change maps: gray scale of how many a region can change
****** CLOTHES OUTFITS
- [[https://twitter.com/_akhaliq/status/1750737690553692570 ][Diffuse to]] Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
- virtually place any e-commerce item in any setting
***** PIX2PIX REGION
- pix2pix-zero (promp2prompt without prompt)
- [[https://github.com/pix2pixzero/pix2pix-zero][no fine]] tuning, using BLIP captions <<pix2pix>>; [[https://huggingface.co/docs/diffusers/api/pipelines/pix2pix_zero][docs]]
- plug-and-[[https://github.com/MichalGeyer/plug-and-play][play]]: like pix2pix but features extracted
***** FORCE IT WHERE IT FITS
- [[https://arxiv.org/abs/2303.16765][MDP]]: [[https://github.com/QianWangX/MDP-Diffusion][A Generalized]] Framework for Text-Guided Image Editing by Manipulating the Diffusion Path
- no training or finetuning; instead force the prompt (exchange the noise)
- [[https://twitter.com/_akhaliq/status/1644557225103335425 ][PAIR-Diffusion]]: [[https://twitter.com/ViditGoel7/status/1713031352709435736 ][Object-Level]] Image Editing with Structure-and-Appearance
- forces input image into edited image, object-level
***** PROMPT IS TARGET
- [[https://arxiv.org/abs/2211.07825][Direct Inversion]]: Optimization-Free Text-Driven Real Image Editing with Diffusion Models
- only changes where the prompt fits
- [[https://twitter.com/aysegl_dndr/status/1691011667394527232 ][Inst-Inpaint]]: Instructing to Remove Objects with Diffusion Models
- erasing unwanted pixels; estimates which object to be removed
- [[https://arxiv.org/pdf/2303.09618.pdf][HIVE]]: Harnessing Human Feedback for Instructional Visual Editing (reward model)
- rlhf, editing instruction, to get output to adhere to the correct instructions
- [[https://twitter.com/_akhaliq/status/1735516803625893936 ][LIME]]: Localized Image Editing via Attention Regularization in Diffusion Models
- do not require specified regions or additional text input
- clustering technique = segmentation maps; without re-training and fine-tuning
****** DDIM
- [[https://github.com/MirrorDiffusion/MirrorDiffusion][MirrorDiffusion]]: [[https://mirrordiffusion.github.io/][Stabilizing]] Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond ==best==
- prompt redescription strategy, revised DDIM inversion
- [[https://arxiv.org/abs/2403.09468][Eta Inversion]]: [[https://github.com/furiosa-ai/eta-inversion][Designing]] an Optimal Eta Function for Diffusion-based Real Image Editing
- better DDIM
- [[https://twitter.com/_akhaliq/status/1771039688280723724 ][ReNoise]]: Real Image Inversion Through Iterative Noising
- building on reversing the diffusion sampling process to manipulate an image
**** SEMANTIC CHANGE - DETECTION
- [[https://github.com/ml-research/semantic-image-editing][sega]] semantic guidance, (apply a concept arithmetic after having a generation)
- [[https://twitter.com/SFResearch/status/1612886999152857088 ][EDICT]]: [[https://github.com/salesforce/EDICT][repo]] Exact Diffusion Inversion via Coupled Transformations
- edits-changes object types(dog breeds)
- adds noise, complex transformations but still getting perfect invertibility
- [[https://twitter.com/_akhaliq/status/1664485230151884800 ][The Hidden]] [[https://huggingface.co/papers/2306.00966][Language]] of Diffusion Models
- learning interpretable pseudotokens from interpolating unet concepts
- useful for: single-image decomposition to tokens, bias detection, and semantic image manipulation
***** SWAP PROMPT
- [[USING ATTENTION MAP]] [[TI-GUIDED-EDIT]]
- [[https://twitter.com/_akhaliq/status/1676071757994680321 ][LEDITS]]: Real Image Editing with DDPM Inversion and Semantic Guidance
- prompt changing, minimal variations <<ledits>>
- [[https://twitter.com/kerstingAIML/status/1729778594790907914 ][LEDITS++]], [[https://twitter.com/MBrack_AIML/status/1729919347542356187 ][an efficient]], versatile & precise textual image manipulator ==best==
- no tuning, no optimization, few diffusion steps, multiple simultaneous edits
- architecture-agnostic, masking for local changes; building on SEGA
- [[https://arxiv.org/abs/2303.15649][StyleDiffusion]]: [[https://github.com/sen-mao/StyleDiffusion][Prompt-Embedding]] Inversion for Text-Based Editing
- preserve the object-like attention maps after editing
**** INSTRUCTIONS
- other: [[PIX2PIX REGION]] [[id:ddd3588a-dc3c-426d-a94e-9aa373fabff9][GUIDING FUNCTION]] [[TIP: text restoration]]
- [[https://twitter.com/_akhaliq/status/1670677370276028416 ][MagicBrush]]: A Manually Annotated Dataset for Instruction-Guided Image Editing
- InstructP[[https://github.com/timothybrooks/instruct-pix2pix][ix2Pix]] [[https://arxiv.org/abs/2211.09800][paper]]
- [[https://github.com/ethansmith2000/MegaEdit][MegaEdit]]: like instructPix2Pix but for any model
- based on EDICT and plug-and-play but using DDIM
***** IMAGE INSTRUCTIONS
- [[https://twitter.com/_akhaliq/status/1743108118630818039 ][Instruct-Imagen]]: Image Generation with Multi-modal Instruction
- example images as style, boundary, edges, sketch
- [[https://twitter.com/_akhaliq/status/1686919394415329281 ][ImageBrush]]: [[https://arxiv.org/abs/2403.18660][Learning Visual]] In-Context Instructions for Exemplar-Based Image Manipulation
- a pair of images as visual instructions
- instruction learning as inpainting problem, useful for pose transfer, image translation and video inpainting
***** IMAGE TRANSLATION
- [[SEVERAL CONTROLS IN ONE]] [[CCM]] [[id:9307c803-21ff-47bf-bdc1-15ea79d2444f][MESH TO MESH]] [[id:98791065-c5dc-4f12-8c0c-fffad5715a2e][SDXS]]
- [[id:d3c6d9ef-9dff-4c60-8f92-5a523c24c139][DRAG DIFFUSION]] dragging two points on the image
- [[https://twitter.com/_akhaliq/status/1691345566243201024 ][Jurassic World]] Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation
- zero-shot <<image-to-image translation>> (I2I) across large domain gaps, like skelleton to dinosaur
- prompting provides target domain
- [[https://github.com/ader47/jittor-jieke-semantic_images_synthesis][IIDM]]: Image-to-Image Diffusion Model for Semantic Image Synthesis
-
- [[https://twitter.com/_akhaliq/status/1770089964744618320 ][One-Step]] Image Translation with Text-to-Image Models
- adapting a single-step diffusion model; preserve the input image structure
****** INTO MANGA
:PROPERTIES:
:ID: 56a81747-2a44-410e-9ca0-26f366829f3e
:END:
- [[https://arxiv.org/abs/2403.08266][Sketch2Manga]]: Shaded Manga Screening from Sketch with Diffusion Models
- normal generation into manga style but while fixing the light anomalies (actually looks manga)
- fixes the tones
****** ARTIST EDITING
:PROPERTIES:
:ID: 20a546c6-135e-45f3-88a6-d3e5869bd28f
:END:
- [[https://lemmy.dbzer0.com/post/12260609][Re:Draw]] — Context Aware Translation as a Controllable Method for Artistic Production
- inpaint with context(style and emotion) aware; like color of the eye
- [[https://arxiv.org/pdf/2402.02733.pdf][ToonAging]]: Face Re-Aging upon Artistic Portrait Style Transfer (including anime)
- and portrait style transfer, single generation step
****** SLIME
:PROPERTIES:
:ID: 7cd466fd-1feb-47ce-bf9a-033ba4838579
:END:
- [[https://twitter.com/_akhaliq/status/1699607375785705717 ][SLiMe]]: Segment Like Me
- extract attention maps, learn about segmented region, then inference
***** EXPLICIT REGION
- [[https://huggingface.co/spaces/xdecoder/Instruct-X-Decoder][X-Decoder]]: instructPix2Pix [[https://github.com/microsoft/X-Decoder][per]] region(objects)
- compaable to [[vpd]] <<x-decoder>>
- [[https://arxiv.org/pdf/2303.17546.pdf][PAIR-Diffusion]]: [[https://github.com/Picsart-AI-Research/PAIR-Diffusion][Object-Level]] Image Editing with Structure-and-Appearance Paired Diffusion Models (region editing)
** SPECIFIC CONCEPTS
- [[layout aware]]
- [[https://twitter.com/_akhaliq/status/1688747476382019584 ][ConceptLab]]: [[https://github.com/kfirgoldberg/ConceptLab][Creative]] Generation using Diffusion Prior Constraints
- generate a new, imaginary concept; adaptively constraints-optimization process
- [[https://github.com/dvirsamuel/SeedSelect][SeedSelect]]: rare concept images, generation of uncommon and ill-formed concepts
- selecting suitable generation seeds from few samples
- [[https://arxiv.org/abs/2403.10133][E4C]]: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance ==best==
- preserving the semantical structure
*** CONTEXT LEARNING
- [[https://twitter.com/_akhaliq/status/1673544034193924103 ][DomainStudio]]: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data
- keep the relative distances between adapted samples to achieve generation diversity
- [[https://twitter.com/WenhuChen/status/1643079958388940803 ][SuTi]]: [[https://open-vision-language.github.io/suti/][Subject-driven]] Text-to-Image Generation via Apprenticeship Learning (using examples)
- replaces subject-specific fine tuning with in-context learning, <<suti>>
**** SEMANTIC CORRESPONDENCE
- [[https://arxiv.org/pdf/2305.15581.pdf][Unsupervised Semantic]] Correspondence Using Stable Diffusion ==no training== ==from other image==
- find locations in multiple images that have the same semantic meaning
- optimize prompt embeddings for maximum attention on the regions of interest
- capture semantic information about location, which can then be transferred to another image
**** IMAGE RELATIONSHIPS
- [[https://twitter.com/_akhaliq/status/1668450247385796609 ][Controlling]] [[https://github.com/Zeju1997/oft][Text-to-Image]] Diffusion by Orthogonal Finetuning
- preserves the hyperspherical energy of the pairwise neuron relationship
- preserves semantic coherance(relationships)
- [[id:bb79e50e-ed85-4f37-bd0c-6cad6acd0a6e][TOKENCOMPOSE]]
***** VERBS
- [[https://ziqihuangg.github.io/projects/reversion.html][ReVersion]]: [[https://github.com/ziqihuangg/ReVersion][Diffusion-Based]] [[https://github.com/ziqihuangg/ReVersion][Relation]] Inversion from Images
- like putting images on materials
- unlike inverting object appearance, inverting object relations
- ADI: [[https://lemmy.dbzer0.com/post/15105096][Learning]] Disentangled Identifiers for Action-Customized Text-to-Image Generation
- learn action-specific identifiers from the exemplar images ignoring appearances
- [[https://arxiv.org/abs/2402.11487][Visual Concept-driven]] Image Generation with Text-to-Image Diffusion Model
- concepts that can interact with other concepts, using masks to teach
*** EXTRA PRETRAINED
- [[id:ddd3588a-dc3c-426d-a94e-9aa373fabff9][GUIDING FUNCTION]] [[IDENTITY ZERO-SHOT]]
- [[https://github.com/mkshing/e4t-diffusion][E4T-diffusion]]: [[https://tuning-encoder.github.io/][Tuning]] [[https://arxiv.org/abs/2302.12228][encoder]]: the text embedding + offset weights <<e4t>> (Needs a >40GB GPU ) (faces)
- [[https://dxli94.github.io/BLIP-Diffusion-website/][BLIP-Diffusion]]: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
- learned in 40 steps vs Textual Inversion 3000
- Subject-driven Style Transfer, Subject Interpolation
- concept replacement
- [[https://arxiv.org/pdf/2305.15779.pdf][Custom-Edit]]: Text-Guided Image Editing with Customized Diffusion Models <<custom-edit diffusion>>
**** UNDERSTANDING NETWORK
- [[https://arxiv.org/pdf/2302.13848.pdf][Elite]]: [[https://github.com/csyxwei/ELITE][Encoding]] Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
- extra neural network to get text embedding, fastest text embeddings
- <<Elite>>
- [[https://arxiv.org/abs/2306.00971][ViCo]]: [[https://github.com/haoosz/ViCo][Detail-Preserving]] Visual Condition for Personalized Text-to-Image Generation
- extra on top, not finetune the original diffusion model, awesome quality, <<ViCo>>
- unlike elite: automatic mechanism to generate object mask: cross-attentions
- [[PHOTOMAKER]] faces
*** SEVERAL CONCEPTS
- [[id:6a66690f-b76f-441a-a093-3c83ca73af2d][MULTIPLE DIFFUSION]]
- [[https://rich-text-to-image.github.io/][Expressive Text-to-Image]] [[https://github.com/SongweiGe/rich-text-to-image][Generation with]] Rich Text (learn concept-map from maxed avarages)
- [[https://arxiv.org/abs/2304.06027][Continual]] [[https://jamessealesmith.github.io/continual-diffusion/][Diffusion]]: Continual Customization of Text-to-Image Diffusion with C-LoRA
- sequentially learned concepts
- [[https://huggingface.co/papers/2305.16311][Break-A-Scene]]: [[https://twitter.com/Gradio/status/1696585736454349106 ][Extracting]] Multiple Concepts from a Single Image
- [[https://twitter.com/_akhaliq/status/1653620239735595010 ][Key-Locked]] Rank One Editing for Text-to-Image Personalization
- combine individually learned concepts into a single generated image
- [[https://huggingface.co/papers/2305.18292][Mix-of-Show]]: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models
- solving concept conflicts
*** CONES
- [[https://arxiv.org/abs/2303.05125][Cones]]: [[https://github.com/Johanan528/Cones][Concept Neurons]] [[https://github.com/damo-vilab/Cones][in Diffusion]] [[https://github.com/ali-vilab/Cones-V2][Models]] for Customized Generation (better than Custom Diffusion)
- index only the locations in the layers that give rise to a subject, add them together to include multiple subjects in a new context
- [[https://twitter.com/__Johanan/status/1664495182379884549 ][Cones]] 2: [[https://arxiv.org/pdf/2305.19327.pdf][Customizable]] Image Synthesis with Multiple Subjects
- flexible composition of various subjects without any model tuning
- leaning an extra on top of a regular text embedding, and using layout to compose
*** SVDIFF
- SVDiff: [[https://arxiv.org/pdf/2303.11305.pdf][Compact Parameter]] [[https://arxiv.org/abs/2303.11305][Space]] for Diffusion Fine-Tuning, [[https://twitter.com/mk1stats/status/1643992102853038080 ][code]]([[https://twitter.com/mk1stats/status/1644830152118120448 ][soon]])
- multisubject learning, like D3S
- personalized concepts, combinable; training gan out of its conv
- Singular Value Decomposition (SVD) = gene coefficient vs expression level
- CoSINE: Compact parameter space for SINgle image Editing (remove from prompt after finetune it)
- [[https://arxiv.org/abs/2304.06648][DiffFit]]: [[https://github.com/mkshing/DiffFit-pytorch][Unlocking]] Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
- its PEFT for diffusion
*** LIKE ORIGINAL ONES
- 2 passes to make bigger: Standard High-Res fix or Deep Shrink High-Res Fix ([[https://twitter.com/ai_characters/status/1726369195296960994 ][kohya]])
- [[https://twitter.com/_akhaliq/status/1714490671233454134 ][VeRA]]: Vector-based Random Matrix Adaptation
- single pair of low-rank matrices shared across all layers and learning small scaling vectors instead
- 10x less parameters
- [[https://twitter.com/_akhaliq/status/1715240693403185496 ][An Image]] is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning
- Multi-Concept Prompt Learning (MCPL)
- disentangled concepts with enhanced word-concept correlation
- [[https://twitter.com/_akhaliq/status/1732236367982162080 ][X-Adapter]]: [[https://showlab.github.io/X-Adapter/][Adding Universal]] [[https://github.com/showlab/X-Adapter][Compatibility]] of Plugins for Upgraded Diffusion Model
- feature remapping from SD 1.5 to SDXL for all loras and controlnets
- so you can train at lower resources and map to higher
- [[COGCARTOON]]
- [[P+]] : learning text embeddings for each layer of the unet
- [[https://lemmy.dbzer0.com/post/12196023?scrollToComments=true][PALP]]: Prompt Aligned Personalization of Text-to-Image Models
- input: image and prompt
- display ALL the tokens, not just some
- [[https://eclipse-t2i.github.io/Lambda-ECLIPSE/][λ-ECLIPSE]]: [[https://huggingface.co/spaces/ECLIPSE-Community/lambda-eclipse-personalized-t2i][Multi-Concept]] Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
- [[https://twitter.com/_akhaliq/status/1758354431588938213 ][DreamMatcher]]: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image
- (Personalization for Kandinsky) trained using projection loss and clip contrastive loss
- plug-in method that does semantic matching instead of replacement-disruption
- [[https://twitter.com/wuyang_ly/status/1769213318877598110 ][UniHDA]]: A Unified and Versatile framework for generative Hybrid Domain Adaptation
- blends all characteristics at once, maintains robust cross-domain consistency
**** TARGETING CONTEXTUAL CONSISTENCY
- [[https://lemmy.dbzer0.com/post/13450998][Pick-and-Draw]]: Training-free Semantic Guidance for Text-to-Image Personalization
- approach to boost identity consistency and generative diversity for personalization methods
- [[https://lemmy.dbzer0.com/post/13386940][Object-Driven]] One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding
- class-characterizing regularization to preserve prior knowledge of object classes, so it integrates seamlessly with existing concepts
**** LORA
:PROPERTIES:
:ID: e261c214-31a2-4d93-a62b-61d7d53b702c
:END:
- lora, lycoris, loha, lokr
- loha handles multiple-concepts better
- https://www.canva.com/design/DAFeAteHW18/view#5
- use regularization images with lora https://rentry.org/59xed3#regularization-images
- [[https://twitter.com/_akhaliq/status/1668828166499041281 ][GLORA]]: One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning
- individual adapter of each layer
- superior accuracy fewer parameters-computations
- [[https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference][PEFT]] x Diffusers Integration
- [[https://twitter.com/_akhaliq/status/1725357173155271120 ][Tied-Lora]]: Enhacing parameter efficiency of LoRA with weight tying
- 13% of parameters than lora, parameter efficiency
- [[https://twitter.com/_akhaliq/status/1727177584759161126 ][Concept]] [[https://twitter.com/davidbau/status/1730788830876229776 ][Sliders]]: LoRA Adaptors for Precise Control in Diffusion Models, plug and play ==best==
- concept sliders that enable precise control over attributes
- intuitive editing of visual concepts for which textual description is difficult
- repair of object deformations and fixing distorted hands
- [[https://twitter.com/_akhaliq/status/1727574713751249102 ][ZipLoRA]]: [[https://twitter.com/_akhaliq/status/1728086020267078100 ][Any]] Subject in Any Style by Effectively Merging LoRAs
- cheaply and effectively merge independently trained style and subject LoRAs
- [[https://arxiv.org/abs/2402.09353][DoRA]]: Weight-Decomposed Low-Rank Adaptation
- decomposes the pre-trained weight into two components, magnitude and direction; directional updates
- [[https://lemmy.dbzer0.com/post/15399911][DiffuseKronA]]: [[https://github.com/IBM/DiffuseKronA][A Parameter]] Efficient Fine-tuning Method for Personalized Diffusion Model
- Kronecker product-based adaptation, reduces the parameter count by up to 35% lora
- [[B-LoRA]]
- [[https://lemmy.dbzer0.com/post/18313588][CAT]]: Contrastive Adapter Training for Personalized Image Generation
- no loss of diversity in object generation, no token = no effect
- [[CTRLORA]]
***** MULTIPLE LORA
:PROPERTIES:
:ID: 3f126569-6deb-45e1-9535-77883fc7ad8b
:END:
- [[https://twitter.com/_akhaliq/status/1721759353437311461 ][S-LoRA]]: Serving Thousands of Concurrent LoRA Adapters
- scalable serving of many LoRA adapters, all adapters in the main memory, fetches for the current queries
- [[https://twitter.com/_akhaliq/status/1726793541253280249 ][MultiLoRA]]: Democratizing LoRA for Better Multi-Task Learning
- changes parameter initialization of adaptation matrices to reduce parameter dependency
- [[https://twitter.com/_akhaliq/status/1732237243610210536 ][Orthogonal]] Adaptation for Modular Customization of Diffusion Models
- customized models can be summed with minimal interference, and jointly synthesize
- scalable customization of diffusion models by encouraging orthogonal weights
- [[https://twitter.com/_akhaliq/status/1762334024561787339 ][Multi-LoRA]] Composition for Image Generation
- [[https://arxiv.org/abs/2403.19776][CLoRA]]: A Contrastive Approach to Compose Multiple LoRA Models
- enables the creation of composite images that truly reflect the characteristics of each LoRA
**** TEXTUAL INVERSION
- [[https://t.co/DbEPmPZB1l][Multiresolution Textual]] [[https://github.com/giannisdaras/multires_textual_inversion][Inversion]]: better textual inversion (embedding)
- Extended Textual Inversion (XTI)
- [[https://prompt-plus.github.io/][P+]]: [[https://prompt-plus.github.io/files/PromptPlus.pdf][Extended Textual]] Conditioning in Text-to-Image Generation <<P+>>
- different text embedding per unet layer
- [[https://github.com/cloneofsimo/promptplusplus][code]]
- [[https://arxiv.org/abs/2305.05189][SUR-adapter]]: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models (llm)
- adapter to transfer the semantic understanding of llm to align complex vs simple prompts
- [[id:5762b4c1-e574-4ca5-9e38-032071698637][DREAMDISTRIBUTION]] is like Textual Inversion
- [[https://github.com/RoyZhao926/CatVersion][CatVersion]]: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization
- learns the gap between the personalized concept and its base class
* USE CASES
- [[image-to-image translation]] [[id:7f6f5bc1-ca59-4557-b908-0345e8127cde][ERASING CONCEPTS]]
** IMAGE COMPRESSION FILE
- [[https://arxiv.org/abs/2401.17789][Robustly overfitting]] latents for flexible neural image compression
- refine the latents of pre-trained neural image compression models
- [[https://arxiv.org/abs/2402.08643][Learned]] Image Compression with Text Quality Enhancement
- text logit loss function
** DIFFUSION AS ENCODER - RETRIEVE PROMPT
:PROPERTIES:
:ID: 40792f03-5726-453b-af13-ba0667592497
:END:
- [[https://twitter.com/_akhaliq/status/1719899183430169056 ][De-Diffusion]] [[https://dediffusion.github.io/][Makes]] Text a Strong Cross-Modal Interface
- text as a cross-modal interface
- autoencoder uses a pre-trained text-to-image diffusion model for decoding
- encoder is trained to transform an input image into text
- PH2P: [[https://arxiv.org/abs/2312.12416][Prompting]] Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
- projection scheme to optimize for prompts representative of the space in the model (meaningful prompts)
** DIFFUSING TEXT
- [[RESTORING HANDS]]
- [[https://ds-fusion.github.io/static/pdf/dsfusion.pdf][DS-Fusion]]: [[https://ds-fusion.github.io/][Artistic]] Typography via Discriminated and Stylized Diffusion (fonts)
- [[https://1073521013.github.io/glyph-draw.github.io/][GlyphDraw]]: [[https://arxiv.org/pdf/2303.17870.pdf][Learning]] [[https://twitter.com/_akhaliq/status/1642696550529867779 ][to Draw]] Chinese Characters in Image Synthesis Models Coherently
- [[https://arxiv.org/pdf/2305.10855.pdf][TextDiffuser]]: Diffusion Models as Text Painters
- [[https://arxiv.org/pdf/2402.14314.pdf][Typographic]] Text Generation with Off-the-Shelf Diffusion Model
- complex effects while preserving its overall coherence
- [[https://huggingface.co/papers/2305.18259][GlyphControl]]: [[https://github.com/AIGText/GlyphControl-release][Glyph Conditional]] Control for Visual Text Generation ==this==
- [[https://github.com/microsoft/unilm/tree/master/textdiffuser][TextDiffuser]]: [[https://arxiv.org/pdf/2305.10855.pdf][Diffusion]] [[https://huggingface.co/spaces/microsoft/TextDiffuser][Models]] as Text Painters
- [[https://github.com/microsoft/unilm/tree/master/textdiffuser-2][TextDiffuser-2]]: two language models: for layout planning and layout encoding; before the unet
- [[TIP: text restoration]]
- [[https://arxiv.org/abs/2403.16422][Refining Text-to-Image]] Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation
- training-free framework to enhance layout generator and image generator conditioned on it
- generating images with long and rare text sequences
*** GENERATE VECTORS
- [[https://twitter.com/_akhaliq/status/1736998105969459522 ][VecFusion]]: Vector Font Generation with Diffusion
- rasterized fonts then vector model synthesizes vector fonts
- [[https://twitter.com/_akhaliq/status/1737304904400613558 ][StarVector]]: Generating Scalable Vector Graphics Code from Images
- CLIP image encoder, learning to align the visual and code tokens, generate SVGs
- [[https://arxiv.org/abs/2401.17093][StrokeNUWA]]: Tokenizing Strokes for Vector Graphic Synthesis
- encoding into stroke tokens, naturally compatible with LLMs
- [[https://arxiv.org/abs/2404.00412][SVGCraft]]: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout
- creation of vector graphics depicting entire scenes from textual descriptions
- optimized using a pre-trained encoder
*** INPAINTING TEXT
- DiffSTE: Inpainting to edit text in images with a prompt ([[https://drive.google.com/file/d/1fc0RKGWo6MPSJIZNIA_UweTOPai64S9f/view][model]])
- [[https://github.com/UCSB-NLP-Chang/DiffSTE][Improving]] Diffusion Models for Scene Text Editing with Dual Encoders
**** DERIVED FROM SD
- [[https://github.com/ZYM-PKU/UDiffText][UDiffText]]: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models (with training code)
- [[https://arxiv.org/abs/2312.12232][Brush Your]] Text: Synthesize Any Scene Text on Images via Diffusion Model (Diff-Text)
- attention constraint to address unreasonable positioning, more accurate scene text, any language
- its just a prompt and canny: “sign”, “billboard”, “label”, “promotions”, “notice”, “marquee”, “board”, “blackboard”, “slogan”, “whiteboard”, “logo”
- [[https://github.com/tyxsspa/AnyText][AnyText]]: [[https://twitter.com/_akhaliq/status/1741239193215344810 ][Multilingual]] [[https://youtu.be/hrk_b_CQ36M?si=6cwXFAd1106D3aHK][Visual]] Text Generation And Editing ==best==
- inputs: glyph, position, and masked image to generate latent features for text generation-editing
- curved into shapes-textures text
** IMAGE RESTORATION, SUPER-RESOLUTION
- [[https://twitter.com/_akhaliq/status/1678804195229433861 ][NILUT]]: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement
- image signal processing pipeline , multiple blendable styles into a single network
- [[https://arxiv.org/abs/2303.09833][FreeDoM]]: Training-Free Energy-Guided Conditional Diffusion Model
- [[https://arxiv.org/abs/2304.08291][refusion]]: Image Restoration with Mean-Reverting Stochastic Differential Equations
- [[https://arxiv.org/pdf/2212.00490.pdf][image]] restoration IR, [[https://github.com/wyhuai/DDNM][DDNM]] using NULL-SPACE
- unlimited [[https://arxiv.org/pdf/2303.00354.pdf][superresolution]]
- [[https://twitter.com/_akhaliq/status/1674249594421608448 ][SVNR]]: Spatially-variant Noise Removal with Denoising Diffusion
- real life noise fixing
- [[https://github.com/WindVChen/INR-Harmonization][Dense]] [[https://github.com/WindVChen/INR-Harmonization][Pixel-to-Pixel]] Harmonization via Continuous Image Representation
- stretched images due to change in resolution fixed
- [[https://github.com/WindVChen/Diff-Harmonization][Zero-Shot Image]] Harmonization with Generative Model Prior
- [[https://github.com/xpixelgroup/diffbir][DiffBIR]]: Towards Blind Image Restoration with Generative Diffusion Prior
- using a SwinIR then refine with sd
*** SUPERRESOLUTION
- [[https://github.com/csslc/CCSR][CCSR]]: Improving the Stability of Diffusion Models for Content Consistent Super-Resolution
- Swintormer: [[https://github.com/bnm6900030/swintormer][Image]] Deblurring based on Diffusion Models (limited memory)
- [[https://twitter.com/_akhaliq/status/1749254341507039644 ][Inflation]] with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
- for videos, temporal adapter to ensure temporal coherence
- [[https://lemmy.dbzer0.com/post/13460000][YONOS-SR]]: You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
- start by training a teacher model on a smaller magnification scale
- instead of 200 steps, and finetuned decoder on top of it
- SUPIR: [[https://github.com/Fanghua-Yu/SUPIR][Scaling Up]] to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
- based on large-scale diffusion generative prior
- [[https://arxiv.org/pdf/2401.15366.pdf][Face to Cartoon]] Incremental Super-Resolution using Knowledge Distillation
- faces and anime restoration at various levels of detail
- [[https://twitter.com/_akhaliq/status/1769784456112374077 ][APISR]]: [[https://twitter.com/_akhaliq/status/1769784456112374077 ][Anime]] Production Inspired Real-World Anime Super-Resolution
- [[https://arxiv.org/abs/2403.12915][Ultra-High-Resolution]] Image Synthesis with Pyramid Diffusion Model
- pyramid latent representation
- [[https://github.com/mit-han-lab/efficientvit][EfficientViT]]: Multi-Scale Linear Attention for High-Resolution Dense Prediction ==best==
**** STABLESR
:PROPERTIES:
:ID: bc0dd47c-4f46-4cd0-9606-555990c06626
:END:
- [[https://github.com/IceClear/StableSR][StableSR]]: [[https://huggingface.co/Iceclear/StableSR][Exploiting]] Diffusion Prior for Real-World Image Super-Resolution
- develope a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models
**** DEMOFUSION
- [[https://arxiv.org/abs/2311.16973][DemoFusion]]: Democratising High-Resolution Image Generation With No $$$
- achieve higher-resolution image generation
- [[https://twitter.com/radamar/status/1732978064026706425 ][Enhance]] This: DemoFusion SDXL
- [[https://github.com/ttulttul/ComfyUI-Iterative-Mixer][ComfyUI]] Iterative Mixing Nodes ==best==
- iterative mixing of samples to help with upscaling quality
- SD 1.5 generating at higher resolutions
- evolution from [[https://github.com/Ttl/ComfyUi_NNLatentUpscale][NNLatentUpscale]]
***** PASD MAGNIFY
- [[https://twitter.com/fffiloni/status/1743306262379475304 ][PASD]] Magnify: Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization
- image slider custom component
** DEPTH GENERATION
- [[https://twitter.com/_akhaliq/status/1630747135909015552 ][depth map]] from diffusion, build 3d enviroment with it
- [[https://github.com/wl-zhao/VPD][VPD]]: using diffusion for depth estimation, image segmentation (better) <<vpd>> comparable [[x-decoder]]
- [[https://github.com/isl-org/ZoeDepth][ZoeDepth]]: [[https://arxiv.org/abs/2302.12288][Combining]] relative and metric depth
- [[https://github.com/BillFSmith/TilingZoeDepth][tiling ZoeDepth]]
- [[https://twitter.com/zhenyu_li9955/status/1732669069717909672 ][PatchFusion]]: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
- [[https://twitter.com/AntonObukhov1/status/1732946419663667464 ][Marigold]]: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (70s inference)
- [[https://twitter.com/radamar/status/1691137538734583808 ][LDM3D]] by intel, generates image & depth from text prompts
- [[https://twitter.com/_akhaliq/status/1721793148391760196 ][LDM3D-VR]]: Latent Diffusion Model for 3D VR
- generating depth together, panoramic RGBD
- DMD (Diffusion for Metric Depth)
- [[https://twitter.com/_akhaliq/status/1737699544542973965 ][Zero-Shot]] Metric Depth with a Field-of-View Conditioned Diffusion Model (depth from image)
- [[https://twitter.com/pythontrending/status/1750141129314468051 ][Depth Anything]]: [[https://twitter.com/mervenoyann/status/1750531698008498431 ][Unleashing]] the Power of Large-Scale Unlabeled Data (temporal coherance no flickering)
- [[id:fa7469f5-948a-42b2-8787-14109bc9ed5a][GIBR]]
*** DEPTH DIFFUSION
:PROPERTIES:
:ID: 277f7cda-963c-48ff-8e43-169986d8cff6
:END:
- [[https://twitter.com/_akhaliq/status/1734051086175027595 ][MVDD]]: Multi-View Depth Diffusion Models
- 3D shape generation, depth completion, and its potential as a 3D prior
- enforce 3D consistency in multi-view depth
- [[https://twitter.com/_akhaliq/status/1770673356821442847 ][DepthFM]]: Fast Monocular Depth Estimation with Flow Matching
- pre-trained image diffusion model can become flow matching depth model
*** NORMAL MAPS
- [[https://github.com/baegwangbin/DSINE][DSine]]: Rethinking Inductive Biases for Surface Normal Estimation
- better than bae and midas
- [[https://github.com/Mikubill/sd-webui-controlnet/discussions/2703][preprocessor]]