:PROPERTIES: :ID: c7fe7e79-73d3-4cc7-a673-2c2e259ab5b5 :END: #+title: stable diffusion #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - parent: [[id:82127d6a-b3bb-40bf-a912-51fa5134dacc][diffusion]] - related: [[id:58c585b9-a03e-4320-a313-e00e68c4ce7e][diffusion video]] [[id:75929071-e62b-4c0a-8374-8ca322d0a020][software]] - combining [[https://github.com/huggingface/diffusers/tree/main/examples/community#stable-diffusion-mega][pipelines]], creating [[https://huggingface.co/docs/diffusers/main/en/using-diffusers/contribute_pipeline][pipelines]] - generate: [[id:3a9a6a52-b3a6-4a69-b402-531b3b1e2d91][NOVEL VIEW]] - how to [[https://wandb.ai/johnowhitaker/midu-guidance/reports/-Mid-U-Guidance-Fast-Classifier-Guidance-for-Latent-Diffusion-Models--VmlldzozMjg0NzA1][guidance-classifier]] the diffusion * SD MODELS - [[https://twitter.com/iScienceLuvr/status/1717359916422496596][CommonCanvas]]: [[https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md#coming-soon][An Open]] [[https://github.com/mosaicml/diffusion][Diffusion]] Model Trained with Creative-Commons Images - CC-licensed images with BLIP-2 captions, similar performance to Stable Diffusion 2 (apache license) - [[https://huggingface.co/ptx0/terminus-xl-gamma-v1][Terminus]] XL Gamma: simpler SDXL, for inpainting tasks, super-resolution, style transfer - [[SDXL-DPO]] - [[https://huggingface.co/wangfuyun/AnimateLCM-SVD-xt][AnimateLCM-SVD-xt]]: image to video - stable-cascade: würstchen architecture = even smaller latent space - [[https://huggingface.co/KBlueLeaf/Stable-Cascade-FP16-fixed][Stable-Cascade-FP16]] - sd x8 compression (1024x1024 > 128x128) vs cascade x42 compression, (1024x1024 > 24x24) - faster inference, cheaper training - [[id:fdbe1937-ac2f-4eb6-b617-8e48fca083e4][STABLE DIFFUSION 3]] - [[https://huggingface.co/fal/AuraFlow][AuraFlow]]: actually open source (apache 2) model, by simo ** DISTILLATION - [[https://twitter.com/camenduru/status/1716817970255831414][SSD1B]] ([[https://blog.segmind.com/introducing-segmind-ssd-1b/][distilled]] [[https://huggingface.co/segmind/SSD-1B/blob/main/SSD-1B.safetensors][SDXL]]) 60% Fast -40% VRAM - [[https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic][Playground v2]] - [[https://twitter.com/_akhaliq/status/1760157720173224301][SDXL-Lightning]]: [[https://twitter.com/_akhaliq/status/1760157720173224301][a lightning]] fast 1024px text-to-image generation model (few-steps generation) - progressive adversarial diffusion distillation - [[https://snap-research.github.io/BitsFusion/][BitsFusion]]: 1.99 bits Weight Quantization of Diffusion Model - SD 1.5 quantized to 1.99 bits (instead of 8B) *** ONE STEP DIFFUSION :PROPERTIES: :ID: 9e94f7d8-752f-48e9-9ef1-9c79eba258e3 :END: - [[https://tianweiy.github.io/dmd/][One-step Diffusion]] with Distribution Matching Distillation - comparable with v1.5 while being 30x faster - critic similar to GANs in that is jointly trained with the generator - differs in that it does not play adversarial game, and can fully leverage a pretrained model *** SDXS :PROPERTIES: :ID: 98791065-c5dc-4f12-8c0c-fffad5715a2e :END: - [[https://lemmy.dbzer0.com/post/17299457][SDXS]]: Real-Time One-Step Latent Diffusion Models with Image Conditions - knowledge distillation to streamline the U-Net and image decoder architectures - one-step DM training technique that utilizes feature matching and score distillation - speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a GPU - image-conditioned control, facilitating efficient image-to-image translation. ** IRIS LUX https://civitai.com/models/201287 Model created through consensus via statistical filtering (novel consensus merge) https://gist.github.com/Extraltodeus/0700821a3df907914994eb48036fc23e ** EMOJIS - [[https://twitter.com/_akhaliq/status/1726817847525978514][Text-to-Sticker]]: Style Tailoring Latent Diffusion Models for Human Expression - emojis, stickers ** MERGING MODELS - where the text encoder is different for each, by training a difference - https://www.reddit.com/r/StableDiffusion/comments/1g6500o/ive_managed_to_merge_two_models_with_very/ *** SEGMOE :PROPERTIES: :ID: 7c77fcdf-8b60-48dc-bb7a-11c9d6aad309 :END: - [[https://huggingface.co/segmind/SegMoE-SD-4x2-v0][SegMoE]] - [[https://youtu.be/6Q4BJOcvwGE?si=zBNrQrKIgmwmPPvI][The Stable]] [[https://huggingface.co/segmind][Diffusion]] [[https://lemmy.dbzer0.com/post/13761591][Mixture]] of Experts for Image Generation, Mixture of Diffusion Experts - training free, creation of larger models on the fly, larger knowledge * GENERATION CONTROL - [[EXTRA PRETRAINED]] [[id:208c064d-f700-4e8f-a4ab-2c73c557f9e3][DRAG]] [[MAPPED INPAINTING]] - [[https://www.storminthecastle.com/posts/01_head_poser/][hyperparameters with]] extra network [[https://wandb.ai/johnowhitaker/midu-guidance/reports/Mid-U-Guidance-Fast-Classifier-Guidance-for-Latent-Diffusion-Models--VmlldzozMjg0NzA1][Mid-U Guidance]] - block [[https://github.com/hako-mikan/sd-webui-lora-block-weight#%E6%A6%82%E8%A6%81][weights lora]] - [[https://twitter.com/_akhaliq/status/1759789799685202011][DiLightNet]]: [[https://arxiv.org/abs/2402.11929][Fine-grained]] Lighting Control for Diffusion-based Image Generation - using light hints to resynthetize a prompt with user-defined consistent lighting - [[https://arxiv.org/abs/2403.06452][Text2QR]]: [[https://github.com/mulns/Text2QR][Harmonizing]] Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation - refines the output iteratively in the latent space - [[https://twitter.com/_akhaliq/status/1778606395014676821][ControlNet++]]: Improving Conditional Controls with Efficient Consistency Feedback - explicitly optimizing pixel-level cycle consistency between generated images ** MATERIAL EXTRACTION - [[https://arxiv.org/pdf/2403.20231.pdf][U-VAP]]: User-specified Visual Appearance Personalization via Decoupled Self Augmentation - generates images with the material or color extracted from the input image - sentence describing the desired attribute - learn user-specified visual attributes - [[https://ttchengab.github.io/zest/][ZeST]]: Zero-Shot Material Transfer from a Single Image - leverages adapters to extract implicit material representation from exemplar image ** LIGHT CONTROL - [[https://github.com/DiffusionLight/DiffusionLight][DiffusionLight]]: Light Probes for Free by Painting a Chrome Ball - render a chrome ball into the input image - produces convincing light estimates ** BACKGROUND - [[https://twitter.com/bria_ai_/status/1754846894675673097][BriaAI]]: [[https://twitter.com/camenduru/status/1755038599500718083][Open-Source]] Background Removal (RMBG v1.4) - [[https://github.com/layerdiffusion/sd-forge-layerdiffusion][LayerDiffusion]]: Transparent Image Layer Diffusion using Latent Transparency - layers with alpha, generate pngs, remove backgrounds (more like generate with removable background) - method learns a “latent transparency” - [[https://huggingface.co/LayerDiffusion/layerdiffusion-v1/tree/main][models]] ** EMOTIONS - [[https://arxiv.org/abs/2401.01207][Towards]] a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation - face swapping and reenactment, interpolate between emotions - [[https://arxiv.org/abs/2401.04608][EmoGen]]: Emotional Image Content Generation with Text-to-Image Diffusion Models - clip, abstract emotions - [[https://arxiv.org/abs/2403.08255][Make Me]] Happier: Evoking Emotions Through Image Diffusion Models - understanding and editing source images emotions cues ** NOISE CONTROL :PROPERTIES: :ID: b68ef215-2e3e-4cd5-abbd-dffcc30acdae :END: - offset noise(darkness capable loras), pyramid noise - [[https://arxiv.org/pdf/2305.08891.pdf][Common Diffusion]] Noise Schedules and Sample Steps are Flawed (and several proposed fixes) - native offset noise - [[https://github.com/Extraltodeus/noise_latent_perlinpinpin][noisy perlin]] latent - you can reinject the same noise pattern after an upscale, more coherent results and better upscaling - [[https://arxiv.org/abs/2402.04930][Blue noise]] for diffusion models - allows introducing correlation across images within a single mini-batch to improve gradient flow ** GUIDING FUNCTION :PROPERTIES: :ID: ddd3588a-dc3c-426d-a94e-9aa373fabff9 :END: - [[https://github.com/arpitbansal297/Universal-Guided-Diffusion][Universal Guided Diffusion]] (face and style transfer) - [[https://arxiv.org/abs/2303.09833][FreeDoM]]: [[https://github.com/vvictoryuki/FreeDoM][Training-Free]] Energy-Guided Conditional Diffusion Model <> - extra: repo has list of deblurring, super-resolution and restoration methods - masks as energy function - Diffusion Self-Guidance [[https://dave.ml/selfguidance/][for Controllable]] Image Generation - steer sampling, similarly to classifier guidance, but using signals in the pretrained model itself - instructional transfomations - [[https://mcm-diffusion.github.io/][MCM]] [[https://arxiv.org/pdf/2302.12764.pdf][Modulating Pretrained]] Diffusion Models for Multimodal Image Synthesis (module after denoiser) mmc - mask like control to tilt the noise, maybe useful for text <> *** ADAPTIVE GUIDANCE - [[https://twitter.com/_akhaliq/status/1737695636814712844][Adaptive Guidance]]: Training-free Acceleration of Conditional Diffusion Models - AG, efficient variant of CFG(Classifier-Free Guidance); reducing computation by 25% - omits network evaluations when the denoising process displays convergence - second half of the denoising process redundant; plug-and-play alternative to Guidance Distillation - LinearAG: entire neural-evaluations can be replaced by affine transformations of past estimates ** CONTROL NETWORKS, CONTROLNET - [[id:33903015-49dd-4a1a-81b5-78350c074fff][REFERENENET]] [[id:d1d1a9ff-670e-4bed-9087-ad0b8b71ee7a][CONTROLNET FOR 3D]] [[CCM]] [[id:fd3d677f-1b5e-46a3-8ee9-6524baa07339][CONTROLNET VIDEO]] - why controlnet, alternatives https://github.com/lllyasviel/ControlNet/discussions/188 - [[https://github.com/Sierkinhane/VisorGPT][VisorGPT]]: Learning Visual Prior via Generative Pre-Training - [[https://huggingface.co/papers/2305.13777][gpt]] that learns to tranform normal prompts into controlnet primitives - [[https://twitter.com/_akhaliq/status/1735515389692424461][FineControlNet]]: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection - geometric control via human pose images and appearance control via instance-level text prompts - [[https://twitter.com/_akhaliq/status/1734808238753788179][FreeControl]]: [[https://github.com/kijai/ComfyUI-Diffusers-freecontrol?tab=readme-ov-file][Training-Free]] Spatial Control of Any Text-to-Image Diffusion Model with Any Condition - alignment with guidance image: lidar, face mesh, wireframe mesh, rag doll - [[https://github.com/SamsungLabs/FineControlNet][FineControlNet]]: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection - instance-specific text description, better prompt following *** SKETCH - [[https://arxiv.org/abs/2401.00739][diffmorph]]: text-less image morphing with diffusion models - sketch-to-image module - [[https://lemmy.dbzer0.com/post/15434577][Block]] and Detail: Scaffolding Sketch-to-Image Generation - sketch-to-image, can generate coherent elements from partial sketches, generate beyond the sketch following the prompt - [[https://arxiv.org/abs/2402.17624][CustomSketching]]: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing - one for contour, the other flow lines representing texture *** ALTERNATIVES - controlNet (total control of image generation, from doodles to masks) - T2I-Adapter (lighter, composable), [[https://www.reddit.com/r/StableDiffusion/comments/11v3dgj/comment/jcrag7x/?utm_source=share&utm_medium=web2x&context=3][how color pallete]] - lora like (old) https://github.com/HighCWu/ControlLoRA - [[https://vislearn.github.io/ControlNet-XS/][ControlNet-XS]]: 1% of the parameters - [[https://twitter.com/_akhaliq/status/1732585051039088837][LooseControl]]: [[https://github.com/shariqfarooq123/LooseControl][Lifting]] ControlNet for Generalized Depth Conditioning - loosely specifying scenes with boxes - controlnet-lltite by [[https://github.com/kohya-ss/sd-scripts/blob/sdxl/docs/train_lllite_README.md][kohya]] - [[https://twitter.com/_akhaliq/status/1736991952283783568][SCEdit]]: [[https://github.com/mkshing/scedit-pytorch][Efficient]] [[https://scedit.github.io/][and Controllable]] Image Diffusion Generation via Skip Connection Editing - lightweight tuning module named SC-Tuner, synthesis by injecting different conditions - reduces training parameters and memory requirements - Integrated Into SCEPTER and SWIFT - [[https://lemmy.dbzer0.com/post/12591345][Compose and]] [[https://twitter.com/_akhaliq/status/1747857732818854040][Conquer]]: Diffusion-Based 3D Depth Aware Composable Image Synthesis - imposing global semantics onto targeted regions without the use of any additional localization cues - alternative to controlnet and t2i-adapter **** CTRLORA - https://github.com/xyfJASON/ctrlora *** TIP: text restoration - [[https://twitter.com/_akhaliq/status/1737318799634755765][TIP]]: Text-Driven Image Processing with Semantic and Restoration Instructions ==best== - controlnet architecture, leverages natural language as interface to control image restoration - instruction driven, can inprint text into image *** HANDS - [[id:3f752b46-cae4-49d9-948d-50e3c500727e][HANDS DATASET]] - [[https://arxiv.org/abs/2312.04867][HandDiffuse]]: Generative Controllers for Two-Hand Interactions via Diffusion Models - two-hand interactions, motion in-betweening and trajectory control **** RESTORING HANDS - [[https://arxiv.org/abs/2312.04236][Detecting]] and Restoring Non-Standard Hands in Stable Diffusion Generated Images - body pose estimation to understand hand orientation for accurate anomaly correction - integration of ControlNet and InstructPix2Pix - [[https://github.com/wenquanlu/HandRefiner][HandRefiner]]: [[https://github.com/wenquanlu/HandRefiner][Refining]] Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting - incorrect number of fingers, irregular shapes, effectively rectified - utilize ControlNet modules to re-inject corrected information, 1.5 *** USING ATTENTION MAP - [[CONES]] [[The Chosen One]] [[id:65812d6a-a81d-47f2-a7ad-25c94e2ff70a][STORYTELLER DIFFUSION]] - [[https://rival-diff.github.io/][RIVAL]]: Real-World Image Variation by Aligning Diffusion Inversion Chain ==best== **** MASA - [[https://ljzycmd.github.io/projects/MasaCtrl/][MasaCtrl]]: [[https://github.com/TencentARC/MasaCtrl][Tuning-free]] Mutual Self-Attention Control for Consistent Image Synthesis and Editing - same thing different views or poses - by querying the attention map from another image - better than ddim inversion, consistent SD animations; mixable with T2I-Adapter ***** TI-GUIDED-EDIT - [[https://arxiv.org/abs/2401.02126][Unified]] [[https://github.com/Kihensarn/TI-Guided-Edit][Diffusion-Based]] Rigid and Non-Rigid Editing with Text and Image Guidance - rigid=conserve the structure **** LLLYASVIEL - reference-only preprocessor doesnt require any control models, generate variations - can guide the diffusion directly using images as references, and generate variations - [[https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode][Guess Mode]] / [[https://github.com/lllyasviel/ControlNet/discussions/188][Non-Prompt]] Mode, now named: Control Modes, how much prompt vs controlnet; [[https://github.com/comfyanonymous/ComfyUI_experiments][comfy node]] *** SEVERAL CONTROLS IN ONE - [[https://huggingface.co/papers/2305.11147][UniControl]]: [[https://www.reddit.com/r/StableDiffusion/comments/15851w6/code_for_unicontrol_has_been_released/][A Unified]] [[https://twitter.com/CaimingXiong/status/1662250281315315713][Diffusion]] Model for Controllable Visual Generation In the Wild - several controlnets in one, contextual understanding - image deblurring, image colorization - [[https://twitter.com/abhi1thakur/status/1684926197870870529][using UniControl]] with Stable Diffusion XL 1.0 Refiner; sketch to image tool - In-[[https://github.com/Zhendong-Wang/Prompt-Diffusion][Context]] [[https://zhendong-wang.github.io/prompt-diffusion.github.io/][Learning]] Unlocked for Diffusion Models - learn translation of image to hed, depth, segmentation, outline ** HUMAN PAINT - [[https://arxiv.org/pdf/2108.01073.pdf][SDEdit]]: guided image synthesis and editing with stochastic differential equation - stroke based inpainting-editing - [[https://arxiv.org/pdf/2402.03705.pdf][FOOLSDEDIT]]: Deceptively Steering Your Edits Towards Targeted Attribute-aware Distribution - forcing SDEdit to generate a data distribution aligned a specified attribute (e.g. female) - [[https://zhexinliang.github.io/Control_Color/][Control]] [[https://github.com/ZhexinLiang/Control-Color][Color]]: Multimodal Diffusion-Based Interactive Image Colorization - paint over grayscale to recolor it ** LAYOUT DIFFUSION :PROPERTIES: :ID: dafb1713-5d08-40de-b445-76d25f2cf070 :END: - 3d: [[id:5e1ee0b4-8493-44e4-b0cf-89b429a78532][ROOM LAYOUT]] - [[ATTENTION LAYOUT]] [[id:65812d6a-a81d-47f2-a7ad-25c94e2ff70a][STORYTELLER DIFFUSION]] - ZestGuide: [[https://twitter.com/_akhaliq/status/1673539960664911874][Zero-shot]] [[https://twitter.com/gcouairon/status/1721529637690327062][spatial]] layout conditioning for text-to-image diffusion models - implicit segmentation maps can be extracted from cross-attention layers - spatial conditioning to sd without finetunning - [[https://arxiv.org/abs/2402.04754][Towards Aligned]] Layout Generation via Diffusion Model with Aesthetic Constraints - constraints representing design intentions - continuous state-space design can incorporate differentiable aesthetic constraint functions in training - by introducing conditions via masked input - [[https://arxiv.org/abs/2402.12908][RealCompo]]: [[https://github.com/YangLing0818/RealCompo][Dynamic Equilibrium]] between Realism and Compositionality Improves Text-to-Image Diffusion Models - dynamically balance the strengths of the two models in denoising process - [[https://spright-t2i.github.io/][Getting]] it Right: Improving Spatial Consistency in Text-to-Image Models - better representing spatial relationships - faithfully follow the spatial relationships specified in the text prompt *** SCENES - [[https://twitter.com/_akhaliq/status/1674623306551508993][Generate Anything]] Anywhere in Any Scene <> - training guides to focus on object identity, personalized concept with localization controllability - [[ANYDOOR]] [[id:4b8a772d-e3ad-4183-863b-eeddb47bab9e][ALDM]] *** WITH BOXES - [[https://gligen.github.io/][GLIGEN]]: Open-Set Grounded Text-to-Image Generation (boxes) - [[https://twitter.com/_akhaliq/status/1645253639575830530][Training-Free]] Layout Control with Cross-Attention Guidance - [[https://arxiv.org/pdf/2304.14573.pdf][SceneGenie]]: Scene Graph Guided Diffusion Models for Image Synthesis - [[https://twitter.com/_akhaliq/status/1683340606217781248][BoxDiff]]: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion - [[https://people.eecs.berkeley.edu/~xdwang/projects/InstDiff/][InstanceDiffusion]]: [[https://lemmy.dbzer0.com/post/13827955][Instance-level]] Control for Image Generation - conditional generation, hierarchical bounding-boxes structure, featur(prompt) at point - single points, scribbles, bounding boxes or segmentation masks - [[https://arxiv.org/abs/2402.17910][Box It]] to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models - bounding boxes with attribute(prompt) binding *** ALDM :PROPERTIES: :ID: 4b8a772d-e3ad-4183-863b-eeddb47bab9e :END: - [[https://lemmy.dbzer0.com/post/12605682][ALDM]]: Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive - layout faithfulness *** OPEN-VOCABULARY :PROPERTIES: :ID: 19d99453-7d66-41b1-80e8-fbe91d035084 :END: - [[https://arxiv.org/abs/2401.16157][Spatial-Aware Latent]] Initialization for Controllable Image Generation - inverted reference image contains spatial awareness regarding positions, resulting in similar layouts - open-vocabulary framework to customize a spatial-aware initialization *** CARTOON :PROPERTIES: :ID: cc058fea-c2dd-4f7f-aa59-156825bed0ef :END: - [[https://whaohan.github.io/desigen/][Desigen]]: A Pipeline for Controllable Design Template Generation - generating images with proper layout space for text; generating the template itself **** COGCARTOON - [[https://arxiv.org/pdf/2312.10718.pdf][CogCartoon]]: Towards Practical Story Visualization - plugin-guided and layout-guided inference; specific character = 316 KB plugin ** IMAGE PROMPT - ONE IMAGE - [[suti]] [[custom-edit diffusion]] *** UNET LESS - [[https://github.com/drboog/ProFusion][ProFusion]]: Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach - and can interpolate between two - promptnet (embedding), encoder based, for style transform - one image, no regularization needed - [[https://twitter.com/kelvinckchan/status/1680288217378197504][Taming]] Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models - using CLIP features extracted from the subject *** IMAGE-SUGGESTION - [[SEMANTIC CORRESPONDENCE]] - UMM-Diffusion, TIUE: [[https://arxiv.org/abs/2303.09319][Unified Multi-Modal]] Latent Diffusion for Joint Subject and Text Conditional Image Generation - takes joint texts and images - only the image-mapping to a pseudo word embedding is learned **** ZERO SHOT - [[https://twitter.com/_akhaliq/status/1732592245105185195][Context Diffusion]]: In-Context Aware Image Generation - separates the encoding of the visual context; prompt not needed - ReVision - Unclip https://comfyanonymous.github.io/ComfyUI_examples/sdxl/ - Revision gives the model the pooled output from CLIPVision G instead of the CLIP G text encoder - [[https://github.com/Xiaojiu-z/SSR_Encoder][SSR-Encoder]]: Encoding Selective Subject Representation for Subject-Driven Generation - architecture designed for selectively capturing any subject from single or multiple reference images ***** IP-ADAPTER - [[https://twitter.com/_akhaliq/status/1691341380348682240][IP-Adapter]]: [[https://github.com/tencent-ailab/IP-Adapter][Text Compatible]] Image Prompt Adapter for Text-to-Image Diffusion Models ==stock SD== - works with other controlnets - [[https://huggingface.co/h94/IP-Adapter-FaceID][IP-Adapter-FaceID]] (face recognition model) ****** LCM-LOOKAHEAD :PROPERTIES: :ID: 28ff20ec-5501-47da-9ac1-8adc65303376 :END: - [[https://lcm-lookahead.github.io/][LCM-Lookahead]] for Encoder-based Text-to-Image Personalization - LCM-based approach for propagating image-space losses to personalization model training and classifier guidance ***** SEECODERS :PROPERTIES: :ID: 1c014bca-d8db-4d28-9c49-5297626d4484 :END: - [[https://arxiv.org/abs/2305.16223][Seecoders]]: [[https://github.com/SHI-Labs/Prompt-Free-Diffusion][Prompt-Free]] Diffusion: Taking "Text" out of Text-to-Image Diffusion Models - Semantic Context Encoder, replaces clip with seecoder; works with ==stock SD== - input image and controlnet - unlike unclip, seecoders uses extra model - one image into several perspectives ([[id:505848e8-02a5-4699-be28-6e7b2e91837c][MULTIVIEW DIFFUSION]]) - the embeddings can be textures, effects, objects, semantics(contexts) tics, etc. **** PERSONALIZATION - [[https://twitter.com/_akhaliq/status/1645254918121422859][InstantBooth]]: Personalized Text-to-Image Generation without Test-Time Finetuning - personalized images with only a single forward pass - [[https://twitter.com/AbermanKfir/status/1679689404573679616][HyperDreamBooth]]: HyperNetworks for Fast Personalization of Text-to-Image Models; just one image *** IDENTITY - [[https://github.com/cloneofsimo/lora/discussions/96][masked score estimation]] - HiPer: [[https://arxiv.org/abs/2303.08767][Highly Personalized]] Text Embedding for Image Manipulation by Stable Diffusion - one image single thing, gets the clip - [[IP-ADAPTER]] **** STORYTELLER DIFFUSION :PROPERTIES: :ID: 65812d6a-a81d-47f2-a7ad-25c94e2ff70a :END: - [[https://consistory-paper.github.io/][ConsiStory]]: Training-Free Consistent Text-to-Image Generation - training-free approach for consistent subject(object) generation x20 faster, multi-subject scenarios - by sharing the internal activations of the pretrained model **** ANYDOOR - [[https://github.com/damo-vilab/AnyDoor][AnyDoor]]: [[https://damo-vilab.github.io/AnyDoor-Page/][Zero-shot]] [[https://twitter.com/_akhaliq/status/1738772616142303728][Object-level]] [[https://twitter.com/_akhaliq/status/1738775751887860120][Image]] Customization - teleport target objects to new scenes at user-specified locations - identity feature with detail feature - moving objects, swapping them, multi-subject composition, try-on a cloth **** SUBJECT - [[https://huggingface.co/papers/2306.00926][Inserting Anybody]] in Diffusion Models via Celeb Basis - one facial photograph, 1024 learnable parameters, 3 minutes; several at once - [[https://twitter.com/_akhaliq/status/1683294368940318720][Subject-Diffusion]]:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning - multi subject, single reference image - [[https://twitter.com/_akhaliq/status/1701777751286366283][PhotoVerse]]: Tuning-Free Image Customization with Text-to-Image Diffusion Models - incorporates facial identity loss, single facial photo, single training phase - [[https://twitter.com/_akhaliq/status/1725365231050793081][The Chosen]] [[https://omriavrahami.com/the-chosen-one/][One]]: Consistent Characters in Text-to-Image Diffusion Models - <> sole input being text - generate gallery of images, use pre-trained feature extractor to choose the most cohesive cluster - [[https://twitter.com/_akhaliq/status/1732222107583500453][FaceStudio]]: Put Your Face Everywhere in Seconds ==best== - direct feed-forward mechanism, circumventing the need for intensive fine-tuning - stylized images, facial images, and textual prompts to guide the image generation process - [[https://arxiv.org/abs/2402.00631][SeFi-IDE]]: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation - face-wise attention loss to fit the face region ***** IDENTITY IN VIDEO :PROPERTIES: :ID: 8872aa4e-0394-4066-822b-9145f14caf6f :END: - [[https://magic-me-webpage.github.io/][Magic-Me]]: Identity-Specific Video Customized Diffusion ****** STABLEIDENTITY :PROPERTIES: :ID: 55829fe3-d777-4723-8b48-5c9454822b5e :END: - [[https://arxiv.org/abs/2401.15975][StableIdentity]]: Inserting Anybody into Anywhere at First Sight - identity recontextualization with just one face image without finetuning - also for into video/3D generation ***** IDENTITY ZERO-SHOT - [[https://github.com/InstantID/InstantID][InstantID]]: [[https://instantid.github.io/][Zero-shot]] Identity-Preserving Generation in Seconds (using face encoder) - [[https://github.com/TencentARC/PhotoMaker][PhotoMaker]]: Customizing Realistic Human Photos via Stacked ID Embedding Paper page - [[https://twitter.com/_akhaliq/status/1769930922525159883][Infinite-ID]]: Identity-preserved Personalization via ID-semantics Decoupling Paradigm ==best== - identity provided by the reference image while mitigating interference from textual input - [[https://caphuman.github.io/][CapHuman]]: Capture Your Moments in Parallel Universes - encode then learn to align, identity preservation for new individuals without tuning - [[https://twitter.com/_akhaliq/status/1740616525478781168][SSR-Encoder]]: Encoding Selective Subject Representation for Subject-Driven Generation ==best== - Token-to-Patch Aligner = preserving fine features of the subjects; multiple subjects - combinable with controlnet, and across styles - [[https://twitter.com/_akhaliq/status/1764514136849846667][RealCustom]]: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization - gradually narrowing to the specific subject, iteratively update the influence scope ***** PHOTOMAKER - [[https://huggingface.co/papers/2312.04461][PhotoMaker]]: [[https://twitter.com/_akhaliq/status/1732965700405281099][Customizing]] [[https://arxiv.org/pdf/2312.04461.pdf][Realistic]] Human Photos via Stacked ID Embedding - encodes (into mlp) images into embedding wich preserves id **** ANIME - [[https://github.com/7eu7d7/DreamArtist-sd-webui-extension][DreamArtist]]: a single one image and target text (mainly works with anime) - [[https://twitter.com/_akhaliq/status/1738018255720030343][DreamTuner]]: Single Image is Enough for Subject-Driven Generation - subject-encoder for coarse subject identity preservation, training-free - [[https://github.com/laksjdjf/pfg][pfg]] Prompt free generation; learns to interpret (anime) input-images - old one: [[https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/6585][PaintByExample]] *** VARIATIONS - others: [[USING ATTENTION MAP]] [[VARIATIONS]] [[Elite]] [[ZERO SHOT]] - image variations model (mix images): https://twitter.com/Buntworthy/status/1615302310854381571 - by versatile diffusion model guy, [[https://www.reddit.com/r/StableDiffusion/comments/10ent88/guy_who_made_the_image_variations_model_is_making/][reddit]] - improved: https://github.com/SHI-Labs/Versatile-Diffusion - stable diffusion reimagine: conditioning the unet with the image clip embeddings, then training * BETTER DIFFUSION - editing [[https://time-diffusion.github.io/TIME_paper.pdf][default]] of a prompt: https://github.com/bahjat-kawar/time-diffusion - [[https://github.com/SusungHong/Self-Attention-Guidance][Self-Attention Guidance]] (SAG): [[https://arxiv.org/pdf/2210.00939.pdf][SAG leverages]] [[https://github.com/ashen-sensored/sd_webui_SAG][intermediate attention]] maps of diffusion models at each iteration to capture essential information for the generative process and guide it accordingly - pretty much just reimplemented the attention function without changing much else - [[https://github.com/ChenyangSi/FreeU#freeu-code][FreeU]]: [[https://twitter.com/_akhaliq/status/1704721496122266035][Free]] Lunch in Diffusion U-Net (unet) ==best== - improves diffusion model sample quality at no costs - more color variance - [[https://twitter.com/_akhaliq/status/1683293200574988289][Diffusion Sampling]] with Momentum for Mitigating Divergence Artifacts - incorporation of: Heavy Ball (HB) momentum = expand stability regions; Generalized HB (GHVB) = supression - better low step sampling - DG: [[https://github.com/luping-liu/Detector-Guidance][Detector Guidance]] for Multi-Object Text-to-Image Generation - mid-diffusion, performs latent object detection then enhances following CAMs(cross-attention maps) ** SCHEDULER - [[https://arxiv.org/abs/2301.11093v1][simple diffusion]]: End-to-end diffusion for high resolution images - shifted scheduled noise - [[https://github.com/Extraltodeus/sigmas_tools_and_the_golden_scheduler][Sigmas Tools]] and The Golden Scheduler ** QUALITY - [[RESOLUTION]] - [[https://twitter.com/_akhaliq/status/1707253415061938424][Emu]]: Enhancing Image Generation Models Using Photogenic Needles in a Haystack (dataset method) - guide pre-trained model to exclusively generate good images - [[https://twitter.com/_akhaliq/status/1712830952441819382][HyperHuman]]: Hyper-Realistic Human Generation with Latent Structural Diffusion - Latent Structural Diffusion Model that simultaneously denoises depth and surface normal with RGB image - [[https://github.com/openai/consistencydecoder][Consistency]] Distilled Diff VAE - Improved decoding for stable diffusion vaes ** HUMAN FEEDBACK :PROPERTIES: :ID: 59d1d337-eff3-42bb-9398-1e51b0739074 :END: - [[id:37688f5e-9dc2-48ed-a3f9-eeb318c64f02][RLCM]] - Aligning Text-to-Image Models using Human Feedback https://arxiv.org/abs/2302.12192 - [[https://tgxs002.github.io/align_sd_web/][Better Aligning]] Text-to-Image Models with Human Preference - [[https://github.com/GanjinZero/RRHF][RRHF]]: Rank Responses to Align Language Models with Human Feedback without tears - [[https://github.com/THUDM/ImageReward][ImageReward]]: [[https://arxiv.org/abs/2304.05977][Learning]] and Evaluating Human Preferences for Text-to-Image Generation - [[https://twitter.com/_akhaliq/status/1681870383408984064][FABRIC]]: [[https://twitter.com/dvruette/status/1681942402582425600][Personalizing]] Diffusion Models with Iterative Feedback - training-free approach, exploits the self-attention layer - improve the results of any Stable Diffusion model - [[https://twitter.com/_akhaliq/status/1727575485717021062][Using]] Human Feedback to Fine-tune Diffusion Models without Any Reward Model - Direct Preference for Denoising Diffusion Policy Optimization (D3PO) - omits training a reward model - [[https://twitter.com/_akhaliq/status/1727565261375418555][Diffusion-DPO]]: [[https://github.com/SalesforceAIResearch/DiffusionDPO][Diffusion]] Model Alignment Using Direct Preference Optimization ([[https://github.com/huggingface/diffusers/tree/main/examples/research_projects/diffusion_dpo][training script]]) - improving visual appeal and prompt alignment, using direct preference optimization - [[https://twitter.com/_akhaliq/status/1737132429385576704][SDXL]]: [[https://huggingface.co/mhdang/dpo-sdxl-text2image-v1][Direct]] Preference Optimization (better images) <> (and [[https://huggingface.co/mhdang/dpo-sd1.5-text2image-v1][SD 1.5]]) - [[id:4b8a772d-e3ad-4183-863b-eeddb47bab9e][ALDM]] layout - [[https://twitter.com/_akhaliq/status/1749978885893063029][RL Diffusion]]: Large-scale Reinforcement Learning for Diffusion Models (improves pretrained) - [[https://twitter.com/_akhaliq/status/1758000776801137055][PRDP]]: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models ==best== - better training stability for unseen prompts - reward difference of generated image pairs from their denoising trajectories - [[id:411493fe-a082-477f-923c-9a048dab036e][MESH HUMAN FEEDBACK]] *** ACTUALLY SELF-FEEDBACK - SPIN-Diffusion: [[https://arxiv.org/abs/2402.10210][Self-Play]] Fine-Tuning of Diffusion Models for Text-to-Image Generation ==best== - diffusion model engages in competition with its earlier versions, iterative self-improvement - [[https://arxiv.org/abs/2403.13352][AGFSync]]: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation - use Vision Models (VLM) to assess quality across style, coherence, and aesthetics, generating feedback ** SD GENERATION OPTIMIZATION - [[id:3c3b352c-c73e-49e2-8ddc-81a8569229a2][ONE STEP DIFFUSION]] [[SAMPLERS]] [[id:28491008-6287-47c6-ac2e-ed22f862c997][STABLE CASCADE]] - [[https://twitter.com/Birchlabs/status/1640033271512702977][turning off]] [[https://github.com/Birch-san/diffusers-play/commit/77fa7f965edf7ab7280a47d2f8fc0362d4b135a9][CFG when]] denoising sigmas below 1.1 - Tomesd: [[https://github.com/dbolya/tomesd][Token Merging]] for [[https://arxiv.org/abs/2303.17604][Stable Diffusion]] [[https://git.mmaker.moe/mmaker/sd-webui-tome][code]] - [[https://lemmy.dbzer0.com/post/14962261][ToDo]]: Token Downsampling for Efficient Generation of High-Resolution Images - token downsampling of key and value tokens to accelerate inference 2x-4x - [[https://twitter.com/bahjat_kawar/status/1684827989408673793][Nested Diffusion]] Processes for Anytime Image Generation - can generate viable when stopped arbitrarily before completion - [[https://twitter.com/_akhaliq/status/1668076625924177921][BOOT]]: Data-free Distillation of Denoising Diffusion Models with Bootstrapping - use sd as teacher model and train faster one using it as bootstrap; 30 fps - Divide & Bind Your Attention for Improved Generative Semantic Nursing - [[https://twitter.com/YumengLi_007/status/1682404804583104512][novel objective]] [[https://sites.google.com/view/divide-and-bind][functions]]: can handle complex prompts with proper attribute binding - [[https://twitter.com/_akhaliq/status/1709059088636612739][Conditional]] Diffusion Distillation - added parameters, suplementing image conditions to the diffusion priors - super-resolution, image editing, and depth-to-image generation - [[SAMPLERS]] [[ADAPTIVE GUIDANCE]] - [[https://github.com/Oneflow-Inc/onediff/tree/main][OneDiff]]: [[https://lemmy.dbzer0.com/post/15883033][acceleration]] library for diffusion models, [[https://github.com/Oneflow-Inc/onediff/tree/main][ComfyUI Nodes]] - [[https://twitter.com/_akhaliq/status/1760859243018703040][T-Stitch]]: Accelerating Sampling in Pre-trained Diffusion Models with Trajectory Stitching - improve sampling efficiency with no generation degradation - smaller DPM in the initial steps, larger DPM at a later stage, 40% of the early timesteps - [[https://lemmy.dbzer0.com/post/18177662][The Missing]] U for Efficient Diffusion Models - operates with approximately a quarter of the parameters, diffusion models 80% faster *** ULTRA SPEED - [[https://twitter.com/StabilityAI/status/1729589510155948074][SDXL Turbo]]: A real-time text-to-image generation model (distillation) - [[https://github.com/aifartist/ArtSpew/][ArtSpew]]: SD at 149 images per second (high volume random image generation) - [[https://twitter.com/cumulo_autumn/status/1732309219041571163][StreamDiffusion]]: A Pipeline-level Solution for Real-time Interactive Generation (10ms) - transforms sequential denoising into the batching denoising - [[https://lemmy.dbzer0.com/post/13491532][MobileDiffusion]]: Subsecond Text-to-Image Generation on Mobile Devices - diffusion-GAN finetuning techniques to achieve 8-step and 1-step inference - [[https://arxiv.org/abs/2402.17376][Accelerating]] Diffusion Sampling with Optimized Time Steps - image performance compared to using uniform time steps *** CACHE - [[https://twitter.com/_akhaliq/status/1731888038626615703][DeepCache]]: Accelerating Diffusion Models for Free ==best== - exploits temporal redundancy observed in the sequential denoising steps - superiority over existing pruning and distillation - [[https://twitter.com/_akhaliq/status/1732587729970479354][Cache Me]] if You Can: Accelerating Diffusion Models through Block Caching - reuse outputs from layer blocks of previous steps, automatically determine caching schedules - [[https://twitter.com/_akhaliq/status/1736615005913591865][Faster Diffusion]]: Rethinking the Role of UNet Encoder in Diffusion Models ==best== - reuse cyclically the encoder features in the previous time-steps for the decoder - [[https://arxiv.org/abs/2401.01008][Fast]] Inference Through The Reuse Of Attention Maps In Diffusion Models - structured reuse of attention maps during sampling - [[https://github.com/HaozheLiu-ST/T-GATE][T-GATE]]: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models - two stages: semantics-planning phase, and subsequent fidelity-improving phase - so caching cross-attention output once converges and fixing it during the remaining inference **** EXPLOITING FEATURES - [[https://arxiv.org/abs/2312.03517][FRDiff]]: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models - Reusing feature maps with high temporal similarity - [[https://arxiv.org/abs/2312.08128][Clockwork Diffusion]]: Efficient Generation With Model-Step Distillation - high-res features sensitive to small perturbations; low-res feature only sets semantic layout - so reuses computation from preceding steps for low-res *** LCM :PROPERTIES: :ID: 7396b121-d509-461a-b5ed-8c75d4718519 :END: - LCMs: [[https://latent-consistency-models.github.io/][Latent Consistency]] Models: Synthesizing High-Resolution Images with Few-step Inference - inference with minimal steps (2-4) - training LCM model: only 32 A100 GPU hours - Latent Consistency Fine-tuning (LCF) custom datasets - [[https://github.com/0xbitches/ComfyUI-LCM][comfyui]] [[https://github.com/0xbitches/sd-webui-lcm][auto1111]] [[https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7][the model]] - [[https://twitter.com/SimianLuo/status/1722845777868075455][LCM-LoRA]]: A Universal Stable-Diffusion Acceleration Module - universally applicable accelerator for diffusion models, plug-in neural PF-ODE solver - [[https://twitter.com/_akhaliq/status/1735514410049794502][VideoLCM]]: Video Latent Consistency Model - smooth video synthesis with only four sampling steps - [[id:ccc8f98c-34eb-448b-b2d8-6ef662627fa4][ANIMATELCM]] - [[https://twitter.com/fffiloni/status/1756719446578585709][Quick]] Image Variations with LCM and Image Caption - [[https://github.com/jabir-zheng/TCD][TCD]]: [[https://twitter.com/_akhaliq/status/1763436246565572891][Trajectory]] Consistency Distillation ([[https://huggingface.co/h1t/TCD-SDXL-LoRA][lora]]) - accurately trace the entire trajectory of the Probability Flow ODE - https://github.com/dfl/comfyui-tcd-scheduler - [[id:28ff20ec-5501-47da-9ac1-8adc65303376][LCM-LOOKAHEAD]] **** CCM - [[https://twitter.com/_akhaliq/status/1734804912809148750][CCM]]: Adding Conditional Controls to Text-to-Image Consistency Models - ControlNet-like, lightweight adapter can be jointly optimized while consistency training **** PERFLOW :PROPERTIES: :ID: 44943c87-ca5b-4604-840e-ff52993c1bf1 :END: - [[https://github.com/magic-research/piecewise-rectified-flow][PeRFlow]] (Piecewise Rectified Flow) - fast generation, 4 steps, 4,000 training iterations - multiview normal maps and textures from text prompts instantly ** PROMPT CORRECTNESS - [[https://arxiv.org/abs/2211.15518][ReCo]]: region control, counting donuts - [[https://github.com/hnmr293/sd-webui-cutoff][sd-webui-cutoff]], hide tokens for each separated group, limits the token influence scope (color control) - hard-prompts-made-easy - [[https://huggingface.co/spaces/Gustavosta/MagicPrompt-Stable-Diffusion][magic prompt]]: amplifies-improves the prompt - [[https://github.com/sen-mao/SuppressEOT][Get What]] You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models - suppress unwanted content generation of the prompt, and encourages the generation of desired content - better than negative prompts - [[https://dpt-t2i.github.io/][Discriminative]] Probing and Tuning for Text-to-Image Generation - discriminative adapter to improve their text-image alignment - global matching and local grounding - [[https://twitter.com/_akhaliq/status/1776074505351282720][CoMat]]: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching - fine-tuning strategy with an image-to-text(captioning model) concept matching mechanism - [[https://youtu.be/_Pr7aFkkAvY?si=Xr5e_RL-rwcdL10q ][ELLA]] - [[https://github.com/DataCTE/ELLA_Training][A Powerful]] Adapter for Complex Stable Diffusion Prompts - using an adaptor for an llm instead of clip *** ATTENTION LAYOUT - Attend-and-Excite ([[https://attendandexcite.github.io/Attend-and-Excite/][excite]] the ignored prompt [[https://github.com/AttendAndExcite/Attend-and-Excite][tokens]]) (no retrain) - [[https://arxiv.org/abs/2304.03869][Harnessing]] the [[https://github.com/UCSB-NLP-Chang/Diffusion-SpaceTime-Attn][Spatial-Temporal]] Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis - [[https://arxiv.org/pdf/2302.13153.pdf][Directed Diffusion]]: [[https://github.com/hohonu-vicml/DirectedDiffusion][Direct Control]] of Object Placement through Attention Guidance (no retrain) [[https://github.com/giga-bytes-dev/stable-diffusion-webui-two-shot/tree/ashen-sensored_directed-diffusion][repo]] - [[https://twitter.com/_akhaliq/status/1696155079458406758][DenseDiffusion]]: Dense Text-to-Image Generation with Attention Modulation - training free, layout guidance *** LANGUAGE ENHANCEMENT - [[IMAGE RELATIONSHIPS]] - [[https://twitter.com/_akhaliq/status/1670190734543134720][Linguistic]] Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment - using prompt sentence structure during inference to improve the faithfulness - [[https://weixi-feng.github.io/structure-diffusion-guidance/][Training-Free Structured]] [[https://arxiv.org/abs/2212.05032][Diffusion]] Guidance for Compositional [[https://arxiv.org/pdf/2212.05032.pdf][Text-to-Image Synthesis]] - exploiting language sentences semantical hierarchies (lojban) - [[https://github.com/weixi-feng/Structured-Diffusion-Guidance][Structured Diffusion Guidance]], language enhanced clip enforces on unet - Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering - prompt learning, improve the matches between the input text and the generated **** PROMPT EXPANSION, PROMPT AUGMENTATION - [[https://huggingface.co/KBlueLeaf/DanTagGen?not-for-all-audiences=true][DanTagGen]]: LLaMA arch - [[https://github.com/sammcj/superprompter][superprompter]]: Supercharge your AI/LLM prompts - [[https://arxiv.org/pdf/2403.19716.pdf][Capability-aware]] Prompt Reformulation Learning for Text-to-Image Generation - effectively learn diverse reformulation strategies across various user capacities to simulate high-capability user reformulation **** TOKENCOMPOSE :PROPERTIES: :ID: bb79e50e-ed85-4f37-bd0c-6cad6acd0a6e :END: - [[https://mlpc-ucsd.github.io/TokenCompose/][TokenCompose]]: Grounding Diffusion with Token-level Supervision ==best== - finetuned with token-wise grounding objectives for multi-category instance composition - exploiting binary segmentation maps from SAM - compositions that are unlikely to appear simultaneously in a natural scene ** BIGGER COHERENCE :PROPERTIES: :ID: b211cec9-6cf2-4f6d-9e1e-10186f513da1 :END: - [[INTERPOLATION]] [[id:18c951a2-6883-4010-ad9d-9dee396b9839][VIDEO COHERENCE]] - [[https://arxiv.org/pdf/2404.03109.pdf][Many-to-many]] Image Generation with Auto-regressive Diffusion Models *** PANORAMAS - [[https://research.nvidia.com/labs/dir/diffcollage/][DiffCollage]]: Parallel Generation of Large Content with Diffusion Models (panoramas) - [[https://twitter.com/_akhaliq/status/1678943514917326848][Collaborative]] Score Distillation for Consistent Visual Synthesis - consistent visual synthesis across multiple samples ==best one== - distill generative priors over a set of images synchronously - zoom, video, panoramas - [[https://syncdiffusion.github.io/][SyncDiffusion]]: Coherent Montage via Synchronized Joint Diffusions - plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss - [[https://chengzhag.github.io/publication/panfusion/][Taming Stable]] Diffusion for Text to 360° Panorama Image Generation - minimize distortion during the collaborative denoising process **** OUTPAINTING - [[BETTER INPAINTING]] - [[https://arxiv.org/abs/2401.15652][Continuous-Multiple Image]] Outpainting in One-Step via Positional Query and A Diffusion-based Approach - generate content beyond boundaries using relative positional information - [[https://tencentarc.github.io/BrushNet/][BrushNet]]: [[https://github.com/TencentARC/BrushNet][A Plug-and-Play]] [[https://github.com/nullquant/ComfyUI-BrushNet][Image]] Inpainting Model with Decomposed Dual-Branch Diffusion - pre-trained SD model, useful in product exhibitions, virtual try-on, or background replacement *** RESOLUTION - [[https://twitter.com/_akhaliq/status/1697522827992150206][Any-Size-Diffusion]]: Toward Efficient Text-Driven Synthesis for Any-Size HD Images - training on images of unlimited sizes is unfeasible - Fast Seamless Tiled Diffusion (FSTD) - [[https://yingqinghe.github.io/scalecrafter/][ScaleCrafter]]: Tuning-free Higher-Resolution Visual Generation with Diffusion Models (video too) - generating images at much higher resolutions than the training image sizes - does not require any training or optimization - [[https://twitter.com/iScienceLuvr/status/1716789813750493468][Matryoshka]] [[https://twitter.com/_akhaliq/status/1716831652545208407][Diffusion]] Models - diffusion process that denoises inputs at multiple resolutions jointly - [[id:bbc5a347-bc62-4b5e-b659-1c6a57d6a2a5][FIT TRANSFORMER]] - [[https://lemmy.dbzer0.com/post/17799119][Upsample Guidance]]: Scale Up Diffusion Models without Training - technique that adapts pretrained model to generate higher-resolution images by adding a single term in the sampling process, without any additional training or relying on external models - can be applied to various models, such as pixel-space, latent space, and video diffusion models **** ARBITRARY - [[https://github.com/MoayedHajiAli/ElasticDiffusion-official][ElasticDiffusion]]: Training-free Arbitrary Size Image Generation - decoding method better than MultiDiffusion - [[https://lemmy.dbzer0.com/post/15814254][ResAdapter]]: Domain Consistent Resolution Adapter for Diffusion Models - unlike post-process, directly generates images with the dynamical resolution - compatible with ControlNet, IP-Adapter and LCM-LoRA; can be integrated with ElasticDiffusion * SAMPLERS - [[https://arxiv.org/pdf/2210.05475.pdf][GENIE]]: Higher-Order Denoising Diffusion Solvers - faster diffusion equation? - DDIM vs GENIE - 4 time less expensive upsampling - fastest solver https://arxiv.org/abs/2301.12935 - another accelerator: https://arxiv.org/abs/2301.11558 - unipc sampler (sampling in 5 steps) - [[https://blog.novelai.net/introducing-nai-smea-higher-image-generation-resolutions-9b0034ffdc4b][smea]]: (nai) global attention sampling - Karras no blurry improvement [[https://www.reddit.com/r/StableDiffusion/comments/11mulj6/quality_improvements_to_dpm_2m_karras_sampling/][reddit]] - [[https://twitter.com/_akhaliq/status/1716332535142117852][DPM-Solver-v3]]: Improved Diffusion ODE Solver with Empirical Model Statistics - several coefficients efficiently computed on the pretrained mode, faster - [[id:bc0dd47c-4f46-4cd0-9606-555990c06626][STABLESR]] novel approach - [[DIRECT CONSISTENCY OPTIMIZATION]]: controls intensity of style * IMAGE EDITING - [[id:b4052ea2-df86-4c37-91b0-e2c2448ab08c][3D-AWARE IMAGE EDITING]] - [[https://arxiv.org/pdf/2211.09794.pdf][null-text]] [[https://github.com/cccntu/efficient-prompt-to-prompt][inversion]]: prompttoprompt but better - [[https://github.com/ShivamShrirao/diffusers/tree/main/examples/imagic][imagic]]: editing photo with prompt ** IMAGE SCULPTING ==best== :PROPERTIES: :ID: 303a8796-8fc8-4c2f-92f6-62516c8a6ea1 :END: - [[https://github.com/vision-x-nyu/image-sculpting][Image]] Sculpting: Precise Object Editing with 3D Geometry Control - enables direct interaction with their 3D geometry - pose editing, translation, rotation, carving, serial addition, space deformation - turned into nerf using Zero-1-to-3, then returned to image including features ** STYLE - [[https://huggingface.co/papers/2306.00983][StyleDrop]]: [[https://styledrop.github.io/][Text-to-Image]] [[https://github.com/zideliu/StyleDrop-PyTorch][Generation]] in Any Style (muse architecture) - 1% of parameters (painting style) - [[https://twitter.com/_akhaliq/status/1685898061221076992][PromptStyler]]: Prompt-driven Style Generation for Source-free Domain Generalization - learnable style word vectors, style-content features to be located nearby - [[https://arxiv.org/abs/2304.03119][Zero-shot]] [[https://arxiv.org/pdf/2304.03119.pdf][Generative]] [[https://github.com/Picsart-AI-Research/IPL-Zero-Shot-Generative-Model-Adaptation][Model]] Adaptation via Image-specific Prompt Learning - adapt style to concept - [[https://twitter.com/_akhaliq/status/1699267731332182491][StyleAdapter]]: A Single-Pass LoRA-Free Model for Stylized Image Generation - process the prompt and style features separately - [[https://twitter.com/_akhaliq/status/1702193640687235295][DreamStyler]]: Paint by Style Inversion with Text-to-Image Diffusion Models - textual embedding with style guidance - [[https://garibida.github.io/cross-image-attention/][Cross-Image]] Attention for Zero-Shot Appearance Transfer - zero-shot appearance transfer by building on the self-attention layers of image diffusion models - architectural transfer - [[id:4c93f57d-43b7-4fbe-9415-e007a06efd46][STYLECRAFTER]] transfer to video - [[https://github.com/google/style-aligned/][Style Aligned]] Image Generation via Shared Attention ==best== ([[https://github.com/Mikubill/sd-webui-controlnet/commit/47dfefa54fb128035cc6e84c2fca0b4bc28be62f][as controlnet extension]]) - color palette too - [[https://freestylefreelunch.github.io/][FreeStyle]]: Free Lunch for Text-guided Style Transfer using Diffusion Models - style transfer built upon sd, dual-stream encoder and single-stream decoder architecture - content into pixelart, origami, anime - [[https://cszy98.github.io/PLACE/][PLACE]]: [[https://lemmy.dbzer0.com/post/15768553][Adaptive]] Layout-Semantic Fusion for Semantic Image Synthesis - image from segmentation map and also using semantic features - [[https://curryjung.github.io/VisualStylePrompt/][Visual Style]] Prompting with Swapping Self-Attention - consistent style across generations - unlike others (ip-adapter) disentangle other semantics away (like pose) - [[https://tianhao-qi.github.io/DEADiff/][DEADiff]]: An Efficient Stylization Diffusion Model with Disentangled Representations ==best== - decouple the style and semantics of reference images - optimal balance between the text controllability and style similarity - [[https://twitter.com/_akhaliq/status/1775718553448051022][InstantStyle]]: Free Lunch towards Style-Preserving in Text-to-Image Generation - decouples style and content from reference images within the feature space - [[https://mshu1.github.io/dreamwalk.github.io/][DreamWalk]]: Style Space Exploration using Diffusion Guidance - decompose the text prompt into conceptual elements, apply a separate guidance for each element - [[id:28ff20ec-5501-47da-9ac1-8adc65303376][LCM-LOOKAHEAD]] *** B-LoRA - [[https://arxiv.org/abs/2403.14572][Implicit Style-Content]] [[https://twitter.com/yarden343/status/1772894805313405151#m][Separation]] using B-LoRA - preserving its underlying objects, structures, and concepts - LoRA of two specific blocks - image style transfer, text-based stylization, consistent style generation, and style-content mixing *** STYLE TOOLS - [[https://github.com/learn2phoenix/CSD][Measuring]] Style Similarity in Diffusion Models - compute similarity score *** DIRECT CONSISTENCY OPTIMIZATION - DCO: [[https://lemmy.dbzer0.com/post/14778281][Direct Consistency]] Optimization for Compositional Text-to-Image Personalization - minimally fine-tuning pretrained to achieve consistency - new sampling method that controls the tradeoff between image fidelity and prompt fidelity ** REGIONS - different inpainting ways with diffusers: https://github.com/huggingface/diffusers/pull/1585 - [[https://zengyu.me/scenec/][SceneComposer]]: paint with words but cooler - bounding boxes instead: [[https://github.com/gligen/GLIGEN][GLIGEN]]: image grounding - better VAE and better masks: https://lipurple.github.io/Grounded_Diffusion/ - [[https://arxiv.org/abs/2403.05018][InstructGIE]]: Towards Generalizable Image Editing - leveraging the VMamba Block, aligns language embeddings with editing semantics - editing instructions dataset *** REGIONS MERGE - [[id:6a66690f-b76f-441a-a093-3c83ca73af2d][MULTIPLE DIFFUSION]] [[id:b211cec9-6cf2-4f6d-9e1e-10186f513da1][BIGGER COHERENCE]] [[HARMONIZATION]] [[id:3f126569-6deb-45e1-9535-77883fc7ad8b][MULTIPLE LORA]] - [[https://arxiv.org/pdf/2303.13126.pdf][MagicFusion]]: [[https://magicfusion.github.io/][Boosting]] Text-to-Image Generation Performance by Fusing Diffusion Models - blending the predicted noises of two diffusion models in a saliency-aware manner (composite) - [[https://twitter.com/_akhaliq/status/1681865088838270978][Text2Layer]]: [[https://huggingface.co/papers/2307.09781][Layered]] Image Generation using Latent Diffusion Model - train an autoencoder to reconstruct layered images and train models on the latent representation - generate background, foreground, layer mask, and the composed image simultaneously - [[https://lemmy.dbzer0.com/post/17448456][Isolated Diffusion]]: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance - bind each attachment to corresponding subjects separately with split text prompts - object segmentation to obtain the layouts of subjects, then isolate and resynthesize individually - [[https://lemmy.dbzer0.com/post/17243698][Be Yourself]]: Bounded Attention for Multi-Subject Text-to-Image Generation - bounded attention, training-free method; bounding information flow in the sampling process - prevents leakage, promotes each subject’s individuality, even with complex multi-subject conditioning **** INTERPOLATION - [[https://github.com/lunarring/latentblending][Latent]] Blending (interpolate latents) - latent couple, multidiffusion, [[https://note.com/gcem156/n/nb3d516e376d7][attention couple]] - comfy ui like but [[https://github.com/omerbt/MultiDiffusion][masks]] - [[https://twitter.com/_akhaliq/status/1683753746315239425][Interpolating]] between Images with Diffusion Models - convincing interpolations across diverse subject poses, image styles, and image content - [[https://twitter.com/_akhaliq/status/1732973286206636454][Smooth Diffusion]]: [[https://github.com/SHI-Labs/Smooth-Diffusion][Crafting]] [[https://arxiv.org/abs/2312.04410][Smooth]] [[https://github.com/SHI-Labs/Smooth-Diffusion][Latent]] Spaces in Diffusion Models ==best== - steady change in the output image, plug-and-play Smooth-LoRA; best interpolation - perhaps for video or drag diffusion - [[https://kongzhecn.github.io/omg-project/][OMG]]: Occlusion-friendly Personalized Multi-concept Generation In Diffusion Models - integrate multiple concepts within a single image - combined with LoRA and InstantID ***** DIFFMORPHER :PROPERTIES: :ID: 60a63fe6-8088-4b2b-af55-f1d5e23e804b :END: - [[https://twitter.com/_akhaliq/status/1734778250574840146][DiffMorpher]]: [[https://twitter.com/sze68zkw/status/1738407559009366025][Unleashing]] the Capability of Diffusion Models for Image Morphing ==best== - alternative to gan; interpolate between their loras (not just their latents) *** MINIMAL CHANGES - [[id:db81202f-abf0-410e-98c2-c202fa2ca350][SEMANTICALLY DEFORMED]] - [[https://delta-denoising-score.github.io/][Delta]] [[https://arxiv.org/abs/2304.07090][Denoising]] Score: minimal modifications, keeping the image **** HARMONIZATION - [[REGIONS MERGE]] - SEELE: [[https://yikai-wang.github.io/seele/][Repositioning]] The Subject Within Image - minimal changes like moving people, subject removal, subject completion and harmonization - [[https://arxiv.org/pdf/2303.00262.pdf][Collage]] [[https://twitter.com/VSarukkai/status/1701293909647958490][Diffusion]] (harmonize collaged images) - [[https://twitter.com/_akhaliq/status/1770645980767011268][Magic Fixup]]: [[https://twitter.com/HadiZayer/status/1773457936309682661][Streamlining]] Photo Editing by Watching Dynamic Videos - given a coarsely edited image (cut and move blob), synthesizes a photorealistic output ***** SWAPANYTHING - [[https://twitter.com/_akhaliq/status/1777551248775901647][SwapAnything]]: Enabling Arbitrary Object Swapping in Personalized Visual Editing - keeping the context unchanged (like it's in texture clothes) **** REGION EXCHANGE - [[id:992f12e2-c595-4aca-8129-6dace7d2f3ba][VIDEO EXCHANGE]] [[SWAPANYTHING]] - [[https://github.com/haha-lisa/RDM-Region-Aware-Diffusion-Model][RDM-Region-Aware-Diffusion-Model]] edits only the region of interest - [[https://github.com/cloneofsimo/magicmix][magicmix]] merge their noise shapes - [[https://omriavrahami.com/blended-latent-diffusion-page/][Blended]] Latent Diffusion - input image and a mask, modifies the masked area according to a guiding text prompt ***** SUBJECT SWAPPING - [[https://huggingface.co/papers/2305.18286][Photoswap]]: Personalized Subject Swapping in Images - [[https://arxiv.org/abs/2402.18351][LatentSwap]]: An Efficient Latent Code Mapping Framework for Face Swapping ***** BETTER INPAINTING - [[OUTPAINTING]] - [[https://powerpaint.github.io/][A Task]] [[https://github.com/open-mmlab/mmagic/tree/main/projects/powerpaint][is Worth]] One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting - inpainting model: context-aware image and shape-guided object inpainting, object removal, controlnet - [[https://huggingface.co/spaces/modelscope/ReplaceAnything][ReplaceAnything]] [[https://github.com/AIGCDesignGroup/ReplaceAnything][as you want]]: Ultra-high quality content replacement - masked region is strictly retained - [[OUTPAINTING]] - [[https://arxiv.org/pdf/2404.03642.pdf][DiffBody]]: Human Body Restoration by Imagining with Generative Diffusion Prior - good proportions, (clothes) texture quality, no limb distortions - [[https://github.com/htyjers/StrDiffusion][StrDiffusion]]: Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting - semantically sparse structure in early stage, dense texture in late stage - [[https://powerpaint.github.io/][A Task is]] Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting ****** MAPPED INPAINTING - [[https://dangeng.github.io/motion_guidance/][Motion Guidance]]: Diffusion-Based Image Editing with Differentiable Motion Estimators ******* DIFFERENTIAL DIFFUSION - [[https://lemmy.dbzer0.com/post/13246157][Differential]] [[https://github.com/exx8/differential-diffusion][Diffusion]]: [[https://differential-diffusion.github.io/][Giving]] Each Pixel Its Strength ==best== - control of the extent to which individual objects are modified, or the ability to introduce gradual spatial changes - using change maps: gray scale of how many a region can change ****** CLOTHES OUTFITS - [[https://twitter.com/_akhaliq/status/1750737690553692570][Diffuse to]] Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All - virtually place any e-commerce item in any setting ***** PIX2PIX REGION - pix2pix-zero (promp2prompt without prompt) - [[https://github.com/pix2pixzero/pix2pix-zero][no fine]] tuning, using BLIP captions <>; [[https://huggingface.co/docs/diffusers/api/pipelines/pix2pix_zero][docs]] - plug-and-[[https://github.com/MichalGeyer/plug-and-play][play]]: like pix2pix but features extracted ***** FORCE IT WHERE IT FITS - [[https://arxiv.org/abs/2303.16765][MDP]]: [[https://github.com/QianWangX/MDP-Diffusion][A Generalized]] Framework for Text-Guided Image Editing by Manipulating the Diffusion Path - no training or finetuning; instead force the prompt (exchange the noise) - [[https://twitter.com/_akhaliq/status/1644557225103335425][PAIR-Diffusion]]: [[https://twitter.com/ViditGoel7/status/1713031352709435736][Object-Level]] Image Editing with Structure-and-Appearance - forces input image into edited image, object-level ***** PROMPT IS TARGET - [[https://arxiv.org/abs/2211.07825][Direct Inversion]]: Optimization-Free Text-Driven Real Image Editing with Diffusion Models - only changes where the prompt fits - [[https://twitter.com/aysegl_dndr/status/1691011667394527232][Inst-Inpaint]]: Instructing to Remove Objects with Diffusion Models - erasing unwanted pixels; estimates which object to be removed - [[https://arxiv.org/pdf/2303.09618.pdf][HIVE]]: Harnessing Human Feedback for Instructional Visual Editing (reward model) - rlhf, editing instruction, to get output to adhere to the correct instructions - [[https://twitter.com/_akhaliq/status/1735516803625893936][LIME]]: Localized Image Editing via Attention Regularization in Diffusion Models - do not require specified regions or additional text input - clustering technique = segmentation maps; without re-training and fine-tuning ****** DDIM - [[https://github.com/MirrorDiffusion/MirrorDiffusion][MirrorDiffusion]]: [[https://mirrordiffusion.github.io/][Stabilizing]] Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond ==best== - prompt redescription strategy, revised DDIM inversion - [[https://arxiv.org/abs/2403.09468][Eta Inversion]]: [[https://github.com/furiosa-ai/eta-inversion][Designing]] an Optimal Eta Function for Diffusion-based Real Image Editing - better DDIM - [[https://twitter.com/_akhaliq/status/1771039688280723724][ReNoise]]: Real Image Inversion Through Iterative Noising - building on reversing the diffusion sampling process to manipulate an image **** SEMANTIC CHANGE - DETECTION - [[https://github.com/ml-research/semantic-image-editing][sega]] semantic guidance, (apply a concept arithmetic after having a generation) - [[https://twitter.com/SFResearch/status/1612886999152857088][EDICT]]: [[https://github.com/salesforce/EDICT][repo]] Exact Diffusion Inversion via Coupled Transformations - edits-changes object types(dog breeds) - adds noise, complex transformations but still getting perfect invertibility - [[https://twitter.com/_akhaliq/status/1664485230151884800][The Hidden]] [[https://huggingface.co/papers/2306.00966][Language]] of Diffusion Models - learning interpretable pseudotokens from interpolating unet concepts - useful for: single-image decomposition to tokens, bias detection, and semantic image manipulation ***** SWAP PROMPT - [[USING ATTENTION MAP]] [[TI-GUIDED-EDIT]] - [[https://twitter.com/_akhaliq/status/1676071757994680321][LEDITS]]: Real Image Editing with DDPM Inversion and Semantic Guidance - prompt changing, minimal variations <> - [[https://twitter.com/kerstingAIML/status/1729778594790907914][LEDITS++]], [[https://twitter.com/MBrack_AIML/status/1729919347542356187][an efficient]], versatile & precise textual image manipulator ==best== - no tuning, no optimization, few diffusion steps, multiple simultaneous edits - architecture-agnostic, masking for local changes; building on SEGA - [[https://arxiv.org/abs/2303.15649][StyleDiffusion]]: [[https://github.com/sen-mao/StyleDiffusion][Prompt-Embedding]] Inversion for Text-Based Editing - preserve the object-like attention maps after editing **** INSTRUCTIONS - other: [[PIX2PIX REGION]] [[id:ddd3588a-dc3c-426d-a94e-9aa373fabff9][GUIDING FUNCTION]] [[TIP: text restoration]] - [[https://twitter.com/_akhaliq/status/1670677370276028416][MagicBrush]]: A Manually Annotated Dataset for Instruction-Guided Image Editing - InstructP[[https://github.com/timothybrooks/instruct-pix2pix][ix2Pix]] [[https://arxiv.org/abs/2211.09800][paper]] - [[https://github.com/ethansmith2000/MegaEdit][MegaEdit]]: like instructPix2Pix but for any model - based on EDICT and plug-and-play but using DDIM ***** IMAGE INSTRUCTIONS - [[https://twitter.com/_akhaliq/status/1743108118630818039][Instruct-Imagen]]: Image Generation with Multi-modal Instruction - example images as style, boundary, edges, sketch - [[https://twitter.com/_akhaliq/status/1686919394415329281][ImageBrush]]: [[https://arxiv.org/abs/2403.18660][Learning Visual]] In-Context Instructions for Exemplar-Based Image Manipulation - a pair of images as visual instructions - instruction learning as inpainting problem, useful for pose transfer, image translation and video inpainting ***** IMAGE TRANSLATION - [[SEVERAL CONTROLS IN ONE]] [[CCM]] [[id:9307c803-21ff-47bf-bdc1-15ea79d2444f][MESH TO MESH]] [[id:98791065-c5dc-4f12-8c0c-fffad5715a2e][SDXS]] - [[id:d3c6d9ef-9dff-4c60-8f92-5a523c24c139][DRAG DIFFUSION]] dragging two points on the image - [[https://twitter.com/_akhaliq/status/1691345566243201024][Jurassic World]] Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation - zero-shot <> (I2I) across large domain gaps, like skelleton to dinosaur - prompting provides target domain - [[https://github.com/ader47/jittor-jieke-semantic_images_synthesis][IIDM]]: Image-to-Image Diffusion Model for Semantic Image Synthesis - - [[https://twitter.com/_akhaliq/status/1770089964744618320][One-Step]] Image Translation with Text-to-Image Models - adapting a single-step diffusion model; preserve the input image structure ****** INTO MANGA :PROPERTIES: :ID: 56a81747-2a44-410e-9ca0-26f366829f3e :END: - [[https://arxiv.org/abs/2403.08266][Sketch2Manga]]: Shaded Manga Screening from Sketch with Diffusion Models - normal generation into manga style but while fixing the light anomalies (actually looks manga) - fixes the tones ****** ARTIST EDITING :PROPERTIES: :ID: 20a546c6-135e-45f3-88a6-d3e5869bd28f :END: - [[https://lemmy.dbzer0.com/post/12260609][Re:Draw]] -- Context Aware Translation as a Controllable Method for Artistic Production - inpaint with context(style and emotion) aware; like color of the eye - [[https://arxiv.org/pdf/2402.02733.pdf][ToonAging]]: Face Re-Aging upon Artistic Portrait Style Transfer (including anime) - and portrait style transfer, single generation step ****** SLIME :PROPERTIES: :ID: 7cd466fd-1feb-47ce-bf9a-033ba4838579 :END: - [[https://twitter.com/_akhaliq/status/1699607375785705717][SLiMe]]: Segment Like Me - extract attention maps, learn about segmented region, then inference ***** EXPLICIT REGION - [[https://huggingface.co/spaces/xdecoder/Instruct-X-Decoder][X-Decoder]]: instructPix2Pix [[https://github.com/microsoft/X-Decoder][per]] region(objects) - compaable to [[vpd]] <> - [[https://arxiv.org/pdf/2303.17546.pdf][PAIR-Diffusion]]: [[https://github.com/Picsart-AI-Research/PAIR-Diffusion][Object-Level]] Image Editing with Structure-and-Appearance Paired Diffusion Models (region editing) ** SPECIFIC CONCEPTS - [[layout aware]] - [[https://twitter.com/_akhaliq/status/1688747476382019584][ConceptLab]]: [[https://github.com/kfirgoldberg/ConceptLab][Creative]] Generation using Diffusion Prior Constraints - generate a new, imaginary concept; adaptively constraints-optimization process - [[https://github.com/dvirsamuel/SeedSelect][SeedSelect]]: rare concept images, generation of uncommon and ill-formed concepts - selecting suitable generation seeds from few samples - [[https://arxiv.org/abs/2403.10133][E4C]]: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance ==best== - preserving the semantical structure *** CONTEXT LEARNING - [[https://twitter.com/_akhaliq/status/1673544034193924103][DomainStudio]]: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data - keep the relative distances between adapted samples to achieve generation diversity - [[https://twitter.com/WenhuChen/status/1643079958388940803][SuTi]]: [[https://open-vision-language.github.io/suti/][Subject-driven]] Text-to-Image Generation via Apprenticeship Learning (using examples) - replaces subject-specific fine tuning with in-context learning, <> **** SEMANTIC CORRESPONDENCE - [[https://arxiv.org/pdf/2305.15581.pdf][Unsupervised Semantic]] Correspondence Using Stable Diffusion ==no training== ==from other image== - find locations in multiple images that have the same semantic meaning - optimize prompt embeddings for maximum attention on the regions of interest - capture semantic information about location, which can then be transferred to another image **** IMAGE RELATIONSHIPS - [[https://twitter.com/_akhaliq/status/1668450247385796609][Controlling]] [[https://github.com/Zeju1997/oft][Text-to-Image]] Diffusion by Orthogonal Finetuning - preserves the hyperspherical energy of the pairwise neuron relationship - preserves semantic coherance(relationships) - [[id:bb79e50e-ed85-4f37-bd0c-6cad6acd0a6e][TOKENCOMPOSE]] ***** VERBS - [[https://ziqihuangg.github.io/projects/reversion.html][ReVersion]]: [[https://github.com/ziqihuangg/ReVersion][Diffusion-Based]] [[https://github.com/ziqihuangg/ReVersion][Relation]] Inversion from Images - like putting images on materials - unlike inverting object appearance, inverting object relations - ADI: [[https://lemmy.dbzer0.com/post/15105096][Learning]] Disentangled Identifiers for Action-Customized Text-to-Image Generation - learn action-specific identifiers from the exemplar images ignoring appearances - [[https://arxiv.org/abs/2402.11487][Visual Concept-driven]] Image Generation with Text-to-Image Diffusion Model - concepts that can interact with other concepts, using masks to teach *** EXTRA PRETRAINED - [[id:ddd3588a-dc3c-426d-a94e-9aa373fabff9][GUIDING FUNCTION]] [[IDENTITY ZERO-SHOT]] - [[https://github.com/mkshing/e4t-diffusion][E4T-diffusion]]: [[https://tuning-encoder.github.io/][Tuning]] [[https://arxiv.org/abs/2302.12228][encoder]]: the text embedding + offset weights <> (Needs a >40GB GPU ) (faces) - [[https://dxli94.github.io/BLIP-Diffusion-website/][BLIP-Diffusion]]: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing - learned in 40 steps vs Textual Inversion 3000 - Subject-driven Style Transfer, Subject Interpolation - concept replacement - [[https://arxiv.org/pdf/2305.15779.pdf][Custom-Edit]]: Text-Guided Image Editing with Customized Diffusion Models <> **** UNDERSTANDING NETWORK - [[https://arxiv.org/pdf/2302.13848.pdf][Elite]]: [[https://github.com/csyxwei/ELITE][Encoding]] Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation - extra neural network to get text embedding, fastest text embeddings - <> - [[https://arxiv.org/abs/2306.00971][ViCo]]: [[https://github.com/haoosz/ViCo][Detail-Preserving]] Visual Condition for Personalized Text-to-Image Generation - extra on top, not finetune the original diffusion model, awesome quality, <> - unlike elite: automatic mechanism to generate object mask: cross-attentions - [[PHOTOMAKER]] faces *** SEVERAL CONCEPTS - [[id:6a66690f-b76f-441a-a093-3c83ca73af2d][MULTIPLE DIFFUSION]] - [[https://rich-text-to-image.github.io/][Expressive Text-to-Image]] [[https://github.com/SongweiGe/rich-text-to-image][Generation with]] Rich Text (learn concept-map from maxed avarages) - [[https://arxiv.org/abs/2304.06027][Continual]] [[https://jamessealesmith.github.io/continual-diffusion/][Diffusion]]: Continual Customization of Text-to-Image Diffusion with C-LoRA - sequentially learned concepts - [[https://huggingface.co/papers/2305.16311][Break-A-Scene]]: [[https://twitter.com/Gradio/status/1696585736454349106][Extracting]] Multiple Concepts from a Single Image - [[https://twitter.com/_akhaliq/status/1653620239735595010][Key-Locked]] Rank One Editing for Text-to-Image Personalization - combine individually learned concepts into a single generated image - [[https://huggingface.co/papers/2305.18292][Mix-of-Show]]: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models - solving concept conflicts *** CONES - [[https://arxiv.org/abs/2303.05125][Cones]]: [[https://github.com/Johanan528/Cones][Concept Neurons]] [[https://github.com/damo-vilab/Cones][in Diffusion]] [[https://github.com/ali-vilab/Cones-V2][Models]] for Customized Generation (better than Custom Diffusion) - index only the locations in the layers that give rise to a subject, add them together to include multiple subjects in a new context - [[https://twitter.com/__Johanan/status/1664495182379884549][Cones]] 2: [[https://arxiv.org/pdf/2305.19327.pdf][Customizable]] Image Synthesis with Multiple Subjects - flexible composition of various subjects without any model tuning - leaning an extra on top of a regular text embedding, and using layout to compose *** SVDIFF - SVDiff: [[https://arxiv.org/pdf/2303.11305.pdf][Compact Parameter]] [[https://arxiv.org/abs/2303.11305][Space]] for Diffusion Fine-Tuning, [[https://twitter.com/mk1stats/status/1643992102853038080][code]]([[https://twitter.com/mk1stats/status/1644830152118120448][soon]]) - multisubject learning, like D3S - personalized concepts, combinable; training gan out of its conv - Singular Value Decomposition (SVD) = gene coefficient vs expression level - CoSINE: Compact parameter space for SINgle image Editing (remove from prompt after finetune it) - [[https://arxiv.org/abs/2304.06648][DiffFit]]: [[https://github.com/mkshing/DiffFit-pytorch][Unlocking]] Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning - its PEFT for diffusion *** LIKE ORIGINAL ONES - 2 passes to make bigger: Standard High-Res fix or Deep Shrink High-Res Fix ([[https://twitter.com/ai_characters/status/1726369195296960994][kohya]]) - [[https://twitter.com/_akhaliq/status/1714490671233454134][VeRA]]: Vector-based Random Matrix Adaptation - single pair of low-rank matrices shared across all layers and learning small scaling vectors instead - 10x less parameters - [[https://twitter.com/_akhaliq/status/1715240693403185496][An Image]] is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning - Multi-Concept Prompt Learning (MCPL) - disentangled concepts with enhanced word-concept correlation - [[https://twitter.com/_akhaliq/status/1732236367982162080][X-Adapter]]: [[https://showlab.github.io/X-Adapter/][Adding Universal]] [[https://github.com/showlab/X-Adapter][Compatibility]] of Plugins for Upgraded Diffusion Model - feature remapping from SD 1.5 to SDXL for all loras and controlnets - so you can train at lower resources and map to higher - [[COGCARTOON]] - [[P+]] : learning text embeddings for each layer of the unet - [[https://lemmy.dbzer0.com/post/12196023?scrollToComments=true][PALP]]: Prompt Aligned Personalization of Text-to-Image Models - input: image and prompt - display ALL the tokens, not just some - [[https://eclipse-t2i.github.io/Lambda-ECLIPSE/][λ-ECLIPSE]]: [[https://huggingface.co/spaces/ECLIPSE-Community/lambda-eclipse-personalized-t2i][Multi-Concept]] Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space - [[https://twitter.com/_akhaliq/status/1758354431588938213][DreamMatcher]]: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image - (Personalization for Kandinsky) trained using projection loss and clip contrastive loss - plug-in method that does semantic matching instead of replacement-disruption - [[https://twitter.com/wuyang_ly/status/1769213318877598110][UniHDA]]: A Unified and Versatile framework for generative Hybrid Domain Adaptation - blends all characteristics at once, maintains robust cross-domain consistency **** TARGETING CONTEXTUAL CONSISTENCY - [[https://lemmy.dbzer0.com/post/13450998][Pick-and-Draw]]: Training-free Semantic Guidance for Text-to-Image Personalization - approach to boost identity consistency and generative diversity for personalization methods - [[https://lemmy.dbzer0.com/post/13386940][Object-Driven]] One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding - class-characterizing regularization to preserve prior knowledge of object classes, so it integrates seamlessly with existing concepts **** LORA :PROPERTIES: :ID: e261c214-31a2-4d93-a62b-61d7d53b702c :END: - lora, lycoris, loha, lokr - loha handles multiple-concepts better - https://www.canva.com/design/DAFeAteHW18/view#5 - use regularization images with lora https://rentry.org/59xed3#regularization-images - [[https://twitter.com/_akhaliq/status/1668828166499041281][GLORA]]: One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning - individual adapter of each layer - superior accuracy fewer parameters-computations - [[https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference][PEFT]] x Diffusers Integration - [[https://twitter.com/_akhaliq/status/1725357173155271120][Tied-Lora]]: Enhacing parameter efficiency of LoRA with weight tying - 13% of parameters than lora, parameter efficiency - [[https://twitter.com/_akhaliq/status/1727177584759161126][Concept]] [[https://twitter.com/davidbau/status/1730788830876229776][Sliders]]: LoRA Adaptors for Precise Control in Diffusion Models, plug and play ==best== - concept sliders that enable precise control over attributes - intuitive editing of visual concepts for which textual description is difficult - repair of object deformations and fixing distorted hands - [[https://twitter.com/_akhaliq/status/1727574713751249102][ZipLoRA]]: [[https://twitter.com/_akhaliq/status/1728086020267078100][Any]] Subject in Any Style by Effectively Merging LoRAs - cheaply and effectively merge independently trained style and subject LoRAs - [[https://arxiv.org/abs/2402.09353][DoRA]]: Weight-Decomposed Low-Rank Adaptation - decomposes the pre-trained weight into two components, magnitude and direction; directional updates - [[https://lemmy.dbzer0.com/post/15399911][DiffuseKronA]]: [[https://github.com/IBM/DiffuseKronA][A Parameter]] Efficient Fine-tuning Method for Personalized Diffusion Model - Kronecker product-based adaptation, reduces the parameter count by up to 35% lora - [[B-LoRA]] - [[https://lemmy.dbzer0.com/post/18313588][CAT]]: Contrastive Adapter Training for Personalized Image Generation - no loss of diversity in object generation, no token = no effect - [[CTRLORA]] ***** MULTIPLE LORA :PROPERTIES: :ID: 3f126569-6deb-45e1-9535-77883fc7ad8b :END: - [[https://twitter.com/_akhaliq/status/1721759353437311461][S-LoRA]]: Serving Thousands of Concurrent LoRA Adapters - scalable serving of many LoRA adapters, all adapters in the main memory, fetches for the current queries - [[https://twitter.com/_akhaliq/status/1726793541253280249][MultiLoRA]]: Democratizing LoRA for Better Multi-Task Learning - changes parameter initialization of adaptation matrices to reduce parameter dependency - [[https://twitter.com/_akhaliq/status/1732237243610210536][Orthogonal]] Adaptation for Modular Customization of Diffusion Models - customized models can be summed with minimal interference, and jointly synthesize - scalable customization of diffusion models by encouraging orthogonal weights - [[https://twitter.com/_akhaliq/status/1762334024561787339][Multi-LoRA]] Composition for Image Generation - [[https://arxiv.org/abs/2403.19776][CLoRA]]: A Contrastive Approach to Compose Multiple LoRA Models - enables the creation of composite images that truly reflect the characteristics of each LoRA **** TEXTUAL INVERSION - [[https://t.co/DbEPmPZB1l][Multiresolution Textual]] [[https://github.com/giannisdaras/multires_textual_inversion][Inversion]]: better textual inversion (embedding) - Extended Textual Inversion (XTI) - [[https://prompt-plus.github.io/][P+]]: [[https://prompt-plus.github.io/files/PromptPlus.pdf][Extended Textual]] Conditioning in Text-to-Image Generation <> - different text embedding per unet layer - [[https://github.com/cloneofsimo/promptplusplus][code]] - [[https://arxiv.org/abs/2305.05189][SUR-adapter]]: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models (llm) - adapter to transfer the semantic understanding of llm to align complex vs simple prompts - [[id:5762b4c1-e574-4ca5-9e38-032071698637][DREAMDISTRIBUTION]] is like Textual Inversion - [[https://github.com/RoyZhao926/CatVersion][CatVersion]]: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization - learns the gap between the personalized concept and its base class * USE CASES - [[image-to-image translation]] [[id:7f6f5bc1-ca59-4557-b908-0345e8127cde][ERASING CONCEPTS]] ** IMAGE COMPRESSION FILE - [[https://arxiv.org/abs/2401.17789][Robustly overfitting]] latents for flexible neural image compression - refine the latents of pre-trained neural image compression models - [[https://arxiv.org/abs/2402.08643][Learned]] Image Compression with Text Quality Enhancement - text logit loss function ** DIFFUSION AS ENCODER - RETRIEVE PROMPT :PROPERTIES: :ID: 40792f03-5726-453b-af13-ba0667592497 :END: - [[https://twitter.com/_akhaliq/status/1719899183430169056][De-Diffusion]] [[https://dediffusion.github.io/][Makes]] Text a Strong Cross-Modal Interface - text as a cross-modal interface - autoencoder uses a pre-trained text-to-image diffusion model for decoding - encoder is trained to transform an input image into text - PH2P: [[https://arxiv.org/abs/2312.12416][Prompting]] Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models - projection scheme to optimize for prompts representative of the space in the model (meaningful prompts) ** DIFFUSING TEXT - [[RESTORING HANDS]] - [[https://ds-fusion.github.io/static/pdf/dsfusion.pdf][DS-Fusion]]: [[https://ds-fusion.github.io/][Artistic]] Typography via Discriminated and Stylized Diffusion (fonts) - [[https://1073521013.github.io/glyph-draw.github.io/][GlyphDraw]]: [[https://arxiv.org/pdf/2303.17870.pdf][Learning]] [[https://twitter.com/_akhaliq/status/1642696550529867779][to Draw]] Chinese Characters in Image Synthesis Models Coherently - [[https://arxiv.org/pdf/2305.10855.pdf][TextDiffuser]]: Diffusion Models as Text Painters - [[https://arxiv.org/pdf/2402.14314.pdf][Typographic]] Text Generation with Off-the-Shelf Diffusion Model - complex effects while preserving its overall coherence - [[https://huggingface.co/papers/2305.18259][GlyphControl]]: [[https://github.com/AIGText/GlyphControl-release][Glyph Conditional]] Control for Visual Text Generation ==this== - [[https://github.com/microsoft/unilm/tree/master/textdiffuser][TextDiffuser]]: [[https://arxiv.org/pdf/2305.10855.pdf][Diffusion]] [[https://huggingface.co/spaces/microsoft/TextDiffuser][Models]] as Text Painters - [[https://github.com/microsoft/unilm/tree/master/textdiffuser-2][TextDiffuser-2]]: two language models: for layout planning and layout encoding; before the unet - [[TIP: text restoration]] - [[https://arxiv.org/abs/2403.16422][Refining Text-to-Image]] Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation - training-free framework to enhance layout generator and image generator conditioned on it - generating images with long and rare text sequences *** GENERATE VECTORS - [[https://twitter.com/_akhaliq/status/1736998105969459522][VecFusion]]: Vector Font Generation with Diffusion - rasterized fonts then vector model synthesizes vector fonts - [[https://twitter.com/_akhaliq/status/1737304904400613558][StarVector]]: Generating Scalable Vector Graphics Code from Images - CLIP image encoder, learning to align the visual and code tokens, generate SVGs - [[https://arxiv.org/abs/2401.17093][StrokeNUWA]]: Tokenizing Strokes for Vector Graphic Synthesis - encoding into stroke tokens, naturally compatible with LLMs - [[https://arxiv.org/abs/2404.00412][SVGCraft]]: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout - creation of vector graphics depicting entire scenes from textual descriptions - optimized using a pre-trained encoder *** INPAINTING TEXT - DiffSTE: Inpainting to edit text in images with a prompt ([[https://drive.google.com/file/d/1fc0RKGWo6MPSJIZNIA_UweTOPai64S9f/view][model]]) - [[https://github.com/UCSB-NLP-Chang/DiffSTE][Improving]] Diffusion Models for Scene Text Editing with Dual Encoders **** DERIVED FROM SD - [[https://github.com/ZYM-PKU/UDiffText][UDiffText]]: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models (with training code) - [[https://arxiv.org/abs/2312.12232][Brush Your]] Text: Synthesize Any Scene Text on Images via Diffusion Model (Diff-Text) - attention constraint to address unreasonable positioning, more accurate scene text, any language - its just a prompt and canny: "sign", "billboard", "label", "promotions", "notice", "marquee", "board", "blackboard", "slogan", "whiteboard", "logo" - [[https://github.com/tyxsspa/AnyText][AnyText]]: [[https://twitter.com/_akhaliq/status/1741239193215344810][Multilingual]] [[https://youtu.be/hrk_b_CQ36M?si=6cwXFAd1106D3aHK][Visual]] Text Generation And Editing ==best== - inputs: glyph, position, and masked image to generate latent features for text generation-editing - curved into shapes-textures text ** IMAGE RESTORATION, SUPER-RESOLUTION - [[https://twitter.com/_akhaliq/status/1678804195229433861][NILUT]]: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement - image signal processing pipeline , multiple blendable styles into a single network - [[https://arxiv.org/abs/2303.09833][FreeDoM]]: Training-Free Energy-Guided Conditional Diffusion Model - [[https://arxiv.org/abs/2304.08291][refusion]]: Image Restoration with Mean-Reverting Stochastic Differential Equations - [[https://arxiv.org/pdf/2212.00490.pdf][image]] restoration IR, [[https://github.com/wyhuai/DDNM][DDNM]] using NULL-SPACE - unlimited [[https://arxiv.org/pdf/2303.00354.pdf][superresolution]] - [[https://twitter.com/_akhaliq/status/1674249594421608448][SVNR]]: Spatially-variant Noise Removal with Denoising Diffusion - real life noise fixing - [[https://github.com/WindVChen/INR-Harmonization][Dense]] [[https://github.com/WindVChen/INR-Harmonization][Pixel-to-Pixel]] Harmonization via Continuous Image Representation - stretched images due to change in resolution fixed - [[https://github.com/WindVChen/Diff-Harmonization][Zero-Shot Image]] Harmonization with Generative Model Prior - [[https://github.com/xpixelgroup/diffbir][DiffBIR]]: Towards Blind Image Restoration with Generative Diffusion Prior - using a SwinIR then refine with sd *** SUPERRESOLUTION - [[https://github.com/csslc/CCSR][CCSR]]: Improving the Stability of Diffusion Models for Content Consistent Super-Resolution - Swintormer: [[https://github.com/bnm6900030/swintormer][Image]] Deblurring based on Diffusion Models (limited memory) - [[https://twitter.com/_akhaliq/status/1749254341507039644][Inflation]] with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution - for videos, temporal adapter to ensure temporal coherence - [[https://lemmy.dbzer0.com/post/13460000][YONOS-SR]]: You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation - start by training a teacher model on a smaller magnification scale - instead of 200 steps, and finetuned decoder on top of it - SUPIR: [[https://github.com/Fanghua-Yu/SUPIR][Scaling Up]] to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild - based on large-scale diffusion generative prior - [[https://arxiv.org/pdf/2401.15366.pdf][Face to Cartoon]] Incremental Super-Resolution using Knowledge Distillation - faces and anime restoration at various levels of detail - [[https://twitter.com/_akhaliq/status/1769784456112374077][APISR]]: [[https://twitter.com/_akhaliq/status/1769784456112374077][Anime]] Production Inspired Real-World Anime Super-Resolution - [[https://arxiv.org/abs/2403.12915][Ultra-High-Resolution]] Image Synthesis with Pyramid Diffusion Model - pyramid latent representation - [[https://github.com/mit-han-lab/efficientvit][EfficientViT]]: Multi-Scale Linear Attention for High-Resolution Dense Prediction ==best== **** STABLESR :PROPERTIES: :ID: bc0dd47c-4f46-4cd0-9606-555990c06626 :END: - [[https://github.com/IceClear/StableSR][StableSR]]: [[https://huggingface.co/Iceclear/StableSR][Exploiting]] Diffusion Prior for Real-World Image Super-Resolution - develope a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models **** DEMOFUSION - [[https://arxiv.org/abs/2311.16973][DemoFusion]]: Democratising High-Resolution Image Generation With No $$$ - achieve higher-resolution image generation - [[https://twitter.com/radamar/status/1732978064026706425][Enhance]] This: DemoFusion SDXL - [[https://github.com/ttulttul/ComfyUI-Iterative-Mixer][ComfyUI]] Iterative Mixing Nodes ==best== - iterative mixing of samples to help with upscaling quality - SD 1.5 generating at higher resolutions - evolution from [[https://github.com/Ttl/ComfyUi_NNLatentUpscale][NNLatentUpscale]] ***** PASD MAGNIFY - [[https://twitter.com/fffiloni/status/1743306262379475304][PASD]] Magnify: Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization - image slider custom component ** DEPTH GENERATION - [[https://twitter.com/_akhaliq/status/1630747135909015552][depth map]] from diffusion, build 3d enviroment with it - [[https://github.com/wl-zhao/VPD][VPD]]: using diffusion for depth estimation, image segmentation (better) <> comparable [[x-decoder]] - [[https://github.com/isl-org/ZoeDepth][ZoeDepth]]: [[https://arxiv.org/abs/2302.12288][Combining]] relative and metric depth - [[https://github.com/BillFSmith/TilingZoeDepth][tiling ZoeDepth]] - [[https://twitter.com/zhenyu_li9955/status/1732669069717909672][PatchFusion]]: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation - [[https://twitter.com/AntonObukhov1/status/1732946419663667464][Marigold]]: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (70s inference) - [[https://twitter.com/radamar/status/1691137538734583808][LDM3D]] by intel, generates image & depth from text prompts - [[https://twitter.com/_akhaliq/status/1721793148391760196][LDM3D-VR]]: Latent Diffusion Model for 3D VR - generating depth together, panoramic RGBD - DMD (Diffusion for Metric Depth) - [[https://twitter.com/_akhaliq/status/1737699544542973965][Zero-Shot]] Metric Depth with a Field-of-View Conditioned Diffusion Model (depth from image) - [[https://twitter.com/pythontrending/status/1750141129314468051][Depth Anything]]: [[https://twitter.com/mervenoyann/status/1750531698008498431][Unleashing]] the Power of Large-Scale Unlabeled Data (temporal coherance no flickering) - [[id:fa7469f5-948a-42b2-8787-14109bc9ed5a][GIBR]] *** DEPTH DIFFUSION :PROPERTIES: :ID: 277f7cda-963c-48ff-8e43-169986d8cff6 :END: - [[https://twitter.com/_akhaliq/status/1734051086175027595][MVDD]]: Multi-View Depth Diffusion Models - 3D shape generation, depth completion, and its potential as a 3D prior - enforce 3D consistency in multi-view depth - [[https://twitter.com/_akhaliq/status/1770673356821442847][DepthFM]]: Fast Monocular Depth Estimation with Flow Matching - pre-trained image diffusion model can become flow matching depth model *** NORMAL MAPS - [[https://github.com/baegwangbin/DSINE][DSine]]: Rethinking Inductive Biases for Surface Normal Estimation - better than bae and midas - [[https://github.com/Mikubill/sd-webui-controlnet/discussions/2703][preprocessor]]