:PROPERTIES: :ID: 58c585b9-a03e-4320-a313-e00e68c4ce7e :END: #+title: diffusion video #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - cogvideox1.5-5B Lora finetuning on 1.5-5B-I2V inference - https://github.com/a-r-r-o-w/cogvideox-factory - parent: [[id:c7fe7e79-73d3-4cc7-a673-2c2e259ab5b5][stable_diffusion]] - [[https://arxiv.org/abs/2403.07711][SSM Meets]] [[https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models][Video Diffusion]] Models: Efficient Video Generation with Structured State Spaces - no longer exponential for more frames * TUNING - [[https://noise-rectification.github.io/][Tuning-Free Noise]] Rectification for High Fidelity Image-to-Video Generation (dataset alleviation) - prevent loss of image details and the noise prediction biases during the denoising process - adds noise then denoises the noisy latent with proper rectification to alleviate the noise prediction biases - [[https://github.com/wgcban/apt][Attention Prompt]] [[https://arxiv.org/abs/2403.06978][Tuning]]: Parameter-Efficient Adaptation of Pre-Trained Models for Action Recognition - efficient prompt tuning for video applications such as action recognition * ANIMATION - [[https://twitter.com/_akhaliq/status/1756867429047640065][Keyframer]]: Empowering Animation Design using Large Language Models - animating static images (SVGs) with natural language - [[https://arxiv.org/abs/2402.06088][Animated]] Stickers: Bringing Stickers to Life with Video Diffusion (animated emojis) * 4D CONTROL - [[id:b211cec9-6cf2-4f6d-9e1e-10186f513da1][BIGGER COHERENCE]] [[id:5a81561e-9e1c-4fc8-bfba-de467b4de033][FACE]] [[id:20a546c6-135e-45f3-88a6-d3e5869bd28f][ARTIST EDITING]] - [[https://t.co/dLjkJDBfJa][DiffDreamer]]: [[https://twitter.com/prime_cai/status/1680429147146063874][Consistent]] Single-view Perpetual View Generation with Conditional Diffusion Models - landscape(mountains) fly overs - [[https://twitter.com/_akhaliq/status/1676084523006566403][DisCo]]: Disentangled Control for Referring Human Dance Generation in Real World - human dance(movement) images and videos (using skelleton rigs) - [[https://youtu.be/5qjR9aFRg1A?si=tVxadTz7HXLr93_L][paintsundo]]: instead of just diffusing, ai paints like humans paint - make speedpaints and (extract) sketches, fake drawing process ** GIF :PROPERTIES: :ID: de20b51a-2de5-4011-9561-2ce400d75af8 :END: - [[https://twitter.com/_akhaliq/status/1709342103434617240][Hotshot-XL]], text-to-GIF model for Stable Diffusion XL - [[https://twitter.com/_akhaliq/status/1702568561422553175][Generative]] Image Dynamics, interactive gifs(looping dynamic videos) - frequency-coordinated diffusion sampling process - neural stochastic motion texture - [[https://twitter.com/_akhaliq/status/1765974266510492028][Pix2Gif]]: Motion-Guided Diffusion for GIF Generation - transformed feature map (motion) remains within the same space as the target, thus consistency-coherence - [[https://twitter.com/_akhaliq/status/1768642913901183419][dynamicrafter]]: generative frame interpolation and looping video generation (320x512) - [[https://time-reversal.github.io/][Explorative]] Inbetweening of Time and Space - bounded generation of a pre-trained image-to-video model without any tuning and optimization - two images that capture a subject motion, translation between different viewpoints, or looping ** INTERACTIVE :PROPERTIES: :ID: 7ab592e4-5e55-40eb-8c93-9d779ea2bcf7 :END: *** COLORIZATION :PROPERTIES: :ID: 247ce640-4a9c-4cb9-94f9-d013848f47ce :END: - [[https://github.com/ykdai/BasicPBC][Learning]] Inclusion Matching for Animation Paint Bucket Colorization - for hand-drawn cel animation - comprehend the inclusion relationships between segments - paint based on previous frame *** DRAG :PROPERTIES: :ID: 208c064d-f700-4e8f-a4ab-2c73c557f9e3 :END: - ==DragGAN==: [[https://huggingface.co/papers/2305.10973][Drag Your]] [[https://github.com/Zeqiang-Lai/DragGAN][GAN]]: Interactive Point-based Manipulation on the Generative Image Manifold - dragging as input primitive, using pairs of points, excellent results, stylegan derivative - [[https://mc-e.github.io/project/DragonDiffusion/][DragonDiffusion]]: Enabling Drag-style Manipulation on Diffusion Models - moving, resizing, appearance replacement, dragging - [[https://twitter.com/_akhaliq/status/1765944027134750925][StableDrag]]: Stable Dragging for Point-based Image Editing - models: StableDrag-GAN and StableDrag-Diff - confidence-based latent enhancement strategy for motion supervision **** DRAG DIFFUSION :PROPERTIES: :ID: d3c6d9ef-9dff-4c60-8f92-5a523c24c139 :END: - [[https://twitter.com/_akhaliq/status/1673570232429051906][DragDiffusion]]: [[https://github.com/Yujun-Shi/DragDiffusion][Harnessing]] Diffusion Models for Interactive Point-based Image Editing - [[https://lemmy.dbzer0.com/post/12388550][RotationDrag]]: [[https://github.com/Tony-Lowe/RotationDrag][Point-based]] Image Editing with Rotated Diffusion Features - utilizing the feature map to rotate-move images - [[https://twitter.com/_akhaliq/status/1676808539317182464][DragonDiffusion]]: Enabling Drag-style Manipulation on Diffusion Models - [[https://twitter.com/_akhaliq/status/1692061631638114356][DragNUWA]]: [[https://github.com/ProjectNUWA/DragNUWA][Fine-grained]] Control in Video Generation by Integrating Text, Image, and Trajectory - control trajectories in different granularities - [[https://arxiv.org/abs/2404.01050][Drag Your]] [[https://github.com/haofengl/DragNoise][Noise]]: Interactive Point-based Editing via Diffusion Semantic Propagation - superior control and semantic retention, reducing the optimization time 50% compared to DragDiffusion - [[https://github.com/alibaba/Tora][Tora]]: Trajectory-oriented Diffusion Transformer for Video Generation *** HEAD POSE :PROPERTIES: :ID: 4ee8ad27-c3eb-43db-9086-d689ad44b2c6 :END: - [[TALKING FACES]] - [[https://twitter.com/_akhaliq/status/1664084264349040640][Control4D]]: [[https://huggingface.co/papers/2305.20082][Dynamic]] Portrait Editing by Learning 4D GAN from 2D Diffusion-based Editor <> - 4d gan, 2D diffusion, consistent 4D, ==best one== - change face of video - [[https://twitter.com/_akhaliq/status/1699336073636147476][AniPortraitGAN]]: Animatable 3D Portrait Generation from 2D Image Collections - facial expression, head pose, and shoulder movements - trained on unstructured 2D images - [[https://twitter.com/_akhaliq/status/1702155485183361455][MagiCapture]]: High-Resolution Multi-Concept Portrait Customization - generate high-resolution portrait images given a handful of random selfies - DiffPortrait3D: [[https://twitter.com/_akhaliq/status/1738247956019524048][Controllable]] Diffusion for Zero-Shot Portrait View Synthesis - input: unposed portrait image, retains identity and facial expression - [[https://xiyichen.github.io/morphablediffusion/][Morphable]] Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation - novel view synthesis; input: single image and morphable mesh for desired facial expression (emotion) * SEMANTICALLY DEFORMED :PROPERTIES: :ID: db81202f-abf0-410e-98c2-c202fa2ca350 :END: - [[https://research.nvidia.com/labs/toronto-ai/VideoLDM/][VideoLDM]]: hd, but still semantically deformed (nvidia) ** SEMANTICAL FIELD - [[https://twitter.com/_akhaliq/status/1682206212203376642][TokenFlow]]: [[https://github.com/omerbt/TokenFlow][Consistent]] Diffusion Features for Consistent Video Editing - consistency in edited video can be obtained by enforcing consistency in the diffusion feature space - [[https://twitter.com/_akhaliq/status/1691931947935908151][CoDeF]]: Content Deformation Fields for Temporally Consistent Video Processing - video to video, frame consistency - aggregating the entire video and then using deformation field on one image ==best one== - [[https://s2dm.github.io/S2DM/][S2DM]]: Sector-Shaped Diffusion Models for Video Generation ==best== - explore the use of optical flow as temporal conditions - prompt correctness while keeping semantical consistenc, can integrate with another temporal conditions - decouple the generation of temporal features from semantic-content features ** SD BASED - [[https://arxiv.org/abs/2304.08477][Latent-Shift]]: [[https://latent-shift.github.io/][Latent]] Diffusion with Temporal Shift for Efficient Text-to-Video Generation - temporal shift module that can leverage the spatial unet as is - [[https://twitter.com/_akhaliq/status/1668808284575342594][Rerender A]] [[https://twitter.com/_akhaliq/status/1669726589737631745][Video]]: [[https://huggingface.co/spaces/Anonymous-sub/Rerender][Zero-Shot]] Text-Guided Video-to-Video Translation - compatible with existing diffusion ==best one== - hierarchical cross-frame constraints applied to enforce coherence - [[https://tuneavideo.github.io/][Tune-A-Video]]: [[https://github.com/showlab/Tune-A-Video][One-Shot]] Tuning of Image Diffusion Models for Text-to-Video Generation - inflated sd model into video - FROZEN SD - [[https://fate-zero-edit.github.io/][Fate/Zero]]: [[https://github.com/ChenyangQiQi/FateZero][Fusing Attentions]](MIT) for Zero-shot Text-based Video Editing - most fluid one, without training - [[https://github.com/rehg-lab/RAVE][RAVE]]: [[https://rave-video.github.io/][Randomized]] Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models ==best== - employs novel noise shuffling strategy to leverage temporal interactions (coherence) - guidance with ControlNet - [[https://twitter.com/_akhaliq/status/1741666796770390207][FlowVid]]: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis - doesnt strictly adhere to optical flow - first frame = supplementary reference in the diffusion model - works seamlessly with existing I2I models *** I2VGEN-XL - [[https://huggingface.co/damo-vilab/MS-Image2Video][I2VGen-XL]] [[https://twitter.com/fffiloni/status/1696845630583296177][MS-Image2Video]] *non commercial* good consistency and continuity, animate image - built on sd; designed UNet to perform spatiotemporal modeling in the latent space; - pre-trained on video and images - [[https://twitter.com/_akhaliq/status/1735677931903418838][I2VGen-XL]]: [[https://i2vgen-xl.github.io/][High-Quality]] Image-to-Video Synthesis via Cascaded Diffusion Models - utilizing static images as a form taining guidance *** ANIMATEDIFF - ==best one== - [[https://twitter.com/_akhaliq/status/1678610810644451328][AnimateDiff]]: [[https://www.reddit.com/r/StableDiffusion/comments/14wgv2p/animatediff_animate_your_personalized_texttoimage/][Animate]] [[https://github.com/talesofai/AnimateDiff][Your]] [[https://github.com/guoyww/AnimateDiff][Personalized]] Text-to-Image Diffusion Models without Specific Tuning - insert motion module into frozen(normal sd) text-to-image model - examples: (nsfw) [[https://www.reddit.com/r/WaifuDiffusion/comments/15b7o19/what_character_next/][video1]] [[https://www.reddit.com/r/WaifuDiffusion/comments/15afld0/ai_prompt_lol/][video2]] [[https://boards.4channel.org/g/thread/95968853#p95968853][video3]] [[https://boards.4channel.org/g/thread/96099609#p96099609][video4>>96101928]] notnsfw: [[https://boards.4channel.org/g/thread/96050730>>96052621][video1>>96052859]] [[https://boards.4channel.org/g/thread/96149191#p96149191][sword and sun>>96155685]] - best one: [[https://twitter.com/Zuntan03/status/1717809494271660276][video6]] [[https://www.reddit.com/r/StableDiffusion/comments/17ssbkj/soon_there_will_be_nothing_left_to_fix/][video7]] [[https://www.reddit.com/r/StableDiffusion/comments/188je4b/do_you_like_this_knife/?utm_source=share&utm_medium=web2x&context=3][video8]] [[https://www.instagram.com/reel/C3EpNVUvn5d/?utm_source=ig_web_copy_link][video9]] - current state ways: https://banodoco.ai/Animatediff more [[https://www.reddit.com/r/StableDiffusion/comments/16a4j5j/summary_of_community_progress_in_expanding/][insight]] - techniques: - Animatediff-cli-[[https://github.com/s9roll7/animatediff-cli-prompt-travel/tree/main][prompt-travel]]+Upscale: https://twitter.com/toyxyz3/status/1695134607317012749 - [[https://www.reddit.com/r/StableDiffusion/comments/15xq294/controlling_animateddiff_using_starting_and/][Controlling AnimatedDiff]] using starting and ending frames (from Twitter user [[https://twitter.com/TDS_95514874/status/1679606124130205702][@TDS_95514874]]) - [[https://vvictoryuki.github.io/animatezero.github.io/][AnimateZero]]: Video Diffusion Models are Zero-Shot Image Animators - T2I generation is more controllable and efficient compared to T2V - we can transform pre-trained T2V models into I2V models - [[https://huggingface.co/Lightricks/LongAnimateDiff][LongAnimateDiff]]: now 64 frames - [[https://github.com/arthur-qiu/FreeNoise-AnimateDiff][FreeNoise]]: Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise-AnimateDiff) - removed the semantic flickering - [[https://huggingface.co/ByteDance/AnimateDiff-Lightning][AnimateDiff-Lightning]]: fast text-to-video model; can generate videos ten times than ANIMATEDIFF **** DIFFDIRECTOR :PROPERTIES: :ID: 0518240a-8adf-4bec-8460-034b49b7195e :END: - [[https://github.com/ExponentialML/AnimateDiff-MotionDirector][DiffDirector]]: AnimateDiff-MotionDirector, MotionDirector Train a MotionLoRA and run it on any compatible AnimateDiff UI **** PIA - [[https://huggingface.co/papers/2312.13964][PIA]]: [[https://twitter.com/_akhaliq/status/1738029201033175308][Your]] Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models - motion controllability by text: temporal alignment layers (TA) out of token **** ANIMATELCM :PROPERTIES: :ID: ccc8f98c-34eb-448b-b2d8-6ef662627fa4 :END: - [[https://twitter.com/_akhaliq/status/1753244353974214745][AnimateLCM]]: [[https://huggingface.co/wangfuyun/AnimateLCM][decouples]] [[https://huggingface.co/wangfuyun/AnimateLCM-SVD-xt][the distillation]] of image generation priors and motion generation priors **** CMD - [[https://twitter.com/_akhaliq/status/1770999135555956830][Efficient]] Video Diffusion Models via Content-Frame Motion-Latent Decomposition ==best== - content-motion latent diffusion model (CMD) - autoencoder that succinctly encodes a video as a combination of image and a low-dimensional motion latent representation - pretrained image diffusion model plus lightweight diffusion motion model ** 3D SD - [[https://yingqinghe.github.io/LVDM/][VideoCrafter]]: Open Diffusion Models for High-Quality Video Generation and Editing ([[https://github.com/VideoCrafter/VideoCrafter][A Toolkit]] [[https://github.com/VideoCrafter/VideoCrafter][for Text-to-Video]]) - has loras and controlnet, 3d unet; [[https://twitter.com/jfischoff/status/1643649328723144705/photo/1][deeper lesson]] - VideoFusion: damo/text-to-video-[[https://modelscope.cn/models/damo/text-to-video-synthesis/files][synthesis]], [[https://www.modelscope.cn/models/damo/cv_diffusion_text-to-image-synthesis_tiny/summary][summary]] [[https://www.modelscope.cn/models/damo/cv_diffusion_text-to-image-synthesis_tiny/summary][tiny]], [[https://arxiv.org/pdf/2303.08320.pdf][paper]] - https://rentry.org/f34hy [[https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/commit/ac7fbae73c65a6bbde3814d0198e16bb8e886cef][license change commit]] * BY INPUT ** VIDEO COHERENCE :PROPERTIES: :ID: 18c951a2-6883-4010-ad9d-9dee396b9839 :END: - [[id:b211cec9-6cf2-4f6d-9e1e-10186f513da1][BIGGER COHERENCE]] from normal sd image generation - [[https://twitter.com/_akhaliq/status/1737672598832447635][InstructVideo]]: Instructing Video Diffusion Models with Human Feedback - recast reward fine-tuning as editing: process corrupted video rated by image reward model ** VCHITECT :PROPERTIES: :ID: 89b433ed-e943-4fe6-8c18-bdaa834298fa :END: - [[https://twitter.com/liuziwei7/status/1730518084350521384][Vchitect]]: [[LAVIE][LaVie]](Text2Video), [[SEINE]](Image2Video) ** IMAGES - [[https://twitter.com/_akhaliq/status/1719561992740933900][SEINE]]: [[https://twitter.com/liuziwei7/status/1719732214521544984][Short-to-Long]] Video Diffusion Model for Generative Transition and Prediction ==best== <> - [[https://github.com/Vchitect/SEINE][SEINE]]: images of different scenes as inputs, plus text-based control, generates transition videos - [[https://twitter.com/camenduru/status/1734193274796011994][DynamiCrafter]]: [[https://arxiv.org/abs/2310.12190][Animating]] [[https://twitter.com/_akhaliq/status/1754344043209797702][Open-domain]] [[https://replicate.com/camenduru/dynami-crafter][Images]] with Video Diffusion Priors (prompt and image) ==best== - [[https://atomo-video.github.io/][AtomoVideo]]: High Fidelity Image-to-Video Generation ==best== - from input images, motion intensity and consistency; compatible with sd models without specific tuning - pre-trained sd, add 1D temporal convolution, temporal attention - [[https://github.com/ToonCrafter/ToonCrafter][ToonCrafter]]: Generative Cartoon Interpolation ==best== *** DANCING :PROPERTIES: :ID: 778f3c47-0420-41be-b5bb-f4d4c7f23cb9 :END: - [[id:7ed28066-314a-44f4-94d3-d1dc73aeb3df][CLOTH]] - [[https://twitter.com/_akhaliq/status/1726823455272644865][PixelDance]]: Make Pixels Dance: High-Dynamic Video Generation - synthesizing videos with complex scenes and intricate motions - incorporates image instructions (not just text instructions) - [[https://twitter.com/_akhaliq/status/1731754853238501522][MagicAnimate]]: Temporally Consistent Human Image Animation using Diffusion Model - video diffusion model to encode temporal information - [[REFERENENET]] - [[https://abdo-eldesokey.github.io/text2ac-zero/][Text2AC-Zero]]: Consistent Synthesis of Animated Characters using 2D Diffusion - zero shot on existing t2i, no training or fine-tuning - pixel-wise guidance to steer the diffusion to minimizes visual discrepancies - [[https://twitter.com/_akhaliq/status/1734050153093308820][DreaMoving]]: [[https://twitter.com/_akhaliq/status/1737536184862076979][A Human]] [[https://twitter.com/_akhaliq/status/1740380744726594003][Dance]] Video Generation Framework based on Diffusion Models - Video ControlNet for motion-controlling and a Content Guider for identity preserving - [[https://github.com/aigc3d/motionshop][Motionshop]]: [[https://aigc3d.github.io/motionshop/][An application]] of replacing the human motion in the video with a virtual 3D human - segment retarget and and inpaint (with light awareness) - [[https://lemmy.dbzer0.com/post/13388517][Diffutoon]]: [[https://ecnu-cilab.github.io/DiffutoonProjectPage/][High-Resolution]] [[https://github.com/Artiprocher/DiffSynth-Studio][Editable]] Toon Shading via Diffusion Models - aiming to directly render(turn) photorealistic videos into anime styles; keeping consistency - [[https://arxiv.org/abs/2402.03549][AnaMoDiff]]: 2D Analogical Motion Diffusion via Disentangled Denoising - best trade-off between motion analogy and identity preservation - [[https://boese0601.github.io/magicdance/][MagicDance]]: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer ==best== - real people references **** TALKING FACES :PROPERTIES: :ID: cb95a484-0d7c-4702-a256-41e21110c1aa :END: - [[https://github.com/ali-vilab/dreamtalk][DreamTalk]]: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models - inputs: songs, speech in multiple languages, noisy audio, and out-of-domain portraits - [[https://peterfanfan.github.io/EmoSpeaker/][EmoSpeaker]]: One-shot Fine-grained Emotion-Controlled Talking Face Generation - emotion input, different emotional intensities by adjusting the fine-grained emotion - [[https://arxiv.org/abs/2402.06149][HeadStudio]]: Text to Animatable Head Avatars with 3D Gaussian Splatting - generating animatable avatars from textual prompts, visually appealing - [[https://arxiv.org/abs/2402.10636][PEGASUS]]: Personalized Generative 3D Avatars with Composable Attributes - disentangled controls while preserving the identity, realistic - trained using synthetic data at first ** VIDEO INPUT - [[https://twitter.com/_akhaliq/status/1699323163178402032][MagicProp]]: Diffusion-based Video Editing via Motion-aware Appearance Propagation - edit one frame, then propagate to all - [[https://twitter.com/_akhaliq/status/1699330779334029624][Hierarchical]] Masked 3D Diffusion Model for Video Outpainting - [[https://github.com/williamyang1991/FRESCO][FRESCO]]: Spatial-Temporal Correspondence for Zero-Shot Video Translation - Zero shot and EBsynth come together for a new vid2vid ** BY PROMPT - [[https://arxiv.org/abs/2303.13439][Text2Video-Zero]]: [[https://github.com/Picsart-AI-Research/Text2Video-Zero][Text-to-Image]] [[https://github.com/JiauZhang/Text2Video-Zero][Diffusion Models]] are Zero-Shot Video Generators - DDIM enhanced with motion dynamics, after cross-frame attention to protect identity - [[https://arxiv.org/abs/2303.17599][Zero-Shot Video]] [[https://github.com/baaivision/vid2vid-zero][Editing Using]] Off-The-Shelf Image Diffusion Models (vid2vid zero) - [[https://arxiv.org/abs/2303.07945][Edit-A-Video]]: Single Video Editing with Object-Aware Consistency - [[https://video-p2p.github.io/][video-p2p]] [[https://arxiv.org/abs/2303.04761][cross]] attention control (more coherance than instruct-pix2pix) (Adobe) - [[https://twitter.com/_akhaliq/status/1669574695232888832][VidEdit]]: Zero-Shot and Spatially Aware Text-Driven Video Editing (temporal smoothness) - [[https://twitter.com/_akhaliq/status/1694614951997137083][StableVideo]]: [[https://github.com/rese1f/stablevideo][Text-driven]] Consistency-aware Diffusion Video Editing (14 gb vram) - temporal dependency = consistent appearance for the edited objects ==best one== - [[http://haonanqiu.com/projects/FreeNoise.html][FreeNoise]]: [[https://twitter.com/iScienceLuvr/status/1716644241961836683][Tuning-Free]] Longer Video Diffusion Via Noise Rescheduling Video Diffusion Via Noise Rescheduling ==best== - reschedule a sequence of noises peforming window-based function = longer videos conditioned on multiple texts *** MODELS - [[https://twitter.com/dreamingtulpa/status/1727076204115595359][Stable Video]] Diffusion - loras for camara control, multiview generation - [[https://twitter.com/_akhaliq/status/1744914502703874215][MagicVideo-V2]]: Multi-Stage High-Aesthetic Video Generation ==best== - more coherent movements *** LATENT OF BOTH IMAGES AND VIDEO - [[https://arxiv.org/pdf/2210.02399.pdf][Phenaki]] - C-vit is the video encoder, [[https://arxiv.org/pdf/2103.15691.pdf][Vivit]] [[https://github.com/google-research/scenic/tree/main/scenic/projects/vivit][repo]] - single images are treated like videos - [[https://twitter.com/_akhaliq/status/1734266117516845119][Photorealistic]] Video Generation with Diffusion Models - compress images and videos within a unified latent space - [[I2VGEN-XL]] *** WITH ARCHITECTURE STRUCTURE - [[https://arxiv.org/abs/2309.15818][Show-1]]: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation - first pixel-based t2v generation then latent-based upscaling **** CASCADED - [[https://twitter.com/_akhaliq/status/1706845034786492694][LAVIE]]: [[https://twitter.com/_akhaliq/status/1730264972658188320][High-Quality]] [[https://twitter.com/camenduru/status/1730356331238801494][Video]] [[https://twitter.com/cocktailpeanut/status/1730620352022167890][Generation]] with Cascaded Latent Diffusion Models <> - cascaded video latent diffusion models, temporal interpolation model - incorporation of simple temporal self-attentions with rotary positional encoding, captures correlations inherent in video ==best one== - [[https://twitter.com/_akhaliq/status/1722078229601661046][I2VGen-XL]]: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models ==best== - utilizing static images as a form of crucial guidance - guarantee coherent semantics by using two hierarchical encoders * EXTRA PRIORS - [[id:d3c6d9ef-9dff-4c60-8f92-5a523c24c139][DRAG DIFFUSION]] [[id:8872aa4e-0394-4066-822b-9145f14caf6f][IDENTITY IN VIDEO]] - [[https://twitter.com/_akhaliq/status/1692064144789475524][Dual-Stream]] Diffusion Net for Text-to-Video Generation - two diffusion streams, video content and motion branches = video variations; continuous with no flickers ** STYLECRAFTER :PROPERTIES: :ID: 4c93f57d-43b7-4fbe-9415-e007a06efd46 :END: - [[https://twitter.com/_akhaliq/status/1731499506745651217][StyleCrafter]]: [[https://github.com/GongyeLiu/StyleCrafter][Enhancing]] Stylized Text-to-Video Generation with Style Adapter ==best== - high-quality stylized videos that align with the content of the texts - train a style control adapter from image dataset then transfer to video ** MOTION - [[https://arxiv.org/pdf/2304.14404.pdf][MCDiff]]: Motion-Conditioned Diffusion Model for Controllable Video Synthesis - [[https://twitter.com/_akhaliq/status/1670219559511420929][VideoComposer]]: Compositional Video Synthesis with Motion Controllability models (temporal consistency) - motion vector from as control signal - [[https://twitter.com/_akhaliq/status/1712833464846709046][MotionDirector]]: Motion Customization of Text-to-Video Diffusion Models - dual-path LoRAs architecture to decouple the learning of appearance and motion - [[ANIMATEDIFF]] - [[https://twitter.com/_akhaliq/status/1714485412834517440][LAMP]]: Learn A Motion Pattern for Few-Shot-Based Video Generation (8~16 videos = 1 Motion) - expand pretrained 2D T2I convolution layers to temporal-spatial motion learning layers - shared-noise sampling = improve the stability of videos - [[https://github.com/showlab/MotionDirector][MotionDirector]]: Motion Customization of Text-to-Video Diffusion Models ==best== - [[https://twitter.com/_akhaliq/status/1732965060090183803][DreamVideo]]: Composing Your Dream Videos with Customized Subject and Motion - desired subject and a few videos of target motion (subject, motion learning on top of video model) *** SVD - [[https://github.com/alibaba/animate-anything][AnimateAnyghing]]: Fine Grained Open Domain Image Animation with Motion Guidance (anything) - finetuning stable diffusion video *** CONTROLER - [[https://twitter.com/_akhaliq/status/1732595480679330242][MagicStick]]: Controllable Video Editing via Control Handle Transformations - keyframe transformations can easily propagate to other frames to provide generation guidance - inflate image model and ControlNet to temporal dimension, train lora to fit the specific scenes - [[https://twitter.com/_akhaliq/status/1734049216073269509][Customizing]] Motion in Text-to-Video Diffusion Models - map depicted motion to a new unique token, and can invoke the motion in combination with other motions - [[https://jinga-lala.github.io/projects/Peekaboo/][Peekaboo]]: Interactive Video Generation via Masked-Diffusion - based on masking attention, control size and position - [[id:0518240a-8adf-4bec-8460-034b49b7195e][DIFFDIRECTOR]] [[id:ccc8f98c-34eb-448b-b2d8-6ef662627fa4][ANIMATELCM]] - [[https://lemmy.dbzer0.com/post/13525471][Motion Guidance]]: Diffusion-Based Image Editing with Differentiable Motion Estimators - a guidance loss that encourages the sample to have the desired motion - [[https://hohonu-vicml.github.io/Trailblazer.Page/][TrailBlazer]]: Trajectory Control for Diffusion-Based Video Generation - pre-trained model without further model training (bounding boxes to guide) - [[https://browse.arxiv.org/abs/2402.01566][Boximator]]: Generating Rich and Controllable Motions for Video Synthesis - hard box and soft box - plug-in for existing video diffusion models, training only a module - [[https://twitter.com/_akhaliq/status/1768112006702211223][Follow-Your-Click]]: [[https://twitter.com/Gradio/status/1768223127840952592][Open-domain]] Regional Image Animation via Short Prompts - locally aware and not moving the entire scene - [[https://twitter.com/_akhaliq/status/1775345383943659966][CameraCtrl]]: Enabling Camera Control for Text-to-Video Generation - camera pose control, parameterizing the camera trajector - [[https://github.com/hehao13/CameraCtrl][AnimateDiff]] [[https://twitter.com/angrypenguinPNG/status/1775364454328447039][more]] [[https://twitter.com/angrypenguinPNG/status/1776679692139188639][more]] **** MOTION FROM VIDEO - [[https://geonyeong-park.github.io/spectral-motion-alignment/][Spectral Motion]] Alignment for Video Motion Transfer using Diffusion Models - aligns motion vectors using Fourier and wavelet transforms - maintaining computational efficiency and compatibility with other customizations - [[https://wileewang.github.io/MotionInversion/][Motion Inversion]] for Video Customization - Motion Embeddings: temporally coherent derived from a given video - less than 10 minutes of training time **** DRAGANYTHING :PROPERTIES: :ID: 044077e6-d2ea-415f-9ec0-5ae727626dc1 :END: - [[https://twitter.com/_akhaliq/status/1767737635064127749][DragAnything]]: Motion Control for Anything using Entity Representation - trajectory-based is more userfriendly; control of motion for diverse entities **** DEFINE CAMERA MOVEMENT - [[https://lemmy.dbzer0.com/post/lemmy.dbzer0.com/9849411][LivePhoto]]: [[https://arxiv.org/abs/2312.02928][Real]] [[https://github.com/XavierCHEN34/LivePhoto][Image]] Animation with Text-guided Motion Control - motion-related textual instructions: actions, camera movements, new contents - motion intensity estimation module(control signal) - [[https://twitter.com/_akhaliq/status/1732597717044486599][MotionCtrl]]: [[https://twitter.com/xinntao/status/1739927557888520373][A Unified]] [[https://twitter.com/_akhaliq/status/1732597717044486599][and Flexible]] [[https://x.com/_akhaliq/status/1747275860607177121?s=20][Motion]] Controller for Video Generation - independently control camera and object motion, determined by camera poses and trajectories - using drawn lines - motionctrl for [[https://github.com/TencentARC/MotionCtrl][svd]], [[https://github.com/chaojie/ComfyUI-MotionCtrl-SVD][comfy]] - [[https://direct-a-video.github.io/][Icon Direct-a-Video]]: Customized Video Generation with User-Directed Camera Movement and Object Motion - define camera movement and then object motion using bounding box *** REFERENENET :PROPERTIES: :ID: 33903015-49dd-4a1a-81b5-78350c074fff :END: - [[https://humanaigc.github.io/animate-anyone/][Animate Anyone]]: [[https://twitter.com/dreamingtulpa/status/1730876691755450572][Consistent]] [[https://github.com/HumanAIGC/AnimateAnyone][and]] Controllable Image-to-Video Synthesis for Character Animation - ReferenceNet(controlnet), to merge detail features via spatial attention (temporal modeling for inter-frame transitions between video frames) - [[https://github.com/MooreThreads/Moore-AnimateAnyone][Moore-AnimateAnyone]] (over sd 1.5) ** LONG VIDEO - [[https://arxiv.org/abs/2303.12346][NUWA-XL]]: Diffusion over Diffusion for eXtremely Long Video Generation - coarse-to-fine process, iteratively complete the middle frames - sparseformer - [[https://sites.google.com/view/mebt-cvpr2023][Towards End-to-End]] [[https://arxiv.org/abs/2303.11251][Generative]] Modeling of Long Videos with Memory-Efficient Bidirectional Transformers - autorregresive with patches - [[https://twitter.com/_akhaliq/status/1727571442374455513][FusionFrames]]: Efficient Architectural Aspects for Text-to-Video Generation Pipeline - keyframes synthesis to figure the storyline of a video, then interpolation - [[SEINE]] * GENERATED VIDEO ENHANCEMENT - optical flow background [[https://www.ccoderun.ca/programming/doxygen/opencv/tutorial_background_subtraction.html][removal]] ** TRICKS - script cinema [[https://xanthius.itch.io/multi-frame-rendering-for-stablediffusion \[\[https://www.reddit.com/r/StableDiffusion/comments/11mlleh/custom_animation_script_for_automatic1111_in_beta/\]\[reddit\]\]][inspired]] - [[https://www.reddit.com/r/StableDiffusion/comments/11yejrj/another_temporal_consistency_experiment_the_real/][grid]] of [[https://www.reddit.com/r/StableDiffusion/comments/11zeb17/tips_for_temporal_stability_while_changing_the/][frames]] ** USING MODEL - [[https://twitter.com/_akhaliq/status/1696581953225683446][MS-Vid2Vid]] - enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos * OTHERS EDITING VIDEO - [[https://arxiv.org/abs/2303.15893][VIVE3D]]: [[https://afruehstueck.github.io/vive3D/][Viewpoint-Independent]] Video Editing using 3D-Aware GANs - [[https://showlab.github.io/Moonshot/][MoonShot]]: Towards Controllable Video Generation and Editing with Multimodal Conditions - zero-shot subject customized, controlnet only, video transformation - [[https://twitter.com/_akhaliq/status/1749275207552942328][ActAnywhere]]: Subject-Aware Video Background Generation - input: segmented subject and contextual image input - [[id:55829fe3-d777-4723-8b48-5c9454822b5e][STABLEIDENTITY]] inserting identity ** VIDEO INPAINT - [[https://anythinginanyscene.github.io/][Anything in]] Any Scene Photorealistic Video Object Insertion (realism, lighting realism, and photorealism) - [[https://invictus717.github.io/InteractiveVideo/][InteractiveVideo]]: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions - use human-painting, drag and drop, as prior to inpainting generation, dynamic interaction, - [[https://place-anything.github.io/][Place Anything]] into Any Video - using just a photograph of the object, looks like enhanced VR - [[https://arxiv.org/abs/2403.14617][Videoshop]]: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion - add or remove objects, semantically change objects, insert stock photos into videos *** OUTPAINTER - [[https://be-your-outpainter.github.io/][Be-Your-Outpainter]]: Mastering Video Outpainting through Input-Specific Adaptation ==best== - input-specific adaptation and pattern-aware outpainting ** VIDEO EXCHANGE :PROPERTIES: :ID: 992f12e2-c595-4aca-8129-6dace7d2f3ba :END: - [[https://twitter.com/_akhaliq/status/1731906167285121471][VideoSwap]]: Customized Video Subject Swapping with Interactive Semantic Point Correspondence - exploits semantic point correspondences, - only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape ** CONTROLNET VIDEO :PROPERTIES: :ID: fd3d677f-1b5e-46a3-8ee9-6524baa07339 :END: - [[https://github.com/CiaraStrawberry/svd-temporal-controlnet][Stable]] Video Diffusion Temporal Controlnet ** FRAME INTERPOLATION :PROPERTIES: :ID: 497486c7-7284-4087-86a2-223084f9901a :END: - MA-VFI: [[https://arxiv.org/abs/2402.02892][Motion-Aware]] Video Frame Interpolation - [[https://arxiv.org/abs/2403.06243][BlazeBVD]]: Make Scale-Time Equalization Great Again for Blind Video Deflickering - illumination histograms that precisely capture flickering and local exposure variation - to restore faithful and consistent texture affected by lighting changes; 10 times faster