cogvideox1.5-5B Lora finetuning on 1.5-5B-I2V inference
parent: stable_diffusion
SSM MeetsVideo DiffusionModels: Efficient Video Generation with Structured State Spaces
no longer exponential for more frames
Tuning-Free NoiseRectification for High Fidelity Image-to-Video Generation (dataset alleviation)
prevent loss of image details and the noise prediction biases during the denoising process
adds noise then denoises the noisy latent with proper rectification to alleviate the noise prediction biases
Attention PromptTuning Parameter-Efficient Adaptation of Pre-Trained Models for Action Recognition
efficient prompt tuning for video applications such as action recognition
Keyframer Empowering Animation Design using Large Language Models
animating static images (SVGs) with natural language
AnimatedStickers: Bringing Stickers to Life with Video Diffusion (animated emojis)
DiffDreamer ConsistentSingle-view Perpetual View Generation with Conditional Diffusion Models
landscape(mountains) fly overs
DisCo Disentangled Control for Referring Human Dance Generation in Real World
human dance(movement) images and videos (using skelleton rigs)
paintsundo instead of just diffusing, ai paints like humans paint
make speedpaints and (extract) sketches, fake drawing process
Hotshot-XL text-to-GIF model for Stable Diffusion XL
GenerativeImage Dynamics, interactive gifs(looping dynamic videos)
frequency-coordinated diffusion sampling process
neural stochastic motion texture
Pix2Gif Motion-Guided Diffusion for GIF Generation
transformed feature map (motion) remains within the same space as the target, thus consistency-coherence
dynamicrafter generative frame interpolation and looping video generation (320x512)
ExplorativeInbetweening of Time and Space
bounded generation of a pre-trained image-to-video model without any tuning and optimization
two images that capture a subject motion, translation between different viewpoints, or looping
LearningInclusion Matching for Animation Paint Bucket Colorization
for hand-drawn cel animation
comprehend the inclusion relationships between segments
paint based on previous frame
==DragGAN==: Drag YourGAN Interactive Point-based Manipulation on the Generative Image Manifold
dragging as input primitive, using pairs of points, excellent results, stylegan derivative
DragonDiffusion Enabling Drag-style Manipulation on Diffusion Models
moving, resizing, appearance replacement, dragging
StableDrag Stable Dragging for Point-based Image Editing
models: StableDrag-GAN and StableDrag-Diff
confidence-based latent enhancement strategy for motion supervision
DragDiffusion HarnessingDiffusion Models for Interactive Point-based Image Editing
RotationDrag Point-basedImage Editing with Rotated Diffusion Features
utilizing the feature map to rotate-move images
DragonDiffusion Enabling Drag-style Manipulation on Diffusion Models
DragNUWA Fine-grainedControl in Video Generation by Integrating Text, Image, and Trajectory
control trajectories in different granularities
Drag YourNoise Interactive Point-based Editing via Diffusion Semantic Propagation
superior control and semantic retention, reducing the optimization time 50% compared to DragDiffusion
Tora Trajectory-oriented Diffusion Transformer for Video Generation
Control4D DynamicPortrait Editing by Learning 4D GAN from 2D Diffusion-based Editor <<Control4D>>
4d gan, 2D diffusion, consistent 4D, ==best one==
change face of video
AniPortraitGAN Animatable 3D Portrait Generation from 2D Image Collections
facial expression, head pose, and shoulder movements
trained on unstructured 2D images
MagiCapture High-Resolution Multi-Concept Portrait Customization
generate high-resolution portrait images given a handful of random selfies
DiffPortrait3D: ControllableDiffusion for Zero-Shot Portrait View Synthesis
input: unposed portrait image, retains identity and facial expression
MorphableDiffusion: 3D-Consistent Diffusion for Single-image Avatar Creation
novel view synthesis; input: single image and morphable mesh for desired facial expression (emotion)
VideoLDM hd, but still semantically deformed (nvidia)
TokenFlow ConsistentDiffusion Features for Consistent Video Editing
consistency in edited video can be obtained by enforcing consistency in the diffusion feature space
CoDeF Content Deformation Fields for Temporally Consistent Video Processing
video to video, frame consistency
aggregating the entire video and then using deformation field on one image ==best one==
S2DM Sector-Shaped Diffusion Models for Video Generation ==best==
explore the use of optical flow as temporal conditions
prompt correctness while keeping semantical consistenc, can integrate with another temporal conditions
decouple the generation of temporal features from semantic-content features
Latent-Shift LatentDiffusion with Temporal Shift for Efficient Text-to-Video Generation
temporal shift module that can leverage the spatial unet as is
Rerender AVideo Zero-ShotText-Guided Video-to-Video Translation
compatible with existing diffusion ==best one==
hierarchical cross-frame constraints applied to enforce coherence
Tune-A-Video One-ShotTuning of Image Diffusion Models for Text-to-Video Generation
inflated sd model into video
FROZEN SD
Fate/Zero Fusing AttentionsMIT) for Zero-shot Text-based Video Editing
most fluid one, without training
RAVE RandomizedNoise Shuffling for Fast and Consistent Video Editing with Diffusion Models ==best==
employs novel noise shuffling strategy to leverage temporal interactions (coherence)
guidance with ControlNet
FlowVid Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
doesnt strictly adhere to optical flow
first frame = supplementary reference in the diffusion model
works seamlessly with existing I2I models
I2VGen-XLMS-Image2Videonon commercial good consistency and continuity, animate image
built on sd; designed UNet to perform spatiotemporal modeling in the latent space;
pre-trained on video and images
I2VGen-XL High-QualityImage-to-Video Synthesis via Cascaded Diffusion Models
utilizing static images as a form taining guidance
==best one==
AnimateDiff AnimateYourPersonalizedText-to-Image Diffusion Models without Specific Tuning
insert motion module into frozen(normal sd) text-to-image model
examples: (nsfw) video1video2video3video4>>96101928notnsfw: video1>>96052859sword and sun>>96155685
current state ways: https://banodoco.ai/Animatediff more insight
techniques:
Animatediff-cli-prompt-travelUpscale: https://twitter.com/toyxyz3/status/1695134607317012749 <button class="pull-tweet" value=https://twitter.com/toyxyz3/status/1695134607317012749>pull</button>
Controlling AnimatedDiffusing starting and ending frames (from Twitter user @TDS_95514874
AnimateZero Video Diffusion Models are Zero-Shot Image Animators
T2I generation is more controllable and efficient compared to T2V
we can transform pre-trained T2V models into I2V models
LongAnimateDiff now 64 frames
FreeNoise Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise-AnimateDiff)
removed the semantic flickering
AnimateDiff-Lightning fast text-to-video model; can generate videos ten times than ANIMATEDIFF
DiffDirector AnimateDiff-MotionDirector, MotionDirector Train a MotionLoRA and run it on any compatible AnimateDiff UI
PIA YourPersonalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
motion controllability by text: temporal alignment layers (TA) out of token
AnimateLCM decouplesthe distillationof image generation priors and motion generation priors
EfficientVideo Diffusion Models via Content-Frame Motion-Latent Decomposition ==best==
content-motion latent diffusion model (CMD)
autoencoder that succinctly encodes a video as a combination of image and a low-dimensional motion latent representation
pretrained image diffusion model plus lightweight diffusion motion model
VideoCrafter Open Diffusion Models for High-Quality Video Generation and Editing (A Toolkitfor Text-to-Video
has loras and controlnet, 3d unet; deeper lesson
BIGGER COHERENCEfrom normal sd image generation
InstructVideo Instructing Video Diffusion Models with Human Feedback
recast reward fine-tuning as editing: process corrupted video rated by image reward model
SEINE Short-to-LongVideo Diffusion Model for Generative Transition and Prediction ==best== <<SEINE>>
SEINE images of different scenes as inputs, plus text-based control, generates transition videos
DynamiCrafter AnimatingOpen-domainImageswith Video Diffusion Priors (prompt and image) ==best==
AtomoVideo High Fidelity Image-to-Video Generation ==best==
from input images, motion intensity and consistency; compatible with sd models without specific tuning
pre-trained sd, add 1D temporal convolution, temporal attention
ToonCrafter Generative Cartoon Interpolation ==best==
PixelDance Make Pixels Dance: High-Dynamic Video Generation
synthesizing videos with complex scenes and intricate motions
incorporates image instructions (not just text instructions)
MagicAnimate Temporally Consistent Human Image Animation using Diffusion Model
video diffusion model to encode temporal information
Text2AC-Zero Consistent Synthesis of Animated Characters using 2D Diffusion
zero shot on existing t2i, no training or fine-tuning
pixel-wise guidance to steer the diffusion to minimizes visual discrepancies
DreaMoving A HumanDanceVideo Generation Framework based on Diffusion Models
Video ControlNet for motion-controlling and a Content Guider for identity preserving
Motionshop An applicationof replacing the human motion in the video with a virtual 3D human
segment retarget and and inpaint (with light awareness)
Diffutoon High-ResolutionEditableToon Shading via Diffusion Models
aiming to directly render(turn) photorealistic videos into anime styles; keeping consistency
AnaMoDiff 2D Analogical Motion Diffusion via Disentangled Denoising
best trade-off between motion analogy and identity preservation
MagicDance Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer ==best==
real people references
DreamTalk When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
inputs: songs, speech in multiple languages, noisy audio, and out-of-domain portraits
EmoSpeaker One-shot Fine-grained Emotion-Controlled Talking Face Generation
emotion input, different emotional intensities by adjusting the fine-grained emotion
HeadStudio Text to Animatable Head Avatars with 3D Gaussian Splatting
generating animatable avatars from textual prompts, visually appealing
PEGASUS Personalized Generative 3D Avatars with Composable Attributes
disentangled controls while preserving the identity, realistic
trained using synthetic data at first
MagicProp Diffusion-based Video Editing via Motion-aware Appearance Propagation
edit one frame, then propagate to all
HierarchicalMasked 3D Diffusion Model for Video Outpainting
FRESCO Spatial-Temporal Correspondence for Zero-Shot Video Translation
Zero shot and EBsynth come together for a new vid2vid
Text2Video-Zero Text-to-ImageDiffusion Modelsare Zero-Shot Video Generators
DDIM enhanced with motion dynamics, after cross-frame attention to protect identity
Zero-Shot VideoEditing UsingOff-The-Shelf Image Diffusion Models (vid2vid zero)
Edit-A-Video Single Video Editing with Object-Aware Consistency
video-p2pcrossattention control (more coherance than instruct-pix2pix) (Adobe)
VidEdit Zero-Shot and Spatially Aware Text-Driven Video Editing (temporal smoothness)
StableVideo Text-drivenConsistency-aware Diffusion Video Editing (14 gb vram)
temporal dependency = consistent appearance for the edited objects ==best one==
FreeNoise Tuning-FreeLonger Video Diffusion Via Noise Rescheduling Video Diffusion Via Noise Rescheduling ==best==
reschedule a sequence of noises peforming window-based function = longer videos conditioned on multiple texts
Stable VideoDiffusion
loras for camara control, multiview generation
MagicVideo-V2 Multi-Stage High-Aesthetic Video Generation ==best==
more coherent movements
PhotorealisticVideo Generation with Diffusion Models
compress images and videos within a unified latent space
Show-1 Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
first pixel-based t2v generation then latent-based upscaling
LAVIE High-QualityVideoGenerationwith Cascaded Latent Diffusion Models <<LAVIE>>
cascaded video latent diffusion models, temporal interpolation model
incorporation of simple temporal self-attentions with rotary positional encoding, captures correlations inherent in video ==best one==
I2VGen-XL High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models ==best==
utilizing static images as a form of crucial guidance
guarantee coherent semantics by using two hierarchical encoders
Dual-StreamDiffusion Net for Text-to-Video Generation
two diffusion streams, video content and motion branches = video variations; continuous with no flickers
StyleCrafter EnhancingStylized Text-to-Video Generation with Style Adapter ==best==
high-quality stylized videos that align with the content of the texts
train a style control adapter from image dataset then transfer to video
MCDiff Motion-Conditioned Diffusion Model for Controllable Video Synthesis
VideoComposer Compositional Video Synthesis with Motion Controllability models (temporal consistency)
motion vector from as control signal
MotionDirector Motion Customization of Text-to-Video Diffusion Models
dual-path LoRAs architecture to decouple the learning of appearance and motion
LAMP Learn A Motion Pattern for Few-Shot-Based Video Generation (8~16 videos = 1 Motion)
expand pretrained 2D T2I convolution layers to temporal-spatial motion learning layers
shared-noise sampling = improve the stability of videos
MotionDirector Motion Customization of Text-to-Video Diffusion Models ==best==
DreamVideo Composing Your Dream Videos with Customized Subject and Motion
desired subject and a few videos of target motion (subject, motion learning on top of video model)
AnimateAnyghing Fine Grained Open Domain Image Animation with Motion Guidance (anything)
finetuning stable diffusion video
MagicStick Controllable Video Editing via Control Handle Transformations
keyframe transformations can easily propagate to other frames to provide generation guidance
inflate image model and ControlNet to temporal dimension, train lora to fit the specific scenes
CustomizingMotion in Text-to-Video Diffusion Models
map depicted motion to a new unique token, and can invoke the motion in combination with other motions
Peekaboo Interactive Video Generation via Masked-Diffusion
based on masking attention, control size and position
Motion Guidance Diffusion-Based Image Editing with Differentiable Motion Estimators
a guidance loss that encourages the sample to have the desired motion
TrailBlazer Trajectory Control for Diffusion-Based Video Generation
pre-trained model without further model training (bounding boxes to guide)
Boximator Generating Rich and Controllable Motions for Video Synthesis
hard box and soft box
plug-in for existing video diffusion models, training only a module
Follow-Your-Click Open-domainRegional Image Animation via Short Prompts
locally aware and not moving the entire scene
CameraCtrl Enabling Camera Control for Text-to-Video Generation
camera pose control, parameterizing the camera trajector
Spectral MotionAlignment for Video Motion Transfer using Diffusion Models
aligns motion vectors using Fourier and wavelet transforms
maintaining computational efficiency and compatibility with other customizations
Motion Inversionfor Video Customization
Motion Embeddings: temporally coherent derived from a given video
less than 10 minutes of training time
DragAnything Motion Control for Anything using Entity Representation
trajectory-based is more userfriendly; control of motion for diverse entities
LivePhoto RealImageAnimation with Text-guided Motion Control
motion-related textual instructions: actions, camera movements, new contents
motion intensity estimation module(control signal)
MotionCtrl A Unifiedand FlexibleMotionController for Video Generation
Icon Direct-a-Video Customized Video Generation with User-Directed Camera Movement and Object Motion
define camera movement and then object motion using bounding box
Animate Anyone ConsistentandControllable Image-to-Video Synthesis for Character Animation
ReferenceNet(controlnet), to merge detail features via spatial attention (temporal modeling for inter-frame transitions between video frames)
Moore-AnimateAnyone(over sd 1.5)
NUWA-XL Diffusion over Diffusion for eXtremely Long Video Generation
coarse-to-fine process, iteratively complete the middle frames
sparseformer
Towards End-to-EndGenerativeModeling of Long Videos with Memory-Efficient Bidirectional Transformers
autorregresive with patches
FusionFrames Efficient Architectural Aspects for Text-to-Video Generation Pipeline
keyframes synthesis to figure the storyline of a video, then interpolation
optical flow background removal
script cinema https://xanthius.itch.io/multi-frame-rendering-for-stablediffusion \[\[https://www.reddit.com/r/StableDiffusion/comments/11mlleh/custom_animation_script_for_automatic1111_in_beta/\]\[reddit\]\inspired]]
enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos
VIVE3D Viewpoint-IndependentVideo Editing using 3D-Aware GANs
MoonShot Towards Controllable Video Generation and Editing with Multimodal Conditions
zero-shot subject customized, controlnet only, video transformation
ActAnywhere Subject-Aware Video Background Generation
input: segmented subject and contextual image input
STABLEIDENTITYinserting identity
Anything inAny Scene Photorealistic Video Object Insertion (realism, lighting realism, and photorealism)
InteractiveVideo User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
use human-painting, drag and drop, as prior to inpainting generation, dynamic interaction,
Place Anythinginto Any Video
using just a photograph of the object, looks like enhanced VR
Videoshop Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
add or remove objects, semantically change objects, insert stock photos into videos
Be-Your-Outpainter Mastering Video Outpainting through Input-Specific Adaptation ==best==
input-specific adaptation and pattern-aware outpainting
VideoSwap Customized Video Subject Swapping with Interactive Semantic Point Correspondence
exploits semantic point correspondences,
only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape
StableVideo Diffusion Temporal Controlnet
MA-VFI: Motion-AwareVideo Frame Interpolation
BlazeBVD Make Scale-Time Equalization Great Again for Blind Video Deflickering
illumination histograms that precisely capture flickering and local exposure variation
to restore faithful and consistent texture affected by lighting changes; 10 times faster