parent: domain
models: https://rentry.org/AIVoiceStuff
tortoise dvae
MusicHiFi: Fast High-Fidelity Stereo Vocoding
from mel-spectrogram to higher quality mono and stereo
FoundationTTS Text-to-Speech for ASR Custmization with Generative Language Model (automatic phonems, coerse and fine composition)
artificialtongue-throat
Voicebox Text-Guided Multilingual Universal Speech Generation at Scale (20 times faster than valle)
Open sourcingAudioCraft: Generative AI for audio made simple and available to all
MusicGen, AudioGen, and EnCodec
F5-TTS A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching ==best==
extremely fast and you can add emotions
MAGNeT MaskedAudio Generation using a Single Non-Autoregressive Transformer ==best==
==comparison of them all==
trained: predict spans of masked tokens
single non-autoregressive model, for text-to-music and text-to-sound generation
SOTA models, while being 7x faster
AUDIO DIFFUSION (SOUND MUSIC VOICE)
parent: diffusion
NaturalSpeech 2 Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
From DiscreteTokensto High-Fidelity Audio Using Multi-Band Diffusion
multi-band diffusion, generates any type of audio
music diffusion https://www.arxiv-vanity.com/papers/2301.11757/
JEN-1 Text-Guided Universal Music Generation with Omnidirectional Diffusion Models
text-guided music generation, music inpainting, and continuation
Re-AudioLDM Retrieval-Augmented Text-to-Audio Generation (CLAP, audio clip), complex scenes
Stable Audio Tools audio training ==by stable diffusion==
ControllableMusic Production with Diffusion Models and Guidance Gradients
continuation, inpainting and regeneration; style transfer
StyleTTS2 ElevenLabsquality ==best==
E3TTS: Easy End-to-End Diffusion-based Text to Speech
Music ControlNet MultipleTime-varying Controls for Music Generation
melody, dynamics, and rhythm controls, 35x fewer parameters, 11x less data
Mustango TowardControllable Text-to-Music Generation
conditioned on prompts and various musical features
Fast Timing-ConditionedLatent Audio Diffusion
conditioned on text prompts as well as timing embeddings, can generate structure and stereo sounds
SchrodingerBridgesBeat Diffusion Models on Text-to-Speech Synthesis
issue: noisy representation (little information of the generation target)
solution: Bridge-TTS: strong structural information of the target
Schrodinger bridge between latent from text input and the ground-truth mel-spectrogram
better synthesis quality and sampling efficiency
AudioLM a LanguageModelingApproach to Audio Generation <<gpt voice only>>
actually BERT, and using soundstream
also tts, and extended to valle, <<AudioLM>>
SoundStorm Efficient Parallel Audio Generation
2 times faster than AudioLM, 50 fps, 30 seconds of speech continuation within 2 seconds
bark==best so far== not just voices
Mega-TTS Zero-ShotText-to-Speech at Scale with Intrinsic Inductive Bias
decomposed, uses spectrograms, wild-big dataset, phase reconstructed, best zero shot
UniAudio An Audio Foundation Model Toward Universal Audio Generation
transformer, LMs techniques, simple fine-tuning ==best==
OpenAI'sText to Speech TTS
EmotiVoice aMulti-Voice and Prompt-Controlled TTS Engine
PiperA fast, local neural text to speech system ==best==
ImprovingJoint Speech-Text Representations Without Alignment
sequence-length mismatch naturally fix, simply assuming the best alignment
WavJourney CompositionalAudio Creation with Large Language Models
script compiler: encompassing speech, music, effects, guided by instructions; creative control
AudioStyle Transfer (using a dsp - a daw plugin)
gradient estimation instead of having to replace the plugin with a proxy network
SpeechX Neural CodecLanguage Model as a Versatile Speech Transformer
phoneme intrinsics; choose-task voice transform (like voice transfer)
Text-to-Sing melody, then with your own lyrics
ChatMusician Understanding and Generating Music Intrinsically with LLM
music-notation is treated as a second language
also excellent compressor for music
MusicLang Llama 2 based Music generation model
trained from scratch; runs on cpu
using chords
Disen: Disentangled FeatureLearning for Real-Time Neural Speech Coding
voice conversion in real-time communications
==Codec== codebook each for speaker and content
valleconcept modeling (building up decoder)
VALL-E X: MultilingualTextto-SpeechSynthesis and Voice Cloning
clone with only 3 seconds of reference audio
High-Fidelity AudioCompressionwithImproved RVQGAN (8kbps)
3 times better than facebook encodec==best codec==
LMCodec A Low Bitrate Speech Codec With Causal Transformer Models
MusicGen vs google musiclm, inteweaving of discrete sound tokens conditioned on text input
AcceleratingTransducers through Adjacent Token Merging
reduce 57% of tokens, improve speed by 70%
Natural LanguageSupervision for General-Purpose Audio Representations
audio representations trained on 22 tasks, instead of the of sound event classification
language autoregressive decoder-only, then joint with Contrastive Learning
CLARA Multilingual Contrastive Learning for Audio Representation Acquisition
contrastive audio-text model, with understanding of implicit aspects of speech: emotions
HierSpeech++ Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
AudioSR Versatile Audio Super-resolution at Scale (upsample, enhance)
end2endvc End-to-End VoiceConversion with Information Perturbation (==better mos than nvc==) (better MOS than freevc)
TriAAN-VC Triple AdaptiveAttention Normalization for Any-to-Any Voice Conversion
best similarity and close naturality, speaker encoding
gpt voice only(best similarity, semantic tokens)
FREEVC TOWARDSHIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION (==vits==)
CONTROLVC ZERO-SHOT VOICE CONVERSION WITH TIME-VARYING CONTROLS ON PITCH AND SPEED
StyleTTS-VC One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models (phonems)
HierVST Hierarchical Adaptive Zero-shot Voice Style Transfer
end-to-end zero-shot VST model (better than DiffVC)
VoiceCraft: Zero-ShotSpeechEditing and Text-to-Speech in the Wild ==best==
to clone or edit an unseen voice, voicecraft needs only a few seconds of reference
LVC-VC Voice Conversion with Location-Variable Convolutions
simultaneously performing voice conversion while generating audio
smaller than NVC-Net
==has charts==
NVC-Net End-to-End Adversarial Voice Conversion (==SONY==)
voice conversion directly on the raw audio waveform
==best one== 3600 kHz fastest
SpeechRepresentationExtractor: divide in parts, voice, pitch, context; zero shot Nvidia
Pits vits with pitch control (monotomic alignment)
SpeechSplit disentangling speech into content, timbre, rhythm and pitch.
AutoVCimplementation
speaker embeddings: https://github.com/yistLin/dvector
FragmentVCTimber transfer (better than AutoVC) keeps frecuency
RGSM better than Fragment
MFC-StyleVC: DELIVERING SPEAKINGSTYLE IN LOW-RESOURCECONVERSION WITH MULTI-FACTOR CONSTRAINTS
repeat the utterance; different training objective for adaptation, normalizing
content, speaker, style on/off
Nonparallel EmotionalVoice ConversionFor Unseen Speaker-Emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing (==SONY==)
emotion transfer