:PROPERTIES: :ID: 73ac7415-61d5-4266-964a-647a4243ac6c :ROAM_ALIASES: speech audio sound :END: #+title: voice #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - parent: [[id:e9be16f7-8032-4509-9aa9-7843836eacd9][domain]] - [[id:f03ccf94-1aa5-4705-89af-617a22570e26][AUDIO VISION]] - models: https://rentry.org/AIVoiceStuff - tortoise [[https://huggingface.co/jbetker/tortoise-tts-v2/blob/3704aea61678e7e468a06d8eea121dba368a798e/.models/dvae.pth][dvae]] - MusicHiFi: Fast High-Fidelity Stereo Vocoding - from mel-spectrogram to higher quality mono and stereo * GENERATION - [[https://arxiv.org/abs/2303.02939][FoundationTTS]]: Text-to-Speech for ASR Custmization with Generative Language Model (automatic phonems, coerse and fine composition) - [[https://twitter.com/ConcreteSciFi/status/1642249097104220160][artificial]] tongue-throat - [[https://twitter.com/_akhaliq/status/1669736556301631496][Voicebox]]: Text-Guided Multilingual Universal Speech Generation at Scale (20 times faster than valle) - [[https://twitter.com/_akhaliq/status/1686780937630011392][Open sourcing]] AudioCraft: Generative AI for audio made simple and available to all - MusicGen, AudioGen, and EnCodec - [[https://github.com/SWivid/F5-TTS][F5-TTS]]: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching ==best== - extremely fast and you can add emotions ** MAGNET - [[https://twitter.com/lonziks/status/1746951479334768777][MAGNeT]]: [[https://twitter.com/AIatMeta/status/1757825426272235661][Masked]] Audio Generation using a Single Non-Autoregressive Transformer ==best== - ==comparison of them all== - trained: predict spans of masked tokens - single non-autoregressive model, for text-to-music and text-to-sound generation - SOTA models, while being 7x faster ** AUDIO DIFFUSION :PROPERTIES: :ID: 4131b9b5-0018-4639-8285-546810516cae :END: - AUDIO DIFFUSION (SOUND MUSIC VOICE) - parent: [[id:82127d6a-b3bb-40bf-a912-51fa5134dacc][diffusion]] - [[https://twitter.com/_akhaliq/status/1648510180009844738][NaturalSpeech 2]]: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers - [[https://twitter.com/_akhaliq/status/1688780256419684352][From Discrete]] [[https://huggingface.co/papers/2308.02560][Tokens]] to High-Fidelity Audio Using Multi-Band Diffusion - multi-band diffusion, generates any type of audio - music diffusion https://www.arxiv-vanity.com/papers/2301.11757/ - [[https://twitter.com/_akhaliq/status/1689463206643650560][JEN-1]]: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models - text-guided music generation, music inpainting, and continuation - [[https://twitter.com/_akhaliq/status/1703661711322914843][Re-AudioLDM]]: Retrieval-Augmented Text-to-Audio Generation (CLAP, audio clip), complex scenes - [[https://twitter.com/dadabots/status/1712534241073058089][Stable Audio Tools]]: audio training ==by stable diffusion== - [[https://twitter.com/_akhaliq/status/1719909644745814197][Controllable]] Music Production with Diffusion Models and Guidance Gradients - continuation, inpainting and regeneration; style transfer - [[https://styletts2.github.io/][StyleTTS2]]: [[https://boards.4channel.org/g/thread/97069589][ElevenLabs]] quality ==best== - [[https://twitter.com/_akhaliq/status/1720452643137441946][E3]] TTS: Easy End-to-End Diffusion-based Text to Speech - [[https://twitter.com/_akhaliq/status/1724277147177541920][Music ControlNet]]: [[https://musiccontrolnet.github.io/web/][Multiple]] Time-varying Controls for Music Generation - melody, dynamics, and rhythm controls, 35x fewer parameters, 11x less data - [[https://huggingface.co/declare-lab/mustango][Mustango]]: [[https://twitter.com/soujanyaporia/status/1725377945215373687][Toward]] Controllable Text-to-Music Generation - conditioned on prompts and various musical features - [[https://twitter.com/_akhaliq/status/1755461022494699663][Fast Timing-Conditioned]] Latent Audio Diffusion - conditioned on text prompts as well as timing embeddings, can generate structure and stereo sounds *** OUTPERFORMED? - [[https://twitter.com/_akhaliq/status/1732606971876966849][Schrodinger]] [[https://bridge-tts.github.io/][Bridges]] Beat Diffusion Models on Text-to-Speech Synthesis - issue: noisy representation (little information of the generation target) - solution: Bridge-TTS: strong structural information of the target - Schrodinger bridge between latent from text input and the ground-truth mel-spectrogram - better synthesis quality and sampling efficiency ** TTS GPT - [[https://arxiv.org/abs/2209.03143][AudioLM]]: [[https://github.com/lucidrains/audiolm-pytorch][a Language]] [[https://google-research.github.io/seanet/audiolm/examples/][Modeling]] Approach to Audio Generation <> - actually BERT, and using soundstream - also tts, and extended to valle, <> - [[https://arxiv.org/abs/2305.09636][SoundStorm]]: Efficient Parallel Audio Generation - 2 times faster than AudioLM, 50 fps, 30 seconds of speech continuation within 2 seconds - [[https://github.com/suno-ai/bark][bark]] ==best so far== not just voices - [[https://twitter.com/_akhaliq/status/1666255898749042689][Mega-TTS]]: [[https://mega-tts.github.io/demo-page/][Zero-Shot]] Text-to-Speech at Scale with Intrinsic Inductive Bias - decomposed, uses spectrograms, wild-big dataset, phase reconstructed, best zero shot - [[https://twitter.com/_akhaliq/status/1710112638422642732][UniAudio]]: An Audio Foundation Model Toward Universal Audio Generation - transformer, LMs techniques, simple fine-tuning ==best== *** TORTOISE LIKE - [[https://nonint.com/2022/04/25/tortoise-architectural-design-doc/][tortoise]] [[https://152334h.github.io/blog/tortoise-fine-tuning/][finetuning]] - [[https://twitter.com/angrypenguinPNG/status/1721660569332408336][OpenAI's]] Text to Speech TTS - [[https://twitter.com/_akhaliq/status/1724093477321855403][EmotiVoice]]: [[https://github.com/netease-youdao/EmotiVoice][a]] Multi-Voice and Prompt-Controlled TTS Engine - [[https://github.com/rhasspy/piper][Piper]] A fast, local neural text to speech system ==best== ** ALIGNMENT - [[https://twitter.com/_akhaliq/status/1690947635354693632][Improving]] Joint Speech-Text Representations Without Alignment - sequence-length mismatch naturally fix, simply assuming the best alignment * INSTRUMENT LIKE - [[https://twitter.com/_akhaliq/status/1684375640643096577][WavJourney]]: [[https://twitter.com/LiuXub/status/1694999774800302444][Compositional]] Audio Creation with Large Language Models - script compiler: encompassing speech, music, effects, guided by instructions; creative control - [[https://www.youtube.com/watch?v=63cXyngKD_s][Audio]] Style Transfer (using a dsp - a daw plugin) - gradient estimation instead of having to replace the plugin with a [[https://youtu.be/63cXyngKD_s?t=1235][proxy network]] - [[https://twitter.com/_akhaliq/status/1691330070760321024][SpeechX]]: [[https://www.microsoft.com/en-us/research/project/speechx/][Neural Codec]] Language Model as a Versatile Speech Transformer - phoneme intrinsics; choose-task voice transform (like voice transfer) - [[https://twitter.com/Gradio/status/1696489361741680691][Text-to-Sing]]: melody, then with your own lyrics - [[https://twitter.com/_akhaliq/status/1762339575299551316][ChatMusician]]: Understanding and Generating Music Intrinsically with LLM - music-notation is treated as a second language - also excellent compressor for music - [[https://twitter.com/reach_vb/status/1766115433885651137][MusicLang]]: Llama 2 based Music generation model - trained from scratch; runs on cpu - using chords * AUDIO CODEC - Disen: [[https://arxiv.org/abs/2211.11960][Disentangled Feature]] Learning for Real-Time Neural Speech Coding - voice conversion in real-time communications - ==Codec== codebook each for speaker and content - [[https://github.com/enhuiz/vall-e][valle]] [[https://valle-demo.github.io/][concept]]: modeling (building up decoder) - VALL-E X: [[https://arxiv.org/pdf/2303.03926.pdf][Multilingual]] [[https://github.com/Plachtaa/VALL-E-X][Text]]-[[https://vallex-demo.github.io/][to-Speech]] Synthesis and Voice Cloning - clone with only 3 seconds of reference audio - [[AudioLM]] - [[https://twitter.com/_akhaliq/status/1668430703128707078][High-Fidelity Audio]] [[https://github.com/descriptinc/descript-audio-codec][Compression]] [[https://twitter.com/arankomatsuzaki/status/1668435803373191168][with]] Improved RVQGAN (8kbps) - 3 times better than facebook [[https://github.com/facebookresearch/encodec][encodec]] ==best codec== - [[https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec/][LMCodec]]: A Low Bitrate Speech Codec With Causal Transformer Models - [[https://ai.googleblog.com/2021/08/soundstream-end-to-end-neural-audio.html][Soundstream]] [[https://google-research.github.io/seanet/soundstream/examples/][encoder]] [[https://github.com/wesbz/SoundStream][implementation]] - [[https://youtu.be/lX0S0ZdWdDw][MusicGen]]: vs google musiclm, inteweaving of discrete sound tokens conditioned on text input - [[https://twitter.com/_akhaliq/status/1674254559428968448][Accelerating]] Transducers through Adjacent Token Merging - reduce 57% of tokens, improve speed by 70% ** LANGUAGE ENCODING - [[https://twitter.com/_akhaliq/status/1701767789910843836][Natural Language]] Supervision for General-Purpose Audio Representations - audio representations trained on 22 tasks, instead of the of sound event classification - language autoregressive decoder-only, then joint with Contrastive Learning - [[https://twitter.com/laion_ai/status/1715027028582228175][CLARA]]: Multilingual Contrastive Learning for Audio Representation Acquisition - contrastive audio-text model, with understanding of implicit aspects of speech: emotions - [[https://twitter.com/_akhaliq/status/1727210336497893438][HierSpeech++]]: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis ** SUPER-RESOLUTION - [[https://twitter.com/_akhaliq/status/1702565321842769984][AudioSR]]: Versatile Audio Super-resolution at Scale (upsample, enhance) * VOICE CONVERSION - [[https://qicongxie.github.io/end2endvc/][end2endvc]]: [[https://arxiv.org/pdf/2206.07569.pdf][End-to-End Voice]] Conversion with Information Perturbation (==better mos than nvc==) (better MOS than freevc) - [[https://arxiv.org/pdf/2302.08296.pdf][QuickVC]] ([[https://github.com/quickvc/QuickVC-VoiceConversion][5000 kHz]] **fastest**) ==vits== - [[https://arxiv.org/abs/2303.09057][TriAAN-VC]]: [[https://winddori2002.github.io/vc-demo.github.io/][Triple Adaptive]] Attention Normalization for Any-to-Any Voice Conversion - **best similarity** and close naturality, speaker encoding - [[gpt voice only]] (best similarity, semantic tokens) ** FEW SHOT - [[https://arxiv.org/pdf/2210.15418.pdf][FREEVC]]: [[https://github.com/olawod/freevc][TOWARDS]] HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION (==vits==) - [[https://arxiv.org/pdf/2209.11866.pdf][CONTROLVC]]: ZERO-SHOT VOICE CONVERSION WITH TIME-VARYING CONTROLS ON PITCH AND SPEED - [[https://arxiv.org/abs/2212.14227][StyleTTS-VC]]: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models (phonems) - [[https://hiervst.github.io/][HierVST]]: Hierarchical Adaptive Zero-shot Voice Style Transfer - end-to-end zero-shot VST model (better than DiffVC) ** ZERO SHOT - VoiceCraft: [[https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf][Zero-Shot]] [[https://github.com/jasonppy/VoiceCraft][Speech]] Editing and Text-to-Speech in the Wild ==best== - to clone or edit an unseen voice, voicecraft needs only a few seconds of reference * STYLE CONVERSION - [[https://arxiv.org/pdf/2205.09784.pdf][LVC-VC]]: Voice Conversion with Location-Variable Convolutions - simultaneously performing voice conversion while [[https://lvc-vc.github.io/lvc-vc-demo/][generating audio]] - smaller than NVC-Net - ==has charts== - [[https://arxiv.org/abs/2106.00992][NVC-Net]]: End-to-End Adversarial Voice Conversion (==SONY==) - voice conversion directly on the raw audio waveform - ==best one== 3600 kHz fastest - https://github.com/sony/ai-research-code [[https://github.com/sony/ai-research-code/tree/master/nvcnet][nvc]] [[https://nvcnet.github.io/][voices]] ** ATOMIC TRANSFERS - [[https://arxiv.org/abs/2302.08137][Speech]] [[https://paarthneekhara.github.io/ace/code.html][Representation]] Extractor: divide in parts, voice, pitch, context; zero shot [[https://github.com/NVIDIA/NeMo][Nvidia]] - [[https://github.com/anonymous-pits/pits][Pits]]: vits with pitch control (monotomic alignment) - [[https://github.com/auspicious3000/SpeechSplit][SpeechSplit]]: disentangling speech into content, timbre, rhythm and pitch. - [[https://github.com/cyhuang-tw/AutoVC][AutoVC]] implementation - speaker embeddings: https://github.com/yistLin/dvector - [[https://yistlin.github.io/FragmentVC/][FragmentVC]] Timber transfer (better than AutoVC) keeps frecuency - [[https://arxiv.org/pdf/2203.16037.pdf][RGSM]]: better than Fragment - MFC-StyleVC: [[https://arxiv.org/pdf/2211.08857.pdf][DELIVERING SPEAKING]] [[https://kerwinchao.github.io/lowresourcevc.github.io/][STYLE IN LOW-RESOURCE]] CONVERSION WITH MULTI-FACTOR CONSTRAINTS - repeat the utterance; different training objective for **adaptation**, normalizing - content, speaker, style on/off - [[https://arxiv.org/abs/2302.10536][Nonparallel Emotional]] [[https://demosamplesites.github.io/EVCUP/][Voice Conversion]] For Unseen Speaker-Emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing (==SONY==) - emotion transfer