:PROPERTIES:
:ID:       73ac7415-61d5-4266-964a-647a4243ac6c
:ROAM_ALIASES: speech audio sound
:END:
#+title: voice
#+filetags: :neuralnomicon:
#+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup
- parent: [[id:e9be16f7-8032-4509-9aa9-7843836eacd9][domain]]
- [[id:f03ccf94-1aa5-4705-89af-617a22570e26][AUDIO VISION]]
- models: https://rentry.org/AIVoiceStuff
  - tortoise [[https://huggingface.co/jbetker/tortoise-tts-v2/blob/3704aea61678e7e468a06d8eea121dba368a798e/.models/dvae.pth][dvae]]
- MusicHiFi: Fast High-Fidelity Stereo Vocoding
  - from mel-spectrogram to higher quality mono and stereo
* GENERATION
- [[https://arxiv.org/abs/2303.02939][FoundationTTS]]: Text-to-Speech for ASR Custmization with Generative Language Model (automatic phonems, coerse and fine composition)
- [[https://twitter.com/ConcreteSciFi/status/1642249097104220160][artificial]] tongue-throat
- [[https://twitter.com/_akhaliq/status/1669736556301631496][Voicebox]]: Text-Guided Multilingual Universal Speech Generation at Scale (20 times faster than valle)
- [[https://twitter.com/_akhaliq/status/1686780937630011392][Open sourcing]] AudioCraft: Generative AI for audio made simple and available to all
  - MusicGen, AudioGen, and EnCodec
- [[https://github.com/SWivid/F5-TTS][F5-TTS]]: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching ==best==
  - extremely fast and you can add emotions
** MAGNET
- [[https://twitter.com/lonziks/status/1746951479334768777][MAGNeT]]: [[https://twitter.com/AIatMeta/status/1757825426272235661][Masked]] Audio Generation using a Single Non-Autoregressive Transformer ==best==
  - ==comparison of them all==
  - trained: predict spans of masked tokens
  - single non-autoregressive model, for text-to-music and text-to-sound generation
  - SOTA models, while being 7x faster
** AUDIO DIFFUSION
:PROPERTIES:
:ID:       4131b9b5-0018-4639-8285-546810516cae
:END:
- AUDIO DIFFUSION (SOUND MUSIC VOICE)
- parent: [[id:82127d6a-b3bb-40bf-a912-51fa5134dacc][diffusion]]
- [[https://twitter.com/_akhaliq/status/1648510180009844738][NaturalSpeech 2]]: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
- [[https://twitter.com/_akhaliq/status/1688780256419684352][From Discrete]] [[https://huggingface.co/papers/2308.02560][Tokens]] to High-Fidelity Audio Using Multi-Band Diffusion
  - multi-band diffusion, generates any type of audio
- music diffusion https://www.arxiv-vanity.com/papers/2301.11757/
  - [[https://twitter.com/_akhaliq/status/1689463206643650560][JEN-1]]: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models
    - text-guided music generation, music inpainting, and continuation
- [[https://twitter.com/_akhaliq/status/1703661711322914843][Re-AudioLDM]]: Retrieval-Augmented Text-to-Audio Generation (CLAP, audio clip), complex scenes
- [[https://twitter.com/dadabots/status/1712534241073058089][Stable Audio Tools]]: audio training ==by stable diffusion==
- [[https://twitter.com/_akhaliq/status/1719909644745814197][Controllable]] Music Production with Diffusion Models and Guidance Gradients
  - continuation, inpainting and regeneration; style transfer
- [[https://styletts2.github.io/][StyleTTS2]]: [[https://boards.4channel.org/g/thread/97069589][ElevenLabs]] quality ==best==
  - [[https://twitter.com/_akhaliq/status/1720452643137441946][E3]] TTS: Easy End-to-End Diffusion-based Text to Speech
- [[https://twitter.com/_akhaliq/status/1724277147177541920][Music ControlNet]]: [[https://musiccontrolnet.github.io/web/][Multiple]] Time-varying Controls for Music Generation
  - melody, dynamics, and rhythm controls, 35x fewer parameters, 11x less data
- [[https://huggingface.co/declare-lab/mustango][Mustango]]: [[https://twitter.com/soujanyaporia/status/1725377945215373687][Toward]] Controllable Text-to-Music Generation
  - conditioned on prompts and various musical features
- [[https://twitter.com/_akhaliq/status/1755461022494699663][Fast Timing-Conditioned]] Latent Audio Diffusion
  - conditioned on text prompts as well as timing embeddings, can generate structure and stereo sounds
*** OUTPERFORMED?
- [[https://twitter.com/_akhaliq/status/1732606971876966849][Schrodinger]] [[https://bridge-tts.github.io/][Bridges]] Beat Diffusion Models on Text-to-Speech Synthesis
  - issue: noisy representation (little information of the generation target)
    - solution: Bridge-TTS: strong structural information of the target
      - Schrodinger bridge between latent from text input and the ground-truth mel-spectrogram
  - better synthesis quality and sampling efficiency
** TTS GPT
- [[https://arxiv.org/abs/2209.03143][AudioLM]]: [[https://github.com/lucidrains/audiolm-pytorch][a Language]] [[https://google-research.github.io/seanet/audiolm/examples/][Modeling]] Approach to Audio Generation <<gpt voice only>>
  - actually BERT, and using soundstream
  - also tts, and extended to valle, <<AudioLM>>
  - [[https://arxiv.org/abs/2305.09636][SoundStorm]]: Efficient Parallel Audio Generation
    - 2 times faster than AudioLM, 50 fps, 30 seconds of speech continuation within 2 seconds
- [[https://github.com/suno-ai/bark][bark]] ==best so far== not just voices
- [[https://twitter.com/_akhaliq/status/1666255898749042689][Mega-TTS]]: [[https://mega-tts.github.io/demo-page/][Zero-Shot]] Text-to-Speech at Scale with Intrinsic Inductive Bias
  - decomposed, uses spectrograms, wild-big dataset, phase reconstructed, best zero shot
- [[https://twitter.com/_akhaliq/status/1710112638422642732][UniAudio]]: An Audio Foundation Model Toward Universal Audio Generation
  - transformer, LMs techniques, simple fine-tuning ==best==
*** TORTOISE LIKE
- [[https://nonint.com/2022/04/25/tortoise-architectural-design-doc/][tortoise]] [[https://152334h.github.io/blog/tortoise-fine-tuning/][finetuning]]
- [[https://twitter.com/angrypenguinPNG/status/1721660569332408336][OpenAI's]] Text to Speech TTS
- [[https://twitter.com/_akhaliq/status/1724093477321855403][EmotiVoice]]: [[https://github.com/netease-youdao/EmotiVoice][a]] Multi-Voice and Prompt-Controlled TTS Engine
- [[https://github.com/rhasspy/piper][Piper]] A fast, local neural text to speech system ==best==
** ALIGNMENT
- [[https://twitter.com/_akhaliq/status/1690947635354693632][Improving]] Joint Speech-Text Representations Without Alignment
  - sequence-length mismatch naturally fix, simply assuming the best alignment
* INSTRUMENT LIKE
- [[https://twitter.com/_akhaliq/status/1684375640643096577][WavJourney]]: [[https://twitter.com/LiuXub/status/1694999774800302444][Compositional]] Audio Creation with Large Language Models
  - script compiler: encompassing speech, music, effects, guided by instructions; creative control
- [[https://www.youtube.com/watch?v=63cXyngKD_s][Audio]] Style Transfer (using a dsp - a daw plugin)
  - gradient estimation instead of having to replace the plugin with a [[https://youtu.be/63cXyngKD_s?t=1235][proxy network]]
- [[https://twitter.com/_akhaliq/status/1691330070760321024][SpeechX]]: [[https://www.microsoft.com/en-us/research/project/speechx/][Neural Codec]] Language Model as a Versatile Speech Transformer
  - phoneme intrinsics; choose-task voice transform (like voice transfer)
- [[https://twitter.com/Gradio/status/1696489361741680691][Text-to-Sing]]: melody, then with your own lyrics
- [[https://twitter.com/_akhaliq/status/1762339575299551316][ChatMusician]]: Understanding and Generating Music Intrinsically with LLM
  - music-notation is treated as a second language
  - also excellent compressor for music
- [[https://twitter.com/reach_vb/status/1766115433885651137][MusicLang]]: Llama 2 based Music generation model
  - trained from scratch; runs on cpu
  - using chords
* AUDIO CODEC
- Disen: [[https://arxiv.org/abs/2211.11960][Disentangled Feature]] Learning for Real-Time Neural Speech Coding
  - voice conversion in real-time communications
  - ==Codec== codebook each for speaker and content
- [[https://github.com/enhuiz/vall-e][valle]] [[https://valle-demo.github.io/][concept]]: modeling (building up decoder)
  - VALL-E X: [[https://arxiv.org/pdf/2303.03926.pdf][Multilingual]] [[https://github.com/Plachtaa/VALL-E-X][Text]]-[[https://vallex-demo.github.io/][to-Speech]] Synthesis and Voice Cloning
    - clone with only 3 seconds of reference audio
  - [[AudioLM]]
    - [[https://twitter.com/_akhaliq/status/1668430703128707078][High-Fidelity Audio]] [[https://github.com/descriptinc/descript-audio-codec][Compression]] [[https://twitter.com/arankomatsuzaki/status/1668435803373191168][with]] Improved RVQGAN (8kbps)
      - 3 times better than facebook [[https://github.com/facebookresearch/encodec][encodec]] ==best codec==
- [[https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec/][LMCodec]]: A Low Bitrate Speech Codec With Causal Transformer Models
  - [[https://ai.googleblog.com/2021/08/soundstream-end-to-end-neural-audio.html][Soundstream]] [[https://google-research.github.io/seanet/soundstream/examples/][encoder]] [[https://github.com/wesbz/SoundStream][implementation]]
- [[https://youtu.be/lX0S0ZdWdDw][MusicGen]]: vs google musiclm, inteweaving of discrete sound tokens conditioned on text input
- [[https://twitter.com/_akhaliq/status/1674254559428968448][Accelerating]] Transducers through Adjacent Token Merging
  - reduce 57% of tokens, improve speed by 70%
** LANGUAGE ENCODING
- [[https://twitter.com/_akhaliq/status/1701767789910843836][Natural Language]] Supervision for General-Purpose Audio Representations
  - audio representations trained on 22 tasks, instead of the of sound event classification
  - language autoregressive decoder-only, then joint with Contrastive Learning
- [[https://twitter.com/laion_ai/status/1715027028582228175][CLARA]]: Multilingual Contrastive Learning for Audio Representation Acquisition
  - contrastive audio-text model, with understanding of implicit aspects of speech: emotions
- [[https://twitter.com/_akhaliq/status/1727210336497893438][HierSpeech++]]: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
** SUPER-RESOLUTION
- [[https://twitter.com/_akhaliq/status/1702565321842769984][AudioSR]]: Versatile Audio Super-resolution at Scale (upsample, enhance)
* VOICE CONVERSION
- [[https://qicongxie.github.io/end2endvc/][end2endvc]]: [[https://arxiv.org/pdf/2206.07569.pdf][End-to-End Voice]] Conversion with Information Perturbation (==better mos than nvc==) (better MOS than freevc)
- [[https://arxiv.org/pdf/2302.08296.pdf][QuickVC]] ([[https://github.com/quickvc/QuickVC-VoiceConversion][5000 kHz]] **fastest**) ==vits==
- [[https://arxiv.org/abs/2303.09057][TriAAN-VC]]: [[https://winddori2002.github.io/vc-demo.github.io/][Triple Adaptive]] Attention Normalization for Any-to-Any Voice Conversion
  - **best similarity** and close naturality, speaker encoding
- [[gpt voice only]] (best similarity, semantic tokens)
** FEW SHOT
- [[https://arxiv.org/pdf/2210.15418.pdf][FREEVC]]: [[https://github.com/olawod/freevc][TOWARDS]] HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION (==vits==)
- [[https://arxiv.org/pdf/2209.11866.pdf][CONTROLVC]]: ZERO-SHOT VOICE CONVERSION WITH TIME-VARYING CONTROLS ON PITCH AND SPEED
- [[https://arxiv.org/abs/2212.14227][StyleTTS-VC]]: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models (phonems)
- [[https://hiervst.github.io/][HierVST]]: Hierarchical Adaptive Zero-shot Voice Style Transfer
  - end-to-end zero-shot VST model (better than DiffVC)
** ZERO SHOT
- VoiceCraft: [[https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf][Zero-Shot]] [[https://github.com/jasonppy/VoiceCraft][Speech]] Editing and Text-to-Speech in the Wild ==best==
  - to clone or edit an unseen voice, voicecraft needs only a few seconds of reference
* STYLE CONVERSION
- [[https://arxiv.org/pdf/2205.09784.pdf][LVC-VC]]: Voice Conversion with Location-Variable Convolutions
  - simultaneously performing voice conversion while [[https://lvc-vc.github.io/lvc-vc-demo/][generating audio]]
  - smaller than NVC-Net
  - ==has charts==
- [[https://arxiv.org/abs/2106.00992][NVC-Net]]: End-to-End Adversarial Voice Conversion (==SONY==)
  - voice conversion directly on the raw audio waveform
  - ==best one==  3600 kHz fastest
  - https://github.com/sony/ai-research-code  [[https://github.com/sony/ai-research-code/tree/master/nvcnet][nvc]] [[https://nvcnet.github.io/][voices]]
** ATOMIC TRANSFERS
- [[https://arxiv.org/abs/2302.08137][Speech]] [[https://paarthneekhara.github.io/ace/code.html][Representation]] Extractor: divide in parts, voice, pitch, context; zero shot [[https://github.com/NVIDIA/NeMo][Nvidia]]
- [[https://github.com/anonymous-pits/pits][Pits]]: vits with pitch control (monotomic alignment)
- [[https://github.com/auspicious3000/SpeechSplit][SpeechSplit]]:  disentangling speech into content, timbre, rhythm and pitch.
  - [[https://github.com/cyhuang-tw/AutoVC][AutoVC]] implementation
    - speaker embeddings: https://github.com/yistLin/dvector
    - [[https://yistlin.github.io/FragmentVC/][FragmentVC]] Timber transfer (better than AutoVC) keeps frecuency
      - [[https://arxiv.org/pdf/2203.16037.pdf][RGSM]]: better than Fragment
- MFC-StyleVC: [[https://arxiv.org/pdf/2211.08857.pdf][DELIVERING SPEAKING]] [[https://kerwinchao.github.io/lowresourcevc.github.io/][STYLE IN LOW-RESOURCE]] CONVERSION WITH MULTI-FACTOR CONSTRAINTS
  - repeat the utterance; different training objective for **adaptation**, normalizing
  - content, speaker, style on/off
- [[https://arxiv.org/abs/2302.10536][Nonparallel Emotional]] [[https://demosamplesites.github.io/EVCUP/][Voice Conversion]] For Unseen Speaker-Emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing (==SONY==)
  - emotion transfer