:PROPERTIES:
:ID:       a76fa223-70da-4b76-bf82-1d3ffef3698c
:ROAM_ALIASES: llm
:END:
#+title: text
#+filetags: :neuralnomicon:
#+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup
- [[https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/][OpenSource]] [[https://twitter.com/rskuzma/status/1640721436179308545][Model]] but for new Hardware
- cpp geeration library and list of supported models(gpt, RWKV): [[https://github.com/ggerganov/ggml][ggml]]
- [[https://twitter.com/iScienceLuvr/status/1729310393200148618][Language]] Model Inversion
  - given output reconstruct the original prompt
- [[https://cloud.google.com/vertex-ai/docs/model-garden/lora-qlora][LoRA or]] QLoRA by Google
* ADDED - EXTRAS TO LLM
- llama plugins: https://twitter.com/algo_diver/status/1639681733468753925
- llama tools: https://github.com/OpenBMB/ToolBench
- [[https://twitter.com/osanseviero/status/1692517354784043367][streaming]] vs non-streaming generation
** VECTOR DB
- langchain, and https://github.com/srush/MiniChain
  - [[https://arxiv.org/pdf/2305.14564.pdf][PEARL]]: Prompting Large Language Models to Plan and Execute Actions Over Long Documents
- [[https://github.com/cpacker/MemGPT][MemGPT]]: [[https://www.youtube.com/watch?v=jSLcc3opedQ&t=566][manages]] memory tiers to effectively provide extended context within llm limited context window
  - llm taught to manage their own memory, resembles paging in OS (main context, external context) ==best==
  - trained to generate function calls
* SPECIALIZED USES
- [[id:adc6ba5b-a1de-40ed-a65e-993c14d1fee8][QUERING MODELS - MULTIMODAL]]
- [[https://arxiv.org/abs/2305.12031][Clinical]] [[https://github.com/bowang-lab/clinical-camel][Camel]]: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding; medical, doctor
- [[https://twitter.com/_akhaliq/status/1676052985544155136][Personality Traits]] in Large Language Models, quantifying personalities
- [[https://twitter.com/_akhaliq/status/1719883839382655255][ChipNeMo]]: Domain-Adapted LLMs for Chip Design
- [[https://twitter.com/_akhaliq/status/1741661714515431833][LARP]]: Language-Agent Role Play for Open-World Games
  - decision-making assistant, framework refines interactions between users and agents
** LAYOUT LLM
:PROPERTIES:
:ID:       3a1a687e-63e6-4552-8803-a06deeb494c6
:END:
- [[https://arxiv.org/abs/2404.00995][PosterLlama]]: Bridging Design Ability of Langauge Model to Contents-Aware Layout Generation
  - reformatting layout elements into HTML code
  - unconditional layout generation, element conditional layout generation, layout completion
** PLOT
- [[https://twitter.com/NielsRogge/status/1644388959416352783][Pix2Struct]]: text to plot
  - DePlot: plot-to-text model helping LLMs understand plots
  - MatCha: great chart & math capabilities by plot deconstruction & numerical reasoning objectives
- [[https://twitter.com/_akhaliq/status/1762349999919071528][StructLM]]: Towards Building Generalist Models for Structured Knowledge Grounding
  - based on the Code-LLaMA architecture
** LEGAL
- [[https://twitter.com/_akhaliq/status/1765614083875738028][SaulLM-7B]]: A pioneering Large Language Model for Law
  - designed explicitly for legal text comprehension and generation
** VISUAL
- [[https://twitter.com/_akhaliq/status/1735509186547486848][Pixel]] Aligned Language Models
  - can take locations (set of points, boxes) as inputs or outputs
  - location-aware vision-language tasks
** CODE ASSISTANT
- [[id:bb65f50b-04af-4161-afcd-acdc4821a0c4][ROBOTS]] [[id:e84c6d77-0e77-4084-a912-06d6846ba539][WEB MOCKING]]
- [[https://twitter.com/_akhaliq/status/1714482353689464844][CrossCodeEval]]: A Diverse and Multilingual Benchmark for Cross-File Code Completion
  - cross-file contextual understanding
- [[https://twitter.com/karpathy/status/1734251375163511203][mistral-8X-7B]] > codellama-34B (on humaneval)
*** MATH
- [[https://twitter.com/_akhaliq/status/1714130148784497116][Llemma]]: An Open Language Model For Mathematics
  - capable of tool use and formal theorem proving
- [[https://twitter.com/_akhaliq/status/1732945544287334811][Large Language]] Models for Mathematicians (academic)
  - mathematical description of the transformer model used in all modern language models
- [[https://twitter.com/_akhaliq/status/1767754905584824647][Chronos]]: Learning the Language of Time Series
  - improve zero-shot accuracy on unseen forecasting tasks; forecasting pipeline
- [[https://twitter.com/_akhaliq/status/1771019526265618889][MathVerse]]: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
  - extract crucial reasoning steps, to reveal the intermediate reasoning quality
  - MLLMs
*** CODE COMPLETION
- [[https://twitter.com/BrianRoemmele/status/1691852347377594764][DeciCoder]]: decoder-only code completion model
  - approach of grouping tokens into clusters and having each token attend to others only within its cluster
- [[https://twitter.com/_akhaliq/status/1731865391205425575][Magicoder]]: [[https://twitter.com/_akhaliq/status/1732121442223853950][Source]] Code Is All You Need
  - MagicoderS-CL-7B based on CodeLlama
- [[https://twitter.com/_akhaliq/status/1754340063675056536][StepCoder]]: Improve Code Generation with Reinforcement Learning from Compiler Feedback
  - breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks
    - while masking segments to properly optimize
**** OPERATOR
- [[https://twitter.com/_akhaliq/status/1690946387171749888][Enhancing]] Network Management Using Code Generated by Large Language Models
  - program synthesis: generate task-specific code from natural language queries
    - analyzing network topologies and communication graphs
*** DIFFUSION
- [[https://twitter.com/_akhaliq/status/1718824268060893533][CodeFusion]]: A Pre-trained Diffusion Model for Code Generation ==diffusion== (75M vs 1B auto-regressive)
  - iterative denoising, no need to start from scratch
- [[https://twitter.com/_akhaliq/status/1719902516358328711][Text Rendering]] Strategies for Pixel Language Models
  - characters as images, handle any script; PIXEL model
*** TOOLS-USE TOOLS
- [[https://huggingface.co/papers/2305.19234][Grammar Prompting]] for Domain-Specific Language Generation with Large Language Models
  - like programming languages
  - predict a BNF grammar given an input, then generates the output according to the rules of that grammar
- [[https://twitter.com/_akhaliq/status/1686569710001758208][Tool Documentation]] Enables Zero-Shot Tool-Usage with Large Language Models
  - zero-shot prompts with only documentation are sufficient for tool usage
  - tool documentation > demonstrations
- [[https://twitter.com/_akhaliq/status/1718819055228907581][ControlLLM]]: Augment Language Models with Tools by Searching on Graphs
  - breaks down a complex task into clear subtasks, then optimal solution path
- [[https://github.com/xszyou/Fay][Fay]]: integrating language models and digital characters
** TRANSLATION
- [[https://github.com/emorynlp/elit][elit]]: provides NLPs for tokenization, tagging, recognition of languages
- translation prompt: https://boards.4channel.org/g/thread/92468569#p92470651
- [[https://twitter.com/_akhaliq/status/1732950154146169191][EMMA]]: Efficient Monotonic Multihead Attention
  - simultaneous speech-to-text translation on the Spanish and English translation task
** OPTIMIZATION
- OPRO: Optimization by PROmpting, [[https://twitter.com/_akhaliq/status/1699963552952397952][Large Language]] Models as Optimizers
  - each step = generates new solutions from previously generated solutions
- [[https://twitter.com/_akhaliq/status/1702146221178015744][Large Language]] Models for Compiler Optimization
  - reducing instruction counts over the compiler
- [[https://twitter.com/_akhaliq/status/1703583311526813829][EvoPrompt]]: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
*** CACHE
- [[https://twitter.com/_akhaliq/status/1734046582805344492][SparQ]] Attention: Bandwidth-Efficient LLM Inference
  - reducing memory bandwidth requirements within the attention blocks through selective fetching of the cached history (up to eight times)
** SUMMARIZATION
- thread summarizer https://labs.kagi.com/ai/sum?url=%3E%3E248633369
- [[https://twitter.com/RLanceMartin/status/1687130407425417216][LLM Use]] Case: Summarization (using langchain)
- [[https://twitter.com/_akhaliq/status/1701043650015207871][From Sparse]] to Dense: GPT-4 Summarization with Chain of Density Prompting
  - iteratively incorporating missing salient entities without increasing the length
- [[https://twitter.com/_akhaliq/status/1704677505821487613][LMDX]]: Language Model-based Document Information Extraction and Localization
  - methodology to adapt arbitrary LLMs for document information extraction (without hallucination)
* TEXT DIFFUSION
- parent: [[id:82127d6a-b3bb-40bf-a912-51fa5134dacc][diffusion]]
- [[https://arxiv.org/abs/2212.11685][GENIE]]: Large Scale Pre-training for Text Generation with Diffusion Model
- [[https://arxiv.org/abs/2305.08379][TESS]]: Text-to-Text Self-Conditioned Simplex Diffusion
  - [[https://arxiv.org/abs/2305.09515][AR-Diffusion]]: Auto-Regressive Diffusion Model for Text Generation
- [[https://twitter.com/_akhaliq/status/1665936266372739074][PLANNER]]: Generating Diversified Paragraph via Latent Language Diffusion Model
- [[https://arxiv.org/abs/2404.06760][DiffusionDialog]]: A Diffusion Model for Diverse Dialog Generation with Latent Space
  - enhances the diversity of dialog responses while maintaining coherence
* TEXT GENERATION
- [[https://github.com/allenai/OLMo][allenai / OLMo]]: actually open source AI model
** INFERENCE
*** BETTER
**** FOCUS THE ATTENTION
- [[https://twitter.com/_akhaliq/status/1721793530706682120][PASTA]]: Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
  - identifies a small subset of attention heads, then applies precise attention reweighting on them
  - applied next to prompting
- S2A: [[https://twitter.com/_akhaliq/status/1726815889268359293][System]] 2 Attention (is something you might need too)
  - regenerates context to only include the relevant portions before responding
*** FASTER
- [[https://twitter.com/_akhaliq/status/1689462088626782209][Accelerating]] LLM Inference with Staged Speculative Decoding
  - restructure the speculative batch as a tree
- [[https://twitter.com/_akhaliq/status/1666646646103441410][MobileNMT]]: Enabling Translation in 15MB and 30ms
- [[https://twitter.com/_akhaliq/status/1720447630084276329][FlashDecoding++]]: Faster Large Language Model Inference on GPUs
  - inference engine, 4-2x speedup; no more matrix flatness
- [[https://twitter.com/_akhaliq/status/1726798167683862744][Exponentially]] Faster Language Modelling
  - replacing feedforward networks with fast feedforward networks (FFFs)
  - engages just 12 out of 4095 neurons for each layer inference, 78x speedup
- [[https://twitter.com/hongyangzh/status/1733169111625064833][EAGLE]]: [[https://github.com/SafeAILab/EAGLE][LLM decoding]] based on compression (and others with comparison: Medusa, Lookahead, Vanilla)
  - sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model
*** MODELS
- [[https://twitter.com/jinaai_/status/1717067977819046007][jina-embeddings-v2]]: [[https://huggingface.co/jinaai/jina-embeddings-v2-base-en][8k]] context length, bert architecture
- [[https://huggingface.co/01-ai/Yi-34B][Yi-34B]]: [[https://github.com/01-ai/Yi][6B]] [[https://twitter.com/AdeenaY8/status/1721475753441894498][and]] 34B, better than llama2 (has benchmarks list)
  - [[https://arxiv.org/abs/2403.04652][Yi]]: Open Foundation Models by 01.AI
**** QWEN
- Qwen-7B: surpasses both LLaMA 2 7B and 13B on MMLU score, math and code
- [[https://twitter.com/huybery/status/1754537742892232972][Qwen-1.5]] [[https://twitter.com/_akhaliq/status/1754545091434139732][space]]
**** LLAMA
- LLaMa [[https://www.reddit.com/r/StableDiffusion/comments/11h2wpv/comment/jb59lgt/][ipfs]]
- [[https://github.com/cocktailpeanut/dalai][in browser]] (there is also the cpp one)
- [[https://twitter.com/lvwerra/status/1681701409677246471][tain all]] Llama-2 models on your own data
***** ALTERNATIVES
- [[https://github.com/openlm-research/open_llama][Open LLama]], [[https://huggingface.co/openlm-research/open_llama_7b_400bt_preview][Open-Source]] Reproduction, permissively licensed; [[https://github.com/Lightning-AI/lit-llama][Lit-LLaMA]], RedPajama dataset
- [[https://twitter.com/pcuenq/status/1664605575882366980][Falcon]]: new family, open-source ==instruct finetuned too==
- [[https://twitter.com/_akhaliq/status/1743135851238805685][LLaMA Pro]]: Progressive LLaMA with Block Expansion
  - take pretrained model freeze params, then add new blocks
  - model with new data without forgetting old
- [[https://twitter.com/_akhaliq/status/1744009616562819526][LiteLlama]]: has 460M parameters trained with 1T tokens.
- [[https://twitter.com/_akhaliq/status/1762353396688806172][MobiLlama]]: Small Language Models (SLMs), open-source 0.5 billion (0.5B) parameter
**** MISTRAL
- [[https://huggingface.co/mistralai/Mistral-7B-v0.1][Mistral-7B]]: outperforms Llama 2 13B, MIT-Apache Licensed
  - [[https://twitter.com/skunkworks_ai/status/1713372586225156392/photo/3][BakLLaVA]]: mistral + vision model
  - [[https://ollama.ai/library/zephyr][zephyr]]: [[https://twitter.com/_lewtun/status/1717816585786626550][fine-tuned]] [[https://twitter.com/_akhaliq/status/1718009133570408824][using]] Direct Performance Optimization
    - dataset ranked by a teacher model with intent alignment, smaller: 7b vs 70b llama
  - [[https://huggingface.co/TheBloke/OpenHermes-2-Mistral-7B-GGUF][OpenHermes-2]]: roleplay, gpt4 dataset
  - https://huggingface.co/TheBloke/openchat_3.5-GGUF
  - [[https://huggingface.co/argilla/notux-8x7b-v1][notux]]: chat data
** TRAINING
- [[https://arxiv.org/pdf/2304.04947.pdf][Conditional Adapters]]: Parameter-efficient Transfer Learning with Fast Inference
  - [[https://github.com/ZrrSkywalker/LLaMA-Adapter/tree/main/imagebind_LLM][LLaMa-Adapter Multimodal]]! ([[https://twitter.com/lupantech/status/1664316926003396608][vision]])
- [[https://arxiv.org/abs/2304.05511][Training]] Large Language Models Efficiently with Sparsity and Dataflow
- [[https://huggingface.co/papers/2305.16843][Randomized]] Positional Encodings Boost Length Generalization of Transformers
- [[https://huggingface.co/papers/2305.16958][MixCE]]: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies
  - reverse cross-entropy RATHER THAN maximum likelihood estimation (MLE)
- [[https://twitter.com/_akhaliq/status/1701423258430492710][Neurons in Large]] Language Models: Dead, N-gram, Positional
  - study: 70 neurons per layer are dead, some neurons specialize in removing the information from input
- [[https://huggingface.co/papers/2305.16765][Backpack Language]] Models: non-contextual sense vectors, which specialize encoding different aspects word
- [[https://twitter.com/_akhaliq/status/1717019165054144688][In-Context]] Learning Creates Task Vectors
  - In-context learning = compressing training set into a single task vector, then using it to modulate transformer to produce the output
- [[https://www.youtube.com/watch?v=409tNlaByds&t=1588][Efficient]] Streaming Language Models with Attention Sinks (==better inference or trainning==)
  - ==context window cache is bad==, just keep first tokens around (as is)
    - or it is better to have a static null token at begining of window
  - reated to "Vision Transformers need registers" paper
*** CHEAPNESS
- [[https://research.myshell.ai/jetmoe][JetMoE]]: [[https://twitter.com/qinzytech/status/1775916338822709755][Reaching]] LLaMA2 Performance with 0.1M Dollars
  - and can be finetuned with a very limited computing budget
*** STRUCTURE
- [[https://twitter.com/_akhaliq/status/1672046849400909824][From Word Models]] [[https://arxiv.org/pdf/2306.12672.pdf][to World]] [[https://github.com/gabegrand/world-models][Models]]: Translating from Natural Language to the Probabilistic Language of Thought
  - probabilistic programming language = commonsense reasoning, linguistics
- [[https://twitter.com/_akhaliq/status/1768485746225156339][Quiet-STaR]]: Language Models Can Teach Themselves to Think Before Speaking
  - learn to generate rationales at each token to explain future text, improving their predictions
**** MERGING
- [[https://twitter.com/_akhaliq/status/1665887472335695873][LLM-Blender]]: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
  - (specialized) text model merging (using rankings)
- [[https://twitter.com/_akhaliq/status/1762344073820578145][FuseChat]]: Knowledge Fusion of Chat Models
  - knowledge fusion for structurally diverse architectures and scales llms
**** SKELETON
- [[https://twitter.com/_akhaliq/status/1685900593779634176][Skeleton-of-Thought]]: Large Language Models Can Do Parallel Decoding
  - first skeleton, then parallel filling; faster and better
- [[https://arxiv.org/abs/2303.09014][ART]]: [[https://github.com/bhargaviparanjape/language-programmes/][Automatic]] multi-step reasoning and tool-use for large language models
  - bubbles of logic
- [[https://twitter.com/_akhaliq/status/1726800556667085263][Orca 2]]: [[https://huggingface.co/microsoft/Orca-2-13b][Teaching]] Small Language Models How to Reason
  - reasoning techniques: step-by-step, recall then generate, recall-reason-generate, direct answer
- [[https://twitter.com/_akhaliq/status/1734047664516346022][PathFinder]]: Guided Search over Multi-Step Reasoning Paths
  - tree-search-based reasoning path generation approach (beam search algorith)
  - improved commonsense reasoning tasks and complex arithmetic
- [[https://twitter.com/_akhaliq/status/1777177284954194213][Stream of]] Search (SoS): Learning to Search in Language
  - models can be taught to search by representing the process of search in language, as a flattened string
***** META-PROCESS TOKENS
- [[https://twitter.com/_akhaliq/status/1692045755178226053][Teach LLMs]] to Personalize -- An Approach inspired by Writing Education
  - retrieval, ranking, summarization, synthesis, and generation
- [[https://twitter.com/_akhaliq/status/1691730510408822885][Link-Context]] Learning for Multimodal LLMs
  - causal associations between data points = cause and effect
  - In-Context Learning (ICL) = learn to learn
  - from limited tasks (providing demonstrations) and generalize to unseen tasks
- LoGiPT: [[https://twitter.com/_akhaliq/status/1723898514885824936][Language Models]] can be Logical Solvers
  - parse natural language logical questions into symbolic representations, emulates logical solvers
**** CORPUS STRUCTURE, RETRIEVAL
- [[https://github.com/facebookresearch/NPM][NPM]]: Nonparametric Masked Language Modeling, vs GPTv3, text corpus based
  - other code implementations https://www.catalyzex.com/paper/arxiv:2212.01349/code
- [[https://twitter.com/_akhaliq/status/1691734334057963733][RAVEN]]: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models
  - context learning in retrieval-augmented language models
***** LLM AS ENCODER
- [[id:316325a1-f24b-487d-9238-ca35db3a6b0c][GZIP VS GPT]]
  - [[https://twitter.com/_akhaliq/status/1680740847128653829][Copy]] Is All You Need
    - task of text generation decomposed into a series of copy-and-paste operations
    - text spans rather than vocabulary
    - learning = text compression algorithm ?
    - Decoding the ACL Paper: Gzip and KNN Rival BERT in Text Classification
- [[https://arxiv.org/abs/2404.05961][LLM2Vec]]: Large Language Models Are Secretly Powerful Text Encoders
  - LLMs can be effectively transformed into universal text encoders without the need for expensive adaptation
*** QUANTIZATION
- int-3 quantization: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and [[https://twitter.com/NolanoOrg/status/1635409631530057728][twitter]]
- llama.cpp [[https://github.com/ggerganov/llama.cpp/pull/301][quantization]]
- [[https://www.reddit.com/r/LocalLLaMA/comments/13yehfn/new_quantization_method_awq_outperforms_gptq_in/][AWQ]]: Activation-aware Weight Quantization for LLM Compression and Acceleration
  - outperforms GPTQ in 4-bit and 3-bit with 1.45x speedup and works with multimodal LLMs
  - [[https://github.com/Vahe1994/SpQR][SpQR]] [[https://www.reddit.com/r/LocalLLaMA/comments/142ij29/yet_another_quantization_method_spqr_by_tim/][method]] for LLM compression: highly sensitive parameters are not quantized
- [[https://twitter.com/_akhaliq/status/1696057203978100775][OmniQuant]]: Omnidirectionally Calibrated Quantization for Large Language Models
  - no more hand-craft-ed quantization parameters
- [[https://twitter.com/_akhaliq/status/1717384066750685620][LLM-FP4]]: 4-Bit Floating-Point Quantized Transformers, 5.8% lower on reasoning than the full-precision model
- [[https://twitter.com/_akhaliq/status/1755416334685417928][BiLLM]]: Pushing the Limit of Post-Training Quantization for LLMs
  - identifies and structurally selects salient weights
    - 7 billion weights within 0.5 hours
- [[https://twitter.com/_akhaliq/status/1765209290053238788][EasyQuant]]: An Efficient Data-free Quantization Algorithm for LLMs
  - leave the outliers (less than 1%) unchanged, implemented in parallel
**** 1-BIT
- [[https://twitter.com/_akhaliq/status/1714483549716320739][BitNet]]: Scaling 1-bit Transformers for Large Language Models
  - vs 8-bit quantization architectures
- [[https://twitter.com/_akhaliq/status/1717385001031946494][QMoE]]: Practical Sub-1-Bit Compression of Trillion-Parameter Models
  - can compress  1.6 trillion parameter model to less than 160GB (20x compression, 0.8 bits per parameter)
**** LORA WITH QUANTIZATION
- [[https://twitter.com/_akhaliq/status/1706863594917269514][QA-LoRA]]: Quantization-Aware Low-Rank Adaptation of Large Language Models
- [[https://twitter.com/_akhaliq/status/1661177995049172992][QLoRA]]: Efficient Finetuning of Quantized LLMs, 24 hours 1 gpu 48g
  - [[https://twitter.com/_akhaliq/status/1713739581097398618][LoftQ]]: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
    - outperforms than QLora
*** FINETUNNING
- [[https://arxiv.org/abs/2303.16199][LLaMA-Adapter]]: [[https://github.com/ZrrSkywalker/LLaMA-Adapter][Efficient Fine-tuning]] of Language Models with Zero-init Attention
  - [[https://arxiv.org/pdf/2302.14691.pdf][In-Context]] [[https://github.com/seonghyeonye/ICIL][Instruction]] Learning (ICIL)
- [[https://twitter.com/_akhaliq/status/1719217779406954805][LoRAShear]]: Efficient Large Language Model Structured Pruning and Knowledge Recovery
  - distillation
**** FEEDBACK AS TARGET
:PROPERTIES:
:ID:       ad5a8c1e-10c2-4155-86fe-ecbfa1ffcd07
:END:
- [[MULTIPLE LLM]]
- rlhf = Reinforcement Learning with Human Feedback
- [[https://arxiv.org/abs/2305.18290][Direct Preference]] Optimization: Your Language Model is Secretly a Reward Model (DPO)
  - can fine-tune LMs to align with human preferences, better than RLHF
- RAD: [[https://twitter.com/_akhaliq/status/1714099101690642724][Reward-Augmented]] Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
  - generation which uses extra reward model to generate text with certain properties
- [[https://twitter.com/_akhaliq/status/1747820246268887199][ReFT]]: Reasoning with Reinforced Fine-Tuning
  - learn from multiple annotated reasoning paths
  - rewards are naturally derived from the ground-truth answers (like math)
***** SELF TRAIN
- [[https://twitter.com/_akhaliq/status/1716302566592479486][TriPosT]]: Teaching Language Models to Self-Improve through Interactive Demonstrations
  - self-improvement for small models ability, revise own outputs correcting its own mistakes
- [[https://selfrefine.info/][Self-Refine]]: Iterative Refinement with Self-Feedback
**** CHEAPNESS
- [[https://huggingface.co/papers/2305.17333][Fine-Tuning Language]] Models with Just Forward Passes, less ram
- [[https://twitter.com/_akhaliq/status/1670678532349915138][Full Parameter]] Fine-tuning for Large Language Models with Limited Resources, low-memory optimizer
***** MULTIPLE LLM
- EFT: [[https://twitter.com/_akhaliq/status/1715236713436418120][An Emulator]] for Fine-Tuning Large Language Models using Small Language Models
  - avoid resource-intensive fine-tuning of llm by ensembling them with small fine-tuned models
  - also: scaling up finetuning improves helpfulness, scaling up pre-training improves factuality
- [[https://twitter.com/_akhaliq/status/1716301330283671719][Tuna]]: Instruction Tuning using Feedback from Large Language Models
  - finetuning with contextual ranking
- [[https://twitter.com/_akhaliq/status/1715237306506813678][AutoMix]]: Automatically Mixing Language Models
  - strategically routes queries to larger llm, based on the outputs from a smaller LM
**** ADDITIVE METHODS
- [[LORA WITH QUANTIZATION]]
- [[https://twitter.com/_akhaliq/status/1684030297661403136][LoraHub]]: [[https://twitter.com/sivil_taram/status/1684513568950210560][Efficient]] Cross-Task Generalization via Dynamic LoRA Composition
  - LoRA composability for cross-task generalization; neither more parameters nor gradients
- [[https://twitter.com/_akhaliq/status/1723910609857663257][Parameter-Efficient]] Orthogonal Finetuning via Butterfly Factorization
***** LORA
- [[https://github.com/tloen/alpaca-lora][alpaca-lora]]
- sentence transformers: [[https://github.com/huggingface/setfit][SetFit]]
  - efficient few-shot learning
- [[https://huggingface.co/blog/trl-peft][peft]] [[https://twitter.com/younesbelkada/status/1633867640564486144][twitter]] [[https://github.com/huggingface/peft][repo]]
  - [[https://www.youtube.com/watch?v=oPS-8nKGu8U][PEFT]] w/ Multi LoRA explained (LLM fine-tuning)
*** MEMORY
- [[https://arxiv.org/abs/2203.08913][Memorizing]] [[https://twitter.com/nearcyan/status/1637891562385317897][Transformers]] [[https://github.com/google-research/meliad][repo]]
  - Memorizing Transformer does not need to be pre-trained from scratch; possible adding memory to an existing pre-trained model, and then fine-tuning it
- [[https://huggingface.co/papers/2305.16338][Think Before]] You Act: Decision Transformers with Internal Working Memory, task specialized memory
- [[https://twitter.com/_akhaliq/status/1726796663979643174][Memory]] Augmented Language Models through Mixture of Word Experts
  - Mixture of Word Experts (MoWE) (Mixture-of-Experts (MoE))
  - set of word-specific experts play the role of a sparse memory, similar performance to more complex memory augmented
- [[https://twitter.com/_akhaliq/status/1757235218316996896][Fiddler]]: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
  - minimize the data movement between the CPU and GPU.
  - Mixtral-8x7B model, 90GB parameters, over 3 tokens per second on a single GPU with 24GB memory
- [[https://twitter.com/AnimaAnandkumar/status/1765613815146893348][GaLore]]: Memory-Efficient LLM Training by Gradient Low-Rank Projection ==best==
  - feasibility of pre-training a 7B model on GPUs with 24GB memory; unlike lora
    - 82.5% reduction in memory
**** CONTEXT LENGTH
- [[VECTOR DB]]
- [[https://twitter.com/_akhaliq/status/1668436285822836737][Augmenting]] Language Models with Long-Term Memory (unlimited context)
- [[https://twitter.com/_akhaliq/status/1698497385230389585][YaRN]]: Efficient Context Window Extension of Large Language Models
- [[https://twitter.com/_akhaliq/status/1701774889659572288][Efficient]] Memory Management for Large Language Model Serving with PagedAttention
  - vLLM: near-zero waste in KV cache memory, and flexible
- [[https://nitter.net/tri_dao/status/1712904220519944411][Flash-Decoding]]: make long-context LLM inference up to 8x faster
  - load the KV cache in parallel as fast as possible, then separately rescale to combine the results
- [[https://twitter.com/_akhaliq/status/1744181094025433327][Infinite-LLM]]: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
  - LLM serving system dynamically managing KV Cache, orchestrates across the data center
- [[https://twitter.com/_akhaliq/status/1747515567492174185][Extending]] LLMs' Context Window with 100 Samples
  - introduce a novel extension to RoPE so that it can adapt to larger context windows (efficiently)
  - exampled on llama
*** DATASET
- [[SKELETON]]
- [[https://arxiv.org/pdf/2305.11206.pdf][LIMA]]: [[https://twitter.com/_akhaliq/status/1660458199504556034][Less]] Is More for Alignment
  - trained only 1,000 carefully curated prompts and responses
- [[https://arxiv.org/abs/2304.14318][q2d]]: Turning Questions into Dialogs to Teach Models How to Search
  - synthetically-generated data achieve 90%--97% of the performance of training on human-generated data
- [[https://huggingface.co/papers/2305.16635][Impossible Distillation]]: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing
  - high-quality model and dataset from a low-quality teacher model
- [[https://twitter.com/_akhaliq/status/1689120315832483841][Simple synthetic]] data reduces sycophancy in large language models
  - sycophancy = adapting views once a user reveals their views, to statements that are objectively incorrect
  - lightweight finetuning step
- [[https://twitter.com/_akhaliq/status/1699951105927512399][GPT Can]] Solve Mathematical Problems Without a Calculator; with training data = multi-digit arithmetic
- [[https://twitter.com/_akhaliq/status/1719220065655013542][TeacherLM]]: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise
  - anotating the dataset with "why" instead of only "what"
  - Lema: [[https://twitter.com/_akhaliq/status/1719542744710824024][Learning]] From Mistakes Makes LLM Better Reasoner
    - identify, explain, correct mistakes using the llm itself to fintune (learn from mistakes)
- [[https://twitter.com/_akhaliq/status/1721759755847303314][Ziya2]]: Data-centric Learning is All LLMs Need
  - focus on pre-training techniques and data-centric optimization to enhance learning process