:PROPERTIES: :ID: a76fa223-70da-4b76-bf82-1d3ffef3698c :ROAM_ALIASES: llm :END: #+title: text #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - [[https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/][OpenSource]] [[https://twitter.com/rskuzma/status/1640721436179308545][Model]] but for new Hardware - cpp geeration library and list of supported models(gpt, RWKV): [[https://github.com/ggerganov/ggml][ggml]] - [[https://twitter.com/iScienceLuvr/status/1729310393200148618][Language]] Model Inversion - given output reconstruct the original prompt - [[https://cloud.google.com/vertex-ai/docs/model-garden/lora-qlora][LoRA or]] QLoRA by Google * ADDED - EXTRAS TO LLM - llama plugins: https://twitter.com/algo_diver/status/1639681733468753925 - llama tools: https://github.com/OpenBMB/ToolBench - [[https://twitter.com/osanseviero/status/1692517354784043367][streaming]] vs non-streaming generation ** VECTOR DB - langchain, and https://github.com/srush/MiniChain - [[https://arxiv.org/pdf/2305.14564.pdf][PEARL]]: Prompting Large Language Models to Plan and Execute Actions Over Long Documents - [[https://github.com/cpacker/MemGPT][MemGPT]]: [[https://www.youtube.com/watch?v=jSLcc3opedQ&t=566][manages]] memory tiers to effectively provide extended context within llm limited context window - llm taught to manage their own memory, resembles paging in OS (main context, external context) ==best== - trained to generate function calls * SPECIALIZED USES - [[id:adc6ba5b-a1de-40ed-a65e-993c14d1fee8][QUERING MODELS - MULTIMODAL]] - [[https://arxiv.org/abs/2305.12031][Clinical]] [[https://github.com/bowang-lab/clinical-camel][Camel]]: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding; medical, doctor - [[https://twitter.com/_akhaliq/status/1676052985544155136][Personality Traits]] in Large Language Models, quantifying personalities - [[https://twitter.com/_akhaliq/status/1719883839382655255][ChipNeMo]]: Domain-Adapted LLMs for Chip Design - [[https://twitter.com/_akhaliq/status/1741661714515431833][LARP]]: Language-Agent Role Play for Open-World Games - decision-making assistant, framework refines interactions between users and agents ** LAYOUT LLM :PROPERTIES: :ID: 3a1a687e-63e6-4552-8803-a06deeb494c6 :END: - [[https://arxiv.org/abs/2404.00995][PosterLlama]]: Bridging Design Ability of Langauge Model to Contents-Aware Layout Generation - reformatting layout elements into HTML code - unconditional layout generation, element conditional layout generation, layout completion ** PLOT - [[https://twitter.com/NielsRogge/status/1644388959416352783][Pix2Struct]]: text to plot - DePlot: plot-to-text model helping LLMs understand plots - MatCha: great chart & math capabilities by plot deconstruction & numerical reasoning objectives - [[https://twitter.com/_akhaliq/status/1762349999919071528][StructLM]]: Towards Building Generalist Models for Structured Knowledge Grounding - based on the Code-LLaMA architecture ** LEGAL - [[https://twitter.com/_akhaliq/status/1765614083875738028][SaulLM-7B]]: A pioneering Large Language Model for Law - designed explicitly for legal text comprehension and generation ** VISUAL - [[https://twitter.com/_akhaliq/status/1735509186547486848][Pixel]] Aligned Language Models - can take locations (set of points, boxes) as inputs or outputs - location-aware vision-language tasks ** CODE ASSISTANT - [[id:bb65f50b-04af-4161-afcd-acdc4821a0c4][ROBOTS]] [[id:e84c6d77-0e77-4084-a912-06d6846ba539][WEB MOCKING]] - [[https://twitter.com/_akhaliq/status/1714482353689464844][CrossCodeEval]]: A Diverse and Multilingual Benchmark for Cross-File Code Completion - cross-file contextual understanding - [[https://twitter.com/karpathy/status/1734251375163511203][mistral-8X-7B]] > codellama-34B (on humaneval) *** MATH - [[https://twitter.com/_akhaliq/status/1714130148784497116][Llemma]]: An Open Language Model For Mathematics - capable of tool use and formal theorem proving - [[https://twitter.com/_akhaliq/status/1732945544287334811][Large Language]] Models for Mathematicians (academic) - mathematical description of the transformer model used in all modern language models - [[https://twitter.com/_akhaliq/status/1767754905584824647][Chronos]]: Learning the Language of Time Series - improve zero-shot accuracy on unseen forecasting tasks; forecasting pipeline - [[https://twitter.com/_akhaliq/status/1771019526265618889][MathVerse]]: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? - extract crucial reasoning steps, to reveal the intermediate reasoning quality - MLLMs *** CODE COMPLETION - [[https://twitter.com/BrianRoemmele/status/1691852347377594764][DeciCoder]]: decoder-only code completion model - approach of grouping tokens into clusters and having each token attend to others only within its cluster - [[https://twitter.com/_akhaliq/status/1731865391205425575][Magicoder]]: [[https://twitter.com/_akhaliq/status/1732121442223853950][Source]] Code Is All You Need - MagicoderS-CL-7B based on CodeLlama - [[https://twitter.com/_akhaliq/status/1754340063675056536][StepCoder]]: Improve Code Generation with Reinforcement Learning from Compiler Feedback - breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks - while masking segments to properly optimize **** OPERATOR - [[https://twitter.com/_akhaliq/status/1690946387171749888][Enhancing]] Network Management Using Code Generated by Large Language Models - program synthesis: generate task-specific code from natural language queries - analyzing network topologies and communication graphs *** DIFFUSION - [[https://twitter.com/_akhaliq/status/1718824268060893533][CodeFusion]]: A Pre-trained Diffusion Model for Code Generation ==diffusion== (75M vs 1B auto-regressive) - iterative denoising, no need to start from scratch - [[https://twitter.com/_akhaliq/status/1719902516358328711][Text Rendering]] Strategies for Pixel Language Models - characters as images, handle any script; PIXEL model *** TOOLS-USE TOOLS - [[https://huggingface.co/papers/2305.19234][Grammar Prompting]] for Domain-Specific Language Generation with Large Language Models - like programming languages - predict a BNF grammar given an input, then generates the output according to the rules of that grammar - [[https://twitter.com/_akhaliq/status/1686569710001758208][Tool Documentation]] Enables Zero-Shot Tool-Usage with Large Language Models - zero-shot prompts with only documentation are sufficient for tool usage - tool documentation > demonstrations - [[https://twitter.com/_akhaliq/status/1718819055228907581][ControlLLM]]: Augment Language Models with Tools by Searching on Graphs - breaks down a complex task into clear subtasks, then optimal solution path - [[https://github.com/xszyou/Fay][Fay]]: integrating language models and digital characters ** TRANSLATION - [[https://github.com/emorynlp/elit][elit]]: provides NLPs for tokenization, tagging, recognition of languages - translation prompt: https://boards.4channel.org/g/thread/92468569#p92470651 - [[https://twitter.com/_akhaliq/status/1732950154146169191][EMMA]]: Efficient Monotonic Multihead Attention - simultaneous speech-to-text translation on the Spanish and English translation task ** OPTIMIZATION - OPRO: Optimization by PROmpting, [[https://twitter.com/_akhaliq/status/1699963552952397952][Large Language]] Models as Optimizers - each step = generates new solutions from previously generated solutions - [[https://twitter.com/_akhaliq/status/1702146221178015744][Large Language]] Models for Compiler Optimization - reducing instruction counts over the compiler - [[https://twitter.com/_akhaliq/status/1703583311526813829][EvoPrompt]]: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers *** CACHE - [[https://twitter.com/_akhaliq/status/1734046582805344492][SparQ]] Attention: Bandwidth-Efficient LLM Inference - reducing memory bandwidth requirements within the attention blocks through selective fetching of the cached history (up to eight times) ** SUMMARIZATION - thread summarizer https://labs.kagi.com/ai/sum?url=%3E%3E248633369 - [[https://twitter.com/RLanceMartin/status/1687130407425417216][LLM Use]] Case: Summarization (using langchain) - [[https://twitter.com/_akhaliq/status/1701043650015207871][From Sparse]] to Dense: GPT-4 Summarization with Chain of Density Prompting - iteratively incorporating missing salient entities without increasing the length - [[https://twitter.com/_akhaliq/status/1704677505821487613][LMDX]]: Language Model-based Document Information Extraction and Localization - methodology to adapt arbitrary LLMs for document information extraction (without hallucination) * TEXT DIFFUSION - parent: [[id:82127d6a-b3bb-40bf-a912-51fa5134dacc][diffusion]] - [[https://arxiv.org/abs/2212.11685][GENIE]]: Large Scale Pre-training for Text Generation with Diffusion Model - [[https://arxiv.org/abs/2305.08379][TESS]]: Text-to-Text Self-Conditioned Simplex Diffusion - [[https://arxiv.org/abs/2305.09515][AR-Diffusion]]: Auto-Regressive Diffusion Model for Text Generation - [[https://twitter.com/_akhaliq/status/1665936266372739074][PLANNER]]: Generating Diversified Paragraph via Latent Language Diffusion Model - [[https://arxiv.org/abs/2404.06760][DiffusionDialog]]: A Diffusion Model for Diverse Dialog Generation with Latent Space - enhances the diversity of dialog responses while maintaining coherence * TEXT GENERATION - [[https://github.com/allenai/OLMo][allenai / OLMo]]: actually open source AI model ** INFERENCE *** BETTER **** FOCUS THE ATTENTION - [[https://twitter.com/_akhaliq/status/1721793530706682120][PASTA]]: Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs - identifies a small subset of attention heads, then applies precise attention reweighting on them - applied next to prompting - S2A: [[https://twitter.com/_akhaliq/status/1726815889268359293][System]] 2 Attention (is something you might need too) - regenerates context to only include the relevant portions before responding *** FASTER - [[https://twitter.com/_akhaliq/status/1689462088626782209][Accelerating]] LLM Inference with Staged Speculative Decoding - restructure the speculative batch as a tree - [[https://twitter.com/_akhaliq/status/1666646646103441410][MobileNMT]]: Enabling Translation in 15MB and 30ms - [[https://twitter.com/_akhaliq/status/1720447630084276329][FlashDecoding++]]: Faster Large Language Model Inference on GPUs - inference engine, 4-2x speedup; no more matrix flatness - [[https://twitter.com/_akhaliq/status/1726798167683862744][Exponentially]] Faster Language Modelling - replacing feedforward networks with fast feedforward networks (FFFs) - engages just 12 out of 4095 neurons for each layer inference, 78x speedup - [[https://twitter.com/hongyangzh/status/1733169111625064833][EAGLE]]: [[https://github.com/SafeAILab/EAGLE][LLM decoding]] based on compression (and others with comparison: Medusa, Lookahead, Vanilla) - sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model *** MODELS - [[https://twitter.com/jinaai_/status/1717067977819046007][jina-embeddings-v2]]: [[https://huggingface.co/jinaai/jina-embeddings-v2-base-en][8k]] context length, bert architecture - [[https://huggingface.co/01-ai/Yi-34B][Yi-34B]]: [[https://github.com/01-ai/Yi][6B]] [[https://twitter.com/AdeenaY8/status/1721475753441894498][and]] 34B, better than llama2 (has benchmarks list) - [[https://arxiv.org/abs/2403.04652][Yi]]: Open Foundation Models by 01.AI **** QWEN - Qwen-7B: surpasses both LLaMA 2 7B and 13B on MMLU score, math and code - [[https://twitter.com/huybery/status/1754537742892232972][Qwen-1.5]] [[https://twitter.com/_akhaliq/status/1754545091434139732][space]] **** LLAMA - LLaMa [[https://www.reddit.com/r/StableDiffusion/comments/11h2wpv/comment/jb59lgt/][ipfs]] - [[https://github.com/cocktailpeanut/dalai][in browser]] (there is also the cpp one) - [[https://twitter.com/lvwerra/status/1681701409677246471][tain all]] Llama-2 models on your own data ***** ALTERNATIVES - [[https://github.com/openlm-research/open_llama][Open LLama]], [[https://huggingface.co/openlm-research/open_llama_7b_400bt_preview][Open-Source]] Reproduction, permissively licensed; [[https://github.com/Lightning-AI/lit-llama][Lit-LLaMA]], RedPajama dataset - [[https://twitter.com/pcuenq/status/1664605575882366980][Falcon]]: new family, open-source ==instruct finetuned too== - [[https://twitter.com/_akhaliq/status/1743135851238805685][LLaMA Pro]]: Progressive LLaMA with Block Expansion - take pretrained model freeze params, then add new blocks - model with new data without forgetting old - [[https://twitter.com/_akhaliq/status/1744009616562819526][LiteLlama]]: has 460M parameters trained with 1T tokens. - [[https://twitter.com/_akhaliq/status/1762353396688806172][MobiLlama]]: Small Language Models (SLMs), open-source 0.5 billion (0.5B) parameter **** MISTRAL - [[https://huggingface.co/mistralai/Mistral-7B-v0.1][Mistral-7B]]: outperforms Llama 2 13B, MIT-Apache Licensed - [[https://twitter.com/skunkworks_ai/status/1713372586225156392/photo/3][BakLLaVA]]: mistral + vision model - [[https://ollama.ai/library/zephyr][zephyr]]: [[https://twitter.com/_lewtun/status/1717816585786626550][fine-tuned]] [[https://twitter.com/_akhaliq/status/1718009133570408824][using]] Direct Performance Optimization - dataset ranked by a teacher model with intent alignment, smaller: 7b vs 70b llama - [[https://huggingface.co/TheBloke/OpenHermes-2-Mistral-7B-GGUF][OpenHermes-2]]: roleplay, gpt4 dataset - https://huggingface.co/TheBloke/openchat_3.5-GGUF - [[https://huggingface.co/argilla/notux-8x7b-v1][notux]]: chat data ** TRAINING - [[https://arxiv.org/pdf/2304.04947.pdf][Conditional Adapters]]: Parameter-efficient Transfer Learning with Fast Inference - [[https://github.com/ZrrSkywalker/LLaMA-Adapter/tree/main/imagebind_LLM][LLaMa-Adapter Multimodal]]! ([[https://twitter.com/lupantech/status/1664316926003396608][vision]]) - [[https://arxiv.org/abs/2304.05511][Training]] Large Language Models Efficiently with Sparsity and Dataflow - [[https://huggingface.co/papers/2305.16843][Randomized]] Positional Encodings Boost Length Generalization of Transformers - [[https://huggingface.co/papers/2305.16958][MixCE]]: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies - reverse cross-entropy RATHER THAN maximum likelihood estimation (MLE) - [[https://twitter.com/_akhaliq/status/1701423258430492710][Neurons in Large]] Language Models: Dead, N-gram, Positional - study: 70 neurons per layer are dead, some neurons specialize in removing the information from input - [[https://huggingface.co/papers/2305.16765][Backpack Language]] Models: non-contextual sense vectors, which specialize encoding different aspects word - [[https://twitter.com/_akhaliq/status/1717019165054144688][In-Context]] Learning Creates Task Vectors - In-context learning = compressing training set into a single task vector, then using it to modulate transformer to produce the output - [[https://www.youtube.com/watch?v=409tNlaByds&t=1588][Efficient]] Streaming Language Models with Attention Sinks (==better inference or trainning==) - ==context window cache is bad==, just keep first tokens around (as is) - or it is better to have a static null token at begining of window - reated to "Vision Transformers need registers" paper *** CHEAPNESS - [[https://research.myshell.ai/jetmoe][JetMoE]]: [[https://twitter.com/qinzytech/status/1775916338822709755][Reaching]] LLaMA2 Performance with 0.1M Dollars - and can be finetuned with a very limited computing budget *** STRUCTURE - [[https://twitter.com/_akhaliq/status/1672046849400909824][From Word Models]] [[https://arxiv.org/pdf/2306.12672.pdf][to World]] [[https://github.com/gabegrand/world-models][Models]]: Translating from Natural Language to the Probabilistic Language of Thought - probabilistic programming language = commonsense reasoning, linguistics - [[https://twitter.com/_akhaliq/status/1768485746225156339][Quiet-STaR]]: Language Models Can Teach Themselves to Think Before Speaking - learn to generate rationales at each token to explain future text, improving their predictions **** MERGING - [[https://twitter.com/_akhaliq/status/1665887472335695873][LLM-Blender]]: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion - (specialized) text model merging (using rankings) - [[https://twitter.com/_akhaliq/status/1762344073820578145][FuseChat]]: Knowledge Fusion of Chat Models - knowledge fusion for structurally diverse architectures and scales llms **** SKELETON - [[https://twitter.com/_akhaliq/status/1685900593779634176][Skeleton-of-Thought]]: Large Language Models Can Do Parallel Decoding - first skeleton, then parallel filling; faster and better - [[https://arxiv.org/abs/2303.09014][ART]]: [[https://github.com/bhargaviparanjape/language-programmes/][Automatic]] multi-step reasoning and tool-use for large language models - bubbles of logic - [[https://twitter.com/_akhaliq/status/1726800556667085263][Orca 2]]: [[https://huggingface.co/microsoft/Orca-2-13b][Teaching]] Small Language Models How to Reason - reasoning techniques: step-by-step, recall then generate, recall-reason-generate, direct answer - [[https://twitter.com/_akhaliq/status/1734047664516346022][PathFinder]]: Guided Search over Multi-Step Reasoning Paths - tree-search-based reasoning path generation approach (beam search algorith) - improved commonsense reasoning tasks and complex arithmetic - [[https://twitter.com/_akhaliq/status/1777177284954194213][Stream of]] Search (SoS): Learning to Search in Language - models can be taught to search by representing the process of search in language, as a flattened string ***** META-PROCESS TOKENS - [[https://twitter.com/_akhaliq/status/1692045755178226053][Teach LLMs]] to Personalize -- An Approach inspired by Writing Education - retrieval, ranking, summarization, synthesis, and generation - [[https://twitter.com/_akhaliq/status/1691730510408822885][Link-Context]] Learning for Multimodal LLMs - causal associations between data points = cause and effect - In-Context Learning (ICL) = learn to learn - from limited tasks (providing demonstrations) and generalize to unseen tasks - LoGiPT: [[https://twitter.com/_akhaliq/status/1723898514885824936][Language Models]] can be Logical Solvers - parse natural language logical questions into symbolic representations, emulates logical solvers **** CORPUS STRUCTURE, RETRIEVAL - [[https://github.com/facebookresearch/NPM][NPM]]: Nonparametric Masked Language Modeling, vs GPTv3, text corpus based - other code implementations https://www.catalyzex.com/paper/arxiv:2212.01349/code - [[https://twitter.com/_akhaliq/status/1691734334057963733][RAVEN]]: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models - context learning in retrieval-augmented language models ***** LLM AS ENCODER - [[id:316325a1-f24b-487d-9238-ca35db3a6b0c][GZIP VS GPT]] - [[https://twitter.com/_akhaliq/status/1680740847128653829][Copy]] Is All You Need - task of text generation decomposed into a series of copy-and-paste operations - text spans rather than vocabulary - learning = text compression algorithm ? - Decoding the ACL Paper: Gzip and KNN Rival BERT in Text Classification - [[https://arxiv.org/abs/2404.05961][LLM2Vec]]: Large Language Models Are Secretly Powerful Text Encoders - LLMs can be effectively transformed into universal text encoders without the need for expensive adaptation *** QUANTIZATION - int-3 quantization: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and [[https://twitter.com/NolanoOrg/status/1635409631530057728][twitter]] - llama.cpp [[https://github.com/ggerganov/llama.cpp/pull/301][quantization]] - [[https://www.reddit.com/r/LocalLLaMA/comments/13yehfn/new_quantization_method_awq_outperforms_gptq_in/][AWQ]]: Activation-aware Weight Quantization for LLM Compression and Acceleration - outperforms GPTQ in 4-bit and 3-bit with 1.45x speedup and works with multimodal LLMs - [[https://github.com/Vahe1994/SpQR][SpQR]] [[https://www.reddit.com/r/LocalLLaMA/comments/142ij29/yet_another_quantization_method_spqr_by_tim/][method]] for LLM compression: highly sensitive parameters are not quantized - [[https://twitter.com/_akhaliq/status/1696057203978100775][OmniQuant]]: Omnidirectionally Calibrated Quantization for Large Language Models - no more hand-craft-ed quantization parameters - [[https://twitter.com/_akhaliq/status/1717384066750685620][LLM-FP4]]: 4-Bit Floating-Point Quantized Transformers, 5.8% lower on reasoning than the full-precision model - [[https://twitter.com/_akhaliq/status/1755416334685417928][BiLLM]]: Pushing the Limit of Post-Training Quantization for LLMs - identifies and structurally selects salient weights - 7 billion weights within 0.5 hours - [[https://twitter.com/_akhaliq/status/1765209290053238788][EasyQuant]]: An Efficient Data-free Quantization Algorithm for LLMs - leave the outliers (less than 1%) unchanged, implemented in parallel **** 1-BIT - [[https://twitter.com/_akhaliq/status/1714483549716320739][BitNet]]: Scaling 1-bit Transformers for Large Language Models - vs 8-bit quantization architectures - [[https://twitter.com/_akhaliq/status/1717385001031946494][QMoE]]: Practical Sub-1-Bit Compression of Trillion-Parameter Models - can compress 1.6 trillion parameter model to less than 160GB (20x compression, 0.8 bits per parameter) **** LORA WITH QUANTIZATION - [[https://twitter.com/_akhaliq/status/1706863594917269514][QA-LoRA]]: Quantization-Aware Low-Rank Adaptation of Large Language Models - [[https://twitter.com/_akhaliq/status/1661177995049172992][QLoRA]]: Efficient Finetuning of Quantized LLMs, 24 hours 1 gpu 48g - [[https://twitter.com/_akhaliq/status/1713739581097398618][LoftQ]]: LoRA-Fine-Tuning-Aware Quantization for Large Language Models - outperforms than QLora *** FINETUNNING - [[https://arxiv.org/abs/2303.16199][LLaMA-Adapter]]: [[https://github.com/ZrrSkywalker/LLaMA-Adapter][Efficient Fine-tuning]] of Language Models with Zero-init Attention - [[https://arxiv.org/pdf/2302.14691.pdf][In-Context]] [[https://github.com/seonghyeonye/ICIL][Instruction]] Learning (ICIL) - [[https://twitter.com/_akhaliq/status/1719217779406954805][LoRAShear]]: Efficient Large Language Model Structured Pruning and Knowledge Recovery - distillation **** FEEDBACK AS TARGET :PROPERTIES: :ID: ad5a8c1e-10c2-4155-86fe-ecbfa1ffcd07 :END: - [[MULTIPLE LLM]] - rlhf = Reinforcement Learning with Human Feedback - [[https://arxiv.org/abs/2305.18290][Direct Preference]] Optimization: Your Language Model is Secretly a Reward Model (DPO) - can fine-tune LMs to align with human preferences, better than RLHF - RAD: [[https://twitter.com/_akhaliq/status/1714099101690642724][Reward-Augmented]] Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model - generation which uses extra reward model to generate text with certain properties - [[https://twitter.com/_akhaliq/status/1747820246268887199][ReFT]]: Reasoning with Reinforced Fine-Tuning - learn from multiple annotated reasoning paths - rewards are naturally derived from the ground-truth answers (like math) ***** SELF TRAIN - [[https://twitter.com/_akhaliq/status/1716302566592479486][TriPosT]]: Teaching Language Models to Self-Improve through Interactive Demonstrations - self-improvement for small models ability, revise own outputs correcting its own mistakes - [[https://selfrefine.info/][Self-Refine]]: Iterative Refinement with Self-Feedback **** CHEAPNESS - [[https://huggingface.co/papers/2305.17333][Fine-Tuning Language]] Models with Just Forward Passes, less ram - [[https://twitter.com/_akhaliq/status/1670678532349915138][Full Parameter]] Fine-tuning for Large Language Models with Limited Resources, low-memory optimizer ***** MULTIPLE LLM - EFT: [[https://twitter.com/_akhaliq/status/1715236713436418120][An Emulator]] for Fine-Tuning Large Language Models using Small Language Models - avoid resource-intensive fine-tuning of llm by ensembling them with small fine-tuned models - also: scaling up finetuning improves helpfulness, scaling up pre-training improves factuality - [[https://twitter.com/_akhaliq/status/1716301330283671719][Tuna]]: Instruction Tuning using Feedback from Large Language Models - finetuning with contextual ranking - [[https://twitter.com/_akhaliq/status/1715237306506813678][AutoMix]]: Automatically Mixing Language Models - strategically routes queries to larger llm, based on the outputs from a smaller LM **** ADDITIVE METHODS - [[LORA WITH QUANTIZATION]] - [[https://twitter.com/_akhaliq/status/1684030297661403136][LoraHub]]: [[https://twitter.com/sivil_taram/status/1684513568950210560][Efficient]] Cross-Task Generalization via Dynamic LoRA Composition - LoRA composability for cross-task generalization; neither more parameters nor gradients - [[https://twitter.com/_akhaliq/status/1723910609857663257][Parameter-Efficient]] Orthogonal Finetuning via Butterfly Factorization ***** LORA - [[https://github.com/tloen/alpaca-lora][alpaca-lora]] - sentence transformers: [[https://github.com/huggingface/setfit][SetFit]] - efficient few-shot learning - [[https://huggingface.co/blog/trl-peft][peft]] [[https://twitter.com/younesbelkada/status/1633867640564486144][twitter]] [[https://github.com/huggingface/peft][repo]] - [[https://www.youtube.com/watch?v=oPS-8nKGu8U][PEFT]] w/ Multi LoRA explained (LLM fine-tuning) *** MEMORY - [[https://arxiv.org/abs/2203.08913][Memorizing]] [[https://twitter.com/nearcyan/status/1637891562385317897][Transformers]] [[https://github.com/google-research/meliad][repo]] - Memorizing Transformer does not need to be pre-trained from scratch; possible adding memory to an existing pre-trained model, and then fine-tuning it - [[https://huggingface.co/papers/2305.16338][Think Before]] You Act: Decision Transformers with Internal Working Memory, task specialized memory - [[https://twitter.com/_akhaliq/status/1726796663979643174][Memory]] Augmented Language Models through Mixture of Word Experts - Mixture of Word Experts (MoWE) (Mixture-of-Experts (MoE)) - set of word-specific experts play the role of a sparse memory, similar performance to more complex memory augmented - [[https://twitter.com/_akhaliq/status/1757235218316996896][Fiddler]]: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models - minimize the data movement between the CPU and GPU. - Mixtral-8x7B model, 90GB parameters, over 3 tokens per second on a single GPU with 24GB memory - [[https://twitter.com/AnimaAnandkumar/status/1765613815146893348][GaLore]]: Memory-Efficient LLM Training by Gradient Low-Rank Projection ==best== - feasibility of pre-training a 7B model on GPUs with 24GB memory; unlike lora - 82.5% reduction in memory **** CONTEXT LENGTH - [[VECTOR DB]] - [[https://twitter.com/_akhaliq/status/1668436285822836737][Augmenting]] Language Models with Long-Term Memory (unlimited context) - [[https://twitter.com/_akhaliq/status/1698497385230389585][YaRN]]: Efficient Context Window Extension of Large Language Models - [[https://twitter.com/_akhaliq/status/1701774889659572288][Efficient]] Memory Management for Large Language Model Serving with PagedAttention - vLLM: near-zero waste in KV cache memory, and flexible - [[https://nitter.net/tri_dao/status/1712904220519944411][Flash-Decoding]]: make long-context LLM inference up to 8x faster - load the KV cache in parallel as fast as possible, then separately rescale to combine the results - [[https://twitter.com/_akhaliq/status/1744181094025433327][Infinite-LLM]]: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache - LLM serving system dynamically managing KV Cache, orchestrates across the data center - [[https://twitter.com/_akhaliq/status/1747515567492174185][Extending]] LLMs' Context Window with 100 Samples - introduce a novel extension to RoPE so that it can adapt to larger context windows (efficiently) - exampled on llama *** DATASET - [[SKELETON]] - [[https://arxiv.org/pdf/2305.11206.pdf][LIMA]]: [[https://twitter.com/_akhaliq/status/1660458199504556034][Less]] Is More for Alignment - trained only 1,000 carefully curated prompts and responses - [[https://arxiv.org/abs/2304.14318][q2d]]: Turning Questions into Dialogs to Teach Models How to Search - synthetically-generated data achieve 90%--97% of the performance of training on human-generated data - [[https://huggingface.co/papers/2305.16635][Impossible Distillation]]: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing - high-quality model and dataset from a low-quality teacher model - [[https://twitter.com/_akhaliq/status/1689120315832483841][Simple synthetic]] data reduces sycophancy in large language models - sycophancy = adapting views once a user reveals their views, to statements that are objectively incorrect - lightweight finetuning step - [[https://twitter.com/_akhaliq/status/1699951105927512399][GPT Can]] Solve Mathematical Problems Without a Calculator; with training data = multi-digit arithmetic - [[https://twitter.com/_akhaliq/status/1719220065655013542][TeacherLM]]: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise - anotating the dataset with "why" instead of only "what" - Lema: [[https://twitter.com/_akhaliq/status/1719542744710824024][Learning]] From Mistakes Makes LLM Better Reasoner - identify, explain, correct mistakes using the llm itself to fintune (learn from mistakes) - [[https://twitter.com/_akhaliq/status/1721759755847303314][Ziya2]]: Data-centric Learning is All LLMs Need - focus on pre-training techniques and data-centric optimization to enhance learning process