:PROPERTIES: :ID: cb192d74-71e5-40c3-8763-6f68ffde8e27 :END: #+title: train #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - multiple working-into together: - diffusion: [[id:6a66690f-b76f-441a-a093-3c83ca73af2d][MULTIPLE DIFFUSION]], GIT RE-BASIN - text: RAD, EFT - logistic: Auto-Instruct - [[https://twitter.com/_akhaliq/status/1678970340033150977][Self-Supervised]] Learning with Lie Symmetries for Partial Differential Equations - computationally efficient alternatives to numerical solvers - self-supervised learn general-purpose representations of PDEs from heterogeneous data - [[https://twitter.com/HarperSCarroll/status/1728061267624014219][Q*]]: [[https://www.youtube.com/watch?v=4qkKpNnSrlY&t=9][New Objective]]: Q-Learning and Q* - Decision Making Under Uncertainty (CS238/AA228) - Q-learning parallels biological reward neurocircuitry, reinforcement learning (RL) - [[https://twitter.com/_akhaliq/status/1737716034088386604][Model-Based]] Control with Sparse Neural Dynamics (aggressive sparsification) (distillation) - parsify it by removing redundant neurons, applicable to a wide variety of DNNs - [[https://twitter.com/_akhaliq/status/1749289751058710685][Zero Bubble]] Pipeline Parallelism - algorithm for optimal schedule on config and memory limit - [[https://twitter.com/_akhaliq/status/1767393991727657262][Fuyou]]: Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU - training with low-end GPU and limited CPU memory capacity - [[https://twitter.com/_akhaliq/status/1777155790417150111][Direct Nash]] Optimization: Teaching Language Models to Self-Improve with General Preferences - post-training llm using preference feedback from a teacher model to iteratively improve over itself - marries the simplicity and stability of contrastive learning with the theoretical generality from optimizing general preferences - [[https://twitter.com/_akhaliq/status/1778592009067925649][From Words]] to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples - rivaling supervised methods such as Random Forest, Bagging, or Gradient Boosting * RESEARCH - [[https://arxiv.org/abs/1606.01981][Deep neural]] networks are robust to weight binarization and other non-linear distortions - 0.68 effective bits per weight (below 1 bit models) - points to the idea that a stochastic memory element can be used ** MATMUL FREE :PROPERTIES: :ID: b420e2cc-c219-43ef-baa6-e913a4690872 :END: - [[https://twitter.com/rohanpaul_ai/status/1799122826114330866][Scalable]] [[https://github.com/ridgerchu/matmulfreellm][MatMul-free]] Language Modeling - replaces MatMul operations in dense layers with ternary accumulations using weights constrained to {-1, 0, +1} - reducing computational cost and memory while preserving network expressiveness - gpus? they removed matmults but still use hadamard product (element wise product), which are also embarrassingly parallel and can be GPUs accelerated - using flexible GPUs for training and FPGAs-ASIC for inference is the optimal tradeoff for this * SOFTWARE WISE - optimizer from 32 bits to 8 bits - https://github.com/pyg-team/pytorch_geometric - faster matrix using approximations - https://github.com/dblalock/bolt * WITH REWARD - feedback: [[id:ad5a8c1e-10c2-4155-86fe-ecbfa1ffcd07][FEEDBACK AS TARGET]] [[id:59d1d337-eff3-42bb-9398-1e51b0739074][HUMAN FEEDBACK]] [[id:4daacc49-2790-49c2-a32a-880c5f99e681][PROPER-ING INSTRUCTIONS]] - [[https://arxiv.org/abs/2310.03739][AlignProp]]: Aligning Text-to-Image Diffusion Models with Reward Backpropagation - aligns to reward functions - [[https://twitter.com/_akhaliq/status/1716305579101069478][CPL]]: Contrastive Prefence Learning: Learning from Human Feedback without RL - learning optimal policies from preferences without learning reward functions - regret-based model of human preferences instead of reward ** CLIP AS REWARD :PROPERTIES: :ID: 9bec56a3-a402-418d-bc67-40b3165089c3 :END: - [[https://twitter.com/_akhaliq/status/1715244883659661790][Vision-Language]] Models are Zero-Shot Reward Models for Reinforcement Learning - reward function = often infeasible(not posible), reward model from human feedback = often very expensive - VLMs(CLIP) as reward models: a single sentence text prompt describing the desired task ** REINFORCEMENT LEARNING - [[https://twitter.com/_akhaliq/status/1717390896788873543][TD-MPC2]]: Scalable, Robust World Models for Continuous Control - agent to perform 80 tasks across multiple task domains, embodiments, and action spaces - performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model ** LLM AS REWARD :PROPERTIES: :ID: 2fefb31b-1809-49c0-b925-a7b9a6fa3b0b :END: - [[https://twitter.com/arankomatsuzaki/status/1706311844829487153][Text2Reward]]: [[https://text-to-reward.github.io/][Automated]] Dense Reward Function Generation for Reinforcement Learning - automates the generation of dense reward functions based on llm - [[https://twitter.com/_akhaliq/status/1715184868294889490][Eureka]]: [[https://twitter.com/DrJimFan/status/1715397393842401440][Human-Level]] Reward Design via Coding Large Language Models - generates reward functions that outperform expert human-engineered rewards - so now can acquire complex skills via reinforcement learning, optimization over reward - to get sequential decision-making tasks - in-context RLHF to incorporate feedback and steer and align the reward function - outer loop: inference-only LLM instructs a learnable NN to refine the reward function - inner loop: reinforcement learning to train a controller - pen spinning * STRUCTURE - [[id:e261c214-31a2-4d93-a62b-61d7d53b702c][LORA]] - [[https://github.com/facebookresearch/ConvNeXt][ConvNeXt]] (vs ViT, for image classification) - accurate, efficient, scalable and very simple in design - for: zero-shot image classification, image and text retrieval - clip convnext: https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft (320 vs 320) - [[https://twitter.com/_akhaliq/status/1734422847915721136][CNCA]]: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing - replacing linear recurrence with a special temporal convolutional network - permits larger receptive field size with shallower networks - reduces the computational complexity to O(L) - [[https://twitter.com/_akhaliq/status/1741659775673184467][PanGu-π]]: Enhancing Language Model Architectures via Nonlinearity Compensation - shortcut used to enhance the model nonlinearity, 10% inference speed-up - non linearity usual in convolutional networks for vision tasks ** HYPERPARAMETER - muP is proposes "right way to scale", effective weight init scheme; searching the optimal hyperparameters * CLASSIFIER ** GZIP VS GPT :PROPERTIES: :ID: 316325a1-f24b-487d-9238-ca35db3a6b0c :END: - are llm just text compression algorithms? - [[https://twitter.com/_akhaliq/status/1666644201705029632][LLMZip]]: Lossless Text Compression using Large Language Models - gzip instead of parameters for classification - “[[https://aclanthology.org/2023.findings-acl.426.pdf][Low-Resource]]” Text Classification: A Parameter-Free Classification Method with Compressors * SMALLER ** COMPRESSION - [[https://lemmy.dbzer0.com/post/12260097][Knowledge Translation]]: A New Pathway for Model Compression - teacher-student model that receives parameters and generates compressed ones ** QUANTIZATION - [[id:385118af-d780-4cce-ad5a-46b3ecb11db7][DIFFUSION QUANTIZATION]] - [[https://arxiv.org/pdf/2303.10512.pdf][AdaLoRA]] adaptively allocates the parameter budget among weight matrices according to their importance (adaptive lora) - [[https://twitter.com/_akhaliq/status/1688791126080126976][FLIQS]]: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search - mixed-precision quantization, eliminates the need for retraining * OPTIMIZER - [[https://twitter.com/DrJimFan/status/1625920782332489729][Lion]]: better than Adam, optimizer - [[https://arxiv.org/abs/2302.03764][Sketchy]]: [[https://twitter.com/FeinbergVlad/status/1623540032832413696][Memory-efficient]] Adaptive Regularization with Frequent Directions - Kronecker-factored diagonal eigenvalues, Frequent Directions * CHEAPNESS - [[https://huggingface.co/papers/2307.03576][One Step of]] Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention - [[https://twitter.com/_akhaliq/status/1683670100350742528][Optimized]] Network Architectures for Large Language Model Training with Billions of Parameters - only small subgroups of GPUs require high-bandwidth any-to-any communication within them * DATASET :PROPERTIES: :ID: 3b228325-e1af-4fc5-857b-fe5933e20b03 :END: - [[id:aeca80bb-38f3-4343-a214-67e3b4df245e][CAPTIONING]] - dimensionality reduction algorithms - t-SNE and UMAP had long been the favorites - "Deep TDA" combines self-supervised learning and Topological Data Analysis (TDA) - unlock new insights from complex datasets - more robust to noise and outliers in the data - [[https://twitter.com/_akhaliq/status/1732966139762819390][Gen2Det]]: Generate to Detect - directly generating scene-centric images (synthetic) - improves the performance on rare categories - [[https://arxiv.org/abs/2401.04441][Image]] classification network enhancement methods based on knowledge injection - knowledge injection dataset to improve interpretability and classification performance of hidden layers - [[https://twitter.com/_akhaliq/status/1765052609952436422][MovieLLM]]: Enhancing Long Video Understanding with AI-Generated Movies - generate a script and correspoinding video as dataset ** MISTAKES - [[https://twitter.com/_akhaliq/status/1755784642827874416][In-Context]] Principle Learning from Mistakes - induce model to make mistakes; then we reflect on these mistakes, and learn explicit task-specific "principles" from them which help solve similar mistakes ** ACTUAL DATASET - [[https://huggingface.co/datasets/gvecchio/MatSynth][MatSynth]]: Physically Based Rendering (PBR) materials dataset (4,000 ultra-high resolution) - [[https://browse.arxiv.org/abs/2402.01355][FindingEmo]]: An Image Dataset for Emotion Recognition in the Wild - annotated dimensions include: valence, arousal and emotion - [[https://twitter.com/storytracer/status/1765410706638160303][English public]] domain books *** HANDS DATASET :PROPERTIES: :ID: 3f752b46-cae4-49d9-948d-50e3c500727e :END: - [[https://arxiv.org/abs/2401.15075][Annotated Hands]] for Generative Models - with three additional channels that provide annotations to hands in the image, additional structure ** ENHANCEMENT - [[id:f03ccf94-1aa5-4705-89af-617a22570e26][AUDIO VISION]] - [[https://twitter.com/_akhaliq/status/1691914689926840809][Learning]] to Identify Critical States for Reinforcement Learning from Videos - mask-based sensitivity analysis to extract/identify important critical states ==identify important== - recognize relevant states/actions/rewards. = untagged videos - [[https://twitter.com/_akhaliq/status/1716307038764933189][Let's Synthesize]] Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models - extrapolating the errors made by a small model trained on the synthesized dataset using llm - [[https://arxiv.org/abs/2312.02548][GeNIe]]: Generative Hard Negative Images Through Diffusion (synthetic enhanced dataset) - generate challenging samples for the target category - DistDiff: [[https://github.com/haoweiz23/DistDiff][Distribution-Aware]] Data Expansion with Diffusion Models - dataset expansion framework based on the distribution-aware diffusion model - hierarchical prototypes to approximate the real data distribution ** SIMULATION :PROPERTIES: :ID: ba0ec473-43d7-4211-98a8-da6ad853b696 :END: - [[https://twitter.com/kayvonf/status/1688582905394757633][madrona-engine]]: [[https://madrona-engine.github.io/][ECS-based]] game engine that runs 10,000s of environments in parallel on a single GPU - [[https://virl-platform.github.io/][V-IRL]]: Grounding Virtual Intelligence in Real Life - test foundation models in virtual real world cities, geospatial data and street view imagery * FINETUNING - [[https://arxiv.org/abs/2401.04105][Dr2Net]]: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning - surrogate network to finetune a pretrained model with substantially reduced memory consumption - comparable performance to conventional finetuning but with significantly less memory usage - [[https://arxiv.org/abs/2401.15657][Data-Free]] [[https://github.com/ylong4/DFZSL][Generalized]] Zero-Shot Learning (using only it's clip features) - [[https://arxiv.org/abs/2403.02334][Gradient Correlation]] Subspace Learning against Catastrophic Forgetting - detects a subspace of the weights that is least affected by previous tasks trains the new task into said subspace - [[https://twitter.com/_akhaliq/status/1770675608575435046][Evolutionary]] Optimization of Model Merging Recipes - facilitates crossdomain merging, automated model composition - [[https://twitter.com/_akhaliq/status/1772828395107192981][The Unreasonable]] Ineffectiveness of the Deeper Layers - identify optimal block of layers to prune by considering similarity across layers - then, to "heal" the damage, we perform a small amount of finetuning ** FINETUNES *** YOLO - https://docs.ultralytics.com/yolov5/tutorials/train_custom_data/#train-on-custom-data * GAN ALTERNATIVE - [[id:9e94f7d8-752f-48e9-9ef1-9c79eba258e3][ONE STEP DIFFUSION]]