📓 nodes/20230628202317-train.org by @tekakutli-org ☆

multiple working-into together:
- diffusion: MULTIPLE DIFFUSION GIT RE-BASIN
- text: RAD, EFT
- logistic: Auto-Instruct
Self-SupervisedLearning with Lie Symmetries for Partial Differential Equations
- computationally efficient alternatives to numerical solvers
- self-supervised learn general-purpose representations of PDEs from heterogeneous data
Q* New Objective Q-Learning and Q* - Decision Making Under Uncertainty (CS238/AA228)
- Q-learning parallels biological reward neurocircuitry, reinforcement learning (RL)
Model-BasedControl with Sparse Neural Dynamics (aggressive sparsification) (distillation)
- parsify it by removing redundant neurons, applicable to a wide variety of DNNs
Zero BubblePipeline Parallelism
- algorithm for optimal schedule on config and memory limit
Fuyou Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
- training with low-end GPU and limited CPU memory capacity
Direct NashOptimization: Teaching Language Models to Self-Improve with General Preferences
- post-training llm using preference feedback from a teacher model to iteratively improve over itself
- marries the simplicity and stability of contrastive learning with the theoretical generality from optimizing general preferences
From Wordsto Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
- rivaling supervised methods such as Random Forest, Bagging, or Gradient Boosting

RESEARCH

Deep neuralnetworks are robust to weight binarization and other non-linear distortions
- 0.68 effective bits per weight (below 1 bit models)
  - points to the idea that a stochastic memory element can be used

MATMUL FREE

Scalable MatMul-freeLanguage Modeling
- replaces MatMul operations in dense layers with ternary accumulations using weights constrained to {-1, 0, +1}
  - reducing computational cost and memory while preserving network expressiveness
- gpus? they removed matmults but still use hadamard product (element wise product), which are also embarrassingly parallel and can be GPUs accelerated
  - using flexible GPUs for training and FPGAs-ASIC for inference is the optimal tradeoff for this

SOFTWARE WISE

optimizer from 32 bits to 8 bits
- https://github.com/pyg-team/pytorch_geometric
faster matrix using approximations
- https://github.com/dblalock/bolt

WITH REWARD

feedback: FEEDBACK AS TARGET HUMAN FEEDBACK PROPER-ING INSTRUCTIONS
AlignProp Aligning Text-to-Image Diffusion Models with Reward Backpropagation
- aligns to reward functions
CPL Contrastive Prefence Learning: Learning from Human Feedback without RL
- learning optimal policies from preferences without learning reward functions
- regret-based model of human preferences instead of reward

CLIP AS REWARD

Vision-LanguageModels are Zero-Shot Reward Models for Reinforcement Learning
- reward function often infeasible(not posible), reward model from human feedback often very expensive
- VLMs(CLIP) as reward models: a single sentence text prompt describing the desired task

REINFORCEMENT LEARNING

TD-MPC2 Scalable, Robust World Models for Continuous Control
- agent to perform 80 tasks across multiple task domains, embodiments, and action spaces
- performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model

LLM AS REWARD

Text2Reward AutomatedDense Reward Function Generation for Reinforcement Learning
- automates the generation of dense reward functions based on llm
Eureka Human-LevelReward Design via Coding Large Language Models
- generates reward functions that outperform expert human-engineered rewards
  - so now can acquire complex skills via reinforcement learning, optimization over reward
    - to get sequential decision-making tasks
  - in-context RLHF to incorporate feedback and steer and align the reward function
- outer loop: inference-only LLM instructs a learnable NN to refine the reward function
  - inner loop: reinforcement learning to train a controller
- pen spinning

STRUCTURE

LORA
ConvNeXt(vs ViT, for image classification)
- accurate, efficient, scalable and very simple in design
  - for: zero-shot image classification, image and text retrieval
- clip convnext: https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft (320 vs 320)
CNCA Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing
- replacing linear recurrence with a special temporal convolutional network
  - permits larger receptive field size with shallower networks
  - reduces the computational complexity to O(L)
PanGu-π Enhancing Language Model Architectures via Nonlinearity Compensation
- shortcut used to enhance the model nonlinearity, 10% inference speed-up
- non linearity usual in convolutional networks for vision tasks

HYPERPARAMETER

muP is proposes "right way to scale", effective weight init scheme; searching the optimal hyperparameters

CLASSIFIER

GZIP VS GPT

are llm just text compression algorithms?
- LLMZip Lossless Text Compression using Large Language Models
- gzip instead of parameters for classification
  - “Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors

SMALLER

COMPRESSION

Knowledge Translation A New Pathway for Model Compression
- teacher-student model that receives parameters and generates compressed ones

QUANTIZATION

DIFFUSION QUANTIZATION
AdaLoRAadaptively allocates the parameter budget among weight matrices according to their importance (adaptive lora)
FLIQS One-Shot Mixed-Precision Floating-Point and Integer Quantization Search
- mixed-precision quantization, eliminates the need for retraining

OPTIMIZER

Lion better than Adam, optimizer
Sketchy Memory-efficientAdaptive Regularization with Frequent Directions
- Kronecker-factored diagonal eigenvalues, Frequent Directions

CHEAPNESS

One Step ofGradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention
OptimizedNetwork Architectures for Large Language Model Training with Billions of Parameters
- only small subgroups of GPUs require high-bandwidth any-to-any communication within them

DATASET

CAPTIONING
dimensionality reduction algorithms
- t-SNE and UMAP had long been the favorites
- "Deep TDA" combines self-supervised learning and Topological Data Analysis (TDA)
  - unlock new insights from complex datasets
  - more robust to noise and outliers in the data
Gen2Det Generate to Detect
- directly generating scene-centric images (synthetic)
- improves the performance on rare categories
Imageclassification network enhancement methods based on knowledge injection
- knowledge injection dataset to improve interpretability and classification performance of hidden layers
MovieLLM Enhancing Long Video Understanding with AI-Generated Movies
- generate a script and correspoinding video as dataset

MISTAKES

In-ContextPrinciple Learning from Mistakes
- induce model to make mistakes; then we reflect on these mistakes, and learn explicit task-specific "principles" from them which help solve similar mistakes

ACTUAL DATASET

MatSynth Physically Based Rendering (PBR) materials dataset (4,000 ultra-high resolution)
FindingEmo An Image Dataset for Emotion Recognition in the Wild
- annotated dimensions include: valence, arousal and emotion
English publicdomain books

HANDS DATASET

Annotated Handsfor Generative Models
- with three additional channels that provide annotations to hands in the image, additional structure

ENHANCEMENT

AUDIO VISION
Learningto Identify Critical States for Reinforcement Learning from Videos
- mask-based sensitivity analysis to extract/identify important critical states ==identify important==
- recognize relevant states/actions/rewards. = untagged videos
Let's SynthesizeStep by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models
- extrapolating the errors made by a small model trained on the synthesized dataset using llm
GeNIe Generative Hard Negative Images Through Diffusion (synthetic enhanced dataset)
- generate challenging samples for the target category
DistDiff: Distribution-AwareData Expansion with Diffusion Models
- dataset expansion framework based on the distribution-aware diffusion model
- hierarchical prototypes to approximate the real data distribution

SIMULATION

madrona-engine ECS-basedgame engine that runs 10,000s of environments in parallel on a single GPU
V-IRL Grounding Virtual Intelligence in Real Life
- test foundation models in virtual real world cities, geospatial data and street view imagery

FINETUNING

Dr2Net Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning
- surrogate network to finetune a pretrained model with substantially reduced memory consumption
- comparable performance to conventional finetuning but with significantly less memory usage
Data-Free GeneralizedZero-Shot Learning (using only it's clip features)
Gradient CorrelationSubspace Learning against Catastrophic Forgetting
- detects a subspace of the weights that is least affected by previous tasks trains the new task into said subspace
EvolutionaryOptimization of Model Merging Recipes
- facilitates crossdomain merging, automated model composition
The UnreasonableIneffectiveness of the Deeper Layers
- identify optimal block of layers to prune by considering similarity across layers
  - then, to "heal" the damage, we perform a small amount of finetuning

FINETUNES

YOLO

https://docs.ultralytics.com/yolov5/tutorials/train_custom_data/#train-on-custom-data

GAN ALTERNATIVE

ONE STEP DIFFUSION