:PROPERTIES: :ID: d4eebb0c-b7d1-4f56-baf5-004fc69fbd6c :END: #+title: transformer #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - parent: [[id:cb192d74-71e5-40c3-8763-6f68ffde8e27][train]] - [[https://twitter.com/_akhaliq/status/1664497650702471169][Bytes Are All You Need]]: [[https://huggingface.co/papers/2306.00238][Transformers]] Operating Directly On File Bytes - [[https://twitter.com/cloneofsimo/status/1664365355266105344][NoPE]]: dont use positional encoding (PE) in Transformer decoders (GPTs) - [[https://twitter.com/_akhaliq/status/1682248055637041152][Meta-Transformer]]: A Unified Framework for Multimodal Learning - a unified data tokenizer, a modality-shared encoder, and task-specific heads * IMPROVEMENTS ON ** FASTER - [[https://arxiv.org/abs/2303.09752][CoLT5]]: Faster Long-Range Transformers with Conditional Computation - [[https://twitter.com/papers_daily/status/1637748540653936641][strong gains]] up to 64k input length - [[https://twitter.com/_akhaliq/status/1735125272200953965][SwitchHead]]: Accelerating Transformers with Mixture-of-Experts Attention - reduces compute and memory, 4 to 8 times fewer attention matrices - [[https://github.com/LeapLabTHU/Agent-Attention][Agent Attention]]: balance between computational efficiency and representation power - generalized linear attention, integrated with softmax; preserving global context modelling capability ** CONTEXT - [[https://twitter.com/MishaLaskin/status/1585265436723236864][contextual transformers]] (Algorithm Distillation), learns from itself, reinforcement learning - [[https://twitter.com/xiaolonw/status/1677003542249484289][Elastic Decision]] Transformer - not optimal to use all history states as inputs for decision, instead shorter history - [[https://twitter.com/_akhaliq/status/1737680304737800291][Cached Transformers]]: Improving Transformers with Differentiable Memory Cache - Gated Recurrent Cached (GRC), extend the self-attention mechanism * ABOUT ATTENTION - [[https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190][What]] is Q,V,K? multihead attention? - SpectFormer: Frequency and Attention is what you need in a Vision Transformer - [[https://arxiv.org/pdf/1808.03867.pdf][Pervasive]] [[https://github.com/elbayadm/attn2d][Attention]]: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction - two-dimensional convolutions to jointly encode the source-target sequences (translation) - [[https://arxiv.org/abs/1901.03429][On the]] [[https://twitter.com/kfountou/status/1682936558532407296][Turing]] Completeness of Modern Neural Network Architectures; Attention is Turing-Complete * SHAPES - Star-Transformer: https://arxiv.org/abs/1902.09113 - [[https://github.com/HazyResearch/safari][Hungry Hungry Hippos]]: State Space Models - next: [[https://arxiv.org/pdf/2302.10866.pdf][Hyena Hierarchy]]: gating (cache-d attention) Towards Larger Convolutional Language Models - simpler transformer: [[https://twitter.com/_akhaliq/status/1699332742154997916][One Wide]] Feedforward is All You Need - Attention(interdependencies) and the Feed Forward Network(now removed from the decoder, cheaper params) - [[https://twitter.com/_akhaliq/status/1714480762358014417][Approximating]] Two-Layer Feedforward Networks for Efficient Transformers - Mixtures of Experts (MoEs) vs dense transformers, more resource efficient ** BEHAVIORAL TRANSFORMER :PROPERTIES: :ID: d1967bb7-3782-4052-8725-c799c2630893 :END: - [[https://twitter.com/_akhaliq/status/1682248458231480321][PASTA]]: Pretrained Action-State Transformer Agents - self supervised reinforcement learning - learning behavioral, sensor adaptation trajectories - no need to pretrain-tailor to specific downstream applications ** ALTERNATIVE - [[id:b420e2cc-c219-43ef-baa6-e913a4690872][MATMUL FREE]] - [[https://youtu.be/ec56a8wmfRk?si=8lhhpWSzN61VFYXG][Retentive Network]]: A Successor to Transformer for Large Language Models (RetNet) - low cost inference, training parallelism, strong performance - [[https://twitter.com/_akhaliq/status/1726428891399483537][Rethinking]] Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers - potential to streamline complex architectures for sequence-to-sequence tasks - [[https://twitter.com/_akhaliq/status/1749646258245927405][WARM]]: On the Benefits of Weight Averaged Reward Models - averaging weights to deal with inconsistencies *** RNN - RNNs, [[https://twitter.com/_akhaliq/status/1699332382766039051][Gated]] recurrent neural networks discover attention - [[https://github.com/BlinkDL/RWKV-LM][RWKV]]: RNN instead of transformers - [[https://github.com/BlinkDL/nanoRWKV][nanoRWKV]]: minGPT like, does not require custom CUDA kernel to train *** STATE SPACE :PROPERTIES: :ID: bd80ad1d-64de-4445-98e8-0cec31e1ab32 :END: - [[https://arxiv.org/abs/2212.14052][H3]]: [[https://www.reddit.com/r/MachineLearning/comments/10kdeex/h3_a_new_generative_language_models_that/][hungry hippos]]: state space model instead of transformers - [[https://arxiv.org/pdf/2202.07765.pdf][Perceiver]] few latents instead of transformers - [[https://twitter.com/_akhaliq/status/1754334655405326482][Repeat After Me]] Transformers are Better than State Space Models at Copying - GSSMs(generalized state space models): fixed-size latent state that doesnt depend on sequence length - limited compared to transformer models on tasks that require copying from the input context - repeat after me, dilated convolutions are better than transformers? **** MAMBA - [[https://twitter.com/_akhaliq/status/1731507030538543171][Mamba]]: Linear-Time Sequence Modeling with Selective State Spaces - architecture without attention or MLP - allowing the model to selectively propagate or forget information - outperforms Transformers of the same size, matches Transformers twice its size - [[https://twitter.com/_akhaliq/status/1771033002748837953][Cobra]]: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference - comparable performance to LLaVA with about 43% of the number of parameters - [[https://twitter.com/_akhaliq/status/1774664709821567290][MambaMixer]]: Efficient Selective State Space Models with Dual Token and Channel Selection - data-dependent weights that uses a dual selection mechanism across tokens and channels - [[https://twitter.com/_akhaliq/status/1774645753383682344][Jamba]]: A Hybrid Transformer-Mamba Language Model - interleaves blocks of Transformer and Mamba layers ** ATTENTION FREE - gzip vs attention: [[id:316325a1-f24b-487d-9238-ca35db3a6b0c][GZIP VS GPT]] - [[https://arxiv.org/pdf/2111.15588.pdf][SimpleTRON]]: Simple Transformer with O(N) Complexity (no transformer) - vs [[https://arxiv.org/abs/2111.11418][Metaformer]] (poolformer, pureformer) - maybe not the same: [[https://github.com/ThilinaRajapakse/simpletransformers][the github]] ** COMPOSABLE TRANSFORMERS :composable: :PROPERTIES: :ID: 0fc2c1cb-406d-4eec-a328-c5e3838d8eac :END: - [[https://twitter.com/_akhaliq/status/1690955025236004864][Composable]] Function-preserving Expansions for Transformer Architectures ==best== - training pipelines for larger models by progressively expanding the architecture throughout training * TRANSFORMER VISION - parent: [[id:39d30d24-c374-4d0c-8037-b03ecbf983fa][computer_vision]] - [[id:c0fc01e0-6db2-42c1-8053-55cddfcd496c][SEED]] - DINO: self-suppervised Vision Transformers https://youtu.be/h3ij3F3cPIk - [[https://twitter.com/_akhaliq/status/1717385905214759421][ConvNets]] Match Vision Transformers at Scale - match the performance of Vision Transformers with comparable compute budgets - [[https://twitter.com/_akhaliq/status/1744194239104217273][Denoising]] Vision Transformers: removes artifacts improves quality ** VIDEO - [[https://twitter.com/_akhaliq/status/1686200777470058496][MovieChat]]: From Dense Token to Sparse Memory for Long Video Understanding - memory mechanism of rapidly updated short-term memory and thus sustained long-term memory - [[https://twitter.com/_akhaliq/status/1696112258806673905][Eventful]] Transformers: Leveraging Temporal Redundancy in Vision Transformers - video visual recognition = computational savings - identifying the tokens that have changed significantly - can be converted from existing transformers - [[https://twitter.com/_akhaliq/status/1700024227007537235][ProPainter]]: Improving Propagation and Transformer for Video Inpainting - inpainting = make mask, then remove - image and feature warping, discard redundant tokens, attention to distant frames ** TOKENS :PROPERTIES: :ID: bb5bc5a8-876c-43ae-8fa0-ea3d6b7da69f :END: - [[https://arxiv.org/abs/2307.02486][LongNet]]: Scaling Transformers to 1,000,000,000 Tokens - [[https://twitter.com/_akhaliq/status/1645278535878049792][SparseFormer]]: Sparse Visual Recognition via Limited Latent Tokens <> - codebook for videos tokens, not optical flow, 49 tokens - [[https://arxiv.org/pdf/2303.11114.pdf][SeiT]]: [[https://github.com/naver-ai/seit][Storage-efficient]] Vision Training with Tokens Using 1% of Pixel Storage, <1% of JPEG images - Token-based Storage *** VISUAL TOKENS - [[https://twitter.com/_akhaliq/status/1760869569248289151][Subobject-level]] Image Tokenization - subobjects are represented by semantically meaningful image segments obtained by segmentation models **** SEED :PROPERTIES: :ID: c0fc01e0-6db2-42c1-8053-55cddfcd496c :END: - [[https://twitter.com/_akhaliq/status/1681128614949994496][SEED]]: Planting a SEED of Vision in Large Language Model - unify visual and textual representations - Image tokens independent of 2D patch positions, only 1D causal dependency - high-level semantics consistent with semantic abstraction of words - SEED-LLaMA: [[https://ailab-cvc.github.io/seed/seed_llama.html][Making LLaMA]] [[https://twitter.com/ge_yixiao/status/1715294783705665791][SEE]] [[https://github.com/AILab-CVC/SEED][and]] Draw with SEED Tokenizer (autoregressive Transformer) - comprehension and generation of images ** HIERARCHICAL DETAILS - [[https://twitter.com/_akhaliq/status/1645603021248778241][Slide-Transformer]]: Hierarchical Vision Transformer with Local Self-Attention - [[https://twitter.com/_akhaliq/status/1668459325805699073][FasterViT]]: Fast Vision Transformers with Hierarchical Attention - [[https://twitter.com/_akhaliq/status/1679344960150151168][Patch]] n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution - [[https://twitter.com/_akhaliq/status/1686922270604709888][From Sparse]] to Soft Mixtures of Experts; sparse Transformer - passing different weighted combinations of all input tokens to each expert