:PROPERTIES:
:ID:       d4eebb0c-b7d1-4f56-baf5-004fc69fbd6c
:END:
#+title: transformer
#+filetags: :neuralnomicon:
#+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup
- parent: [[id:cb192d74-71e5-40c3-8763-6f68ffde8e27][train]]
- [[https://twitter.com/_akhaliq/status/1664497650702471169][Bytes Are All You Need]]: [[https://huggingface.co/papers/2306.00238][Transformers]] Operating Directly On File Bytes
- [[https://twitter.com/cloneofsimo/status/1664365355266105344][NoPE]]: dont use positional encoding (PE) in Transformer decoders (GPTs)
- [[https://twitter.com/_akhaliq/status/1682248055637041152][Meta-Transformer]]: A Unified Framework for Multimodal Learning
  - a unified data tokenizer, a modality-shared encoder, and task-specific heads
* IMPROVEMENTS ON
** FASTER
- [[https://arxiv.org/abs/2303.09752][CoLT5]]: Faster Long-Range Transformers with Conditional Computation
  - [[https://twitter.com/papers_daily/status/1637748540653936641][strong gains]] up to 64k input length
- [[https://twitter.com/_akhaliq/status/1735125272200953965][SwitchHead]]: Accelerating Transformers with Mixture-of-Experts Attention
  - reduces compute and memory, 4 to 8 times fewer attention matrices
- [[https://github.com/LeapLabTHU/Agent-Attention][Agent Attention]]: balance between computational efficiency and representation power
  - generalized linear attention, integrated with softmax; preserving global context modelling capability
** CONTEXT
- [[https://twitter.com/MishaLaskin/status/1585265436723236864][contextual transformers]] (Algorithm Distillation), learns from itself, reinforcement learning
- [[https://twitter.com/xiaolonw/status/1677003542249484289][Elastic Decision]] Transformer
  - not optimal to use all history states as inputs for decision, instead shorter history
- [[https://twitter.com/_akhaliq/status/1737680304737800291][Cached Transformers]]: Improving Transformers with Differentiable Memory Cache
  - Gated Recurrent Cached (GRC), extend the self-attention mechanism
* ABOUT ATTENTION
- [[https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190][What]] is Q,V,K? multihead attention?
- SpectFormer: Frequency and Attention is what you need in a Vision Transformer
- [[https://arxiv.org/pdf/1808.03867.pdf][Pervasive]] [[https://github.com/elbayadm/attn2d][Attention]]: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
  - two-dimensional convolutions to jointly encode the source-target sequences (translation)
- [[https://arxiv.org/abs/1901.03429][On the]] [[https://twitter.com/kfountou/status/1682936558532407296][Turing]] Completeness of Modern Neural Network Architectures; Attention is Turing-Complete
* SHAPES
- Star-Transformer: https://arxiv.org/abs/1902.09113
  - [[https://github.com/HazyResearch/safari][Hungry Hungry Hippos]]: State Space Models
    - next: [[https://arxiv.org/pdf/2302.10866.pdf][Hyena Hierarchy]]: gating (cache-d attention) Towards Larger Convolutional Language Models
- simpler transformer: [[https://twitter.com/_akhaliq/status/1699332742154997916][One Wide]] Feedforward is All You Need
  - Attention(interdependencies) and the Feed Forward Network(now removed from the decoder, cheaper params)
- [[https://twitter.com/_akhaliq/status/1714480762358014417][Approximating]] Two-Layer Feedforward Networks for Efficient Transformers
  - Mixtures of Experts (MoEs) vs dense transformers, more resource efficient
** BEHAVIORAL TRANSFORMER
:PROPERTIES:
:ID:       d1967bb7-3782-4052-8725-c799c2630893
:END:
- [[https://twitter.com/_akhaliq/status/1682248458231480321][PASTA]]: Pretrained Action-State Transformer Agents
  - self supervised reinforcement learning
  - learning behavioral, sensor adaptation trajectories
  - no need to pretrain-tailor to specific downstream applications
** ALTERNATIVE
- [[id:b420e2cc-c219-43ef-baa6-e913a4690872][MATMUL FREE]]
- [[https://youtu.be/ec56a8wmfRk?si=8lhhpWSzN61VFYXG][Retentive Network]]: A Successor to Transformer for Large Language Models (RetNet)
  - low cost inference, training parallelism, strong performance
- [[https://twitter.com/_akhaliq/status/1726428891399483537][Rethinking]] Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
  - potential to streamline complex architectures for sequence-to-sequence tasks
- [[https://twitter.com/_akhaliq/status/1749646258245927405][WARM]]: On the Benefits of Weight Averaged Reward Models
  - averaging weights to deal with inconsistencies
*** RNN
- RNNs, [[https://twitter.com/_akhaliq/status/1699332382766039051][Gated]] recurrent neural networks discover attention
- [[https://github.com/BlinkDL/RWKV-LM][RWKV]]: RNN instead of transformers
  - [[https://github.com/BlinkDL/nanoRWKV][nanoRWKV]]: minGPT like, does not require custom CUDA kernel to train
*** STATE SPACE
:PROPERTIES:
:ID:       bd80ad1d-64de-4445-98e8-0cec31e1ab32
:END:
- [[https://arxiv.org/abs/2212.14052][H3]]: [[https://www.reddit.com/r/MachineLearning/comments/10kdeex/h3_a_new_generative_language_models_that/][hungry hippos]]: state space model instead of transformers
  - [[https://arxiv.org/pdf/2202.07765.pdf][Perceiver]] few latents instead of transformers
- [[https://twitter.com/_akhaliq/status/1754334655405326482][Repeat After Me]] Transformers are Better than State Space Models at Copying
  - GSSMs(generalized state space models): fixed-size latent state that doesnt depend on sequence length
    - limited compared to transformer models on tasks that require copying from the input context
      - repeat after me, dilated convolutions are better than transformers?
**** MAMBA
- [[https://twitter.com/_akhaliq/status/1731507030538543171][Mamba]]: Linear-Time Sequence Modeling with Selective State Spaces
  - architecture without attention or MLP
  - allowing the model to selectively propagate or forget information
  - outperforms Transformers of the same size, matches Transformers twice its size
- [[https://twitter.com/_akhaliq/status/1771033002748837953][Cobra]]: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
  - comparable performance to LLaVA with about 43% of the number of parameters
- [[https://twitter.com/_akhaliq/status/1774664709821567290][MambaMixer]]: Efficient Selective State Space Models with Dual Token and Channel Selection
  - data-dependent weights that uses a dual selection mechanism across tokens and channels
- [[https://twitter.com/_akhaliq/status/1774645753383682344][Jamba]]: A Hybrid Transformer-Mamba Language Model
  - interleaves blocks of Transformer and Mamba layers
** ATTENTION FREE
- gzip vs attention: [[id:316325a1-f24b-487d-9238-ca35db3a6b0c][GZIP VS GPT]]
- [[https://arxiv.org/pdf/2111.15588.pdf][SimpleTRON]]: Simple Transformer with O(N) Complexity (no transformer)
  - vs [[https://arxiv.org/abs/2111.11418][Metaformer]] (poolformer, pureformer)
  - maybe not the same: [[https://github.com/ThilinaRajapakse/simpletransformers][the github]]
** COMPOSABLE TRANSFORMERS :composable:
:PROPERTIES:
:ID:       0fc2c1cb-406d-4eec-a328-c5e3838d8eac
:END:
- [[https://twitter.com/_akhaliq/status/1690955025236004864][Composable]] Function-preserving Expansions for Transformer Architectures ==best==
  - training pipelines for larger models by progressively expanding the architecture throughout training
* TRANSFORMER VISION
- parent: [[id:39d30d24-c374-4d0c-8037-b03ecbf983fa][computer_vision]]
- [[id:c0fc01e0-6db2-42c1-8053-55cddfcd496c][SEED]]
- DINO: self-suppervised Vision Transformers https://youtu.be/h3ij3F3cPIk
- [[https://twitter.com/_akhaliq/status/1717385905214759421][ConvNets]] Match Vision Transformers at Scale
  - match the performance of Vision Transformers with comparable compute budgets
- [[https://twitter.com/_akhaliq/status/1744194239104217273][Denoising]] Vision Transformers: removes artifacts improves quality
** VIDEO
- [[https://twitter.com/_akhaliq/status/1686200777470058496][MovieChat]]: From Dense Token to Sparse Memory for Long Video Understanding
  - memory mechanism of rapidly updated short-term memory and thus sustained long-term memory
- [[https://twitter.com/_akhaliq/status/1696112258806673905][Eventful]] Transformers: Leveraging Temporal Redundancy in Vision Transformers
  - video visual recognition = computational savings
    - identifying  the tokens that have changed significantly
  - can be converted from existing transformers
- [[https://twitter.com/_akhaliq/status/1700024227007537235][ProPainter]]: Improving Propagation and Transformer for Video Inpainting
  - inpainting = make mask, then remove
  - image and feature warping, discard redundant tokens, attention to distant frames
** TOKENS
:PROPERTIES:
:ID:       bb5bc5a8-876c-43ae-8fa0-ea3d6b7da69f
:END:
- [[https://arxiv.org/abs/2307.02486][LongNet]]: Scaling Transformers to 1,000,000,000 Tokens
- [[https://twitter.com/_akhaliq/status/1645278535878049792][SparseFormer]]: Sparse Visual Recognition via Limited Latent Tokens  <<sparseformer>>
  - codebook for videos tokens, not optical flow, 49 tokens
- [[https://arxiv.org/pdf/2303.11114.pdf][SeiT]]: [[https://github.com/naver-ai/seit][Storage-efficient]] Vision Training with Tokens Using 1% of Pixel Storage, <1% of JPEG images
  - Token-based Storage
*** VISUAL TOKENS
- [[https://twitter.com/_akhaliq/status/1760869569248289151][Subobject-level]] Image Tokenization
  - subobjects are represented by semantically meaningful image segments obtained by segmentation models
**** SEED
:PROPERTIES:
:ID:       c0fc01e0-6db2-42c1-8053-55cddfcd496c
:END:
- [[https://twitter.com/_akhaliq/status/1681128614949994496][SEED]]: Planting a SEED of Vision in Large Language Model
  - unify visual and textual representations
  - Image tokens independent of 2D patch positions, only 1D causal dependency
  - high-level semantics consistent with semantic abstraction of words
  - SEED-LLaMA: [[https://ailab-cvc.github.io/seed/seed_llama.html][Making LLaMA]] [[https://twitter.com/ge_yixiao/status/1715294783705665791][SEE]] [[https://github.com/AILab-CVC/SEED][and]] Draw with SEED Tokenizer (autoregressive Transformer)
    - comprehension and generation of images
** HIERARCHICAL DETAILS
- [[https://twitter.com/_akhaliq/status/1645603021248778241][Slide-Transformer]]: Hierarchical Vision Transformer with Local Self-Attention
- [[https://twitter.com/_akhaliq/status/1668459325805699073][FasterViT]]: Fast Vision Transformers with Hierarchical Attention
- [[https://twitter.com/_akhaliq/status/1679344960150151168][Patch]] n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
- [[https://twitter.com/_akhaliq/status/1686922270604709888][From Sparse]] to Soft Mixtures of Experts; sparse Transformer
  - passing different weighted combinations of all input tokens to each expert