📓 nodes/20230628202012-transformer.org by @tekakutli-org ☆

parent: train
Bytes Are All You Need TransformersOperating Directly On File Bytes
NoPE dont use positional encoding (PE) in Transformer decoders (GPTs)
Meta-Transformer A Unified Framework for Multimodal Learning
- a unified data tokenizer, a modality-shared encoder, and task-specific heads

IMPROVEMENTS ON

FASTER

CoLT5 Faster Long-Range Transformers with Conditional Computation
- strong gainsup to 64k input length
SwitchHead Accelerating Transformers with Mixture-of-Experts Attention
- reduces compute and memory, 4 to 8 times fewer attention matrices
Agent Attention balance between computational efficiency and representation power
- generalized linear attention, integrated with softmax; preserving global context modelling capability

contextual transformers(Algorithm Distillation), learns from itself, reinforcement learning
Elastic DecisionTransformer
- not optimal to use all history states as inputs for decision, instead shorter history
Cached Transformers Improving Transformers with Differentiable Memory Cache
- Gated Recurrent Cached (GRC), extend the self-attention mechanism

Whatis Q,V,K? multihead attention?
SpectFormer: Frequency and Attention is what you need in a Vision Transformer
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
- two-dimensional convolutions to jointly encode the source-target sequences (translation)
On the TuringCompleteness of Modern Neural Network Architectures; Attention is Turing-Complete

Star-Transformer: https://arxiv.org/abs/1902.09113
- Hungry Hungry Hippos State Space Models
  - next: Hyena Hierarchy gating (cache-d attention) Towards Larger Convolutional Language Models
simpler transformer: One WideFeedforward is All You Need
- Attention(interdependencies) and the Feed Forward Network(now removed from the decoder, cheaper params)
ApproximatingTwo-Layer Feedforward Networks for Efficient Transformers
- Mixtures of Experts (MoEs) vs dense transformers, more resource efficient

PASTA Pretrained Action-State Transformer Agents
- self supervised reinforcement learning
- learning behavioral, sensor adaptation trajectories
- no need to pretrain-tailor to specific downstream applications

MATMUL FREE
Retentive Network A Successor to Transformer for Large Language Models (RetNet)
- low cost inference, training parallelism, strong performance
RethinkingAttention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
- potential to streamline complex architectures for sequence-to-sequence tasks
WARM On the Benefits of Weight Averaged Reward Models
- averaging weights to deal with inconsistencies

RNNs, Gatedrecurrent neural networks discover attention
RWKV RNN instead of transformers
- nanoRWKV minGPT like, does not require custom CUDA kernel to train

H3 hungry hippos state space model instead of transformers
- Perceiverfew latents instead of transformers
Repeat After MeTransformers are Better than State Space Models at Copying
- GSSMs(generalized state space models): fixed-size latent state that doesnt depend on sequence length
  - limited compared to transformer models on tasks that require copying from the input context
    - repeat after me, dilated convolutions are better than transformers?

Mamba Linear-Time Sequence Modeling with Selective State Spaces
- architecture without attention or MLP
- allowing the model to selectively propagate or forget information
- outperforms Transformers of the same size, matches Transformers twice its size
Cobra Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
- comparable performance to LLaVA with about 43% of the number of parameters
MambaMixer Efficient Selective State Space Models with Dual Token and Channel Selection
- data-dependent weights that uses a dual selection mechanism across tokens and channels
Jamba A Hybrid Transformer-Mamba Language Model
- interleaves blocks of Transformer and Mamba layers

gzip vs attention: GZIP VS GPT
SimpleTRON Simple Transformer with O(N) Complexity (no transformer)
- vs Metaformer(poolformer, pureformer)
- maybe not the same: the github

ComposableFunction-preserving Expansions for Transformer Architectures ==best==
- training pipelines for larger models by progressively expanding the architecture throughout training

parent: computer_vision
SEED
DINO: self-suppervised Vision Transformers https://youtu.be/h3ij3F3cPIk
ConvNetsMatch Vision Transformers at Scale
- match the performance of Vision Transformers with comparable compute budgets
DenoisingVision Transformers: removes artifacts improves quality

MovieChat From Dense Token to Sparse Memory for Long Video Understanding
- memory mechanism of rapidly updated short-term memory and thus sustained long-term memory
EventfulTransformers: Leveraging Temporal Redundancy in Vision Transformers
- video visual recognition = computational savings
  - identifying the tokens that have changed significantly
- can be converted from existing transformers
ProPainter Improving Propagation and Transformer for Video Inpainting
- inpainting = make mask, then remove
- image and feature warping, discard redundant tokens, attention to distant frames

LongNet Scaling Transformers to 1,000,000,000 Tokens
SparseFormer Sparse Visual Recognition via Limited Latent Tokens <<sparseformer>>
- codebook for videos tokens, not optical flow, 49 tokens
SeiT Storage-efficientVision Training with Tokens Using 1% of Pixel Storage, <1% of JPEG images
- Token-based Storage

Subobject-levelImage Tokenization
- subobjects are represented by semantically meaningful image segments obtained by segmentation models

Slide-Transformer Hierarchical Vision Transformer with Local Self-Attention
FasterViT Fast Vision Transformers with Hierarchical Attention
Patchn' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
From Sparseto Soft Mixtures of Experts; sparse Transformer
- passing different weighted combinations of all input tokens to each expert