parent: train
Bytes Are All You Need TransformersOperating Directly On File Bytes
NoPE dont use positional encoding (PE) in Transformer decoders (GPTs)
Meta-Transformer A Unified Framework for Multimodal Learning
a unified data tokenizer, a modality-shared encoder, and task-specific heads
CoLT5 Faster Long-Range Transformers with Conditional Computation
strong gainsup to 64k input length
SwitchHead Accelerating Transformers with Mixture-of-Experts Attention
reduces compute and memory, 4 to 8 times fewer attention matrices
Agent Attention balance between computational efficiency and representation power
generalized linear attention, integrated with softmax; preserving global context modelling capability
contextual transformers(Algorithm Distillation), learns from itself, reinforcement learning
Elastic DecisionTransformer
not optimal to use all history states as inputs for decision, instead shorter history
Cached Transformers Improving Transformers with Differentiable Memory Cache
Gated Recurrent Cached (GRC), extend the self-attention mechanism
Whatis Q,V,K? multihead attention?
SpectFormer: Frequency and Attention is what you need in a Vision Transformer
PervasiveAttention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
two-dimensional convolutions to jointly encode the source-target sequences (translation)
On theTuringCompleteness of Modern Neural Network Architectures; Attention is Turing-Complete
Star-Transformer: https://arxiv.org/abs/1902.09113
Hungry Hungry Hippos State Space Models
next: Hyena Hierarchy gating (cache-d attention) Towards Larger Convolutional Language Models
simpler transformer: One WideFeedforward is All You Need
Attention(interdependencies) and the Feed Forward Network(now removed from the decoder, cheaper params)
ApproximatingTwo-Layer Feedforward Networks for Efficient Transformers
Mixtures of Experts (MoEs) vs dense transformers, more resource efficient
PASTA Pretrained Action-State Transformer Agents
self supervised reinforcement learning
learning behavioral, sensor adaptation trajectories
no need to pretrain-tailor to specific downstream applications
Retentive Network A Successor to Transformer for Large Language Models (RetNet)
low cost inference, training parallelism, strong performance
RethinkingAttention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
potential to streamline complex architectures for sequence-to-sequence tasks
WARM On the Benefits of Weight Averaged Reward Models
averaging weights to deal with inconsistencies
RNNs, Gatedrecurrent neural networks discover attention
RWKV RNN instead of transformers
nanoRWKV minGPT like, does not require custom CUDA kernel to train
H3 hungry hippos state space model instead of transformers
Perceiverfew latents instead of transformers
Repeat After MeTransformers are Better than State Space Models at Copying
GSSMs(generalized state space models): fixed-size latent state that doesnt depend on sequence length
limited compared to transformer models on tasks that require copying from the input context
repeat after me, dilated convolutions are better than transformers?
Mamba Linear-Time Sequence Modeling with Selective State Spaces
architecture without attention or MLP
allowing the model to selectively propagate or forget information
outperforms Transformers of the same size, matches Transformers twice its size
Cobra Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
comparable performance to LLaVA with about 43% of the number of parameters
MambaMixer Efficient Selective State Space Models with Dual Token and Channel Selection
data-dependent weights that uses a dual selection mechanism across tokens and channels
Jamba A Hybrid Transformer-Mamba Language Model
interleaves blocks of Transformer and Mamba layers
gzip vs attention: GZIP VS GPT
SimpleTRON Simple Transformer with O(N) Complexity (no transformer)
vs Metaformer(poolformer, pureformer)
maybe not the same: the github
ComposableFunction-preserving Expansions for Transformer Architectures ==best==
training pipelines for larger models by progressively expanding the architecture throughout training
parent: computer_vision
DINO: self-suppervised Vision Transformers https://youtu.be/h3ij3F3cPIk
ConvNetsMatch Vision Transformers at Scale
match the performance of Vision Transformers with comparable compute budgets
DenoisingVision Transformers: removes artifacts improves quality
MovieChat From Dense Token to Sparse Memory for Long Video Understanding
memory mechanism of rapidly updated short-term memory and thus sustained long-term memory
EventfulTransformers: Leveraging Temporal Redundancy in Vision Transformers
video visual recognition = computational savings
identifying the tokens that have changed significantly
can be converted from existing transformers
ProPainter Improving Propagation and Transformer for Video Inpainting
inpainting = make mask, then remove
image and feature warping, discard redundant tokens, attention to distant frames
LongNet Scaling Transformers to 1,000,000,000 Tokens
SparseFormer Sparse Visual Recognition via Limited Latent Tokens <<sparseformer>>
codebook for videos tokens, not optical flow, 49 tokens
SeiT Storage-efficientVision Training with Tokens Using 1% of Pixel Storage, <1% of JPEG images
Token-based Storage
Subobject-levelImage Tokenization
subobjects are represented by semantically meaningful image segments obtained by segmentation models
SEED Planting a SEED of Vision in Large Language Model
unify visual and textual representations
Image tokens independent of 2D patch positions, only 1D causal dependency
high-level semantics consistent with semantic abstraction of words
SEED-LLaMA: Making LLaMASEEandDraw with SEED Tokenizer (autoregressive Transformer)
comprehension and generation of images
Slide-Transformer Hierarchical Vision Transformer with Local Self-Attention
FasterViT Fast Vision Transformers with Hierarchical Attention
Patchn' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
From Sparseto Soft Mixtures of Experts; sparse Transformer
passing different weighted combinations of all input tokens to each expert