VISSL computer VIsion library for Self-Supervised Learning
OpenFlamingo An Open-Source Framework for Training Large Autoregressive Vision-Language Models
FastV An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
plug-and-play inference acceleration method relying on redundant visual tokensa
TrackingMeets LoRA: Faster Training, Larger Model, Stronger Performance
MyVLM Personalizing VLMs for User-Specific Queries
personalization of VLMs, enabling them to learn and reason over user-provided concepts
PolyMaX General Dense Prediction with Mask Transformer
cluster-prediction instaed of per-pixel
segmentation, depth and normal from single image
DSINE Rethinking Inductive Biases for Surface Normal Estimation (single image)
per-pixel ray direction as an additional input to the network
V-JEPA teaching to understand and model the physical world by watching videos
learn masked parts of the video, learn to inpaint
WorldModel on Million-Length Video And Language With RingAttention
gradually increase context size from 4K to 1M tokens
DocLLM A layout-aware generative language model for multimodal document understanding (JPMorgan)
taking into account both textual semantics and spatial layout
learns to infill text segments
with image detector, image tokenizer https://github.com/Vision-CAIR/MiniGPT-4
MSViT Dynamic Mixed-Scale Tokenization for Vision Transformers
dynamic tokenizer for ViTs, where the scale at which an image is processed varies based on semantic details
DualToken-ViT Position-aware Efficient Vision Transformer with Dual Token Fusion
fusing local information by convolution(cnn) and global information(vit) by self-attention-based as attention structure
models: ==Llava, Qwen-VL==
InstructBLIP Towards General-purposeVision-Language Models with Instruction Tuning
image understanding
Towards LanguageModels That Can See: Computer Vision Through the LENS of Natural Language
reasoning over independent vision modules
NeVA NeMo Vision and Language Assistant, informative responses (wiki-like answers)
has no image encoder, interleaving of text and images at arbitrary image resolutions
understanding diagrams, charts, and graphs ==best==
Answering UI-based questions, bounding boxes, OCR
OtterHD A High-Resolution Multi-modality Model
to interpret high-resolution visual inputs with granular precision
CogVLM Visual Expert for Pretrained Language Models
frozen llm and image encoder connected with a trainable visual expert module
MoE-LLaVA Mixture-of-Experts for Large Vision-Language Models
sparse model with an outrageous number of parameter but a constant computational cost
InternLM-XComposer A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
InstaGen Enhancing Object Detection by Training on Synthetic Dataset
training on synthetic dataset generated from diffusion models
self-training scheme on (novel) categories not covered by the detector
Qwen-VL A Frontier Large Vision-Language Model with Versatile Abilities
accepts and may output bounding box
Qwen-Audio audio-based QA
Unified-IO2: ScalingAutoregressive Multimodal Models with Vision, Language, Audio, and Action
image transformation instructions, has image segmentation
Draw-and-Understand LeveragingVisual Prompts to Enable MLLMs to Comprehend What You Want
can also use numbered marks
WebSight Web Screenshots into HTML Code
PaLI-3Vision Language Models: Smaller, Faster, Stronger
better localization and visually-situated text understanding
BLIVA A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
with architecturalcomparison to previous ones
query embedings mapped to visual-patch embeddings
GLaMM Pixel Grounding Large Multimodal Model (generates masks for objects)
Ferret Refer and Ground Anything Anywhere at Any Granularity
both boxes and free-form shapes
CoDA for3Dobject detection: discover and classify novel objects, without 2d model
SceneScript Reconstructing Scenes With An Autoregressive Structured Language Model
get layout(boxes) from 3d view of scene (dynamic, in game)
Video-LLaVA Learning United Visual Representation by Alignment Before Projection
existing approaches encode images and videos into separate feature spaces
mixed dataset of images and videos
Soft VideoUnderstanding
audio is crucial for the overall understanding to help the LLM generate a resume
GeneCIS A Benchmark for General Conditional Image Similarity
models, should adapt to notion of similarity dynamically
Vocabulary-freeImage Classification
SITTA A Semantic Image-Text Alignment for Image Captioning
linear semantic mappings = image captioning without access to gradient information; less computation
GuidingImage Captioning Models Toward More Specific Captions
CIC A framework for Culturally-aware Image Captioning
extracts cultural visual elements from Visual Question Answering (VQA)
VideoReCap: Recursive Captioning of Hour-Long Videos
video has hierarchical structure spanning different temporal granularities
exploit the synergy between different video hierarchies
LoSA Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization
classifying action snippets in an untrimmed video
memory-and-parameter-efficient backbone
Segmentand Caption Anything
generate regional captions
RegionGPT Towards Region Understanding Vision Language Model
region-level captions, description, reasoning, object classification, and referring expressions comprehension
Text-GuidedImage Clustering
VQA obtained text representations often outperform image features
DIffusion FeaTures (DIFT) Emergent Correspondence from Image Diffusion
Your DiffusionModel isSecretly a Zero-Shot Classifier
Clamp clip for music
CLAP(ContrastiveLanguage-Audio Pretraining)
FLAP Fast Language-Audio Pre-training
learns to reconstruct the masked portion of audio tokens
Pengi An Audio Language Model for Audio Tasks
audio understanding
MusicAgent An AI Agent for Music Understanding and Generation with Large Language Models
decompose user requests into multiple sub-tasks and invoke corresponding music tools
Whisper-AT Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
audio representation is actually not noise-invariant
audio tagging model on top, <1% extra computational, a single forward pass
Distil-Whisper Distilled Whisper 6x faster, 50% smaller
WhisperLarge-v3:
word-leveltimestamps w/ whisper
fast whispernowwith Speaker Diarisation
SeamlessM4T Speech-to-speech, speech-to-text, text-to-speech, text-to-text translation, and automatic speech recognition
InvertedWhisper= Whisper Speech (everything opensourced) ==best==