📓 nodes/20230628195140-computer_vision.org by @tekakutli-org ☆

UNDERSTANDING
VISSL computer VIsion library for Self-Supervised Learning
OpenFlamingo An Open-Source Framework for Training Large Autoregressive Vision-Language Models
FastV An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
- plug-and-play inference acceleration method relying on redundant visual tokensa

CUSTOMIZATIOn

TrackingMeets LoRA: Faster Training, Larger Model, Stronger Performance
MyVLM Personalizing VLMs for User-Specific Queries
- personalization of VLMs, enabling them to learn and reason over user-provided concepts

MAP AS OUTPUT

PolyMaX General Dense Prediction with Mask Transformer
- cluster-prediction instaed of per-pixel
- segmentation, depth and normal from single image
DSINE Rethinking Inductive Biases for Surface Normal Estimation (single image)
- per-pixel ray direction as an additional input to the network

V-JEPA teaching to understand and model the physical world by watching videos
- learn masked parts of the video, learn to inpaint
WorldModel on Million-Length Video And Language With RingAttention
- gradually increase context size from 4K to 1M tokens

DocLLM A layout-aware generative language model for multimodal document understanding (JPMorgan)
- taking into account both textual semantics and spatial layout
- learns to infill text segments

with image detector, image tokenizer https://github.com/Vision-CAIR/MiniGPT-4
- llava: https://llava-vl.github.io/ https://arxiv.org/abs/2304.08485
MSViT Dynamic Mixed-Scale Tokenization for Vision Transformers
- dynamic tokenizer for ViTs, where the scale at which an image is processed varies based on semantic details
DualToken-ViT Position-aware Efficient Vision Transformer with Dual Token Fusion
- fusing local information by convolution(cnn) and global information(vit) by self-attention-based as attention structure
DIFFUSION AS ENCODER

models: ==Llava, Qwen-VL==
InstructBLIP Towards General-purposeVision-Language Models with Instruction Tuning
- image understanding
Towards LanguageModels That Can See: Computer Vision Through the LENS of Natural Language
- reasoning over independent vision modules
NeVA NeMo Vision and Language Assistant, informative responses (wiki-like answers)
Fuyu-8B twitter
- has no image encoder, interleaving of text and images at arbitrary image resolutions
- understanding diagrams, charts, and graphs ==best==
- Answering UI-based questions, bounding boxes, OCR
- OtterHD A High-Resolution Multi-modality Model
  - to interpret high-resolution visual inputs with granular precision
CogVLM Visual Expert for Pretrained Language Models
- frozen llm and image encoder connected with a trainable visual expert module
MoE-LLaVA Mixture-of-Experts for Large Vision-Language Models
- sparse model with an outrageous number of parameter but a constant computational cost
InternLM-XComposer A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
InstaGen Enhancing Object Detection by Training on Synthetic Dataset
- training on synthetic dataset generated from diffusion models
- self-training scheme on (novel) categories not covered by the detector

Qwen-VL A Frontier Large Vision-Language Model with Versatile Abilities
- accepts and may output bounding box
- Qwen-Audio audio-based QA
Unified-IO2: ScalingAutoregressive Multimodal Models with Vision, Language, Audio, and Action
- image transformation instructions, has image segmentation
Draw-and-Understand LeveragingVisual Prompts to Enable MLLMs to Comprehend What You Want
- can also use numbered marks

PaLI-3Vision Language Models: Smaller, Faster, Stronger
- better localization and visually-situated text understanding
BLIVA A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
- with architecturalcomparison to previous ones
- query embedings mapped to visual-patch embeddings
GLaMM Pixel Grounding Large Multimodal Model (generates masks for objects)
Ferret Refer and Ground Anything Anywhere at Any Granularity
- both boxes and free-form shapes
GEOMETRY INTERACTIONS

CoDA for 3Dobject detection: discover and classify novel objects, without 2d model
SceneScript Reconstructing Scenes With An Autoregressive Structured Language Model
- get layout(boxes) from 3d view of scene (dynamic, in game)

Video-LLaVA Learning United Visual Representation by Alignment Before Projection
- existing approaches encode images and videos into separate feature spaces
- mixed dataset of images and videos
Soft VideoUnderstanding
- audio is crucial for the overall understanding to help the LLM generate a resume

GeneCIS A Benchmark for General Conditional Image Similarity
- models, should adapt to notion of similarity dynamically
Vocabulary-freeImage Classification

SPEECH RECOGNITION STORYTELLING CAPTIONING
SITTA A Semantic Image-Text Alignment for Image Captioning
- linear semantic mappings = image captioning without access to gradient information; less computation
GuidingImage Captioning Models Toward More Specific Captions
CIC A framework for Culturally-aware Image Captioning
- extracts cultural visual elements from Visual Question Answering (VQA)

VideoReCap: Recursive Captioning of Hour-Long Videos
- video has hierarchical structure spanning different temporal granularities
- exploit the synergy between different video hierarchies
LoSA Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization
- classifying action snippets in an untrimmed video
- memory-and-parameter-efficient backbone

Segmentand Caption Anything
- generate regional captions
RegionGPT Towards Region Understanding Vision Language Model
- region-level captions, description, reasoning, object classification, and referring expressions comprehension

ATLAS
Text-GuidedImage Clustering
- VQA obtained text representations often outperform image features

DIffusion FeaTures (DIFT) Emergent Correspondence from Image Diffusion
- Your Diffusion Model isSecretly a Zero-Shot Classifier

Clamp clip for music
- CLAP(ContrastiveLanguage-Audio Pretraining)
  - FLAP Fast Language-Audio Pre-training
    - learns to reconstruct the masked portion of audio tokens
Pengi An Audio Language Model for Audio Tasks
- audio understanding
MusicAgent An AI Agent for Music Understanding and Generation with Large Language Models
- decompose user requests into multiple sub-tasks and invoke corresponding music tools

whisper translatorfast
- Whisper-AT Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
  - audio representation is actually not noise-invariant
    - audio tagging model on top, <1% extra computational, a single forward pass
  - Distil-Whisper Distilled Whisper 6x faster, 50% smaller
  - WhisperLarge-v3:
- word-leveltimestamps w/ whisper
- fast whisper nowwith Speaker Diarisation
SeamlessM4T Speech-to-speech, speech-to-text, text-to-speech, text-to-text translation, and automatic speech recognition
Inverted Whisper= Whisper Speech (everything opensourced) ==best==