:PROPERTIES: :ID: 39d30d24-c374-4d0c-8037-b03ecbf983fa :ROAM_ALIASES: VITS :END: #+title: computer_vision #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - [[id:cdb139f5-cc9e-4dc8-99b3-9aa39ece63ad][UNDERSTANDING]] - [[https://github.com/facebookresearch/vissl][VISSL]]: computer VIsion library for Self-Supervised Learning - [[https://twitter.com/_akhaliq/status/1687275105464852480][OpenFlamingo]]: An Open-Source Framework for Training Large Autoregressive Vision-Language Models - [[https://github.com/pkunlp-icler/FastV][FastV]]: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models - plug-and-play inference acceleration method relying on redundant visual tokensa * CUSTOMIZATIOn - [[https://arxiv.org/abs/2403.05231][Tracking]] Meets LoRA: Faster Training, Larger Model, Stronger Performance - [[https://twitter.com/_akhaliq/status/1771034847516950560][MyVLM]]: Personalizing VLMs for User-Specific Queries - personalization of VLMs, enabling them to learn and reason over user-provided concepts * MAP AS OUTPUT - [[https://twitter.com/_akhaliq/status/1723912622565662944][PolyMaX]]: General Dense Prediction with Mask Transformer - cluster-prediction instaed of per-pixel - segmentation, depth and normal from single image - [[https://baegwangbin.github.io/DSINE/][DSINE]]: Rethinking Inductive Biases for Surface Normal Estimation (single image) - per-pixel ray direction as an additional input to the network * LEARNING FROM VIDEO - [[https://twitter.com/_akhaliq/status/1758178700741255582][V-JEPA]]: teaching to understand and model the physical world by watching videos - learn masked parts of the video, learn to inpaint - [[https://arxiv.org/abs/2402.08268][World]] Model on Million-Length Video And Language With RingAttention - gradually increase context size from 4K to 1M tokens * DOCUMENTS - [[https://twitter.com/_akhaliq/status/1742369195034099731][DocLLM]]: A layout-aware generative language model for multimodal document understanding (JPMorgan) - taking into account both textual semantics and spatial layout - learns to infill text segments * TOKENIZER - with image detector, image tokenizer https://github.com/Vision-CAIR/MiniGPT-4 - llava: https://llava-vl.github.io/ https://arxiv.org/abs/2304.08485 - [[https://twitter.com/_akhaliq/status/1676813163080175616][MSViT]]: Dynamic Mixed-Scale Tokenization for Vision Transformers - dynamic tokenizer for ViTs, where the scale at which an image is processed varies based on semantic details - [[https://twitter.com/_akhaliq/status/1706179080238854399][DualToken-ViT]]: Position-aware Efficient Vision Transformer with Dual Token Fusion - fusing local information by convolution(cnn) and global information(vit) by self-attention-based as attention structure - [[id:40792f03-5726-453b-af13-ba0667592497][DIFFUSION AS ENCODER]] * QUERING MODELS - MULTIMODAL :PROPERTIES: :ID: adc6ba5b-a1de-40ed-a65e-993c14d1fee8 :END: - models: ==Llava, Qwen-VL== - [[https://github.com/salesforce/LAVIS/tree/main/projects/instructblip][InstructBLIP]]: [[http://arxiv.org/abs/2305.06500][Towards General-purpose]] Vision-Language Models with Instruction Tuning - image understanding - [[https://twitter.com/_akhaliq/status/1674237851536334849][Towards Language]] Models That Can See: Computer Vision Through the LENS of Natural Language - reasoning over independent vision modules - [[https://twitter.com/_akhaliq/status/1691165256696107010][NeVA]]: NeMo Vision and Language Assistant, informative responses (wiki-like answers) - [[https://huggingface.co/adept/fuyu-8b][Fuyu-8B]] [[https://twitter.com/AdeptAILabs/status/1714682075763405257][twitter]] - has no image encoder, interleaving of text and images at arbitrary image resolutions - understanding diagrams, charts, and graphs ==best== - Answering UI-based questions, bounding boxes, OCR - [[https://twitter.com/_akhaliq/status/1722103131528397286][OtterHD]]: A High-Resolution Multi-modality Model - to interpret high-resolution visual inputs with granular precision - [[https://twitter.com/_akhaliq/status/1721758951396524259][CogVLM]]: Visual Expert for Pretrained Language Models - frozen llm and image encoder connected with a trainable visual expert module - [[https://arxiv.org/abs/2401.15947][MoE-LLaVA]]: Mixture-of-Experts for Large Vision-Language Models - sparse model with an outrageous number of parameter but a constant computational cost - [[https://github.com/InternLM/InternLM-XComposer][InternLM-XComposer]]: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition - [[https://twitter.com/_akhaliq/status/1755804127102214605][InstaGen]]: Enhancing Object Detection by Training on Synthetic Dataset - training on synthetic dataset generated from diffusion models - self-training scheme on (novel) categories not covered by the detector ** WITH OTHER VISUAL SOMETHING - [[https://twitter.com/_akhaliq/status/1696123597402853781][Qwen-VL]]: A Frontier Large Vision-Language Model with Versatile Abilities - accepts and may output bounding box - [[https://github.com/QwenLM/Qwen-Audio][Qwen-Audio]]: audio-based QA - [[https://twitter.com/_akhaliq/status/1740865169557946689][Unified-IO]] 2: [[https://twitter.com/_akhaliq/status/1740579974095212561][Scaling]] Autoregressive Multimodal Models with Vision, Language, Audio, and Action - image transformation instructions, has image segmentation - [[https://arxiv.org/abs/2403.20271][Draw-and-Understand]]: [[https://draw-and-understand.github.io/][Leveraging]] Visual Prompts to Enable MLLMs to Comprehend What You Want - can also use numbered marks *** WEB MOCKING :PROPERTIES: :ID: e84c6d77-0e77-4084-a912-06d6846ba539 :END: - [[https://twitter.com/_akhaliq/status/1768470659720442148][WebSight]]: Web Screenshots into HTML Code *** GROUNDING - [[https://twitter.com/_akhaliq/status/1713938644468216210][PaLI-3]] Vision Language Models: Smaller, Faster, Stronger - better localization and visually-situated text understanding - [[https://gordonhu608.github.io/bliva/][BLIVA]]: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions - [[https://twitter.com/gordonhu608/status/1694222893503414424/photo/1][with architectural]] comparison to previous ones - query embedings mapped to visual-patch embeddings - [[https://twitter.com/_akhaliq/status/1721789752133718455/photo/1][GLaMM]]: Pixel Grounding Large Multimodal Model (generates masks for objects) - [[https://twitter.com/_akhaliq/status/1738641464010092632][Ferret]]: Refer and Ground Anything Anywhere at Any Granularity - both boxes and free-form shapes - [[id:eb05005c-b0d8-4c3d-b4ec-6915de1d970c][GEOMETRY INTERACTIONS]] ** 2D+ VISION *** 3D VISION - [[https://twitter.com/yangcao_/status/1715034880785199492][CoDA for]] [[https://github.com/yangcaoai/CoDA_NeurIPS2023][3D]] object detection: discover and classify novel objects, without 2d model - [[https://twitter.com/_akhaliq/status/1770644575582818415][SceneScript]]: Reconstructing Scenes With An Autoregressive Structured Language Model - get layout(boxes) from 3d view of scene (dynamic, in game) *** VIDEO VISION - [[https://twitter.com/_akhaliq/status/1727760039799070785][Video-LLaVA]]: Learning United Visual Representation by Alignment Before Projection - existing approaches encode images and videos into separate feature spaces - mixed dataset of images and videos - [[https://twitter.com/fffiloni/status/1766142252387008738][Soft Video]] Understanding - audio is crucial for the overall understanding to help the LLM generate a resume * CLASSIFICATION - [[https://twitter.com/_akhaliq/status/1668828834181836800][GeneCIS]]: A Benchmark for General Conditional Image Similarity - models, should adapt to notion of similarity dynamically - [[https://twitter.com/_akhaliq/status/1665736170100097024][Vocabulary-free]] Image Classification ** CAPTIONING :CLIPREGION: :PROPERTIES: :ID: aeca80bb-38f3-4343-a214-67e3b4df245e :END: - [[id:daae8285-8325-4096-b421-61bb9df79d4a][SPEECH RECOGNITION]] [[id:bdd9160a-2438-4af0-a6f9-618b87096727][STORYTELLING CAPTIONING]] - [[https://twitter.com/_akhaliq/status/1679308968521261056][SITTA]]: A Semantic Image-Text Alignment for Image Captioning - linear semantic mappings = image captioning without access to gradient information; less computation - [[https://twitter.com/_akhaliq/status/1686201499557224448][Guiding]] Image Captioning Models Toward More Specific Captions - [[https://arxiv.org/abs/2402.05374][CIC]]: A framework for Culturally-aware Image Captioning - extracts cultural visual elements from Visual Question Answering (VQA) *** CAPTIONING VIDEO - [[https://twitter.com/_akhaliq/status/1760148818207641796][Video]] ReCap: Recursive Captioning of Hour-Long Videos - video has hierarchical structure spanning different temporal granularities - exploit the synergy between different video hierarchies - [[https://arxiv.org/abs/2404.01282][LoSA]]: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization - classifying action snippets in an untrimmed video - memory-and-parameter-efficient backbone *** REGIONS - [[https://twitter.com/_akhaliq/status/1731875294951190677][Segment]] and Caption Anything - generate regional captions - [[https://arxiv.org/abs/2403.02330][RegionGPT]]: Towards Region Understanding Vision Language Model - region-level captions, description, reasoning, object classification, and referring expressions comprehension ** IMAGE CLUSTERING :PROPERTIES: :ID: 6076a4ad-cfaf-479d-b85d-eebccd3dbc26 :END: - [[id:a1ecf144-3fba-4eb6-8ef9-d51200b32846][ATLAS]] - [[https://arxiv.org/abs/2402.02996][Text-Guided]] Image Clustering - VQA obtained text representations often outperform image features *** DIFFUSION FEATURES :PROPERTIES: :ID: a1ad9f95-47be-450d-be59-72ae98205845 :END: - [[https://twitter.com/_akhaliq/status/1666262910081875970][DIffusion FeaTures (DIFT)]]: Emergent Correspondence from Image Diffusion - [[https://diffusion-classifier.github.io/][Your Diffusion]] [[https://arxiv.org/abs/2303.16203][Model is]] Secretly a Zero-Shot Classifier * AUDIO VISION :PROPERTIES: :ID: f03ccf94-1aa5-4705-89af-617a22570e26 :END: - [[https://github.com/microsoft/muzic/tree/main/clamp][Clamp]]: clip for music - [[https://huggingface.co/docs/transformers/model_doc/clap][CLAP]] ([[https://twitter.com/yoachlacombe/status/1719720650464805259][Contrastive]] Language-Audio Pretraining) - [[https://twitter.com/_akhaliq/status/1721401053642449340][FLAP]]: Fast Language-Audio Pre-training - learns to reconstruct the masked portion of audio tokens - [[https://arxiv.org/pdf/2305.11834.pdf][Pengi]]: An Audio Language Model for Audio Tasks - audio understanding - [[https://twitter.com/_akhaliq/status/1714877890725110022][MusicAgent]]: An AI Agent for Music Understanding and Generation with Large Language Models - decompose user requests into multiple sub-tasks and invoke corresponding music tools ** WHISPER :PROPERTIES: :ID: e54caacc-519a-4187-bafc-4d32c33f1e2b :END: - [[https://github.com/Vaibhavs10/translate-with-whisper][whisper]] [[https://twitter.com/reach_vb/status/1673363113888948224][translator]] fast - [[https://twitter.com/_akhaliq/status/1677150590516834305][Whisper-AT]]: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers - audio representation is actually not noise-invariant - audio tagging model on top, <1% extra computational, a single forward pass - [[https://github.com/huggingface/distil-whisper][Distil-Whisper]]: Distilled Whisper 6x faster, 50% smaller - [[https://twitter.com/reach_vb/status/1724912958826770711][Whisper]] Large-v3: - [[https://twitter.com/xenovacom/status/1678180605836533762][word-level]] timestamps w/ whisper - [[https://twitter.com/reach_vb/status/1729251580371689821][fast whisper]] [[https://github.com/Vaibhavs10/insanely-fast-whisper][now]] with Speaker Diarisation - [[https://twitter.com/camenduru/status/1731130645022196091][SeamlessM4T]]: Speech-to-speech, speech-to-text, text-to-speech, text-to-text translation, and automatic speech recognition - [[https://twitter.com/1littlecoder/status/1747976274537013606][Inverted]] [[https://github.com/collabora/WhisperSpeech][Whisper]] = Whisper Speech (everything opensourced) ==best==