:PROPERTIES:
:ID:       39d30d24-c374-4d0c-8037-b03ecbf983fa
:ROAM_ALIASES: VITS
:END:
#+title: computer_vision
#+filetags: :neuralnomicon:
#+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup
- [[id:cdb139f5-cc9e-4dc8-99b3-9aa39ece63ad][UNDERSTANDING]]
- [[https://github.com/facebookresearch/vissl][VISSL]]: computer VIsion library for Self-Supervised Learning
- [[https://twitter.com/_akhaliq/status/1687275105464852480][OpenFlamingo]]: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
- [[https://github.com/pkunlp-icler/FastV][FastV]]: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
  - plug-and-play inference acceleration method relying on redundant visual tokensa
* CUSTOMIZATIOn
- [[https://arxiv.org/abs/2403.05231][Tracking]] Meets LoRA: Faster Training, Larger Model, Stronger Performance
- [[https://twitter.com/_akhaliq/status/1771034847516950560][MyVLM]]: Personalizing VLMs for User-Specific Queries
  - personalization of VLMs, enabling them to learn and reason over user-provided concepts
* MAP AS OUTPUT
- [[https://twitter.com/_akhaliq/status/1723912622565662944][PolyMaX]]: General Dense Prediction with Mask Transformer
  - cluster-prediction instaed of per-pixel
  - segmentation, depth and normal from single image
- [[https://baegwangbin.github.io/DSINE/][DSINE]]: Rethinking Inductive Biases for Surface Normal Estimation (single image)
  - per-pixel ray direction as an additional input to the network
* LEARNING FROM VIDEO
- [[https://twitter.com/_akhaliq/status/1758178700741255582][V-JEPA]]: teaching to understand and model the physical world by watching videos
  - learn masked parts of the video, learn to inpaint
- [[https://arxiv.org/abs/2402.08268][World]] Model on Million-Length Video And Language With RingAttention
  - gradually increase context size from 4K to 1M tokens
* DOCUMENTS
- [[https://twitter.com/_akhaliq/status/1742369195034099731][DocLLM]]: A layout-aware generative language model for multimodal document understanding (JPMorgan)
  - taking into account both textual semantics and spatial layout
  - learns to infill text segments
* TOKENIZER
- with image detector, image tokenizer https://github.com/Vision-CAIR/MiniGPT-4
  - llava: https://llava-vl.github.io/ https://arxiv.org/abs/2304.08485
- [[https://twitter.com/_akhaliq/status/1676813163080175616][MSViT]]: Dynamic Mixed-Scale Tokenization for Vision Transformers
  - dynamic tokenizer for ViTs, where the scale at which an image is processed varies based on semantic details
- [[https://twitter.com/_akhaliq/status/1706179080238854399][DualToken-ViT]]: Position-aware Efficient Vision Transformer with Dual Token Fusion
  - fusing local information by convolution(cnn) and global information(vit) by self-attention-based as attention structure
- [[id:40792f03-5726-453b-af13-ba0667592497][DIFFUSION AS ENCODER]]
* QUERING MODELS - MULTIMODAL
:PROPERTIES:
:ID:       adc6ba5b-a1de-40ed-a65e-993c14d1fee8
:END:
- models: ==Llava, Qwen-VL==
- [[https://github.com/salesforce/LAVIS/tree/main/projects/instructblip][InstructBLIP]]: [[http://arxiv.org/abs/2305.06500][Towards General-purpose]] Vision-Language Models with Instruction Tuning
  - image understanding
- [[https://twitter.com/_akhaliq/status/1674237851536334849][Towards Language]] Models That Can See: Computer Vision Through the LENS of Natural Language
  - reasoning over independent vision modules
- [[https://twitter.com/_akhaliq/status/1691165256696107010][NeVA]]: NeMo Vision and Language Assistant, informative responses (wiki-like answers)
- [[https://huggingface.co/adept/fuyu-8b][Fuyu-8B]] [[https://twitter.com/AdeptAILabs/status/1714682075763405257][twitter]]
  - has no image encoder, interleaving of text and images at arbitrary image resolutions
  - understanding diagrams, charts, and graphs ==best==
  - Answering UI-based questions, bounding boxes, OCR
  - [[https://twitter.com/_akhaliq/status/1722103131528397286][OtterHD]]: A High-Resolution Multi-modality Model
    - to interpret high-resolution visual inputs with granular precision
- [[https://twitter.com/_akhaliq/status/1721758951396524259][CogVLM]]: Visual Expert for Pretrained Language Models
  - frozen llm and image encoder connected with a trainable visual expert module
- [[https://arxiv.org/abs/2401.15947][MoE-LLaVA]]: Mixture-of-Experts for Large Vision-Language Models
  - sparse model with an outrageous number of parameter but a constant computational cost
- [[https://github.com/InternLM/InternLM-XComposer][InternLM-XComposer]]: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
- [[https://twitter.com/_akhaliq/status/1755804127102214605][InstaGen]]: Enhancing Object Detection by Training on Synthetic Dataset
  - training on synthetic dataset generated from diffusion models
  - self-training scheme on (novel) categories not covered by the detector
** WITH OTHER VISUAL SOMETHING
- [[https://twitter.com/_akhaliq/status/1696123597402853781][Qwen-VL]]: A Frontier Large Vision-Language Model with Versatile Abilities
  - accepts and may output bounding box
  - [[https://github.com/QwenLM/Qwen-Audio][Qwen-Audio]]: audio-based QA
- [[https://twitter.com/_akhaliq/status/1740865169557946689][Unified-IO]] 2: [[https://twitter.com/_akhaliq/status/1740579974095212561][Scaling]] Autoregressive Multimodal Models with Vision, Language, Audio, and Action
  - image transformation instructions, has image segmentation
- [[https://arxiv.org/abs/2403.20271][Draw-and-Understand]]: [[https://draw-and-understand.github.io/][Leveraging]] Visual Prompts to Enable MLLMs to Comprehend What You Want
  - can also use numbered marks
*** WEB MOCKING
:PROPERTIES:
:ID:       e84c6d77-0e77-4084-a912-06d6846ba539
:END:
- [[https://twitter.com/_akhaliq/status/1768470659720442148][WebSight]]: Web Screenshots into HTML Code
*** GROUNDING
- [[https://twitter.com/_akhaliq/status/1713938644468216210][PaLI-3]] Vision Language Models: Smaller, Faster, Stronger
  - better localization and visually-situated text understanding
- [[https://gordonhu608.github.io/bliva/][BLIVA]]: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
  - [[https://twitter.com/gordonhu608/status/1694222893503414424/photo/1][with architectural]] comparison to previous ones
  - query embedings mapped to visual-patch embeddings
- [[https://twitter.com/_akhaliq/status/1721789752133718455/photo/1][GLaMM]]: Pixel Grounding Large Multimodal Model (generates masks for objects)
- [[https://twitter.com/_akhaliq/status/1738641464010092632][Ferret]]: Refer and Ground Anything Anywhere at Any Granularity
  - both boxes and free-form shapes
- [[id:eb05005c-b0d8-4c3d-b4ec-6915de1d970c][GEOMETRY INTERACTIONS]]
** 2D+ VISION
*** 3D VISION
- [[https://twitter.com/yangcao_/status/1715034880785199492][CoDA for]] [[https://github.com/yangcaoai/CoDA_NeurIPS2023][3D]] object detection: discover and classify novel objects, without 2d model
- [[https://twitter.com/_akhaliq/status/1770644575582818415][SceneScript]]: Reconstructing Scenes With An Autoregressive Structured Language Model
  - get layout(boxes) from 3d view of scene (dynamic, in game)
*** VIDEO VISION
- [[https://twitter.com/_akhaliq/status/1727760039799070785][Video-LLaVA]]: Learning United Visual Representation by Alignment Before Projection
  - existing approaches encode images and videos into separate feature spaces
  - mixed dataset of images and videos
- [[https://twitter.com/fffiloni/status/1766142252387008738][Soft Video]] Understanding
  - audio is crucial for the overall understanding to help the LLM generate a resume
* CLASSIFICATION
- [[https://twitter.com/_akhaliq/status/1668828834181836800][GeneCIS]]: A Benchmark for General Conditional Image Similarity
  - models, should adapt to notion of similarity dynamically
- [[https://twitter.com/_akhaliq/status/1665736170100097024][Vocabulary-free]] Image Classification
** CAPTIONING :CLIPREGION:
:PROPERTIES:
:ID:       aeca80bb-38f3-4343-a214-67e3b4df245e
:END:
- [[id:daae8285-8325-4096-b421-61bb9df79d4a][SPEECH RECOGNITION]] [[id:bdd9160a-2438-4af0-a6f9-618b87096727][STORYTELLING CAPTIONING]]
- [[https://twitter.com/_akhaliq/status/1679308968521261056][SITTA]]: A Semantic Image-Text Alignment for Image Captioning
  - linear semantic mappings = image captioning without access to gradient information; less computation
- [[https://twitter.com/_akhaliq/status/1686201499557224448][Guiding]] Image Captioning Models Toward More Specific Captions
- [[https://arxiv.org/abs/2402.05374][CIC]]: A framework for Culturally-aware Image Captioning
  - extracts cultural visual elements from Visual Question Answering (VQA)
*** CAPTIONING VIDEO
- [[https://twitter.com/_akhaliq/status/1760148818207641796][Video]] ReCap: Recursive Captioning of Hour-Long Videos
  - video has hierarchical structure spanning different temporal granularities
  - exploit the synergy between different video hierarchies
- [[https://arxiv.org/abs/2404.01282][LoSA]]: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization
  - classifying action snippets in an untrimmed video
  - memory-and-parameter-efficient backbone
*** REGIONS
- [[https://twitter.com/_akhaliq/status/1731875294951190677][Segment]] and Caption Anything
  - generate regional captions
- [[https://arxiv.org/abs/2403.02330][RegionGPT]]: Towards Region Understanding Vision Language Model
  - region-level captions, description, reasoning, object classification, and referring expressions comprehension
** IMAGE CLUSTERING
:PROPERTIES:
:ID:       6076a4ad-cfaf-479d-b85d-eebccd3dbc26
:END:
- [[id:a1ecf144-3fba-4eb6-8ef9-d51200b32846][ATLAS]]
- [[https://arxiv.org/abs/2402.02996][Text-Guided]] Image Clustering
  - VQA obtained text representations often outperform image features
*** DIFFUSION FEATURES
:PROPERTIES:
:ID:       a1ad9f95-47be-450d-be59-72ae98205845
:END:
- [[https://twitter.com/_akhaliq/status/1666262910081875970][DIffusion FeaTures (DIFT)]]: Emergent Correspondence from Image Diffusion
  - [[https://diffusion-classifier.github.io/][Your Diffusion]] [[https://arxiv.org/abs/2303.16203][Model is]] Secretly a Zero-Shot Classifier
* AUDIO VISION
:PROPERTIES:
:ID:       f03ccf94-1aa5-4705-89af-617a22570e26
:END:
- [[https://github.com/microsoft/muzic/tree/main/clamp][Clamp]]: clip for music
  - [[https://huggingface.co/docs/transformers/model_doc/clap][CLAP]] ([[https://twitter.com/yoachlacombe/status/1719720650464805259][Contrastive]] Language-Audio Pretraining)
    - [[https://twitter.com/_akhaliq/status/1721401053642449340][FLAP]]: Fast Language-Audio Pre-training
      - learns to reconstruct the masked portion of audio tokens
- [[https://arxiv.org/pdf/2305.11834.pdf][Pengi]]: An Audio Language Model for Audio Tasks
  - audio understanding
- [[https://twitter.com/_akhaliq/status/1714877890725110022][MusicAgent]]: An AI Agent for Music Understanding and Generation with Large Language Models
  - decompose user requests into multiple sub-tasks and invoke corresponding music tools
** WHISPER
:PROPERTIES:
:ID:       e54caacc-519a-4187-bafc-4d32c33f1e2b
:END:
- [[https://github.com/Vaibhavs10/translate-with-whisper][whisper]] [[https://twitter.com/reach_vb/status/1673363113888948224][translator]] fast
  - [[https://twitter.com/_akhaliq/status/1677150590516834305][Whisper-AT]]: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
    - audio representation is actually not noise-invariant
      - audio tagging model on top, <1% extra computational, a single forward pass
    - [[https://github.com/huggingface/distil-whisper][Distil-Whisper]]: Distilled Whisper 6x faster, 50% smaller
    - [[https://twitter.com/reach_vb/status/1724912958826770711][Whisper]] Large-v3:
  - [[https://twitter.com/xenovacom/status/1678180605836533762][word-level]] timestamps w/ whisper
  - [[https://twitter.com/reach_vb/status/1729251580371689821][fast whisper]] [[https://github.com/Vaibhavs10/insanely-fast-whisper][now]] with Speaker Diarisation
- [[https://twitter.com/camenduru/status/1731130645022196091][SeamlessM4T]]: Speech-to-speech, speech-to-text, text-to-speech, text-to-text translation, and automatic speech recognition
- [[https://twitter.com/1littlecoder/status/1747976274537013606][Inverted]] [[https://github.com/collabora/WhisperSpeech][Whisper]] = Whisper Speech (everything opensourced) ==best==