:PROPERTIES: :ID: e06c9ae6-abb6-4f82-b951-44ee3a44a1cf :END: #+title: clip #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - parent: [[id:39d30d24-c374-4d0c-8037-b03ecbf983fa][computer_vision]] - [[id:9bec56a3-a402-418d-bc67-40b3165089c3][CLIP AS REWARD]] - simo take on better clip papers: https://twitter.com/cloneofsimo/status/1666086583005769728 - [[https://twitter.com/_akhaliq/status/1655395363283431424][COLA]]: How to adapt vision-language models to Compose Objects Localized with Attributes? - attributes(adjectives) with its subjects properly identified - [[https://twitter.com/_akhaliq/status/1646683395055919104][What does]] CLIP know about a red circle? Visual prompt engineering for VLMs - direct the model attention to that region, while also maintaining global information - [[https://twitter.com/_akhaliq/status/1739516350907736070][Parrot]] Captions Teach CLIP to Spot Text - urgent to redesign CLIP-like models so they ignore captions - [[https://arxiv.org/abs/2401.06397][UMG-CLIP]]: A Unified Multi-Granularity Vision Generalist for Open-World Understanding - image-level, region-level, and pixel-level captions/tags - [[https://arxiv.org/abs/2401.09763][CLIP Model]] for Images to Textual Prompts Based on Top-k Neighbors - CLIP model with K-nearest neighbors (KNN) algorithm - [[https://arxiv.org/abs/2402.15120][ParaCLIP]]: Fine-tuning CLIP Text Encoders with Two-step Paraphrasing - finetuning paraphrases while freezing the image encoder * NOT TO WORDS - [[https://arxiv.org/abs/2402.03251][CLIP]] Can Understand Depth - extended to non-human language prompts - [[https://arxiv.org/pdf/2404.01123.pdf][CLIPtone]]: Unsupervised Learning for Text-based Image Tone Adjustment - quick editing image styles * TRAIN CLIP :PROPERTIES: :ID: 7202d6c8-5d07-4afe-906f-c78d50353505 :END: - [[https://github.com/AI4LIFE-GROUP/SpLiCE][SpLiCE]]: decomposes CLIP embeddings into sparse combinations of human-interpretable, semantic concepts - can be used for concept bottleneck models and spurious correlation detection - [[https://arxiv.org/abs/2402.10099][Any-Shift]] Prompting for Generalization over Distributions - encode the distribution information and their relationships - guide the generalization of the CLIP image-language model from training to any test distribution - faster testing - CoN-CLIP: [[https://arxiv.org/abs/2403.20312][Learn "No"]] to Say "Yes" Better: Improving Vision-Language Models via Negations - highlights limitations of popular VLMs such as CLIP, at understanding the implications of negations, - showcases emergent compositional understanding of objects, relations, and attributes in text * 3D+ CLIP ** LIFT3D :PROPERTIES: :ID: 89276877-2243-411e-8943-bea0427264f3 :END: - Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D - method trains to predict unseen views on feature spaces generated by vision models (i.e. DINO, CLIP) - but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization * VIDEO CLIP - [[https://arxiv.org/abs/2312.08010][EZ-CLIP]]: Efficient Zeroshot Video Action Recognition - no fundamental alterations to clip, guides visual prompts to focus on capturing motion - [[https://twitter.com/SammieAtman/status/1738707513015738554][VideoCLIP]] - compute similarity with text and perform vector retrieval * PRIOR ALTERNATIVES - [[id:1c014bca-d8db-4d28-9c49-5297626d4484][SEECODERS]] - better clip, nearest neighbor - https://arxiv.org/pdf/2110.05208.pdf - nearest neighbor, contrastive - https://arxiv.org/abs/2111.07783 - Image-and-Language, pixels only no strings: [[https://arxiv.org/abs/2212.08045][CLIPPO]] - [[https://arxiv.org/abs/2105.13626][ByT5]], token free, no tokenizer - character aware models: [[https://arxiv.org/pdf/2212.10562.pdf][can spell]], like by ByT5 - maybe hands-aware models? - [[https://arxiv.org/pdf/2212.00653.pdf][Hyperbolic Contrastive]] [[https://github.com/shlokk/HCL/][Learning]] for Visual Representations beyond Objects - [[https://twitter.com/_akhaliq/status/1668464076651937792][Retrieval-Enhanced]] Contrastive Vision-Text Models - train frozen clip to retrieve knowledge from an external memory - [[https://twitter.com/_akhaliq/status/1688402448094838784][Convolutions]] Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP - ability of open-vocabulary classification and also strong mask generator - [[https://twitter.com/_akhaliq/status/1689119573155446784][ReCLIP]]: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation - learns pseudo labels, then refines them - source-free domain adaptation, mitigates misaligned embeddings - [[https://twitter.com/_akhaliq/status/1717044670935580719][SAM-CLIP]]: Merging Vision Foundation Models towards Semantic and Spatial Understanding - CLIP and SAM (good at identifying objects positions) merged into model - [[https://aleafy.github.io/alpha-clip/][Alpha-CLIP]]: [[https://github.com/SunzeY/AlphaCLIP][A CLIP]] Model Focusing on Wherever You Want - auxiliary alpha channel to suggest attentive regions, control over the emphasis ** TEXT MANIPULATION - [[https://huggingface.co/papers/2305.20088][Improving]] [[https://github.com/LijieFan/LaCLIP][CLIP]] Training with Language Rewrites - rewrite the text descriptions associated with each image using an LLM - [[https://twitter.com/_akhaliq/status/1673518661926264832][Language]] models are weak learners - better-than-random performance, boosting component for other models ** CLIP YET BETTER - [[https://twitter.com/NielsRogge/status/1717259646602236136][MetaCLIP]]: a fully open-source replication of CLIP - [[https://twitter.com/_akhaliq/status/1717377075818938625][TiC-CLIP]]: Continual Training of CLIP Models - continues training from the last checkpoint by replaying old data, reduces compute by 2.5times vs from scratch - [[https://twitter.com/_akhaliq/status/1734036192817971630][ECLIPSE]]: A Resource-Efficient Text-to-Image Prior for Image Generations - performance on par with bigger models - distill clip knowledge into the prior model * CHEAPNESS - [[https://arxiv.org/abs/2304.06028][RECLIP]]: Resource-efficient CLIP by Training with Small Images - [[https://twitter.com/_akhaliq/status/1673884289287725057][CLIPA-v2]]: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget - [[https://twitter.com/_akhaliq/status/1707574203371618415][AutoCLIP]]: Auto-tuning Zero-Shot Classifiers for Vision-Language Models (full unsupervised) * SCALENESS - federated clip - FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning - https://arxiv.org/abs/2302.13485 - [[https://github.com/baaivision/EVA/tree/master/EVA-CLIP][EVA-CLIP]]: [[https://arxiv.org/abs/2303.15389][Improved]] Training Techniques for CLIP at Scale * FASTNESS - unum: trained in a day - https://github.com/unum-cloud/uform - https://www.unum.cloud/blog/2023-02-20-efficient-multimodality - [[https://twitter.com/_akhaliq/status/1656908423278084096][An Inverse Scaling]] [[https://arxiv.org/abs/2305.07017][Law for]] [[https://github.com/UCSC-VLAA/CLIPA][CLIP Training]], training clip cheaply in 2 days