parent: computer_vision
simo take on better clip papers: https://twitter.com/cloneofsimo/status/1666086583005769728 <button class="pull-tweet" value=https://twitter.com/cloneofsimo/status/1666086583005769728>pull</button>
COLA How to adapt vision-language models to Compose Objects Localized with Attributes?
attributes(adjectives) with its subjects properly identified
What doesCLIP know about a red circle? Visual prompt engineering for VLMs
direct the model attention to that region, while also maintaining global information
ParrotCaptions Teach CLIP to Spot Text
urgent to redesign CLIP-like models so they ignore captions
UMG-CLIP A Unified Multi-Granularity Vision Generalist for Open-World Understanding
image-level, region-level, and pixel-level captions/tags
CLIP Modelfor Images to Textual Prompts Based on Top-k Neighbors
CLIP model with K-nearest neighbors (KNN) algorithm
ParaCLIP Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
finetuning paraphrases while freezing the image encoder
CLIPCan Understand Depth
extended to non-human language prompts
CLIPtone Unsupervised Learning for Text-based Image Tone Adjustment
quick editing image styles
SpLiCE decomposes CLIP embeddings into sparse combinations of human-interpretable, semantic concepts
can be used for concept bottleneck models and spurious correlation detection
Any-ShiftPrompting for Generalization over Distributions
encode the distribution information and their relationships
guide the generalization of the CLIP image-language model from training to any test distribution
faster testing
CoN-CLIP: Learn "No"to Say "Yes" Better: Improving Vision-Language Models via Negations
highlights limitations of popular VLMs such as CLIP, at understanding the implications of negations,
showcases emergent compositional understanding of objects, relations, and attributes in text
Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D
method trains to predict unseen views on feature spaces generated by vision models (i.e. DINO, CLIP)
but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization
EZ-CLIP Efficient Zeroshot Video Action Recognition
no fundamental alterations to clip, guides visual prompts to focus on capturing motion
compute similarity with text and perform vector retrieval
better clip, nearest neighbor
https://arxiv.org/pdf/2110.05208.pdf
nearest neighbor, contrastive
Image-and-Language, pixels only no strings: CLIPPO
Hyperbolic ContrastiveLearningfor Visual Representations beyond Objects
Retrieval-EnhancedContrastive Vision-Text Models
train frozen clip to retrieve knowledge from an external memory
ConvolutionsDie Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
ability of open-vocabulary classification and also strong mask generator
ReCLIP Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation
learns pseudo labels, then refines them
source-free domain adaptation, mitigates misaligned embeddings
SAM-CLIP Merging Vision Foundation Models towards Semantic and Spatial Understanding
CLIP and SAM (good at identifying objects positions) merged into model
Alpha-CLIP A CLIPModel Focusing on Wherever You Want
auxiliary alpha channel to suggest attentive regions, control over the emphasis
ImprovingCLIPTraining with Language Rewrites
rewrite the text descriptions associated with each image using an LLM
Languagemodels are weak learners
better-than-random performance, boosting component for other models
MetaCLIP a fully open-source replication of CLIP
TiC-CLIP Continual Training of CLIP Models
continues training from the last checkpoint by replaying old data, reduces compute by 2.5times vs from scratch
ECLIPSE A Resource-Efficient Text-to-Image Prior for Image Generations
performance on par with bigger models
distill clip knowledge into the prior model
RECLIP Resource-efficient CLIP by Training with Small Images
CLIPA-v2 Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget
AutoCLIP Auto-tuning Zero-Shot Classifiers for Vision-Language Models (full unsupervised)
federated clip
FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning
unum: trained in a day
https://www.unum.cloud/blog/2023-02-20-efficient-multimodality
An Inverse ScalingLaw forCLIP Training training clip cheaply in 2 days