:PROPERTIES: :ID: 8300ca3c-deff-4147-9c31-b7c54e5780d3 :END: #+title: segmentation #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - parent: [[id:39d30d24-c374-4d0c-8037-b03ecbf983fa][computer_vision]] - [[https://github.com/vijishmadhavan/ArtLine][ArtLine]]: gan to get artline from image, maybe instead of canny for controlnet? - [[https://twitter.com/_akhaliq/status/1697527572840456558][Emergence]] of Segmentation with Minimalistic White-Box Transformers - [[https://twitter.com/_akhaliq/status/1742379761026937009][Boundary]] Attention: Learning to Find Faint Boundaries at Any Resolution ==best== - infers including contours, corners and junctions - [[https://arxiv.org/abs/2401.04403][MST]]: Adaptive Multi-Scale Tokens Guided Interactive Segmentation - leveraging token similarity to allow for fewer tokens to be used, maintaining multi-scale token interaction * TARGET-ING - [[https://twitter.com/prafull7/status/1686560498974957568][Materialistic]]: Selecting Similar Materials in Images - [[https://twitter.com/_akhaliq/status/1667053581944455174][Background]] Prompting for Improved Object Depth - learned background prompt, thus focuses in the object - [[https://twitter.com/RainbowYuhui/status/1687654715704967168][LISA]]: [[https://arxiv.org/abs/2308.00692][Reasoning]] Segmentation via Large Language Model - Language Instructed Segmentation Assistant, speak to it and it segments - [[https://arxiv.org/abs/2304.03284][SegGPT]]: Segmenting Everything In Context - [[https://github.com/baaivision/Painter][Painter]] & SegGPT Series: Vision Foundation Models from BAAI (radiography components, top of box) - [[https://github.com/WalBouss/GEM][Grounding Everything]]: Emerging Localization Properties in Vision-Language Transformers - clip can perform zero-shot open-vocabulary segmentation; probability-like experiance - [[https://github.com/CartoonSegmentation/CartoonSegmentation][CartoonSegmentation]]: Instance-guided Cartoon Editing with a Large-scale Dataset (anime fine details) ==best== ** OBJECT DETECTION - [[https://twitter.com/_akhaliq/status/1737317495642427506][Tracking]] Any Object Amodally - comprehend complete objects from partial visibility; boxes for occluded objects *** CUTLER - [[https://github.com/facebookresearch/CutLER][CutLer]]: object detection and segmentator - [[https://github.com/natethegreate/hent-AI][Detecting censors]] with deep learning and computer vision; location (to later inpaint over them) - [[https://twitter.com/_akhaliq/status/1740596579185102997][U2Seg]]: Unsupervised Universal Image Segmentation (vs CutLER) ==best== - clustering of seudo semantic labels *** CONTROLNET FOR 3D :PROPERTIES: :ID: d1d1a9ff-670e-4bed-9087-ad0b8b71ee7a :END: - [[https://twitter.com/_akhaliq/status/1722468688819867693][3DiffTection]]: 3D Object Detection with Geometry-Aware Diffusion Features - finetune(controlnet) 2d diffusion to perform novel view synthesis from a single image (using epipolar warp operator) ==best== - 3D detection and identifying cross-view point correspondences *** NERF SEGMENTATION :PROPERTIES: :ID: d66a336f-083e-4515-b68e-67141ae4776c :END: - [[https://twitter.com/_akhaliq/status/1684818691161264128][NeRF-Det]]: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection - indoor 3D detection(and depth) with images as input; unseen scenes, without requiring per-scene optimization - [[https://twitter.com/_akhaliq/status/1721404421475569933][EmerNeRF]]: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision - captures scene geometry, appearance, motion, represent highly-dynamic scenes self-sufficiently - [[https://github.com/Jumpat/SegAnyGAussians][SAGA]]: [[https://twitter.com/_akhaliq/status/1731886323051417790][Segment]] Any 3D Gaussians - multi-granularity segmentation, instantaneous(unlike SA3D) - [[https://twitter.com/_akhaliq/status/1747863699581218821][GARField]]: Group Anything with Radiance Fields - use sam 2D masks, coarse-to-fine hierarchy * SAM :PROPERTIES: :ID: 1eb158d5-47a5-42a4-8692-86c42376d25a :END: - [[https://twitter.com/_akhaliq/status/1645115958594351106][SAM + DINO]], [[https://github.com/mattyamonaca/PBRemTools][segment]] anything, image region editing - [[https://huggingface.co/papers/2306.01567][high quality sam]] - [[https://twitter.com/_akhaliq/status/1666273462766170113][Recognize]] Anything: A Strong Image Tagging Model - https://arxiv.org/abs/2304.06718 [[https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once][Segment-Everything-Everywhere-All-At-Once]] - [[https://github.com/geekyutao/Inpaint-Anything][inpainting]] - [[https://twitter.com/_akhaliq/status/1678599147455119363][Semantic-SAM]]: [[https://github.com/UX-Decoder/Semantic-SAM][Segment]] and Recognize Anything at Any Granularity - generate masks at multiple levels - [[https://twitter.com/_akhaliq/status/1744189586157273234][Open-Vocabulary]] SAM: Segment and Recognize Twenty-thousand Classes Interactively - CLIP-like real-world recognition - [[https://arxiv.org/abs/2401.04651][Learning]] to Prompt Segment Anything Models - optimizing the prompts using few shot data ** FASTER - [[https://arxiv.org/abs/2306.12156][Fast Segment]] Anything, [[https://github.com/casia-iva-lab/fastsam][40ms per]] image [[https://twitter.com/giswqs/status/1691541059703074817][PyPI]] - [[https://twitter.com/fiandola/status/1732171016783180132][EfficientSAM]]: 20x fewer parameters and 20x faster runtime - [[https://twitter.com/horseeeMa/status/1737039910299959505][SlimSam]]: 0.1% Data Makes Segment Anything Slim - 0.9%(5.7M) parameters, 0.1% data - [[https://twitter.com/_akhaliq/status/1738019802709471395][TinySAM]]: Pushing the Envelope for Efficient Segment Anything Model - knowledge distillation to distill a lightweight student model ** VIDEOS - segment videos https://github.com/gaomingqi/Track-Anything - [[https://twitter.com/_akhaliq/status/1700030823926280448][Tracking Anything]] with Decoupled Video Segmentation - [[https://twitter.com/_akhaliq/status/1722112134866141193][Video]] Instance Matting - estimating each instance at each frame of a video sequence - [[https://twitter.com/_akhaliq/status/1739894833076945307][UniRef++]]: Segment Every Reference Object in Spatial and Temporal Spaces - unify four reference-based object segmentation tasks with a single architecture (box, area from prompt) - [[https://arxiv.org/abs/2402.09883][Lester]]: rotoscope animation through video object segmentation and tracking - mask and track across frames ** USE CASES - [[https://twitter.com/_akhaliq/status/1667027179308195843][Matting]] Anything Model (MAM): green screen-ed - [[id:bb79e50e-ed85-4f37-bd0c-6cad6acd0a6e][TOKENCOMPOSE]] enhanced prompting *** UNDERSTANDING - [[https://github.com/Luodian/RelateAnything][RelateAnything]]: see relationships between them - [[https://twitter.com/CircleRadonqq/status/1737338671219843076][Osprey]]: Pixel Understanding with Visual Instruction Tuning Understand everything for SAM - click on and get description of cluster of pixels *** FOLLOW AREA - [[https://twitter.com/_akhaliq/status/1676092343148064770][Segment]] Anything Meets Point Tracking, follow pixels, [[id:88e29751-d7d6-41e4-8375-3c7ac24cb77b][OPTICAL FLOW]] - [[https://twitter.com/_akhaliq/status/1680772619392385025][DreamTeacher]]: Pretraining Image Backbones with Deep Generative Models - following 3d concepts with 3d understanding * DIFFUSION SEGMENTATION - parent: [[id:c7fe7e79-73d3-4cc7-a673-2c2e259ab5b5][stable_diffusion]] - [[id:7cd466fd-1feb-47ce-bf9a-033ba4838579][SLIME]] - [[https://weichen582.github.io/diffmae.html][Diffusion Models]] [[https://arxiv.org/abs/2304.03283][as Masked]] Autoencoders - [[https://jerryxu.net/ODISE/][ODISE]]: Open-Vocabulary [[https://github.com/NVlabs/ODISE][Panoptic]] Segmentation with Text-to-Image Diffusion Models. - [[https://twitter.com/_akhaliq/status/1669588008117338113][Diffusion Models]] for Zero-Shot Open-Vocabulary Segmentation (considers the contextual background) - [[https://twitter.com/_akhaliq/status/1706177856353505346][MosaicFusion]]: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation - generate synthetic labeled data, for rare and novel categories to then teach segmentation - [[https://github.com/UX-Decoder/FIND][FIND]]: Interface Foundation Models' Embeddings - segment and correlate to prompt token - [[https://github.com/MengyuWang826/SegRefiner][SegRefiner]]: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process ==best== - augment the segmentation accuracy by denoising it (exceedingly fine details) - [[https://twitter.com/_akhaliq/status/1749676782226231415][EmerDiff]]: Emerging Pixel-level Semantic Knowledge in Diffusion Models - identifies correspondences between pixels and latent space features - [[https://bcorrad.github.io/freesegdiff/][FreeSeg-Diff]]: Training-Free Open-Vocabulary Segmentation with Diffusion Models - through a diffusion model and an image captioner model - both frozen ** 3D SD SEG - [[https://twitter.com/_akhaliq/status/1722468688819867693][3DiffTection]]: 3D Object Detection with Geometry-Aware Diffusion Features - synthesis conditioned on a single image using epipolar warp operator - 3D-aware features for 3D detection identifying cross-view point correspondences * AUDIO - [[https://twitter.com/LiuXub/status/1689311290513063937][AudioSep]]: Separate Anything You Describe, Separate Anything Audio Model * 3D SEGMENATION - [[3D SD SEG]] [[NERF SEGMENTATION]] [[id:89276877-2243-411e-8943-bea0427264f3][LIFT3D]] - [[https://github.com/Jumpat/SegmentAnythingin3D][Segment]] Anything in 3D with NeRFs (SA3D) - [[https://twitter.com/_akhaliq/status/1665926124487036929][SAM3D]]: Zero-Shot 3D Object Detection via Segment Anything Model - [[https://twitter.com/liuziwei7/status/1651461200956514306][SAD is]] [[https://twitter.com/liuziwei7/status/1651461200956514306][able]] to perform 3D segmentation (segment out any 3D object) with RGBD inputs - [[https://github.com/dvlab-research/VoxelNeXt][VoxelNeXt]]: Fully Sparse VoxelNet for 3D Object Detection and Tracking (convnext) - predict objects directly upon sparse voxel features - no sparse-to-dense conversion, anchors, or center proxies needed anymore - use: 2D segmentation mask into 3D boxes: [[https://github.com/IDEA-Research/Grounded-Segment-Anything#install-without-docker][code]] - [[https://nitter.poast.org/_akhaliq/status/1773194569909190975#m][EgoLifter]]: Open-world 3D Segmentation for Egocentric Perception - segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects - [[https://threedle.github.io/iSeg/][iSeg]]: Interactive 3D Segmentation via Interactive Attention - based on clicking, positive and negative clicks directly on the shape's surface ** SUPERPRIMITIVE :PROPERTIES: :ID: 35add1fe-b835-49c7-99f4-8aa4321a3904 :END: - into point cloud: - [[https://github.com/makezur/super_primitive][SuperPrimitive]]: Scene Reconstruction at a Primitive Level - splitting images into semantically correlated local regions, then enhancing with normals - for tasks: depth completion(per pixel), few-view structure from motion, and monocular dense visual odometry(get pov angles) ** GAUSSIAN - [[https://twitter.com/_akhaliq/status/1739889391064031717][LangSplat]]: 3D Language Gaussian Splatting - ground CLIP features into 3D language Gaussians, faster than LERF - SA-GS: [[https://arxiv.org/abs/2401.17857][Segment Anything]] in 3D Gaussians - without any training process and learned parameters * OPTICAL FLOW :PROPERTIES: :ID: 88e29751-d7d6-41e4-8375-3c7ac24cb77b :END: - [[https://arxiv.org/abs/2003.12039][RAFT]]: [[https://github.com/princeton-vl/RAFT][Recurrent]] All-Pairs Field Transforms for Optical Flow (video optical flow) - [[https://twitter.com/_akhaliq/status/1667052177146126336][OmniMotion]]: Tracking Everything Everywhere All at Once (following pixels, optical flow) - [[https://twitter.com/_akhaliq/status/1681162394393886720][INVE]]: Interactive Neural Video Editing; painting pixels, then following them - [[https://twitter.com/_akhaliq/status/1684516609728421888][Tracking]] Anything in High Quality - pretrained MR model is employed to refine the tracking result - [[https://twitter.com/MetaAI/status/1696628347357536730][CoTracker]]: models correlation of the points in time, using attention - [[https://twitter.com/QingdiZhang/status/1696826848586711531][can]] track every pixel or selected - [[https://twitter.com/CarlDoersch/status/1721919205975412848][generate]] rainbow visualizations from a set of point tracks - [[https://twitter.com/_akhaliq/status/1777562507973927126][SpatialTracker]]: Tracking Any 2D Pixels in 3D Space - dealing with occlusions and discontinuities in 2d, mitigate the issues caused by image projection - using monocular depth estimators - [[FOLLOW AREA]] ** DIFFUSION OPTICAL FLOW - parent: [[id:82127d6a-b3bb-40bf-a912-51fa5134dacc][diffusion]] - [[https://twitter.com/_akhaliq/status/1665929002668662784][The Surprising]] Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation * FINETUNING - [[https://arxiv.org/abs/2403.20126][ECLIPSE]]: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning - freezing model parameters, fine-tuning a small set of prompt embeddings - addressing both catastrophic forgetting and plasticity - significantly reducing the trainable parameters