:PROPERTIES: :ID: 5274c3ad-4ade-44d0-ab29-1145a0fbfeee :END: #+title: logistic #+filetags: :neuralnomicon: #+SETUPFILE: https://fniessen.github.io/org-html-themes/org/theme-readtheorg.setup - parent: [[id:e9be16f7-8032-4509-9aa9-7843836eacd9][domain]] - [[https://huggingface.co/papers/2306.00148][SafeDiffuser]]: [[https://safediffuser.github.io/safediffuser/][Safe]] Planning with Diffusion Probabilistic Models * BEHAVIOURAL - [[id:d1967bb7-3782-4052-8725-c799c2630893][BEHAVIORAL TRANSFORMER]] - [[https://twitter.com/_akhaliq/status/1676768086697885699][Building Cooperative]] Embodied Agents Modularly with Large Language Models ** PLANNING - [[https://diffusion-planning.github.io/][Planning with]] [[https://arxiv.org/abs/2205.09991][Diffusion for]] [[https://twitter.com/neurosp1ke/status/1530525256871444480][Flexible Behavior]] Synthesis - [[https://huggingface.co/papers/2305.19472][PlaSma]]: Making Small Language Models Better Procedural Knowledge Models for (Counterfactual) Planning - LLM, revision of a plan to cope with a counterfactual situation - [[https://twitter.com/_akhaliq/status/1716298129799106820][ToolChain*]]: Efficient Action Space Navigation in Large Language Models with A* Search - entire action space as a decision tree, then identifying the most low-cost valid path as the solution - ==cheapest cost decision== - [[https://twitter.com/_akhaliq/status/1754348339426890107][K-Level]] Reasoning with Large Language Models - decision-making in evolving environments, dynamic reasoning - [[https://twitter.com/_akhaliq/status/1754353852155879549][TravelPlanner]]: A Benchmark for Real-World Planning with Language Agents - llms have success rate of 0.6% on travel planning *** CODE PLANNING :PROPERTIES: :ID: 897b3a4e-c8ce-4e7c-bb58-688e3d299370 :END: - [[https://twitter.com/_akhaliq/status/1706118271311724994][CodePlan]]: Repository-level Coding using LLMs and Planning - context derived from the entire repository, previous code changes - package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications ** ROBOTS :PROPERTIES: :ID: bb65f50b-04af-4161-afcd-acdc4821a0c4 :END: - [[id:2fefb31b-1809-49c0-b925-a7b9a6fa3b0b][LLM AS REWARD]] [[PROPER-ING INSTRUCTIONS]] [[id:2eaf93e5-1612-40d5-901d-4a4da911b086][MOTION SYNTHESIS]] - [[https://twitter.com/_akhaliq/status/1645257919997394945][Generative Agents]]: Interactive Simulacra of Human Behavior, sims - [[https://twitter.com/_akhaliq/status/1686223946914439170][Discovering]] Adaptable Symbolic Algorithms from Scratch - evolve(activate) safe control policies that avoid falling when individual limbs suddenly break - [[https://twitter.com/_akhaliq/status/1687296882312101890][Dynalang]]: Learning to Model the World with Language - agents that leverage diverse language that describes state of the world with feedback - Diffusion-CCSP: [[https://twitter.com/_akhaliq/status/1699325117904375839][Compositional]] Diffusion-Based Continuous Constraint Solvers - novel combinations of known constraint - [[https://twitter.com/_akhaliq/status/1731494449094570060][Dolphins]]: Multimodal Language Model for Driving - holistic understanding of intricate driving scenarios and multimodal instructions - [[https://arxiv.org/pdf/2401.11061.pdf][PhotoBot]]: Reference-Guided Interactive Photography via Natural Language - take photo at best poses (cinematography), best perspectives and povs *** MINECRAFT - [[https://twitter.com/_akhaliq/status/1683312064935395328][STEVE-1]]: A Generative Model for Text-to-Behavior in Minecraft - unCLIP is effective for creating instruction-following sequential decision-making agents - pretrained models like VPT and MineCLIP, STEVE-1 costs just $60 to train - [[https://twitter.com/_akhaliq/status/1723910028753564151][JARVIS-1]]: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models - multimodal memory, planning using both pre-trained knowledge and actual game experiences - [[https://twitter.com/_akhaliq/status/1760888473576227305][BeTAIL]]: Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay - without requiring hand-designed reward functions *** DAG :PROPERTIES: :ID: 6d9df8fd-398d-44da-b7d7-cd7146b1b7a8 :END: - [[https://www.youtube.com/watch?v=IXqj4HqNbPE][DAG Amendment]] for Inverse Control of Parametric Shapes - depending the size of the brush and the location, infers the intention - and modifies the hyperparameters, not just one axis but whole arm-mechanisms *** DATASET - [[https://robotics-transformer-x.github.io/][RT-X]]: the largest open-source robot dataset - [[https://twitter.com/DrJimFan/status/1720491210383749136][MimicPlay]]: imitation learning algorithm that extracts the most signals from unlabeled human motions * ECONOMY - [[https://arxiv.org/pdf/2303.17564.pdf][BloombergGPT]]: A Large Language Model for Finance (economy) * SCENE - [[https://scenediffuser.github.io/][Diffusion-based Generation]], [[https://arxiv.org/abs/2301.06015][Optimization, and]] Planning in 3D Scenes - [[https://twitter.com/_akhaliq/status/1707577996951916689][ConceptGraphs]]: Open-Vocabulary 3D Scene Graphs for Perception and Planning - 2D foundation models then fusing their output to 3D by multi-view association - complex reasoning over spatial and semantic concepts. - [[https://twitter.com/_akhaliq/status/1712634496208498813][LangNav]]: Language as a Perceptual Representation for Navigation - select an action(from instruction) based on the current view and the trajectory history ** SCENE SYNTHESIS :PROPERTIES: :ID: 802df88e-d2f7-4849-9def-43190e1cebde :END: - [[id:a4f8fda0-bd4a-42a8-aad0-7a256a696bcd][SCENE TEXTURES]] - [[https://twitter.com/_akhaliq/status/1715235725069693415][3D-GPT]]: Procedural 3D Modeling with Large Language Models - instruction-driven 3D modeling - evolving(and enhancing) their detailed forms while dynamically adapting on subsequent instructions - [[https://arxiv.org/abs/2401.14111][Image]] Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs - leveraging clip scene understanding instead of layouts, GAN based - [[https://arxiv.org/abs/2402.04504][Text2Street]]: Controllable Text-to-image Generation for Street Views - text-to-map generation integrating road structure-topology, object layout and weather description - [[https://sglab.kaist.ac.kr/SemCity/][SemCity]]: Semantic Scene Generation with Triplane Diffusion (refinement and inpainting) - [[https://arxiv.org/pdf/2403.08782.pdf][Procedural]] terrain generation with style transfer - drawing style from real-world height maps unto perlin noise - [[https://twitter.com/_akhaliq/status/1778235336721666203][RealmDreamer]]: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion - optimizes a 3D Gaussian Splatting, allows 3D synthesis from a single image *** ROOM - [[https://dreamscene-project.github.io/][DreamScene]]: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling - multi-timestep sampling strategy guided by the formation patterns of 3D objects - enables targeted adjustments **** ROOM LAYOUT :PROPERTIES: :ID: 5e1ee0b4-8493-44e4-b0cf-89b429a78532 :END: - [[https://arxiv.org/abs/2401.17053][BlockFusion]]: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation - semantically and geometrically meaningful transitions that harmoniously blend with the existing scene - 2D layout conditioning-control - [[https://chenguolin.github.io/projects/InstructScene/][InstructScene]]: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior - with semantic graph prior and a layout decoder - [[https://twitter.com/_akhaliq/status/1757238210906685918][GALA3D]]: [[https://gala3d.github.io/][Towards]] Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting - LLM to generate initial layout for geometric constrain - [[https://zrealli.github.io/sketch2arc/][Sketch-to-Architecture]]: Generative AI-aided Architectural Design - generate conceptual floorplans and 3D models from simple sketches **** ROOMDREAMER :PROPERTIES: :ID: 3c4595ca-1c60-4c23-b305-9068e85dc22d :END: - [[https://arxiv.org/abs/2305.11337][RoomDreamer]]: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture - using a cubemap, with depth(object plane to screen plane) and distance map - [[https://twitter.com/YueYangAI/status/1736745220367057032][Holodeck]]: [[https://github.com/allenai/Holodeck][promptable]] system that can generate diverse, customized, and interactive 3D environments ** INTERACTIONS :PROPERTIES: :ID: 38d684fe-bc58-4132-bc2f-407e70198230 :END: - [[id:eb05005c-b0d8-4c3d-b4ec-6915de1d970c][GEOMETRY INTERACTIONS]] [[id:2af8a345-7338-42f0-8dda-03b2eb69f22e][POSE - POSITION]] *** MOTION SYNTHESIS :PROPERTIES: :ID: 2eaf93e5-1612-40d5-901d-4a4da911b086 :END: - [[id:4d3d8e2a-08c3-4624-9389-cd54e06850b9][motion]] - [[https://github.com/sebastianstarke/AI4Animation][AI4Animation]]: [[https://www.youtube.com/watch?v=7c6oQP1u2eQ][Neural State]] Machine for Character-Scene Interactions - [[id:88490b18-3eaf-402d-b8ef-eca7a125ce93][PAE - PHASE]] - [[https://twitter.com/_akhaliq/status/1680773352791703552][NIFTY]]: Neural Object Interaction Fields for Guided Human Motion Synthesis - neural interaction field attached to a specific object - guided diffusion model trained on generated synthetic data - [[https://twitter.com/_akhaliq/status/1724274588962439517][Story-to-Motion]]: Synthesizing Infinite and Controllable Character Animation from Long Text - text-to-motion various locations and specific motions - motion semantic trajectory constraint - CHOIS: [[https://twitter.com/_akhaliq/status/1732974053382557995][Controllable]] Human-Object Interaction Synthesis - diffusion with constraints 1. language informs style and intent 2. waypoints ground the motion and can be effectively extracted using high-level planning methods - [[https://vcai.mpi-inf.mpg.de/projects/ROAM/][ROAM]]: Robust and Object-aware Motion Generation using Neural Pose Descriptors - method for human-object interaction synthesi - given unseen object, optimise for closest in the feature space - [[https://twitter.com/_akhaliq/status/1768132939206836671][TRUMANS]]: Scaling Up Dynamic Human-Scene Interaction Modeling - 15 hours of human interactions across 100 indoor scenes - diffusion-based autoregressive model that efficiently generates HSI sequences of any length *** LLM - [[https://twitter.com/_akhaliq/status/1683704549817868288][3D-LLM]]: Injecting the 3D World into Large Language Models - llm with 3d understanding such as spatial relationships, affordances, physics, layout - can take 3D point clouds and their features as input - [[https://twitter.com/_akhaliq/status/1699601314798346309][Physically]] Grounded Vision-Language Models for Robotic Manipulation - planning on tasks that require reasoning about physical object concepts - [[https://twitter.com/_akhaliq/status/1767750847239262532][Motion Mamba]]: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM - long-sequence and efficient motion **** PROPER-ING INSTRUCTIONS :PROPERTIES: :ID: 4daacc49-2790-49c2-a32a-880c5f99e681 :END: - [[https://twitter.com/_akhaliq/status/1716297684078702810][Auto-Instruct]]: Automatic Instruction Generation and Ranking for Black-Box Language Models - method to automatically improve the quality of LLM instructions - [[https://twitter.com/_akhaliq/status/1716280700473581733][Creative]] Robot Tool Use with Large Language Models - input instructions and outputs executable code for controlling robots(tools) *** INSIDE COMPUTER - [[https://twitter.com/_akhaliq/status/1683681368818192386][A Real-World]] WebAgent with Planning, Long Context Understanding, and Program Synthesis (website is scene) - LLM-driven agent to complete instruction tasks on real websites - [[https://twitter.com/_akhaliq/status/1713773812670476336][A Zero-Shot]] Language Agent for Computer Control with Structured Reflection - partially observed environment, iteratively learning from its mistakes, structured thought management *** GENERATE BLENDER :PROPERTIES: :ID: b27d9d84-d4a1-49b3-9c03-be873e8aa18b :END: - [[https://arxiv.org/abs/2403.01248][SceneCraft]]: An LLM Agent for Synthesizing 3D Scene as Blender Code - models a scene graph as a blueprint, detailing spatial relationships among assets in the scene - then writes blender Python scripts based on this graph, translating relationships into numerical constraints for asset layout * UNDERSTANDING :PROPERTIES: :ID: cdb139f5-cc9e-4dc8-99b3-9aa39ece63ad :END: - [[id:38d684fe-bc58-4132-bc2f-407e70198230][INTERACTIONS]] - [[https://arxiv.org/abs/2402.02922][Pixel-Wise]] Color Constancy via Smoothness Techniques in Multi-Illuminant Scenes - anti-abnormal-light filter by learning pixel-wise illumination maps caused by multiple light sources ** POSE - POSITION :PROPERTIES: :ID: 2af8a345-7338-42f0-8dda-03b2eb69f22e :END: - [[id:c8b0c87f-b5e4-4720-a50b-253bd7f3a329][DETECTING HUMAN]] [[id:2eaf93e5-1612-40d5-901d-4a4da911b086][MOTION SYNTHESIS]] - [[https://twitter.com/_akhaliq/status/1673879084760440833][PoseDiffusion]]: Solving Pose Estimation via Diffusion-aided Bundle Adjustment - modelling the distribution of camera poses given input images - [[https://twitter.com/rsasaki0109/status/1674362772388458497][Detector-Free]] Structure from Motion - [[https://github.com/IDEA-Research/DWPose][Effective]] Whole-body Pose Estimation with Two-stages Distillation - instead of openpose preprocessor - [[https://twitter.com/_akhaliq/status/1707278993097928709][DECO]]: Dense Estimation of 3D Human-Scene Contact In The Wild - recognize 3D contact between body and objects - [[https://twitter.com/_akhaliq/status/1738147746060657144][Pose Anything]]: [[https://github.com/orhir/PoseAnything][A Graph-Based]] [[https://arxiv.org/abs/2311.17891][Approach]] for [[https://lemmy.dbzer0.com/post/11078405][Category-Agnostic]] Pose Estimation - people, animals, furniture, faces - [[https://arxiv.org/abs/2401.16173][Reconstructing]] Close Human Interactions from Multiple Views - input multi-view 2D keypoint heatmaps and reconstruct the pose of each individual - Extreme Two-View Geometry From Object Poses with Diffusion Models - extreme viewpoint changes, with no co-visible regions in the images *** MONOCULAR - [[https://twitter.com/_akhaliq/status/1676094652020031488][Real-time]] Monocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning - body tracking, only one view needed - [[https://arxiv.org/abs/2401.03914][D3PRefiner]]: A Diffusion-based Denoise Method for 3D Human Pose Refinement - refine the output of any existing 3D pose estimator (monocular camera-based 3D pose estimation) - [[https://twitter.com/liuziwei7/status/1764312276419649728][SMPLer]]: monocular 3D human motion capture, Motion Capture from Any Video *** GET HEAD POSE :PROPERTIES: :ID: ffcb8293-fcbe-4531-8d9e-87cbe68fd4b5 :END: - [[https://arxiv.org/abs/2401.10215][GPAvatar]]: [[https://xg-chu.github.io/project_gpavatar/][Generalizable]] and Precise Head Avatar from Image(s) - recreate the head avatar and precisely control expressions and postures - [[https://twitter.com/_akhaliq/status/1755087437217337709][IMUSIC]]: IMU-based Facial Expression Capture - facial expression capture using purely IMU signals - privacy-protecting, hybrid capture against occlusions, detecting movements often invisible *** SKELETON - [[https://arxiv.org/abs/2401.12946][Coverage Axis++]]: Efficient Inner Point Selection for 3D Shape Skeletonization - strategy that considers both shape coverage and uniformity to derive skeletal points ** GEOMETRY INTERACTIONS :PROPERTIES: :ID: eb05005c-b0d8-4c3d-b4ec-6915de1d970c :END: - [[https://jasonqsy.github.io/3DOI/][Understanding]] 3D Object Interaction from a Single Image - [[https://twitter.com/_akhaliq/status/1692040105211621595][Distilled]] Feature Fields Enable Few-Shot Language-Guided Manipulation - 3D geometry understanding (tokens) with 2D rich semantics - [[https://twitter.com/_akhaliq/status/1747862771851514245][SceneVerse]]: Scaling 3D Vision-Language Learning for Grounded Scene Understanding - scene and object caption, object referral - Learning Generalizable Feature Fields for Mobile Manipulation - [[https://twitter.com/_akhaliq/status/1767753512199295463][GeFF]] (Generalizable Feature Fields) - for both navigation and manipulation in real time