BEHAVIOURAL

BEHAVIORAL TRANSFORMER
Building CooperativeEmbodied Agents Modularly with Large Language Models

PLANNING

Planning with Diffusion for Flexible BehaviorSynthesis
- PlaSma Making Small Language Models Better Procedural Knowledge Models for (Counterfactual) Planning
  - LLM, revision of a plan to cope with a counterfactual situation
ToolChain* Efficient Action Space Navigation in Large Language Models with A* Search
- entire action space as a decision tree, then identifying the most low-cost valid path as the solution
  - ==cheapest cost decision==
K-LevelReasoning with Large Language Models
- decision-making in evolving environments, dynamic reasoning
TravelPlanner A Benchmark for Real-World Planning with Language Agents
- llms have success rate of 0.6% on travel planning

CODE PLANNING

CodePlan Repository-level Coding using LLMs and Planning
- context derived from the entire repository, previous code changes
- package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications

ROBOTS

LLM AS REWARD PROPER-ING INSTRUCTIONS MOTION SYNTHESIS
Generative Agents Interactive Simulacra of Human Behavior, sims
DiscoveringAdaptable Symbolic Algorithms from Scratch
- evolve(activate) safe control policies that avoid falling when individual limbs suddenly break
Dynalang Learning to Model the World with Language
- agents that leverage diverse language that describes state of the world with feedback
Diffusion-CCSP: CompositionalDiffusion-Based Continuous Constraint Solvers
- novel combinations of known constraint
Dolphins Multimodal Language Model for Driving
- holistic understanding of intricate driving scenarios and multimodal instructions
PhotoBot Reference-Guided Interactive Photography via Natural Language
- take photo at best poses (cinematography), best perspectives and povs

MINECRAFT

STEVE-1 A Generative Model for Text-to-Behavior in Minecraft
- unCLIP is effective for creating instruction-following sequential decision-making agents
- pretrained models like VPT and MineCLIP, STEVE-1 costs just $60 to train
JARVIS-1 Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
- multimodal memory, planning using both pre-trained knowledge and actual game experiences
BeTAIL Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay
- without requiring hand-designed reward functions

DAG

DAG Amendmentfor Inverse Control of Parametric Shapes
- depending the size of the brush and the location, infers the intention
  - and modifies the hyperparameters, not just one axis but whole arm-mechanisms

DATASET

RT-X the largest open-source robot dataset
MimicPlay imitation learning algorithm that extracts the most signals from unlabeled human motions

ECONOMY

BloombergGPT A Large Language Model for Finance (economy)

SCENE

Diffusion-based Generation Optimization, andPlanning in 3D Scenes
ConceptGraphs Open-Vocabulary 3D Scene Graphs for Perception and Planning
- 2D foundation models then fusing their output to 3D by multi-view association
- complex reasoning over spatial and semantic concepts.
LangNav Language as a Perceptual Representation for Navigation
- select an action(from instruction) based on the current view and the trajectory history

SCENE SYNTHESIS

SCENE TEXTURES
3D-GPT Procedural 3D Modeling with Large Language Models
- instruction-driven 3D modeling
  - evolving(and enhancing) their detailed forms while dynamically adapting on subsequent instructions
ImageSynthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs
- leveraging clip scene understanding instead of layouts, GAN based
Text2Street Controllable Text-to-image Generation for Street Views
- text-to-map generation integrating road structure-topology, object layout and weather description
SemCity Semantic Scene Generation with Triplane Diffusion (refinement and inpainting)
Proceduralterrain generation with style transfer
- drawing style from real-world height maps unto perlin noise
RealmDreamer Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion
- optimizes a 3D Gaussian Splatting, allows 3D synthesis from a single image

ROOM

DreamScene 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling
- multi-timestep sampling strategy guided by the formation patterns of 3D objects
- enables targeted adjustments

ROOM LAYOUT

BlockFusion Expandable 3D Scene Generation using Latent Tri-plane Extrapolation
- semantically and geometrically meaningful transitions that harmoniously blend with the existing scene
- 2D layout conditioning-control
InstructScene Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior
- with semantic graph prior and a layout decoder
GALA3D TowardsText-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting
- LLM to generate initial layout for geometric constrain
Sketch-to-Architecture Generative AI-aided Architectural Design
- generate conceptual floorplans and 3D models from simple sketches

ROOMDREAMER

RoomDreamer Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture
- using a cubemap, with depth(object plane to screen plane) and distance map
Holodeck promptablesystem that can generate diverse, customized, and interactive 3D environments

INTERACTIONS

GEOMETRY INTERACTIONS POSE - POSITION

MOTION SYNTHESIS

motion
AI4Animation Neural StateMachine for Character-Scene Interactions
- PAE - PHASE
NIFTY Neural Object Interaction Fields for Guided Human Motion Synthesis
- neural interaction field attached to a specific object
- guided diffusion model trained on generated synthetic data
Story-to-Motion Synthesizing Infinite and Controllable Character Animation from Long Text
- text-to-motion various locations and specific motions
- motion semantic trajectory constraint
CHOIS: ControllableHuman-Object Interaction Synthesis
- diffusion with constraints
  1. language informs style and intent
  2. waypoints ground the motion and can be effectively extracted using high-level planning methods
ROAM Robust and Object-aware Motion Generation using Neural Pose Descriptors
- method for human-object interaction synthesi
- given unseen object, optimise for closest in the feature space
TRUMANS Scaling Up Dynamic Human-Scene Interaction Modeling
- 15 hours of human interactions across 100 indoor scenes
- diffusion-based autoregressive model that efficiently generates HSI sequences of any length

LLM

3D-LLM Injecting the 3D World into Large Language Models
- llm with 3d understanding such as spatial relationships, affordances, physics, layout
- can take 3D point clouds and their features as input
PhysicallyGrounded Vision-Language Models for Robotic Manipulation
- planning on tasks that require reasoning about physical object concepts
Motion Mamba Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM
- long-sequence and efficient motion

PROPER-ING INSTRUCTIONS

Auto-Instruct Automatic Instruction Generation and Ranking for Black-Box Language Models
- method to automatically improve the quality of LLM instructions
CreativeRobot Tool Use with Large Language Models
- input instructions and outputs executable code for controlling robots(tools)

INSIDE COMPUTER

A Real-WorldWebAgent with Planning, Long Context Understanding, and Program Synthesis (website is scene)
- LLM-driven agent to complete instruction tasks on real websites
A Zero-ShotLanguage Agent for Computer Control with Structured Reflection
- partially observed environment, iteratively learning from its mistakes, structured thought management

GENERATE BLENDER

SceneCraft An LLM Agent for Synthesizing 3D Scene as Blender Code
- models a scene graph as a blueprint, detailing spatial relationships among assets in the scene
- then writes blender Python scripts based on this graph, translating relationships into numerical constraints for asset layout

UNDERSTANDING

INTERACTIONS
Pixel-WiseColor Constancy via Smoothness Techniques in Multi-Illuminant Scenes
- anti-abnormal-light filter by learning pixel-wise illumination maps caused by multiple light sources

POSE - POSITION

DETECTING HUMAN MOTION SYNTHESIS
PoseDiffusion Solving Pose Estimation via Diffusion-aided Bundle Adjustment
- modelling the distribution of camera poses given input images
- Detector-FreeStructure from Motion
EffectiveWhole-body Pose Estimation with Two-stages Distillation
- instead of openpose preprocessor
DECO Dense Estimation of 3D Human-Scene Contact In The Wild
- recognize 3D contact between body and objects
Pose Anything A Graph-Based Approachfor Category-AgnosticPose Estimation
- people, animals, furniture, faces
ReconstructingClose Human Interactions from Multiple Views
- input multi-view 2D keypoint heatmaps and reconstruct the pose of each individual
Extreme Two-View Geometry From Object Poses with Diffusion Models
- extreme viewpoint changes, with no co-visible regions in the images

MONOCULAR

Real-timeMonocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning
- body tracking, only one view needed
D3PRefiner A Diffusion-based Denoise Method for 3D Human Pose Refinement
- refine the output of any existing 3D pose estimator (monocular camera-based 3D pose estimation)
SMPLer monocular 3D human motion capture, Motion Capture from Any Video

GET HEAD POSE

GPAvatar Generalizableand Precise Head Avatar from Image(s)
- recreate the head avatar and precisely control expressions and postures
IMUSIC IMU-based Facial Expression Capture
- facial expression capture using purely IMU signals
- privacy-protecting, hybrid capture against occlusions, detecting movements often invisible

SKELETON

Coverage Axis++ Efficient Inner Point Selection for 3D Shape Skeletonization
- strategy that considers both shape coverage and uniformity to derive skeletal points

GEOMETRY INTERACTIONS

Understanding3D Object Interaction from a Single Image
DistilledFeature Fields Enable Few-Shot Language-Guided Manipulation
- 3D geometry understanding (tokens) with 2D rich semantics
SceneVerse Scaling 3D Vision-Language Learning for Grounded Scene Understanding
- scene and object caption, object referral
Learning Generalizable Feature Fields for Mobile Manipulation
- GeFF(Generalizable Feature Fields)
- for both navigation and manipulation in real time