AI
AI News

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Source:arXiv
Original Author:Jing Tan et al.
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Image generated by Gemini AI

Talk2Move is a novel reinforcement learning framework designed for spatial transformations of objects in scenes based on text instructions. It addresses limitations in existing methods, enabling geometric adjustments (like rotation and resizing) without requiring extensive paired data. By using Group Relative Policy Optimization and a unique spatial reward system, Talk2Move enhances learning efficiency and achieves superior accuracy in object transformations. Experiments show it outperforms current text-guided editing techniques, offering interpretable and coherent results in spatial manipulation.

Talk2Move: Advancing Object-Level Geometric Transformation Through Reinforcement Learning

A new framework, Talk2Move, utilizes reinforcement learning to enable text-instructed spatial transformations of objects within various scenes. This approach addresses the limitations of existing multimodal generation systems that struggle with object-level geometric adjustments such as translating, rotating, or resizing.

Talk2Move employs Group Relative Policy Optimization (GRPO), facilitating the exploration of geometric actions through diverse rollouts generated from input images and lightweight textual variations. The framework’s design integrates a spatial reward model that aligns geometric transformations with corresponding linguistic descriptions.

Key Features of Talk2Move

  • Off-Policy Step Evaluation: Enhances learning efficiency by focusing on informative stages of transformation.
  • Active Step Sampling: Refines outputs based on real-time feedback.
  • Object-Centric Spatial Rewards: Directly assess behaviors such as displacement, rotation, and scaling.

Experimental results indicate that Talk2Move achieves notable improvements in precision and consistency of object transformations, surpassing existing text-guided editing methods in spatial accuracy and enhancing scene coherence.

Related Topics:

Talk2Movereinforcement learninggeometric transformationspatial rewardsmultimodal generation

📰 Original Source: https://arxiv.org/abs/2601.02356v1

All rights and credit belong to the original publisher.

Share this article