๐ŸŽฏ Grasp Any Region (GAR)

Region-level Multimodal Understanding for Vision-Language Models

This demo showcases GAR's ability to understand and describe specific regions in images:

  • ๐ŸŽจ Single Region Understanding: Describe specific areas using points, boxes, or masks
  • ๐Ÿ” SAM Integration: Generate masks interactively using Segment Anything Model
  • ๐Ÿ’ก Detailed Descriptions: Get comprehensive descriptions of any region

Built on top of Perception-LM with RoI-aligned feature replay technique.

๐Ÿ“„ Paper | ๐Ÿ’ป GitHub | ๐Ÿค— Model

Click points on the image or enter coordinates to segment and describe a region

Example Images
Input Image Points (format: x1,y1;x2,y2;...)

๐Ÿ“– How to Use:

  1. Points โ†’ Describe: Click or enter point coordinates, generate mask, then describe
  2. Box โ†’ Describe: Draw or enter a bounding box, generate mask, then describe
  3. Mask โ†’ Describe: Upload a pre-made mask directly and describe

๐Ÿ”ง Technical Details:

  • Model: GAR-1B (1 billion parameters)
  • Base: Facebook Perception-LM with RoI-aligned feature replay
  • Segmentation: Segment Anything Model (SAM ViT-Huge)
  • Hardware: Powered by ZeroGPU (NVIDIA H200, 70GB VRAM)

๐Ÿ“š Citation:

@article{wang2025grasp,
  title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
  author={Haochen Wang et al.},
  journal={arXiv preprint arXiv:2510.18876},
  year={2025}
}