🎯 Grasp Any Region (GAR)

Region-level Multimodal Understanding for Vision-Language Models

This demo showcases GAR's ability to understand and describe specific regions in images:

🎨 Single Region Understanding: Describe specific areas using points, boxes, or masks
🔍 SAM Integration: Generate masks interactively using Segment Anything Model
💡 Detailed Descriptions: Get comprehensive descriptions of any region

Built on top of Perception-LM with RoI-aligned feature replay technique.

📄 Paper | 💻 GitHub | 🤗 Model

Click points on the image or enter coordinates to segment and describe a region

Input Image

Points (format: x1,y1;x2,y2;...)

Generated Mask

Visualization

Region Description

Example Images

Input Image	Points (format: x1,y1;x2,y2;...)

📖 How to Use:

Points → Describe: Click or enter point coordinates, generate mask, then describe
Box → Describe: Draw or enter a bounding box, generate mask, then describe
Mask → Describe: Upload a pre-made mask directly and describe

🔧 Technical Details:

Model: GAR-1B (1 billion parameters)
Base: Facebook Perception-LM with RoI-aligned feature replay
Segmentation: Segment Anything Model (SAM ViT-Huge)
Hardware: Powered by ZeroGPU (NVIDIA H200, 70GB VRAM)

📚 Citation:

@article{wang2025grasp,
  title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
  author={Haochen Wang et al.},
  journal={arXiv preprint arXiv:2510.18876},
  year={2025}
}

🎯 Grasp Any Region (GAR)

Click points on the image or enter coordinates to segment and describe a region

Draw a bounding box or enter coordinates to segment and describe a region

Upload a pre-made mask to describe a region

📖 How to Use:

🔧 Technical Details:

📚 Citation: