๐ฏ Grasp Any Region (GAR)
Region-level Multimodal Understanding for Vision-Language Models
This demo showcases GAR's ability to understand and describe specific regions in images:
- ๐จ Single Region Understanding: Describe specific areas using points, boxes, or masks
- ๐ SAM Integration: Generate masks interactively using Segment Anything Model
- ๐ก Detailed Descriptions: Get comprehensive descriptions of any region
Built on top of Perception-LM with RoI-aligned feature replay technique.
Click points on the image or enter coordinates to segment and describe a region
Example Images
| Input Image | Points (format: x1,y1;x2,y2;...) |
|---|
Draw a bounding box or enter coordinates to segment and describe a region
Example Images
| Input Image | Bounding Box (format: x1,y1,x2,y2) |
|---|
Upload a pre-made mask to describe a region
Example Images
| Input Image | Upload Mask |
|---|
๐ How to Use:
- Points โ Describe: Click or enter point coordinates, generate mask, then describe
- Box โ Describe: Draw or enter a bounding box, generate mask, then describe
- Mask โ Describe: Upload a pre-made mask directly and describe
๐ง Technical Details:
- Model: GAR-1B (1 billion parameters)
- Base: Facebook Perception-LM with RoI-aligned feature replay
- Segmentation: Segment Anything Model (SAM ViT-Huge)
- Hardware: Powered by ZeroGPU (NVIDIA H200, 70GB VRAM)
๐ Citation:
@article{wang2025grasp,
title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
author={Haochen Wang et al.},
journal={arXiv preprint arXiv:2510.18876},
year={2025}
}