LocateAnything: Fast Visual Grounding AI

Detect and label objects in images and videos. LocateAnything is an NVIDIA vision-language model that finds objects, text, GUI elements, and points in images with natural language prompts.

LocateAnything is a vision-language model for visual grounding and detection. It helps AI systems locate what a user describes in an image, such as “the search button,” “people wearing red shirts,” or “all text on the document.”

Unlike classic object detectors that usually work with fixed categories, LocateAnything uses natural language instructions and returns structured locations such as bounding boxes or points.

Model Overview

Item Details
Model name LocateAnything / LocateAnything-3B
Developer NVIDIA / LocateAnything Team
Model type Vision-language grounding and detection model
Main purpose Locate objects, text, GUI elements, layout regions, and points in images
Core method Parallel Box Decoding (PBD)
Architecture MoonViT vision encoder + Qwen2.5-3B-Instruct language decoder + MLP projector
Parameters 3B model variant
Input Image and natural language prompt
Output Text sequence with structured bounding boxes or points
Training data 12M unique images, 138M+ language queries, and 785M bounding boxes
Supported tasks Object detection, dense detection, GUI grounding, OCR localization, document layout analysis, referring expression grounding, pointing
Performance claim 12.7 boxes per second on a single H100 in default hybrid mode
License Non-commercial research use only
Pricing No public paid API pricing found

Features

Natural Language Visual Grounding

LocateAnything can find visual targets from plain text prompts. Instead of only detecting predefined labels, it can respond to flexible descriptions like “the traffic light” or “the search button.”

Parallel Box Decoding

The model predicts a complete bounding box as one structured unit instead of generating coordinates token by token. This is designed to improve both decoding speed and geometric consistency.

Dense Object Detection

LocateAnything can detect many objects in cluttered scenes. This makes it useful for images where standard single-object grounding is not enough.

GUI Element Grounding

The model can locate buttons, input fields, icons, and other interface elements. This is especially relevant for computer-use agents and automation workflows.

OCR and Text Localization

LocateAnything can detect and ground text regions in images. It can be used for document understanding, screenshot analysis, and scene text detection.

Document Layout Understanding

The model supports layout grounding, helping identify regions such as tables, blocks, figures, or document sections.

Point-Based Localization

Besides bounding boxes, LocateAnything can return points. This is useful when an application needs a precise click target rather than a full object box.

Hybrid Inference Mode

LocateAnything supports fast, slow, and hybrid generation modes. Hybrid mode uses fast parallel decoding first and falls back to slower autoregressive decoding when more stable output is needed.

How to use

Upload an Image or Video, or pick a Quick Sandbox example below.

upload

Choose a Task Type

upload

Enter Categories in the search bar

upload

Optinally tune Advanced parameters

upload

Click Run Inference.

FAQ

What is LocateAnything used for?

LocateAnything is used to locate visual targets in images from natural language prompts. Common use cases include object detection, GUI automation, OCR localization, document layout analysis, robotics perception, and dataset annotation.

Is LocateAnything an object detection model?

Yes, but it is broader than a traditional object detector. It combines object detection with vision-language grounding, so it can locate objects or regions based on flexible text descriptions.

Who developed LocateAnything?

LocateAnything was released by NVIDIA / the LocateAnything Team as part of the Eagle VLM ecosystem.

Is LocateAnything open source?

The code and model are publicly available, but the model license is non-commercial. It is intended for research and development, not unrestricted commercial use.

Can LocateAnything be used commercially?

The public model card states that the model is released for non-commercial use and that commercial use is not permitted except by NVIDIA and its affiliates. For commercial projects, review the license carefully before using it.

What makes LocateAnything different?

Its main difference is Parallel Box Decoding. Instead of generating box coordinates one token at a time, it predicts boxes or points as structured atomic units, which improves decoding speed and helps preserve box geometry.

What inputs does LocateAnything accept?

The model accepts an image and a text prompt. The prompt can describe an object category, a phrase, a GUI element, text to locate, or a point target.

What does LocateAnything output?

It outputs structured text containing labels and coordinates. The coordinates can represent bounding boxes or points that can be converted into pixel positions.

Does LocateAnything support OCR?

Yes. It supports text localization and scene text detection, making it useful for OCR-related workflows where the system needs to know where text appears in an image.

Official examples use NVIDIA GPUs, with H100 used for reported benchmark throughput. The model is designed for GPU-accelerated inference, and performance may vary by hardware and configuration.