LocateAnything

LocateAnything is a vision-language model for visual grounding and detection. It helps AI systems locate what a user describes in an image, such as “the search button,” “people wearing red shirts,” or “all text on the document.”

Unlike classic object detectors that usually work with fixed categories, LocateAnything uses natural language instructions and returns structured locations such as bounding boxes or points.

Model Overview

Item	Details
Model name	LocateAnything / LocateAnything-3B
Developer	NVIDIA / LocateAnything Team
Model type	Vision-language grounding and detection model
Main purpose	Locate objects, text, GUI elements, layout regions, and points in images
Core method	Parallel Box Decoding (PBD)
Architecture	MoonViT vision encoder + Qwen2.5-3B-Instruct language decoder + MLP projector
Parameters	3B model variant
Input	Image and natural language prompt
Output	Text sequence with structured bounding boxes or points
Training data	12M unique images, 138M+ language queries, and 785M bounding boxes
Supported tasks	Object detection, dense detection, GUI grounding, OCR localization, document layout analysis, referring expression grounding, pointing
Performance claim	12.7 boxes per second on a single H100 in default hybrid mode
License	Non-commercial research use only
Pricing	No public paid API pricing found

Features

Natural Language Visual Grounding

LocateAnything can find visual targets from plain text prompts. Instead of only detecting predefined labels, it can respond to flexible descriptions like “the traffic light” or “the search button.”

Parallel Box Decoding

The model predicts a complete bounding box as one structured unit instead of generating coordinates token by token. This is designed to improve both decoding speed and geometric consistency.

Dense Object Detection

LocateAnything can detect many objects in cluttered scenes. This makes it useful for images where standard single-object grounding is not enough.

GUI Element Grounding

The model can locate buttons, input fields, icons, and other interface elements. This is especially relevant for computer-use agents and automation workflows.

OCR and Text Localization

LocateAnything can detect and ground text regions in images. It can be used for document understanding, screenshot analysis, and scene text detection.

Document Layout Understanding

The model supports layout grounding, helping identify regions such as tables, blocks, figures, or document sections.

Point-Based Localization

Besides bounding boxes, LocateAnything can return points. This is useful when an application needs a precise click target rather than a full object box.

Hybrid Inference Mode

LocateAnything supports fast, slow, and hybrid generation modes. Hybrid mode uses fast parallel decoding first and falls back to slower autoregressive decoding when more stable output is needed.

How to use

Upload an Image or Video, or pick a Quick Sandbox example below.

upload

Choose a Task Type

upload

Enter Categories in the search bar

Optinally tune Advanced parameters

upload

Click Run Inference.

FAQ

What is LocateAnything used for?

LocateAnything is used to locate visual targets in images from natural language prompts. Common use cases include object detection, GUI automation, OCR localization, document layout analysis, robotics perception, and dataset annotation.

Is LocateAnything an object detection model?

Yes, but it is broader than a traditional object detector. It combines object detection with vision-language grounding, so it can locate objects or regions based on flexible text descriptions.

Who developed LocateAnything?

LocateAnything was released by NVIDIA / the LocateAnything Team as part of the Eagle VLM ecosystem.

Is LocateAnything open source?

The code and model are publicly available, but the model license is non-commercial. It is intended for research and development, not unrestricted commercial use.

Can LocateAnything be used commercially?

The public model card states that the model is released for non-commercial use and that commercial use is not permitted except by NVIDIA and its affiliates. For commercial projects, review the license carefully before using it.

What makes LocateAnything different?

Its main difference is Parallel Box Decoding. Instead of generating box coordinates one token at a time, it predicts boxes or points as structured atomic units, which improves decoding speed and helps preserve box geometry.

What inputs does LocateAnything accept?

The model accepts an image and a text prompt. The prompt can describe an object category, a phrase, a GUI element, text to locate, or a point target.

What does LocateAnything output?

It outputs structured text containing labels and coordinates. The coordinates can represent bounding boxes or points that can be converted into pixel positions.

Does LocateAnything support OCR?

Yes. It supports text localization and scene text detection, making it useful for OCR-related workflows where the system needs to know where text appears in an image.

What hardware is recommended?

Official examples use NVIDIA GPUs, with H100 used for reported benchmark throughput. The model is designed for GPU-accelerated inference, and performance may vary by hardware and configuration.

LocateAnything

#Model Overview

#Features

#Natural Language Visual Grounding

#Parallel Box Decoding

#Dense Object Detection

#GUI Element Grounding

#OCR and Text Localization

#Document Layout Understanding

#Point-Based Localization

#Hybrid Inference Mode

#How to use

#FAQ

#What is LocateAnything used for?

#Is LocateAnything an object detection model?

#Who developed LocateAnything?

#Is LocateAnything open source?

#Can LocateAnything be used commercially?

#What makes LocateAnything different?

#What inputs does LocateAnything accept?

#What does LocateAnything output?

#Does LocateAnything support OCR?

#What hardware is recommended?

Model Overview

Features

Natural Language Visual Grounding

Parallel Box Decoding

Dense Object Detection

GUI Element Grounding

OCR and Text Localization

Document Layout Understanding

Point-Based Localization

Hybrid Inference Mode

How to use

FAQ

What is LocateAnything used for?

Is LocateAnything an object detection model?

Who developed LocateAnything?

Is LocateAnything open source?

Can LocateAnything be used commercially?

What makes LocateAnything different?

What inputs does LocateAnything accept?

What does LocateAnything output?

Does LocateAnything support OCR?

What hardware is recommended?