Top Annotation Providers Compared

Contents:

This is also a heading
This is a heading
This is a heading

The human data market has undergone a dramatic transformation. What was once a landscape dominated by a single large contractor has fragmented into a diverse ecosystem of specialized providers, talent marketplaces, and domain-specific annotation firms. For CTOs and Heads of AI evaluating their data strategy, understanding who the key players are, what differentiates them, and how to select the right partner has become a critical business decision. This guide breaks down the current market, examines the major categories of providers, and offers a practical framework for choosing the right fit.

At A Glance: The Human Data Market in 2025

Scale AI’s acquisition by Meta disrupted the annotation supply chain, causing most frontier AI labs to diversify their vendor base and seek new providers.
The market has fragmented into four main categories: full-service managed providers, expert talent marketplaces, niche domain specialists, and RL environment companies.
Workforce quality, QC infrastructure, engagement flexibility, and data security are the four factors that matter most when evaluating providers.
Frontier AI labs like Anthropic, OpenAI, and Google DeepMind have adopted distinct procurement strategies, from broad vendor diversification to building in-house data teams.
A paid pilot of two to four weeks is the most reliable way to evaluate a provider’s actual capabilities before committing to a larger contract.

How the Market Got Here: The Post-Scale AI Landscape

Scale AI built a dominant position by becoming the default data contractor for nearly every major AI lab. The company generated over $1.4 billion in revenue in 2024, serving OpenAI, Anthropic, Google, and others simultaneously. Scale’s breadth made it indispensable but created a concentration risk the industry largely ignored.

When Meta acquired Scale AI, that risk materialized. AI labs were uncomfortable having their most sensitive training data — the tasks, environments, and preference signals that define their competitive edge — flowing through a company owned by a direct competitor. Contracts were wound down. Some of Scale’s team joined Meta’s Superintelligence group, primarily in leadership and safety roles. The organization retains some contracts but no longer services the labs to the extent it previously did.

The result has been productive fragmentation. Labs now work with more vendors, increasing management overhead but also reducing single-vendor dependency. The ecosystem has become more specialized: rather than one company attempting to cover every domain, different providers have developed genuine expertise in specific areas.

The Four Categories of Human Data Providers

Full-Service Managed Data Operations

These providers handle the entire data production lifecycle: scoping data needs and edge cases, sourcing and vetting annotators, managing production workflows, running multi-layered quality control, and delivering model-ready data. They are best suited for teams that want to outsource the operational complexity of human data.

Careerflow operates in this category, offering fully managed human data services spanning instruction-tuned labeling, RLHF and DPO pipelines, red-teaming and quality testing, and VLM and LLM pipeline support. Their approach combines access to over one million skilled experts with enterprise-grade QC including multi-layer validation, bias checking, and project tracking. Full-service providers are particularly valuable for enterprise teams running multiple concurrent data projects without extensive internal annotation infrastructure.

Expert Talent Marketplaces

Firms like Mercor, Surge, and Handshake operate as connectors between AI labs and domain-specific contractors. They source, vet, and supply expert annotators — PhDs, licensed professionals, senior engineers — who then embed into the lab’s own data pipeline. Surge is the more established player, with revenue industry sources estimate to be approaching $1 billion ARR. These marketplaces are most valuable for teams that have strong internal data operations but need specialized talent they cannot source independently.

Niche Domain Specialists

A growing segment focuses on single verticals: medical annotation, legal document labeling, financial data processing, geospatial imagery, or code review. These companies are typically smaller but offer deeper expertise in their domain than generalist providers. They work best as supplementary vendors alongside a primary full-service partner, handling domain-specific tasks that require specialized knowledge.

RL Environment Companies

This is a newer and rapidly growing category. Companies like Habitat, DeepTune, Fleet, Vmax, Turing, and Mechanize build simulated environments — cloned websites, software platforms, coding sandboxes — that AI labs use for reinforcement learning. These “UI gyms” often cost approximately $20,000 per website, and OpenAI has purchased hundreds for ChatGPT Agent training. Other environments simulate platforms like Slack, Salesforce, AWS terminals, and Gmail. Anthropic alone works with more than a dozen RL environment companies. For labs scaling RL compute, these providers have become as strategically important as traditional annotation vendors.

What Actually Differentiates Providers

Workforce Quality and Sourcing Methodology

The gap between a provider that recruits from general crowdsourcing platforms and one that sources through professional networks and academic channels is substantial. This directly impacts cost per unit of model improvement, which matters far more than cost per label. Teams that understand why domain knowledge matters more than speed in annotation will prioritize workforce quality in vendor evaluation.

Quality Control Infrastructure

Multi-layered review processes, automated consistency monitoring, inter-annotator agreement tracking, and systematic bias auditing are table stakes for enterprise work. But sophistication varies enormously. Some providers run QC as manual spot-checks on a small sample. Others have built automated pipelines that flag anomalies in real time and feed issues back into annotator calibration. The quality of a provider’s QC infrastructure is the single best predictor of long-term data quality.

Engagement Flexibility

Some teams need a 50-annotator pilot for three months. Others need 2,000 annotators ramped in two weeks. Scaling up and down without sacrificing quality is a genuine differentiator. Providers that can demonstrate consistent quality metrics across different scales and timelines have a significant edge.

Data Security and Compliance

For teams working on frontier models, training data sensitivity is extreme. Enterprise-grade security, clear data handling policies, anonymization capabilities, and regulatory compliance are structural requirements. This is not a checkbox exercise — it eliminates many smaller or newer providers from consideration for enterprise work.

How Frontier AI Labs Are Procuring Data

Anthropic has taken a deliberate approach to vendor diversification. The company works with more than a dozen RL environment companies as contractors, often serving as a first customer for newer vendors. Anthropic appears to favor a broad ecosystem that commoditizes certain types of environments, driving down costs while attracting investor capital into the vendor ecosystem. The company has also been ramping up in domains beyond code, including computer use and biology.

OpenAI has moved in the opposite direction. While still contracting with firms like Surge, Mercor, and Handshake, the company is building a significant in-house human data team. At OpenAI’s scale of data consumption, the margins paid to external vendors become substantial enough to justify internal investment. OpenAI has also built an internal platform called Feather for processing contractor data across its various programs.

Google DeepMind’s procurement is decentralized, driven by researchers from different teams. Google spent a relatively small proportion of compute on post-training for early Gemini versions but has been scaling this up. Google is uniquely positioned because it owns platforms like Sheets, Slides, Docs, and Gmail, giving it access to real user behavior data that can inform model training — though leveraging this data requires cross-organizational coordination.

For enterprise AI teams, the typical approach is one or two primary providers supplemented by domain specialists. A structured evaluation framework — like the one in our guide on criteria for evaluating human data partners — ensures decisions are based on substance rather than sales presentations.

Choosing the Right Provider: Practical Guidance

There is no single best annotation provider. The right choice depends on domain, volume, timeline, expertise requirements, and data sensitivity. Several principles apply across contexts.

Define requirements before soliciting proposals. A detailed scope specifying task types, domain expertise, quality thresholds, expected volumes, and security needs makes it possible to compare providers on substance. Vague requirements lead to proposals optimized for marketing rather than delivery.
Always start with a paid pilot. A two-to-four week trial with defined scope and clear success criteria reveals more about a provider’s capabilities than any pitch deck or reference call. It also calibrates workflows before a larger commitment.
Evaluate quality systems, not just quality claims. Ask how providers measure annotation accuracy — not just inter-annotator agreement. Ask about QC automation, bias detection approaches, and edge case handling. Teams that understand the economics of human data at scale know that investing in quality evaluation upfront prevents costly rework downstream.

Conclusion: Where the Market Is Heading

The human data market is consolidating around expertise, not volume. Providers are moving upmarket, investing in domain knowledge and enterprise infrastructure rather than competing on price. The boundary between annotation companies and RL environment companies is blurring as labs demand integrated solutions. And the role of human data in the AI pipeline is expanding from basic labeling into evaluation, red-teaming, preference optimization, and even scientific research.

For AI leaders making procurement decisions, the core insight is this: annotation is not a commodity. The providers you choose and the data they produce will directly determine the quality of your models. Teams that recognize the competitive advantage of human data quality and invest in finding the right partners will build better AI. Those that optimize for the lowest cost per label will discover, too late, that cheap data is the most expensive kind.

Who Leads the Human Data Market? A Comparison of the Top Annotation Providers