.webp)
.webp)
When a new AI model tops a benchmark, the narrative is predictable: a breakthrough in architecture, a novel training technique, a leap in compute. The press release credits the researchers. The technical report details the engineering. The discussion focuses on algorithms and scaling laws. What almost never gets mentioned is the human labor — the thousands of people who spent months producing the data that made the model work. The annotators who labeled training examples. The domain experts who designed evaluation tasks. The human raters who provided preference signals for RLHF. The red teamers who stress-tested outputs for safety. Without their work, the architecture is just code, the compute is just electricity, and the model learns nothing useful.
The numbers are staggering. Scale AI generated over $1.4 billion in revenue in 2024 — revenue almost entirely derived from human labor performing data tasks for AI labs. Surge, one of the current market leaders, is believed to be approaching $1 billion in annual recurring revenue. Mercor, Handshake, and dozens of smaller firms collectively employ tens of thousands of contractors across the globe. These contractors perform a remarkable range of work. At the basic end, they label images, classify text, and tag entities. At the high end, they write complex grading rubrics, design RL environments, author expert-level solutions to domain-specific problems, evaluate model outputs against professional standards, and provide the nuanced preference judgments that shape AI behavior through RLHF.
The work has diversified well beyond traditional annotation. As labs scale reinforcement learning across new domains, demand has expanded into areas like photography, music, design, healthcare, finance, and law. The people providing preference signals and expert evaluations come from increasingly specialized professional backgrounds. Coding remains the highest-demand domain, but the breadth of expertise required is growing rapidly.
AI labs treat their data operations as some of their most sensitive intellectual property. The specific tasks, annotation guidelines, evaluation criteria, and vendor relationships they use directly reflect their training strategy. Discussing the human workforce means discussing the data strategy, which no lab wants to do publicly. This secrecy extends to the vendors themselves — most operate under exclusive contracts with strict confidentiality requirements.
AI research is published in papers that emphasize algorithmic innovations, not data production processes. The incentive structure of academic publishing and corporate communications both favor stories about clever engineering over stories about careful human work. A paper describing a new RL algorithm gets cited. A paper describing how 500 contractors spent three months labeling training data does not.
Much of the basic annotation work is performed by workers in lower-income countries at rates that would be unacceptable in Silicon Valley. Highlighting this labor raises questions about fairness, compensation, and the ethics of building multi-billion-dollar AI systems on low-wage work. The industry has generally chosen to avoid this conversation rather than engage with it.
When the narrative says progress comes from better algorithms and more compute, budgets flow to researchers and GPU clusters, not annotation operations. Data quality becomes an afterthought. Teams discover too late that models underperform because training data was produced by undertrained annotators following ambiguous guidelines. The assumption that human oversight will eventually become unnecessary reinforces this chronic underinvestment.
When annotation is treated as an invisible commodity rather than a skilled professional service, the pressure is always to reduce cost and increase speed. This leads to undertrained generalists producing high-volume, low-quality data. The errors are systematic rather than random, teaching models consistent mistakes that are difficult to diagnose and expensive to correct.
Invisible labor is labor without leverage. When the contribution of annotators is not recognized, they have less ability to negotiate fair compensation, reasonable working conditions, and professional development opportunities. This is not just an ethical concern — unhappy, underpaid workers produce lower-quality data.
To make the invisible visible, it helps to trace the human labor behind specific achievements.
When OpenAI’s GPT-5 series showed dramatic improvements in real-world utility — creating slides, writing reports, filing tax returns — those improvements were directly downstream of human data. Contractors designed the evaluation tasks. Domain experts with an average of 14 years of experience authored expected solutions. Human raters graded model outputs against professional standards using the GDPval evaluation framework covering over 1,000 tasks across 44 occupations.
When models improved at coding, it was because software engineers constructed coding environments from GitHub repositories, designed test cases from real pull requests, authored solution patches, and graded model-generated code. DeepSeek used 24,667 coding tasks for training V3.2, all requiring human construction and validation.
When models got better at using software tools, it was because developers built UI gyms — replicated websites costing approximately $20,000 each — that simulated platforms like DoorDash, Uber Eats, Slack, and Salesforce. OpenAI purchased hundreds of these environments for ChatGPT Agent training.
The labs winning the current generation of competition understand the strategic importance of human data, even if they rarely discuss it publicly. Anthropic’s aggressive investment in a diverse vendor ecosystem. OpenAI’s decision to build a large in-house data team. Google’s growing efforts to leverage internal user behavior data. These are all acknowledgments that human data is a strategic asset. Providers like Careerflow and the leading annotation companies have built their businesses around the same recognition: that human data work deserves investment, structure, professional standards, and respect.
When companies recognize the importance of their human data operations, they invest in them properly. They hire better annotators. They build better training programs. They develop more sophisticated quality control. They create working conditions that attract and retain talent. And they produce better training data, which produces better models.
Every AI breakthrough is a collaboration between human and machine. The algorithms are essential. The compute is essential. But the human labor that produces the training data — the annotations, the evaluations, the preference signals, the expert demonstrations — is equally essential.
The companies that invest in making their human data operations visible, valued, and well-resourced will build better AI. The ones that keep this labor invisible will keep wondering why their models underperform. The choice is straightforward. The industry just needs to make it.
Sign up now to access Careerflow’s powerful suite of AI tools and take the first step toward landing your dream job.