watching the fast ascent of a bunch of data labeling startups, lots of people have been asking about what the moat could be for these types of companies. one interesting way to view these companies is as defense primes - the lockheed martins of the ai world. they have concentrated revenue from the largest customers they can find, own a core asset, and their main moat is their relationship with their customers (knowing exactly what products they want before others). keep in mind that i’m not running any of these companies, so my perspective should be taken with a grain of salt
companies in the space
here are some examples of data-labeling companies in this category:
- datacurve: founded by serena, a force of nature who dropped out of waterloo. zero to $9m run rate in 13 months. at afore, we are proud to have backed serena’s first round. focused on selling labeled code data to foundational model companies
- surge: bootstrapped to $1b run rate and surpassed scale ai around 2024. focused on selling premium-quality data as the differentiator - the founder edwin (ex-google / meta) built tools to make sure their data quality is very high
- mercor: founded by thiel fellows, started as an ai recruiting platform and added data labeling on top. raised a seed from general catalyst in september 2023, a $30m series a from benchmark, a $100m series b from felicis, and is closing a series c at a $10b valuation
- turing: also started out as a platform to hire remote talent, recently added data labeling on top and went from $100m to $300m in revenue last year.
- micro1: like mercor and turing, micro1 also started as a platform to hire remote engineering talent, which there has always been high demand for. realized these people could be great for data labeling - revenue grew from $10m early this year to over $50m by mid-2025 (projected to exceed $100m by september)
- scale: pretty well-known, generated $870m in 2024, 165% cagr over the past five years, has raised $1.6m to date (series f in may 2024 alone brought in $1b), roughly valued at $13.8b before meta invested ~$15b for a 49% stake and brought alex wang to meta
- invisible: the company has only raised $23m in venture funding, revenue grew from $3m in 2020 to ~$60m in 2023 to ~$134m in 2024 ($15m in ebitda). in an interesting move, the founder francis borrowed $20m in debt to buy out early investors
- handshake: started as a career network for college students, recently started a new business matching these people on its platform to ai labs for hiring / data labeling, reached a $50m run rate in its first four months / will blow past $100m run rate in its first year
- and many more competitors - the space is frothy, and when you do something that starts to work, tons of copycats spring up (as they should)
overview
let’s start into the similarities:
- concentrated revenue from massive customers
- moat is brand + customer relationships
- base assets that you build on
- other parallels
and important differences
concentrated revenue from massive customers
revenue concentration is not ideal and many tech investors cite “customer concentration” as a reason for not investing in companies. but i think it’s not as bad as people think. in the defense industry, your main customer is the united states government one of the biggest customers in the world (~$2.4t in department of defense discretionary spending allocated to private military contractors this fiscal year) and maybe other allied governments. similarly, the large and well-capitalized foundational model companies like openai, anthropic, etc have huge budgets to spend on improving their core product, and pay large sums for data labeling / evals / rl environments providers. in defense, there are a few primes (lockheed, raytheon, northrom grumman, boeing, etc - five defense contractors received $771b in pentagon contracts from 2020-2024) and the data labeling companies may consolidate in a similar way, riding on the backs of a few major ai “prime” contracts. if you lose these contracts it’s pretty serious, for example i heard that google had a $500m contract with scale that became “up-for-grabs” after the meta acquisition (google’s competitor), that many of the other ai data-labeling primes looked to snatch up
moat is brand + customer relationships
if there’s one takeaway here, it’s that brand (trust) and customer relationships are a core moat of data labeling companies, which is exactly what keeps defense primes entrenched. defense contractors have deep, decades-long relationships with government customers, and they know exactly what they need (better than most tech startups who have only talked to a few end customers). this creates inertia and loyalty (if you were the pentagon, would you rather stick with a company that has delivered successfully for decades [maybe even on time and on-budget], or make a bet on an upstart with a much shorter track record). “no one gets fired for buying ibm.” it’s possible of course for smaller players to win contracts, but the lion’s share of money will most likely go to the usual suspects. data labeling companies similarly can build reputations for reliability and quality with the large foundational model companies. and more importantly in this case, forward-deployed engineers / project managers often work very closely with teams at the ai labs and can anticipate what those teams will need next. today perhaps it’s data labeling, tomorrow it’s rl environments, evals, or something else. it’s hard to predict what will be needed next, but it’s very likely that these large data-labeling primes will be the first to know. simply knowing what’s on your customer’s roadmap is a huge competitive advantage, and you can start building a product preemptively (starting r&d on a new technology you think will be wanted, or assembling a specific dataset / talent pool for ai labs)
base assets
tl;dr - data labeling companies build assets like a human talent network, or algorithms to verify data quality; defense primes build in-house engineering talent and manufacturing capabilities (physical facilities, supply chain, etc)
it’s hard to replicate each of these things. handshake spent years building up a massive database of students and new-grads, so when openai needs 500 phd’s in a certain field to label data, few other companies will have access in the way that handshake does. mercor / turing / micro1 / other remote talent marketplaces have millions of developers and experts that would be hard to compete with if you’re starting up a competitor from scratch, as well as the software layer to coordinate search / quality-vetting / proof-of-work / payments / etc. since companies like datacurve and surge sell complete datasets (instead of human data labelers) to the foundational model companies, they have robust verification systems that validate data quality (that are harder than you’d think to build from scratch).
and of course on the defense primes side, it’s more obvious that it’s hard to start a competitor from scratch given the capital-intensive nature of the industry’s operations (factories, shipyards, specialized tech, etc)
other parallels
- governments and ai labs are both in an “arms race” situation, racing to outspend their competitors for mission-critical infrastructure and high-quality products
- execution and scale matter a lot. for defense primes / data-labeling primes, product innovation matters (building a better jet, or more efficient ai interviewer or data quality evaluation suite), but being able to simply execute is really important (delivering huge projects reliably)
- you need a mix of highly-skilled labor and generally massive manpower - managing this workforce efficiently could be a moat in itself
- because of the service-oriented nature of defense / data labeling companies, margins may be lower than you think. defense primes often give up margin for guaranteed volume, and labeling firms may also experience margin compression as competition strengthens. but since these businesses are not very easy to run very profitably at scale, that can be a moat too (few can do it well)
important differences
- the customer is very different for defense / ai primes. the us government is bureaucratic / political which affects procurement (so you need to be good at lobbying - anduril, for example, is really good at this). foundational model companies like openai / anthropic are different - first of all, there are a few of them (google / meta are huge too, for example) versus selling into a single monolithic government. second, ai labs can sign contracts faster and these contracts could be a lot shorter (their needs could change on a dime - datasets / evals could shift quickly if modeling techniques shift)
- capital intensiveness differs too. defense primes need a lot more money to get started (plants, r&d, etc) versus data-labeling businesses need less (their main expenses probably scale with labor) so their barriers to entry are lower. it’s easier to get a thousand data-labeling contractors than build an f-35 fighter
who will win
it’s very unclear. but what is clear is that founders / ceo’s are forced to become much more aggressive in the current environment, and the ones that are the most aggressive may win. check out “current hype cycle” for a few thoughts here (aggression begets aggression, pivots are the new constant, momentum is the new moat). great entrepreneurs move at a breakneck pace to bring their vision to life - scale ai has constantly re-invented itself in order to stay ahead (autonomous vehicle data labeling, other types of data labeling, ml services support, human feedback / evals, government contracts, etc). the data-labeling primes might not know what the big ai labs need in five years, but what they can do is position themselves to be in the best place to serve the labs’ needs when they come up