Why ATS Keywords Fail to Predict Job Performance
Applicant tracking system keywords fail because they measure what a candidate has already done, not how that person will perform in a specific role. Keywords capture prior environments. They do not capture adaptability to a new one. In a study of 10,765 hires, not one of 3,597 tested keywords predicted production after correction, and prior industry experience was anti-predictive. The mechanism is simple: a keyword is a proxy for a skill, and the proxy and the real thing often point in opposite directions.
Source: "Decision Traces," Saad Bin Shafiq, NODES, 2026, plus the Harvard Business School and Accenture report "Hidden Workers: Untapped Talent" (2021). Read the study on arXiv.
Keywords describe the past, not the role
ATS keyword parsing records what appears on a resume. The skills, certifications, and domain terms a candidate lists reflect the jobs they have already held. They say little about whether that person will succeed in a new environment with different processes, different incentives, and different coaching. A resume is a record of the past, and the ATS treats it as a forecast.
Experience can be a liability, not only an asset
In the study, insurance experience and a license were both anti-predictive (odds ratios 0.763 and 0.668). That sounds backwards until you look at the mechanism. Experienced agents arrive with habits, expectations, and compensation anchors from prior roles. They can be harder to coach, slower to adopt new processes, and more likely to revert to approaches that worked somewhere else. The personnel selection literature has long held that domain experience is an asset or a liability depending on how similar the old environment is to the new one. The carrier's data showed the liability side clearly.
The keyword-count trap
More keywords correlated with worse production, not better (correlation of about negative 0.09). One reason is resume padding. Candidates who know an ATS rewards keyword density write to the filter rather than to the role. The filter rewards the behavior it should penalize.
The bigger pattern: hidden workers
This is not unique to one carrier. In the Harvard Business School and Accenture Hidden Workers study, 88% of surveyed executives agreed that qualified, high-skills candidates are vetted out of the hiring process because they do not match the exact criteria. The researchers estimate more than 27 million hidden workers in the United States alone, capable people repeatedly filtered out by automated screening. The Decision Traces study quantifies the same failure across an entire ATS keyword vocabulary, with production outcomes attached.
What predicts performance instead
If keywords do not work, something has to. In this dataset, personality assessment was the strongest single signal, and fusing behavioral, personality, and ATS data improved prediction beyond any system alone. See the fusion results. The practical move is to stop trusting screening rules on faith and start testing them against outcomes through a decision trace. See how.
Frequently asked questions
Why do applicant tracking systems reject qualified candidates? Because they screen on keyword matches that stand in for experience, and those proxies do not reliably track performance. In one large study, prior-experience keywords were anti-predictive.
Are ATS keyword filters biased? Keyword filters can produce adverse impact while also failing to predict production. Research on the diversity-validity dilemma finds that removing weak filters can improve both fairness and accuracy at the same time.
If keywords fail, what should we screen on? In this dataset, behavioral and personality signals carried far more predictive power than keywords, especially when fused together across systems.
Should we stop using an ATS? No. The ATS is the system of record and it does that job well. The problem is screening on keywords inside it without ever checking those keywords against who actually produced.
Related reading
- What 10,765 hires revealed about resume keywords
- From AUC 0.647 to 0.735: what predicts performance
- Decision traces, explained
Test your screening rules against real outcomes. Book a 30-minute walkthrough.