Sonatype finds live data beats larger AI models on upgrades
Sonatype has published research on how AI models handle open source dependency upgrades, assessing about 37,000 recommendations across several leading models.
The study examined newer models from Anthropic, Google and OpenAI, including Claude Sonnet 3.7 and 4.5, Claude Opus 4.6, Gemini 2.5 Pro and 3 Pro, and GPT-5 and GPT-5.2, alongside additional tests on smaller models. It focused on software dependency upgrade suggestions across Maven Central, npm, PyPI and NuGet.
The central finding was that larger or newer models did not consistently produce the safest results when working without live software context. The best outcomes came when AI recommendations were grounded in real-time software intelligence that could validate package availability, identify upgrade paths and account for known vulnerabilities.
Hallucinations had fallen, but remained a material problem. Even the strongest ungrounded systems still fabricated roughly one in 16 dependency recommendations, the study found.
That matters because software teams must spend time checking proposed fixes, discarding invalid advice and verifying whether suggested versions actually exist. It can also increase token use and add clean-up work for developers and security teams.
Cautious Models
The research also found a rise in model restraint. Newer systems were more likely to recommend no change to an open source component rather than propose an upgrade.
That caution reduced hallucinations, but often left significant risk in place. Even the most cautious models still carried roughly 800 to 900 Critical and High vulnerabilities.
Sonatype also compared standalone frontier models with what it described as a hybrid approach that selects the most secure available upgrade path using real-time software intelligence. Across Maven Central, npm and PyPI, that approach delivered mean security score improvements of 269% to 309%, compared with 24% to 68% for the best-performing large language model in each ecosystem.
Cost was another point of comparison. In one test, a smaller grounded model produced 19 fewer Critical and 38 fewer High-severity vulnerabilities than Claude Opus 4.6, while running at up to 71 times lower per-token inference cost.
Data Problem
The findings point to a broader issue in using AI for software maintenance: suggested actions depend on fast-changing package registries, vulnerability data and the specifics of a customer environment. In that setting, model reasoning alone may not be enough if the underlying information is incomplete or stale.
Brian Fox, co-founder and CTO of Sonatype, said the study showed why current data matters more than model size alone.
"Larger models may be improving at reasoning, but dependency management is not a reasoning problem alone - it is a data problem. If a model does not know your actual environment, current vulnerability data, and the policies you operate under, it is just making educated guesses," he said.
"Grounding AI in that reality is what makes its recommendations useful, credible, and safe for enterprise use."
The methodology used direct dependencies drawn from enterprise applications scanned over a three-month period. That produced roughly 37,000 unique package-version pairs and around 258,000 recommendations, which were evaluated across seven frontier models from OpenAI, Anthropic and Google.
Each model received the same prompt. Every recommended version was then checked against Sonatype's package registry, with non-existent versions classified as hallucinations and same-version recommendations treated as inaction.
Security outcomes were measured using the company's severity scoring and deduplicated advisory counts. Hallucinated versions were treated as no-ops in the assessment because package managers would reject them in practice.
Sonatype also ran a separate test on a 397-component sample designed around known failure modes. That exercise evaluated GPT-5 Nano with a single function-calling tool linked to Sonatype Guide's version recommendation API and applied the same validation and security methodology.
The work adds to a growing debate over whether general-purpose AI models can be trusted to make low-level software maintenance decisions without access to live operational data.
For businesses using AI to support developers, the study suggests reliability may depend less on choosing the biggest model and more on tying systems to current software supply chain information.