AI agents are becoming more sophisticated. They progress from answering questions to independently performing complex, multi-step tasks.
But before these agents can be trusted to book trips or perform financial analysis on behalf of users, model providers and startups building such agents want to ensure they perform reliably across a wide range of scenarios.
AI labs often use benchmarks to show the prowess of their model, but a high score, even in an agent-oriented benchmark, doesn’t actually prove that AI can accomplish many complex real-world jobs correctly.
Patronus ia startup founded in 2023 by former Meta AI researchers Anand Kannappan and Rebecca Qian, helps modelers and companies improve models to do this by building simulated digital environments to evaluate agents’ performance.
The San Francisco-based startup must solve an important problem. Almost all leading AI labs and many emerging startups are now clients, according to Glenn Solomon, managing director at Notable Capital, who describes the demand for simulated corporate environments as almost insatiable.
Patronus’ revenues have increased 15-fold over the past year, generating significant interest from investors. The company on Thursday announced a $50 million Series B round led by Greenfield Partners, with participation from Notable Capital, Lightspeed, Datadog and Samsung. This financing brings the company’s total financing to $70 million.
Patronus uses what it calls “digital world models” to create exact copies of websites and internal systems. In these environments, agents are stress tested after training using reinforcement learning, which repeatedly rewards successful task completion and punishes errors.
AI labs see great value in these digital simulations because they give customers the opportunity to try out different, sometimes unpredictable, scenarios. The company compares its approach to how Waymo trains self-driving cars by first building artificial worlds to test vehicles against rare hazards, such as severe weather or a child running after a ball.
The difference between AI agents is that they tend to take shortcuts, which means they fail to complete the task properly. “Patronus is really good at detecting hacks and making sure they hold models accountable,” Solomon said.
Patronus currently offers its own simulated digital worlds for software engineering and finance, but that’s just the beginning, according to Kanappan.
“Today we are very focused on verifiable issues, that is, issues that you can immediately examine and verify, but there are a lot of areas that cannot be verified or are difficult to verify,” he said.
Just because these processes can be verified doesn’t mean they are simple. “We want to be able to actually create an environment where you can run an agent that can work for 10 hours, 10 days, or 10 weeks,” Kanappan said.
As for competitors, Patronos believes it is primarily competing against internal teams that AI labs have already set up to evaluate customer behavior. While human data companies like Mercur and Surge help modelers with reinforcement learning, Patronus works differently by evaluating how agents behave without any human intervention.
When you buy through links in our articles, we may earn a small commission. This does not affect our editorial independence.









