A new study looks at how large language models perform in a variety of medical contexts, including real emergency room situations — where at least one model appears more accurate than human doctors.
It was the study Published this week in Science magazine It comes from a research team led by doctors and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers said they conducted a variety of experiments to measure how the OpenAI models compared to human doctors.
In one experiment, researchers focused on 76 patients who came to the emergency room at Beth Israel, and compared the diagnoses made by two internal medicine doctors to those generated by the OpenAI o1 and 4o models. These diagnoses were evaluated by two other doctors, who did not know which ones came from humans and which came from artificial intelligence.
“At each diagnostic touchpoint, o1 performed nominally better than or on par with the two treating physicians and 4o,” the study said, adding that the differences “were particularly pronounced at the first diagnostic touchpoint (initial emergency triage), where the least information about the patient is available and the most urgent for making the right decision.”
At Harvard Medical School press release Regarding the study, the researchers stressed that they did not “pre-process the data at all” – the AI models were presented with the same information that was available in the electronic medical records at the time of each diagnosis.
Using this information, the o1 model was able to provide an “accurate or very close diagnosis” in 67% of triage cases, compared to one doctor getting an accurate or close diagnosis in 55% of cases, and the other doctor hitting the mark in 50% of cases.
“We tested the AI model against almost every benchmark, and it outperformed previous models and our clinical baselines,” Arjun Manray, who heads the AI Laboratory at Harvard Medical School and one of the study’s lead authors, said in the press release.
TechCrunch event
San Francisco, California
|
October 13-15, 2026
To be clear, the study did not claim that AI is ready to make real life-or-death decisions in the emergency room. Instead, she said the results show “an urgent need for future trials to evaluate these technologies in real-world patient care settings.”
The researchers also noted that they only studied how the models performed when presented with text-based information, and that “existing studies suggest that current basic models are more limited in considering non-textual input.”
Adam Rudman, MD, a Beth Israel physician who is also one of the study’s lead authors, said: The Guardian warned That there is “currently no formal framework for accountability” around AI diagnoses, and that patients still “want humans to guide them through life-or-death decisions.” [and] To guide them through difficult treatment decisions.
in A blog post about the studyChristine Panthagani, an emergency physician, said this was “an interesting study about AI that has led to some very exaggerated headlines,” especially because it was comparing AI diagnoses to those made by internal medicine doctors, not emergency doctors.
“If we want to compare AI tools to the clinical ability of doctors, we should start by comparing them to doctors who are already practicing,” Panthagani said. “I wouldn’t be surprised if an LLM could beat a dermatologist on the neurosurgery board exam; [but] This is not a particularly useful thing to know.
She also said, “As an emergency physician seeing a patient for the first time, my primary goal is… no To guess your final diagnosis. My primary goal is to determine if you have a condition that could kill you.
This publication and headline have been updated to reflect the fact that the diagnoses in the study came from treating internal medicine physicians, and to include commentary from Christine Panthaghani.
When you buy through links in our articles, we may earn a small commission. This does not affect our editorial independence.









