A new independent evaluation of OpenAI's GPT-5 has found that the model demonstrates no measurable improvement over its predecessor, GPT-4o, in producing consistent medical recommendations across different sociodemographic patient groups. Researchers applied established testing pipelines to 500 emergency clinical vignettes, varying 32 sociodemographic labels while keeping all clinical content identical. The findings raise fresh concerns about the readiness of large language models for deployment in healthcare settings.
Among the most striking results, several LGBTQIA+ patient subgroups were flagged for mental health screening in 100% of test cases, regardless of their presenting clinical condition. Researchers also documented variation in recommended levels of care — ranging from outpatient treatment to ICU admission — and differences in urgent referral rates, all driven solely by changes in patient demographic labels rather than medical information. The pattern closely mirrored what the same team had previously documented in GPT-4o, suggesting the underlying issue has not been addressed in the newer model.
The evaluation also tested GPT-5's susceptibility to adversarial hallucinations, a phenomenon where a model incorporates false or fabricated details introduced into a prompt as if they were factually accurate. GPT-5 accepted and reproduced fabricated clinical details at a rate of 65%, compared to 53% recorded for GPT-4o under the same conditions, representing a measurable regression in this area. However, researchers found that applying a specific mitigation prompt dramatically reduced the hallucination rate to 7.67%, indicating the vulnerability may be partially addressable through targeted prompt engineering.
The study's authors stress that their evaluation represents a snapshot rather than a comprehensive audit, and that the findings do not necessarily reflect performance across all clinical use cases. Nevertheless, they argue the results highlight systemic issues that persist across model generations, including differential treatment escalation and inconsistent urgent referral recommendations tied to patient identity rather than medical need. The researchers published their methodology and pipelines, calling for standardized bias and safety testing before AI tools are integrated into clinical workflows.
The findings arrive at a moment of rapid expansion in AI-assisted healthcare tools, with hospitals and health systems in multiple countries actively piloting or deploying large language models for triage support, clinical documentation, and diagnostic assistance. Critics and patient advocacy groups have previously raised concerns that AI systems trained on historically biased medical data risk embedding and amplifying existing disparities in healthcare delivery. OpenAI had not issued a public response to the specific findings at the time of publication.
The researchers concluded that model updates alone, without targeted interventions addressing bias and adversarial robustness, are unlikely to resolve these concerns. They recommended that future model evaluations include mandatory sociodemographic consistency testing as a baseline safety benchmark. The full methodology and results have been made publicly available to allow independent replication and verification by other research teams.