AI judges not reliable for evaluating dental advice

Dental Tribune International

Mon. 11. May 2026

save

XI’AN, China: As patients increasingly turn to artificial intelligence (AI) tools for oral health advice, questions arise not only about the reliability of the information provided but also about the use of one AI system to assess another AI system’s answers for quality and safety. A new study comparing multiple large language models (LLMs) with human dental clinicians highlights both the promise of chatbots for providing oral health information and the continuing need for expert oversight.

Researchers assessed six major LLMs using nine oral health consultation questions based on material from FDI World Dental Federation. Topics included oral care for infants, pregnancy-related oral health, dry mouth in older adults, oral disease prevention and dental trauma. The responses by the LLMs were evaluated and scored by two experienced dental clinicians and separately by three additional LLMs used as AI judges.

DeepSeek-V3 and Doubao-1.8-Pro achieved the strongest overall performance, both scoring highly on a rubric assessing scientific accuracy, logical rigour, clinical practicality, terminology and completeness. The study found significant differences between the models, suggesting that performance in dental consultations depends heavily on the specific architecture and training data of each system. GPT-5, Gemini 3, Qwen3-Max and Kimi K2 also performed well overall, although with greater variability.

Importantly, the study did not conclude that AI systems were unsafe for providing general oral health information. Instead, the main concern centred on the reliability of AI evaluation systems. Agreement between the two human clinicians was high, indicating strong consistency in expert assessment. In contrast, consistency among the AI judges was much lower, and agreement between the AI judges and human clinicians was extremely poor.

The AI evaluators also showed a systematic tendency to score responses more harshly than human experts. However, despite this stricter scoring behaviour, the AI judges still failed to reliably identify some clinically important omissions in the responses by the LLMs, particularly in preventive advice and guidance for higher-risk patient groups.

The researchers suggested that this may reflect a limitation in how current LLMs evaluate clinical information: they may give too much weight to fluency and general completeness, while giving too little weight to the clinical importance of risks and patient-specific cautions. In their view, this is likely because LLMs still rely on patterns in text rather than independent clinical reasoning.

The findings suggest that current LLMs have the potential to become useful tools for delivering standardised oral health information and supporting patient education, particularly where immediate access to dental professionals is limited. However, the study strongly cautions against relying on AI systems alone to evaluate the quality or safety of clinical advice.

The researchers concluded that current “AI-as-a-judge” frameworks are not reliable substitutes for expert human review in dentistry. The authors argued that future systems should focus less on language fluency and more on clinical reasoning, patient safety and evidence-based decision-making. The findings sit alongside other recent research suggesting that AI chatbots have value as supervised educational adjuncts in endodontics, particularly for supporting clinical learning and board-style examination preparation, while reinforcing the need for expert oversight rather than replacement of clinician judgement.

The paper, titled “Performance of large language models in oral health consultations and the consistency of the ‘AI-as-a-judge’ framework”, was published online in the August 2026 issue of the International Dental Journal.