Responsible AI for
Safer Healthcare

The Centre for Responsible Autonomous Systems in Healthcare (CRASH Lab) is a clinician-led collaborative research group directed by Dr. Suvrankar Datta and anchored at the Koita Centre for Digital Health, Ashoka University. We develop the benchmarks, datasets and evaluation frameworks needed to make medical AI safe and reliable in the clinical settings.

Join the team Collaborate with the lab

Built on Institutional Trust and Credibility

Supported By

Grants & Institutional Funding

In Collaboration With

Researchers and clinical groups at these institutions have collaborated with Dr. Suvrankar Datta and CRASH Lab on published or ongoing work.

Featured At

Recent collaborative research of Dr. Suvrankar Datta and CRASH Lab has been published and presented at:

Radiology's Last Exam.

Explore the benchmark Read the paper

A reasoning-heavy benchmark of 50 expert-level spot diagnoses across CT, MRI, and radiography. Frontier multimodal AI tested against board-certified radiologists and trainees, with reproducibility measured across three independent runs.

50 expert cases12 human readers5 frontier modelsRSNA 2025 · Cutting Edge

RadLE Benchmark

Mean diagnostic accuracy with 95% Wilson confidence intervals. Humans on the left, frontier AI models on the right.

N = 50 cases

Human readersFrontier AITrainee benchmark (45%)

Expert: 83%
Best AI: 30%
Expert-AI gap: 53 pts
Reproducibility: κ ≈ 0.64

RSNA 2025 • Cutting Edge Oral Presentationcrashlab.in/radle

What the field said

A benchmark the community
could not stop sharing.

RadLE drew responses from radiologists, AI safety researchers, clinical educators, and health-tech writers across the first week of release.

View all

Demis Hassabis: @rohanpaul_ai awesome to see!
Rohan Paul: Wow. Gemini 3.0 on Radiology's Last Exam The first time a general-purpose model has beaten radiology residents with 51% accuracy. Radiology trainees are at 45%. The main significance is that a general model has finally reached a level where it can compete with early-stage human training on a specialized medical exam. Congratulations to @GoogleDeepMind team. @GeminiApp
Dr. Datta M.D. (Radiology) M.B.B.S. 🇮🇳: 🔥 Gemini 3.0 vs Radiologists: RadLE Benchmark Results Are OUT! ☠️ Is it game over for Radiology? Let us find out! ⬇️ 🫨 Since yesterday, Gemini 3.0 has been everywhere for crushing benchmarks. My inbox exploded asking: “But how did it do on the hardest visual reasoning benchmark in healthcare?” So we ran it! And here you go. 👇 ➡️ Gemini 3.0 Pro on RadLE v1: ✅ 51% accuracy; first time a general-purpose model has beaten radiology residents ✅ Radiology residents: 45% ✅ Board-certified radiologists: ~83% ✅ Shows clean step-by-step reasoning in some tough cases (appendix localization, mimics ruled out, etc.) 🚀 This is the first time ever that a generalist model has crossed the trainee bar on RadLE v1! Congratulations to @GoogleDeepMind and @Google team including @vivnat, @alan_karthi and all others for cooking this time! Full breakdown here: 🔗 Link in comments / bio 🔥 Huge shoutout to Lakshmi, Divya, Upasana, Hakikat, Kautik & the entire #CRASHLab team at @KCDH_A for turning around in under a day. 🙌 If you are a medical AI lab and want to improve your performances and want our expert insights, reach out!
Dr. Datta M.D. (Radiology) M.B.B.S. 🇮🇳: 🚨 Just published! All frontier AI models have failed “Radiology’s Last Exam” - the toughest benchmark in radiology launched today! ✅ Board-certified radiologists scored 83%, trainees 45%, but the best performing AI from frontier labs, GPT-5, managed only 30%. ❌ These results shatter repeated claims of “doctor-level” AI in medicine and give you a reality check! 🇮🇳 The Centre for Responsible Autonomous Systems in Healthcare (#CRASHLab), @KCDH_A @AshokaUniv, India has launched v1 of one of the hardest benchmarks in medicine and we share our results with the world! 1/n
Simon Smith: Here's a very practical real-world benchmark where Gemini 3 Pro shows dramatic progress: Radiology's Last Exam. A general AI model now beats trainee radiologists, with a 70% improvement over the previous best model (which was released in August!).
Healthcare AI Guy: NEW: Gemini 3.0 Pro just passed radiology trainees on Radiology’s Last Exam (51% vs 45%) A general-purpose frontier model is now performing at the level of early-stage human training on a real medical imaging task.
Rohan Paul: Paper – https://arxiv.org/abs/2509.25559 Paper Title: "Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology"
Haider.: "Radiology's Last Exam" — the toughest benchmark in radiology According to the paper: GPT-5 scored 30% with "substantial" consistency on 50 expert-level radiology cases across CT, MRI, and X-ray, performing best on MRI but still below humans surely it will be saturated by 2027
Dominik Filkus: Fortunately, AI is not just about image, video creation or coding, it is here to help humanity against diseases or at least help recognize them with high precision. In Radiology's Last Exam (RadLE v1), Gemini 3 Pro was the first SOTA model which outperformed the trainees. Its score, after multiple runs was still far below the score of the certified radiologists but it's still a milestone. No GPT-5, Gemini 2.5 Pro, Grok or Claude models were capable of finishing with a higher score than the trainees before. At some point, AI will be better than humans in most areas and focusing on this specific case, it will make fewer mistakes or no mistakes at all in the future, hopefully.

Demis Hassabis

@demishassabis · Nov 20

@rohanpaul_ai awesome to see!

41605.3K

Rohan Paul

@rohanpaul_ai · Nov 20

Wow. Gemini 3.0 on Radiology's Last Exam The first time a general-purpose model has beaten radiology residents with 51% accuracy. Radiology trainees are at 45%. The main significance is that a general model has finally reached a level where it can compete with early-stage human training on a specialized medical exam. Congratulations to @GoogleDeepMind team. @GeminiApp

502041.7K234K

Dr. Datta M.D. (Radiology) M.B.B.S. 🇮🇳

@DrDatta_AIIMS · Nov 20

🔥 Gemini 3.0 vs Radiologists: RadLE Benchmark Results Are OUT! ☠️ Is it game over for Radiology? Let us find out! ⬇️ 🫨 Since yesterday, Gemini 3.0 has been everywhere for crushing benchmarks. My inbox exploded asking: “But how did it do on the hardest visual reasoning benchmark in healthcare?” So we ran it! And here you go. 👇 ➡️ Gemini 3.0 Pro on RadLE v1: ✅ 51% accuracy; first time a general-purpose model has beaten radiology residents ✅ Radiology residents: 45% ✅ Board-certified radiologists: ~83% ✅ Shows clean step-by-step reasoning in some tough cases (appendix localization, mimics ruled out, etc.) 🚀 This is the first time ever that a generalist model has crossed the trainee bar on RadLE v1! Congratulations to @GoogleDeepMind and @Google team including @vivnat, @alan_karthi and all others for cooking this time! Full breakdown here: 🔗 Link in comments / bio 🔥 Huge shoutout to Lakshmi, Divya, Upasana, Hakikat, Kautik & the entire #CRASHLab team at @KCDH_A for turning around in under a day. 🙌 If you are a medical AI lab and want to improve your performances and want our expert insights, reach out!

751861.2K525K

Dr. Datta M.D. (Radiology) M.B.B.S. 🇮🇳

@DrDatta_AIIMS · Oct 1

🚨 Just published! All frontier AI models have failed “Radiology’s Last Exam” - the toughest benchmark in radiology launched today! ✅ Board-certified radiologists scored 83%, trainees 45%, but the best performing AI from frontier labs, GPT-5, managed only 30%. ❌ These results shatter repeated claims of “doctor-level” AI in medicine and give you a reality check! 🇮🇳 The Centre for Responsible Autonomous Systems in Healthcare (#CRASHLab), @KCDH_A @AshokaUniv, India has launched v1 of one of the hardest benchmarks in medicine and we share our results with the world! 1/n

46127663202K

Simon Smith

@_simonsmith · Nov 20

Here's a very practical real-world benchmark where Gemini 3 Pro shows dramatic progress: Radiology's Last Exam. A general AI model now beats trainee radiologists, with a 70% improvement over the previous best model (which was released in August!).

0161.1K

Healthcare AI Guy

@HealthcareAIGuy · Nov 21

NEW: Gemini 3.0 Pro just passed radiology trainees on Radiology’s Last Exam (51% vs 45%) A general-purpose frontier model is now performing at the level of early-stage human training on a real medical imaging task.

34442.4K

Rohan Paul

@rohanpaul_ai · Oct 2

Paper – https://arxiv.org/abs/2509.25559 Paper Title: "Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology"

665810K

Haider.

@haider1 · Oct 4

"Radiology's Last Exam" — the toughest benchmark in radiology According to the paper: GPT-5 scored 30% with "substantial" consistency on 50 expert-level radiology cases across CT, MRI, and X-ray, performing best on MRI but still below humans surely it will be saturated by 2027

426255957K

Dominik Filkus

@DominikFilkus · Nov 24

Fortunately, AI is not just about image, video creation or coding, it is here to help humanity against diseases or at least help recognize them with high precision. In Radiology's Last Exam (RadLE v1), Gemini 3 Pro was the first SOTA model which outperformed the trainees. Its score, after multiple runs was still far below the score of the certified radiologists but it's still a milestone. No GPT-5, Gemini 2.5 Pro, Grok or Claude models were capable of finishing with a higher score than the trainees before. At some point, AI will be better than humans in most areas and focusing on this specific case, it will make fewer mistakes or no mistakes at all in the future, hopefully.

103270

International coverage

RadLE reached global audiences in Japanese, German, and Hindi.

チェリ@AIエンジニア•メタAIインフルエンサー

@rN1oO71GTPiEMks · Oct 5

AIIMSのDatta医師が放射線診断ベンチマーク「Radiology’s Last Exam」を公開し、最先端AIはいずれも不合格だったと報告しました。認定医83％、研修医45％に対し、GPT-5は30％、Gemini 2.5 Proは29％、Claude Opus 4.1は1％でした。CT・MRI・X線の難問50例で評価した結果です。 https://x.com/DrDatta_AIIMS/status/1973373655251038701

101412

チェリ@AIエンジニア•メタAIインフルエンサー

@rN1oO71GTPiEMks · Nov 20

Gemini 3.0 Proが、放射線診断の難関ベンチマーク「Radiology’s Last Exam（RadLE）」で放射線科研修医の平均スコアを上回った結果が紹介されています。一般向け汎用モデルと人間の専門家を同じ胸部画像問題で比較し、どのレベルまでAIが迫っているかを示すスレッドです。 https://x.com/DrDatta_AIIMS/status/1991378471604334604

000273

حمید (شیرازی سودوفیکیک سابق)

@pseudophakic_sh · Nov 20

بنچمارکRadiology’s Last Exam (RadLE) بنچمارکی در سطح امتحان بورد رادیولوژی که نشان داد مدلهای پیشتاز AI حتی از رزیدنت سال اول هم عملکرد ضعیف تری دارن + آپدیت آن برای Gemini 3.0 مقاله اولیه این تحقیق ۲ ماه پیش منتشر شد و اپدیت آن برای Gemini 3.0 امروز. #هوش_مصنوعی_و_پزشکی 🧵1/4

13131K

Chubby♨️

@kimmonismus · Oct 4

Radiology’s last exam: human radiologists achieve about 83% accuracy, where as GPT-5 achieves ~30%. - for now. Let’s see if we get a updated GPT-5 version on Monday. Anyways, can’t imagine this benchmark will last longer than 6 months until saturated by AI.

324654778K

AI benchmarks have drifted from real-world clinical practice.

Real-world clinical cases demand more than current generic benchmarks can offer. At CRASH Lab, our focus is building evaluations and benchmarks that probe how frontier AI models actually perform in real-world hard cases.

We evaluate where these models fail, characterise their failure modes, and use those insights to make AI safer and more reliable before it reaches the clinic.

While we lead in building evaluations, our work extends from benchmarking methods to responsible deployment of AI in real clinical workflows.

Evaluation

Many AI benchmarks reuse cases models already saw during training, which inflates performance.

We build hard, contamination-resistant evaluations from fresh clinical cases and test AI models against expert clinicians so scores reflect genuine clinical competence.

Radiology's Last Exam (RadLE paper)

Commission an evaluation

Infrastructure

The future of medical AI evaluation lives inside evaluation harnesses.

We work with academic and commercial partners to embed evaluations directly into clinical AI pipelines, enabling continuous and reproducible testing as models evolve.

Partner on infrastructure

Community

The clinicians of the next decade will work alongside AI every day.

CRASH Lab is cultivating evaluation expertise by collaborating with leading clinicians to identify hard and novel cases, analyse frontier model failures, and build practical AI evaluation judgment.

Join our expert community

Join Our Mission

We are building a new ecosystem for responsible healthcare AI. Whether you are developing an AI-driven product or seeking a dedicated, high-resource research environment, CRASH Lab provides the clinical expertise and infrastructure needed to turn innovation into reality.

Please share as much detail as possible in the form below. This helps us direct your inquiry to the right specialist so we can begin exploring how to work together.

Responsible AI forSafer Healthcare