Introduction: Artificial intelligence (AI) and machine learning (ML) are increasingly positioned to revolutionize healthcare, promising significant improvements in diagnostic accuracy, speed, and personalization. A proliferation of studies has demonstrated the potential of AI models across diverse medical domains, including oncology, cardiology, and infectious diseases. However, this rapid development has created a fragmented landscape, making it difficult to ascertain the true state-of-the-art and identify the most significant barriers to clinical translation. While individual studies often report high performance, there is a critical need to synthesize this evidence to understand overarching trends, methodological consistencies, and persistent challenges. This review synthesizes findings from recent literature to evaluate the diagnostic performance of AI models, characterize common methodological approaches, and highlight the critical gaps that must be addressed to facilitate their responsible integration into clinical practice.
Methods: A systematic literature search was conducted across Scopus, PubMed, Nature, and ScienceDirect for English-language articles published between January 2024 and June 2025. Keywords included terms such as generative artificial intelligence, machine learning in medicine, and diagnostic accuracy. The review included original studies evaluating AI-based diagnostic or predictive models in humans, while excluding reviews and preclinical reports. Following a screening of titles, abstracts, and full texts, relevant articles were selected for synthesis. This review focused on summarizing the methodologies and results reported within these selected studies, including the types of AI models used (e.g., CNN, GAN), their validation frameworks, and the performance metrics achieved.
Results: Our synthesis reveals that AI models consistently achieve high diagnostic performance across numerous applications. A meta-analysis reported a strong pooled AUC of 0.9025, confirming the high accuracy of AI-based diagnostics. Specifically, CNNs demonstrated >90% accuracy in early tumor detection and achieved up to 95% accuracy on histopathology images for breast cancer, outperforming traditional benchmarks. Similarly, ensemble classifiers exceeded 98% accuracy for both tuberculosis identification and myocardial infarction prediction. Generative models have shown unique utility in predicting drug response and disease progression with AUCs ≥0.90 and have successfully augmented datasets to boost downstream classification performance by up to 15%. However, this high performance is contrasted by significant methodological issues. The meta-analysis identified substantial heterogeneity across studies (I² = 91.01%), with performance varying significantly by model type and clinical domain. More critically, multiple sources highlight systemic biases. One systematic review found that a third of studies on generative AI were at high risk of bias, frequently lacking external validation or transparent reporting. This is corroborated by meta-analytic findings of significant publication bias (p < 0.001), suggesting an over-reporting of positive results. Furthermore, the performance of Large Language Models (LLMs) appears contested; while some models like ChatGPT-4 perform comparably to human trainees on exams, they were dramatically outperformed by specialized deep-learning models in chest X-ray interpretation (40.5% vs. 70.5% accuracy) and carry a noted risk of hallucination.
Conclusion: The collective evidence strongly indicates that AI models, especially deep and generative learning architectures, possess the technical capacity to transform medical diagnostics by delivering unprecedented accuracy and enabling novel applications in personalized medicine. These tools can enhance early disease detection, streamline workflows, and simulate individualized treatment effects. However, the transition from algorithmic potential to clinical reality is impeded by critical, unresolved challenges. The field is characterized by significant methodological heterogeneity, a high risk of bias in published studies, and a tendency towards selective reporting of positive outcomes. To realize the clinical promise of AI, future research must pivot from isolated proof-of-concept studies to rigorous, large-scale validation. The immediate priorities should be: 1) Establishing standardized evaluation frameworks and transparent reporting guidelines to ensure reproducibility and comparability. 2) Mandating external validation on diverse, multi-center cohorts to assess model generalizability and fairness. 3) Advancing the development of explainable AI (XAI) systems to foster clinical trust and illuminate model decision-making. 4) Addressing the prevalent issues of algorithmic bias and data privacy through robust ethical and regulatory oversight. Only by confronting these challenges directly can the field ensure that AI-driven diagnostics are developed and deployed in a manner that is effective, equitable, and safe for all patients.
Keywords: Generative Artificial Intelligence, Machine Learning in Medicine, Diagnostic Accuracy.
Join the big family of Pharmacogenetics and Genomics!