There is growing interest in the application of machine learning models and advanced analytics to various healthcare processes and operations, including the generation of new clinical discoveries, development of high-quality predictions, and optimization of administrative processes. Machine learning models for prediction and classification rely on extensive and robust datasets, particularly for deep learning models common in health, creating an urgent need for large health datasets. Yet datasets can be insufficiently large due to the rapid evolution of diseases, such as coronavirus disease 2019 (COVID-19), rarity of disease, or the myriad obstacles to sharing and acquiring existing health data, including ethical, legal, political, economic, cultural, and technical barriers. Synthetic data provide a unique opportunity for health dataset expansion or creation by addressing privacy concerns and other barriers. In this paper, we review prior literature and discuss the landscape of machine learning models used for synthetic health data generation (SHDG), outlining challenges and limitations. We build on existing research on the state of the art in SHDG and prior broad explorations of the potential risks and opportunities for large language models (LLMs) in healthcare. We contribute to the literature with a focused assessment of LLMs for SHDG, including a review of early research in the area and recommendations for future research directions. Six promising research directions are identified for further investigation of LLMs for SHDG: evaluation metrics, LLM adoption, data efficiency, generalization, health equity, and regulatory challenges
Daniel Smolyak, Department of Computer Science, University of Maryland
Margret V. Bjarnadottir, Robert H. Smith School of Business, University of Maryland
Kenyon Crowley, Accenture Federal Services
Ritu Agarwal, Center for Digital Health and Artificial Intelligence, Carey Business School