Structuring the Unstructured: Advancing Health Data with AI
From early ICU risk signals hidden in clinicians’ notes to automated extraction of
sensor metadata across exposome studies, Fatemeh Shah Mohammadi and her research teams
are transforming unstructured text into trustworthy, interoperable data for real-world
health research. By addressing one of health data science’s most persistent bottlenecks,
valuable insight locked inside narrative documentation, her teams demonstrate how
large language models can illuminate early clinical risk and scale the labor-intensive
work of metadata extraction while embedding transparency and standardization to make
AI outputs safer and more actionable.
Early ICU risk: notes, GPT-4o, and uncertainty
When a patient arrives in the ICU, the richest clues to outcome often live in clinicians’ free-text notes, but those notes are hard for traditional models to use. The team applied GPT-4o to interpret the first 24 hours of clinical documentation and identify language patterns associated with in-hospital mortality. Crucially, they paired the language model with conformal prediction to quantify uncertainty for every prediction, creating a more transparent and clinically credible output.
When a patient arrives in the ICU, the richest clues to outcome often live in clinicians’ free-text notes, but those notes are hard for traditional models to use. The team applied GPT-4o to interpret the first 24 hours of clinical documentation and identify language patterns associated with in-hospital mortality. Crucially, they paired the language model with conformal prediction to quantify uncertainty for every prediction, creating a more transparent and clinically credible output.
Earlier, trustable risk signals can help clinicians prioritize care when structured
data are sparse. By surfacing not only a risk score but also the model’s confidence,
the approach addresses the trust barrier that often keeps AI out of bedside decision
support.
Key contributors: Alexander Millar, Julio Facelli, Ramkiran Gouripeddi
Scaling sensor metadata extraction with LLMs
Exposure-health studies rely on sensor data from many device types, but metadata (instrument, measured entities, manufacturer, sampling details) are inconsistently reported and costly to curate. Fatemeh's team trained large language models to read unstructured exposure-health literature and automatically extract, structure, and harmonize key sensor metadata fields. The result: dramatically reduced manual effort and much more consistent, FAIR-aligned metadata for exposome research.
Exposure-health studies rely on sensor data from many device types, but metadata (instrument, measured entities, manufacturer, sampling details) are inconsistently reported and costly to curate. Fatemeh's team trained large language models to read unstructured exposure-health literature and automatically extract, structure, and harmonize key sensor metadata fields. The result: dramatically reduced manual effort and much more consistent, FAIR-aligned metadata for exposome research.
Standardized, machine-readable metadata unlock integration across studies, improve
reproducibility, and accelerate downstream analytics in environmental health.
Key contributors: Sunho Im, Julio C. Facelli, Mollie R. Cummins, Ramkiran Gouripeddi
“Large language models allow us to scale what used to be manual, time-intensive work — whether that’s identifying early clinical risk or discovering and harmonizing metadata across heterogeneous health sources— while maintaining the rigor needed for research.” — Fatemeh Shah Mohammadi
Themes & Impact
Both projects show a pragmatic roadmap for applying LLMs in health research: (1) target domain text that humans already use (clinical notes, literature), (2) extract structured signals or metadata, and (3) attach measures of confidence or harmonization rules that make outputs trustworthy and reusable. Together they advance earlier detection, better prioritization, and broader data interoperability — outcomes that help clinicians, researchers, and public-health teams act faster and with more confidence.
Both projects show a pragmatic roadmap for applying LLMs in health research: (1) target domain text that humans already use (clinical notes, literature), (2) extract structured signals or metadata, and (3) attach measures of confidence or harmonization rules that make outputs trustworthy and reusable. Together they advance earlier detection, better prioritization, and broader data interoperability — outcomes that help clinicians, researchers, and public-health teams act faster and with more confidence.
For more information, contact Fatemeh.Shah-Mohammadi@utah.edu.