Structuring the Unstructured: Advancing Health Data with AI

March 9, 2026Callie Reed

From early ICU risk signals hidden in clinicians’ notes to automated extraction of sensor metadata across exposome studies, Fatemeh Shah Mohammadi and her research teams are transforming unstructured text into trustworthy, interoperable data for real-world health research. By addressing one of health data science’s most persistent bottlenecks, valuable insight locked inside narrative documentation, her teams demonstrate how large language models can illuminate early clinical risk and scale the labor-intensive work of metadata extraction while embedding transparency and standardization to make AI outputs safer and more actionable.

Early ICU risk: notes, GPT-4o, and uncertainty
When a patient arrives in the ICU, the richest clues to outcome often live in clinicians’ free-text notes, but those notes are hard for traditional models to use. The team applied GPT-4o to interpret the first 24 hours of clinical documentation and identify language patterns associated with in-hospital mortality. Crucially, they paired the language model with conformal prediction to quantify uncertainty for every prediction, creating a more transparent and clinically credible output.

Earlier, trustable risk signals can help clinicians prioritize care when structured data are sparse. By surfacing not only a risk score but also the model’s confidence, the approach addresses the trust barrier that often keeps AI out of bedside decision support.

Key contributors: Alexander Millar, Julio Facelli, Ramkiran Gouripeddi

Scaling sensor metadata extraction with LLMs
Exposure-health studies rely on sensor data from many device types, but metadata (instrument, measured entities, manufacturer, sampling details) are inconsistently reported and costly to curate. Fatemeh's team trained large language models to read unstructured exposure-health literature and automatically extract, structure, and harmonize key sensor metadata fields. The result: dramatically reduced manual effort and much more consistent, FAIR-aligned metadata for exposome research.

Standardized, machine-readable metadata unlock integration across studies, improve reproducibility, and accelerate downstream analytics in environmental health.

Key contributors: Sunho Im, Julio C. Facelli, Mollie R. Cummins, Ramkiran Gouripeddi

“Large language models allow us to scale what used to be manual, time-intensive work — whether that’s identifying early clinical risk or discovering and harmonizing metadata across heterogeneous health sources— while maintaining the rigor needed for research.” — Fatemeh Shah Mohammadi

Themes & Impact
Both projects show a pragmatic roadmap for applying LLMs in health research: (1) target domain text that humans already use (clinical notes, literature), (2) extract structured signals or metadata, and (3) attach measures of confidence or harmonization rules that make outputs trustworthy and reusable. Together they advance earlier detection, better prioritization, and broader data interoperability — outcomes that help clinicians, researchers, and public-health teams act faster and with more confidence.

For more information, contact Fatemeh.Shah-Mohammadi@utah.edu.

Categories

Featured Posts