Vibhu Agarwal

Apr 56 min read

Large Language Models: Wrecking Balls for Impregnable Clinical Data Silos (Part II)

Introduction

Advancing health data interoperability holds significant benefits for health research and practice. Fast Healthcare Interoperability Resources (FHIR), a well-established healthcare data standard developed by HL7 International, addresses this urgent need for interoperability. However, challenges arise when transforming Electronic Health Record (EHR) data into FHIR resources due to health data's heterogeneous structures and formats. This difficulty becomes more profound particularly when health information is embedded in unstructured data instead of well-organized structured formats.

FHIR organizes healthcare data into resources using standardized formats like JSON or XML. These resources include patients, medications, and diagnoses. It establishes interoperability through standardization of formats, clinical terminologies, and protocols. Natural Language Processing (NLP) tools for transforming health data into the relevant FHIR resources through expert-curated rules or probabilistic approaches have been proposed in the past. However, these require significant investments in development, data collection and processing, training, and integration into patient care workflows (Li, Wang, Yerebakan, Shinagawa, & Luo, 2023).

As discussed in Part I of this post, Large Language Models (LLMs) offer a path to automatically map unstructured medical data to FHIR profiles but, unlike traditional supervised machine learning methods, do not require copious amounts of annotated data.

Translating Clinical Notes to FHIR v4 with LLMs

We considered two LLMs, OpenAI GPT-3.5 and Anthropic's Claude 2.1, which are known to excel at several NLP tasks and are ranked highly in many independent assessments (Henshall, 2023). The objective was to map the text in discharge summaries from the MIMIC-III dataset, a large, de-identified and publicly available collection of medical records (Johnson, Pollard, & Mark, 2016), to the corresponding FHIR v4 resources.

We illustrate the approach by transforming mentions of medications in the discharge summaries to the MedicationStatements resource . A MedicationStatement is a FHIR v4 resource which represents medications that are being consumed by the patient, whether in the past, present, or future ("MedicationStatement," n.d.) MedicationStatment references resources containing key medication related information such as Medication.Ingredient and Medication.doseForm. Medication.Ingredient identifies an active or inactive ingredient in the medication. Medicine.doseForm represents the specific form of the medication, such as powder, tablets, or capsules ("Medication Definitions," n.d.). The text-to-FHIR transformation was achieved using prompts as shown in Figure 1 and 2. This approach formulates the task of semantic mapping against short, domain specific vocabularies as a few-shot task for the LLMs.

[INSTRUCTIONS]

You're a helpful assistant proficient in extracting medication data based on user-provided drug administration descriptions. Users will provide details of drug usage, and your task is to identify the drug's ingredient without assigning any medical code................

[Narrative]

<The free-text>

[TEMPLATE]

"ingredient" : [{

"item" : {

"concept" : {

"coding" : [{ // optional and MUST look up the table below

"system" :// system (eg: "http://snomed.info/sct")

"code" : # SNOMED code

"display" : # the display of the code

}]

}

"isActive": # optional boolean

"strengthRatio" : {

"numerator" : { // optional and MUST look up the table below

"value" : # Quantity

"system" :// system (eg: "http://unitsofmeasure.org")

"code" : # Unit of measurement code

"denominator" : { // optional and MUST look up the table below

"value" : #SimpleQuantity

"system" : // system (eg: "http://unitsofmeasure.org"),

"code" : # Unit of measurement code

}

}]

[TERMINOLOGIES]

Code Display

9 Gram

<More such codes>

[EXAMPLES]

<4-5 conversion examples>

Don't include any example ingredient codes for items in your answer and you

must return just the JSON.

Figure 1 Prompt template for generating the Medication.Ingredient resource from discharge summary text

[INSTRUCTIONS]

[Narrative]

"doseForm" : {

"coding" : [{ // optional and MUST look up the table below

"system" :// system (eg: "http://snomed.info/sct")

"code" : # SNOMED code

"display" : # the display of the code

}]

}

<The free text>

[TEMPLATE]

[TERMINOLOGIES]

Code System Display

736542009 http://snomed.info/sct Pharmaceutical dose form (dose form) <More such codes>

[EXAMPLES]

<4-5 conversion examples>

Don't include any example doseform codes for items in your answer and you must return just the JSON.

Figure 2 Prompt template used to instruct large language models in generating FHIR v4 resources for medicine dose

The prompts were constructed from a template that includes the task instructions, the expected FHIR v4 output in JSON format, 4-5 input-output examples, a list of codes from the controlled vocabulary referenced in the resource definition, and the input medication mention. Both the GPT 3.5 and Claude 2.1 models were prompted. The prompts instructed the LLMs to transform the medication mentions (Figure 4) in the discharge summaries to generate the output in JSON format. The prompt for Medication.Ingredient asked the LLM to identify an active or inactive ingredient from the medication mentions, as well as the medication strength (mapped to Medication.Ingredient.strength) in the discharge summaries. Similarly, the prompt for Medicine.doseForm asked the LLM to identify the specific dose form from the given medication mentions.

Data interoperability requires the use of standard vocabularies for representing medical concepts. FHIR v4 requires medication information to be standardized to the Systematized Nomenclature of Medicine — Clinical Terms (SNOMED CT) (SNOMED International, n.d.). One way to achieve this mapping is to include a list of candidate codes in the prompt itself. For example, the prompt for Medicine.doseForm includes the list of dose forms curated within SNOMED CT (Figure 2). This approach relies on the LLM’s ability to recapitulate the parametric knowledge learned by it. Similarly, candidate codes from the Unified Code for Units of measure (UCUM) can be included in the prompt to map Medication.Ingredient.Strength units to standard units. This can be challenging if the code list is long, since LLMs impose an upper limit on the length of the text that can be provided as context.

An alternative approach is to store standard vocabularies externally and access these through vector stores that have been optimized for semantic indexing and searching. For instance, the complete list of SNOMED CT codes for medications can be stored and retrieved based on lexical as well as semantic matches against a given medication mention. The top k matches can then be inserted into a prompt via a template to provide the necessary context to the LLM. Conceptually, this is similar to the Retrieval Augmented Generation (RAG) method of augmenting an LLM’s capabilities with non-parametric knowledge (Figure 3).

Figure 3 Accessing extended codelists via external vector stores

Text-to-FHIR transformation

The outputs from GPT3.5 and Claude 2.1 were nearly identical with minor differences related to the handling of missing data. Figure 4 shows examples of the input text (medication mentions) for Medication.Ingredient and Medicine.doseForm, and the corresponding text-to-FHIR transformations produced by the LLMs.

Medication Mention	FHIR v4
``` Medication Discharge: 1. Albuterol Sulfate 0.083% Solution Sig: One (1) Inhalation Q6H (every 6 hours) as needed for wheezing. ```	Medication.Ingredient "ingredient" : [{ "item" : { "concept" : { "coding" : [{ "system" : "http://www.nlm.nih.gov/research/umls/rxnorm", "display" : "Albuterol Sulfate" }] } }, "strengthRatio" : { "numerator" : { "value" : 0.083, "system" : "http://unitsofmeasure.org", "code" : "%" }, "denominator" : { "value" : 1, "system" : "http://unitsofmeasure.org", "code" : "mL" } } Medicine.doseForm "doseForm" : { "coding" : [{ "system" : "http://snomed.info/sct", "code" : "420641004", "display" : "Solution for inhalation" }] }
``` Medication Discharge: 3. Fluticasone-Salmeterol 250-50 mcg/Dose Disk with Device Sig: One (1) Disk with Device Inhalation [Hospital1 ] (2 times a day). ```	Medication.Ingredient "ingredient" : [{ "item" : { "concept" : {} }, "strengthRatio" : { "numerator" : { "value" : 250, "system" : "http://unitsofmeasure.org", "code" : "μg" }, "denominator" : { "value" : 1, "system" : "http://terminology.hl7.org/CodeSystem/v3-orderableDrugForm", "code" : "TAB" } } }, { "item" : { "concept" : {} }, "strengthRatio" : { "numerator" : { "value" : 50, "system" : "http://unitsofmeasure.org", "code" : "μg" }, "denominator" : { "value" : 1, "system" : "http://terminology.hl7.org/CodeSystem/v3-orderableDrugForm", "code" : "TAB" } } }] Medicine.doseForm "doseForm" : { "coding" : [{ "system" : "http://snomed.info/sct", "code" : "420641004", "display" : "Solution for inhalation" }] }

Figure 4 Transformations produced with Claude 2.1 for Medication.Ingredient and Medicine.doseForm. Transformations produced with GPT 3.5 were nearly identical.

GPT 3.5 and Claude 2.1 produce nearly identical outputs. In certain cases, GPT 3.5 assigns incorrect codes for medication strength (eg. the strengthRatio.Denominator.code for Fluticasone-Salmeterol 250-50 mcg/Dose produced by GPT 3.5 is ‘DOSE’ which is not a valid code in the drug form terminology specified under FHIR v4). GPT 3.5 also infers that the applicable terminology is ‘http://unitsofmeasure.org/’. It is likely that the deep domain knowledge required for such a task may not have been covered under the broad texts the model was pre-trained. Plugins or retrieval augmented generation methods (discussed later) may be better suited for domain-specific text processing needs.

While GPT 3.5 is based on a core transformer decoder, Claude 2.1 follows an encoder-decoder architecture. Unlike GPT 3.5 which has been trained via an algorithm known as Reinforcement Learning from Human Feedback (RLHF), Claude 2.1 has been trained using self-supervision techniques that are referred to as Constitutional AI. The training methodology and the architecture employed together give Claude 2.1 greater control over the text generation mechanism compared to GPT 3.5. However, this wasn’t immediately obvious in the text-to-FHIR transformation tasks that we carried out with both models. Model flexibility, as well as the ability to handle a much larger context, gives Claude 2.1 an edge in certain medical text processing tasks like question generation that we will discuss in a later post.

Deployment

Production grade text-to-FHIR conversion may require a few other pieces.

LLM responses are known to be sensitive to changes in the prompt. Prompt engineering is an iterative process, requiring incremental updates to the prompt text to finally achieve the desired output consistently across test data. Therefore, it makes sense to compile robust prompts in a prompt library and share these with LLM application developers. For instance, a prompt library for text-to-FHIR transformations may curate a set of prompts that are needed to create a FHIR v4 profile.

LLM output frequently requires post processing before it can be piped to downstream applications. This can be achieved with the help of plugins that act as behind-the-scenes assistants to LLMs and perform specialized tasks. For example, the LLM output may need to be checked for the presence of required fields, correct data types, and clinical values lying within acceptable ranges. Running validation code, integration with third party services, and accessing external data stores are all some of the common use cases for extending core LLM functionality with plug-ins.

Lastly, it is important to address the concerns around data privacy. State-of-the art LLMs are typically accessed via a public API endpoint and do not appear to be well suited for most text-processing tasks on patient data. Working with sensitive patient data requires accessing the advanced capabilities of LLMs inside a secure compute infrastructure that is fully compliant with all applicable data privacy laws. Smaller LLMs (also known as small language models or SLMs) have a considerably simpler architecture and can be deployed on enterprise-grade compute infrastructure. Domain-adapted SLMs have been shown to deliver performance that is equivalent to that of LLMs, when trained on high quality data (Javaheripi & Bubeck, 2023; Zhou, Li, Chen, & Li, 2023; Dettmers, Pagnoni, Holtzman, & Zettlemoyer, 2023), and could provide a solution for safely and ethically processing sensitive medical text.

Acknowledgements

We thank Niralee Gupta for her help in the preparation of this post

References

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023, May 23). Qlora: Efficient finetuning of quantized llms. arXiv.org. https://arxiv.org/abs/2305.14314

Henshall, W. (2023, July 18). What to Know About Claude 2, Anthropic’s Rival to ChatGPT. TIME. https://time.com/6295523/claude-2-anthropic-chatgpt/

Javaheripi, M., & Bubeck, S. (2023, December 16). Phi-2: The surprising power of small language models. Microsoft Research. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet. https://doi.org/10.13026/C2XW26.

Li, Y., Wang, H., Yerebakan, H. Z., Shinagawa, Y., & Luo, Y. (2023). FHIR-GPT Enhances Health Interoperability with Large Language Models. medRxiv. https://doi.org/10.1101/2023.10.17.23297028

MedicationStatement - FHIR v4.0.0-cibuild. (n.d.). https://fhir-ru.github.io/medicationstatement-definitions.html#MedicationStatement

Medication Definitions - FHIR v4.0.0-cibuild. (n.d.). http://hl7.org/fhir/r4/medication-definitions.html

SNOMED International. (n.d.). Use SNOMED. https://www.snomed.org

Zhou, Z., Li, L., Chen, X., & Li, A. (2023, July 17). Mini-giants: “small” language models and open source win-win. arXiv.org. https://doi.org/10.48550/arXiv.2307.08189

Large Language Models: Wrecking Balls for Impregnable Clinical Data Silos (Part II)

Recent Posts

Comments

Solutions

Managed Services