Clinical concept normalization involves linking free-text concept mentions extracted from EHRs to standardized vocabularies like SNOMED CT or RxNorm. This process helps improve data quality, and makes data accessible for downstream applications like research and decision support. Generic concept normalization systems struggle to handle the unique challenges of medical texts, especially when dealing with multilingual datasets and highly specialized medical terminologies.
We compared our solution to open source alternatives on two common sub-tasks of concept normalization: Named Entity Recognition (NER) and Entity Linking (EL). Our goal was to evaluate how well open-source systems perform on clinical data compared to bespoke, domain-specific solutions tailored for healthcare settings such as the LynxCare NLP pipeline.
A first baseline is provided by QuickUMLS, a system that relies on approximate string matching to map clinical text to Unified Medical Language System (UMLS) concepts. Given the many orthographical variation in EHR data, we hypothesized that QuickUMLS would perform poorly on clinical narratives.
For our second baseline, we developed a pipeline that integrated a large language model (LLM) for concept extraction followed by similarity between Transformer-based embeddings to facilitate entity linking. Specifically, we prompt the Mistral 7B Instruct model for the NER task and use a pretrained SapBERT model for Entity Linking. These models, while not fine-tuned towards specific domains, have shown impressive results on other biomedical datasets.
We selected four datasets (Mantra, Quaero, E3C and EHR) across three languages (English, Dutch, and French) for our experiments. These datasets represent different clinical domains, such as oncology and cardiology, and included annotated medical records, EMA documents, and other clinical texts.
Our results indicate that our customized components outperform generic systems, particularly on the EHR and E3C datasets, demonstrating the importance of tailored solutions for the specific subdomains.
For instance, on the Mantra dataset, the highest scoring pipeline combined a custom NER model with SapBERT, significantly outperforming the Mistral + SapBERT pipeline by nearly 8 percent in F1 score. However, for datasets like Quaero and Mantra, the QuickUMLS pipeline performed strongly due to its close alignment with the contents of the UMLS thesaurus.
Interestingly, while Mistral + SapBERT outperformed the custom pipeline in some precision scores, the fine-tuned model does have an advantage when handling discontinuous concepts, which are common in medical texts. The results underline the tradeoff between precision and recall, as well as the need for customized models that understand complex relationships expressed in clinical data.
Despite the promising results, several challenges remain in clinical concept normalization, particularly when dealing with discontinuous NER, clinical entity linking, and multilingual data. For example:
• Nested and Discontinuous Entities: Clinical texts often include nested concepts or mentions separated by unrelated words, such as “sharp chest and intermittent abdominal pain.” Few systems support extraction of these, yet they are highly frequent in clinical text.
• Lack of Open Domain-specific Benchmarks: The lack of subdomain-specific benchmarks, particularly for specialized areas like oncology or cardiology, makes it difficult to create and evaluate models that can effectively handle synonymous concepts. For instance, two different UMLS concept identifiers for "metastatic neoplasm" illustrate the difficulty in linking concepts with varying degrees of synonymy in medical ontologies.
• Multilingualism: Handling multilingual medical data remains a significant challenge, especially when translations do not capture the nuances of medical terminology. For example, translating “longmetastasen” from Dutch to English reveals a discrepancy between “lung metastases” and “pulmonary metastases,” which can affect the performance of entity linking systems.
Our research paper underscores the importance of domain-specific, bespoke models in clinical concept normalization tasks. While open-source solutions can provide baseline performance, particularly for highly common concepts, our results show that tailored models significantly outperform generic approaches, especially in real-world healthcare data.
Moreover, we emphasize the importance of smaller, scalable models that are fine-tuned to specific subdomains like oncology, cardiology, and psychiatry. As the field moves forward, the continued multilingual alignment of biomedical datasets will be crucial for democratizing healthcare research and improving clinical data accessibility.