In our study, we evaluated the performance of LynxCare’s clinical NLP pipeline against two open-source alternatives. Our goal was to assess out-of-the-box performance on multilingual biomedical text, particularly electronic health record (EHR) data.
Continue reading below and learn more about our methodology and findings.
Download our full technical research paper by completing the form on the right.
• QuickUMLS: A widely used concept-matching tool based on approximate string matching.
• Mistral 7B Instruct (4-bit): We use this LLM for entity recognition using few-shot learning.
o Exhibited output inconsistency, requiring multiple prompt repetitions for optimal results.
o Its biomedical counterpart, BioMistral-7B, showed a higher tendency to hallucinate.
• SapBERT: A multilingual language model, pre-trained on the UMLS 2020AB knowledge base, used to embed textual concept mentions and to perform entity linking (EL) based on similarity with embedded representations of codes.
• LynxCare’s custom, discontinuous NER (C-DNER) combined with SapBERT
• The Mistral LLM with LynxCare’s custom Entity Linking (C-EL)
We selected four datasets spanning three languages (English, Dutch, and French):
• EHR dataset (Dutch) – 35 patient records (oncology & cardiology).
• Mantra GSC corpus (Dutch) – EMA subset (362 concepts).
• Quaero medical corpus (French) – EMA subset (1,970 concepts).
• E3C corpus (English) – Clinical records (2,389 concepts).
• Named entity recognition (NER) was evaluated solely on overlap and concept identification rather than label assignment, due to the varying annotation schemes across the datasets.
• We measured the precision, recall, and F1-score of code-assignments, comparing codes extracted by models to the ground-truth annotations, for all overlapping extractions from the NER.
View video for a visual representation of our findings
• Our pipeline outperforms all others on clinical narratives, highlighting the value of domain-specific fine-tuning.
• The Mistral + SapBERT pipeline slightly outperforms our model on precision, due to Mistral's preference for shorter (often single words), flat concepts, which are easier to map but do not capture all information.
• Our custom DNER model identifies nested and discontinuous concepts, which is more expressive but leads to lower precision in some cases.
• The highest-scoring pipeline combines a custom DNER model + SapBERT, outperforming Mistral + SapBERT by nearly 8 percentage points in F1-score.
• This underscores the importance of custom entity recognition models even when combined with a generic entity linker.
• Interestingly QuickUMLS performs best, benefiting from its close alignment with UMLS content.
• Based on automated metrics, the DNER + C-EL pipeline underperforms compared to QuickUMLS. This is likely because often longer, more specific concepts are extracted while 74% of Quaero’s annotations are unigrams.
• This highlights that differences in annotation and ranking standards across datasets significantly impact results.
• Our Dutch fine-tuned model outperforms others, demonstrating strong cross-lingual generalization in clinical narratives.
• The best-performing approach also involved combining custom and open-source components, further emphasizing the importance of domain-specific adaptation.
• Out-of-the-box LLM performance varies significantly depending on dataset language, annotation guidelines, and domain.
• Clinical NLP models require constant adaptation due to evolving medical knowledge, making static benchmarks insufficient.
• Expert validation remains crucial—final scores change based on medical experts' feedback and ontology adjustments.
• This study highlights the necessity of domain-specific customization in clinical NLP to ensure high-quality results in real-world healthcare applications.
• QuickUMLS: A widely used concept-matching tool based on approximate string matching.
• Mistral 7B Instruct (4-bit): We use this LLM for entity recognition using few-shot learning.
o Exhibited output inconsistency, requiring multiple prompt repetitions for optimal results.
o Its biomedical counterpart, BioMistral-7B, showed a higher tendency to hallucinate.
• SapBERT: A multilingual language model, pre-trained on the UMLS 2020AB knowledge base, used to embed textual concept mentions and to perform entity linking (EL) based on similarity with embedded representations of codes.
• LynxCare’s custom, discontinuous NER (C-DNER) combined with SapBERT
• The Mistral LLM with LynxCare’s custom Entity Linking (C-EL)
We selected four datasets spanning three languages (English, Dutch, and French):
• EHR dataset (Dutch) – 35 patient records (oncology & cardiology).
• Mantra GSC corpus (Dutch) – EMA subset (362 concepts).
• Quaero medical corpus (French) – EMA subset (1,970 concepts).
• E3C corpus (English) – Clinical records (2,389 concepts).
• Named entity recognition (NER) was evaluated solely on overlap and concept identification rather than label assignment, due to the varying annotation schemes across the datasets.
• We measured the precision, recall, and F1-score of code-assignments, comparing codes extracted by models to the ground-truth annotations, for all overlapping extractions from the NER.
View video for a visual representation of our findings
• Our pipeline outperforms all others on clinical narratives, highlighting the value of domain-specific fine-tuning.
• The Mistral + SapBERT pipeline slightly outperforms our model on precision, due to Mistral's preference for shorter (often single words), flat concepts, which are easier to map but do not capture all information.
• Our custom DNER model identifies nested and discontinuous concepts, which is more expressive but leads to lower precision in some cases.
• The highest-scoring pipeline combines a custom DNER model + SapBERT, outperforming Mistral + SapBERT by nearly 8 percentage points in F1-score.
• This underscores the importance of custom entity recognition models even when combined with a generic entity linker.
• Interestingly QuickUMLS performs best, benefiting from its close alignment with UMLS content.
• Based on automated metrics, the DNER + C-EL pipeline underperforms compared to QuickUMLS. This is likely because often longer, more specific concepts are extracted while 74% of Quaero’s annotations are unigrams.
• This highlights that differences in annotation and ranking standards across datasets significantly impact results.
• Our Dutch fine-tuned model outperforms others, demonstrating strong cross-lingual generalization in clinical narratives.
• The best-performing approach also involved combining custom and open-source components, further emphasizing the importance of domain-specific adaptation.
• Out-of-the-box LLM performance varies significantly depending on dataset language, annotation guidelines, and domain.
• Clinical NLP models require constant adaptation due to evolving medical knowledge, making static benchmarks insufficient.
• Expert validation remains crucial—final scores change based on medical experts' feedback and ontology adjustments.
• This study highlights the necessity of domain-specific customization in clinical NLP to ensure high-quality results in real-world healthcare applications.