Publication

NLP | Technical research paper | Evaluating the performance of LynxCare’s proprietary clinical NLP pipeline against two open-source alternatives

In our study, we evaluated the performance of LynxCare’s clinical NLP pipeline against two open-source alternatives. Our goal was to assess out-of-the-box performance on multilingual biomedical text, particularly electronic health record (EHR) data.

Continue reading below and learn more about our methodology and findings.

Download our full technical research paper by completing the form on the right.

In this Publication you’ll learn:

In this article you’ll learn:

Methodology

Baselines & Open-Source Pipelines

QuickUMLS: A widely used concept-matching tool based on approximate string matching.

Mistral 7B Instruct (4-bit): We use this LLM for entity recognition using few-shot learning.

        o Exhibited output inconsistency, requiring multiple prompt repetitions for optimal results.

        o Its biomedical counterpart, BioMistral-7B, showed a higher tendency to hallucinate.

SapBERT: A multilingual language model, pre-trained on the UMLS 2020AB knowledge base, used to embed textual concept mentions and to perform entity linking (EL) based on similarity with embedded representations of codes.

To further validate our findings, we conducted an ablation study, comparing:

• LynxCare’s custom, discontinuous NER (C-DNER) combined with SapBERT

• The Mistral LLM with LynxCare’s custom Entity Linking (C-EL)

Datasets & Languages

We selected four datasets spanning three languages (English, Dutch, and French):

• EHR dataset (Dutch) – 35 patient records (oncology & cardiology).

• Mantra GSC corpus (Dutch) – EMA subset (362 concepts).

• Quaero medical corpus (French) – EMA subset (1,970 concepts).

• E3C corpus (English) – Clinical records (2,389 concepts).

Evaluation Approach

• Named entity recognition (NER) was evaluated solely on overlap and concept identification rather than label assignment, due to the varying annotation schemes across the datasets.

• We measured the precision, recall, and F1-score of code-assignments, comparing codes extracted by models to the ground-truth annotations, for all overlapping extractions from the NER.

Results

View video for a visual representation of our findings

Findings for EHR Data (Dutch)

• Our pipeline outperforms all others on clinical narratives, highlighting the value of domain-specific fine-tuning.

• The Mistral + SapBERT pipeline slightly outperforms our model on precision, due to Mistral's preference for shorter (often single words), flat concepts, which are easier to map but do not capture all information.

• Our custom DNER model identifies nested and discontinuous concepts, which is more expressive but leads to lower precision in some cases.

Findings for Mantra GSC (Dutch)

• The highest-scoring pipeline combines a custom DNER model + SapBERT, outperforming Mistral + SapBERT by nearly 8 percentage points in F1-score.

• This underscores the importance of custom entity recognition models even when combined with a generic entity linker.

Findings for Quaero (French)

• Interestingly QuickUMLS performs best, benefiting from its close alignment with UMLS content.

• Based on automated metrics, the DNER + C-EL pipeline underperforms compared to QuickUMLS. This is likely because often longer, more specific concepts are extracted while 74% of Quaero’s annotations are unigrams.

• This highlights that differences in annotation and ranking standards across datasets significantly impact results.

Findings for E3C (English)

• Our Dutch fine-tuned model outperforms others, demonstrating strong cross-lingual generalization in clinical narratives.

• The best-performing approach also involved combining custom and open-source components, further emphasizing the importance of domain-specific adaptation.

Broader Implications

• Out-of-the-box LLM performance varies significantly depending on dataset language, annotation guidelines, and domain.

• Clinical NLP models require constant adaptation due to evolving medical knowledge, making static benchmarks insufficient.

• Expert validation remains crucial—final scores change based on medical experts' feedback and ontology adjustments.

• This study highlights the necessity of domain-specific customization in clinical NLP to ensure high-quality results in real-world healthcare applications.

Methodology

Baselines & Open-Source Pipelines

QuickUMLS: A widely used concept-matching tool based on approximate string matching.

Mistral 7B Instruct (4-bit): We use this LLM for entity recognition using few-shot learning.

        o Exhibited output inconsistency, requiring multiple prompt repetitions for optimal results.

        o Its biomedical counterpart, BioMistral-7B, showed a higher tendency to hallucinate.

SapBERT: A multilingual language model, pre-trained on the UMLS 2020AB knowledge base, used to embed textual concept mentions and to perform entity linking (EL) based on similarity with embedded representations of codes.

To further validate our findings, we conducted an ablation study, comparing:

• LynxCare’s custom, discontinuous NER (C-DNER) combined with SapBERT

• The Mistral LLM with LynxCare’s custom Entity Linking (C-EL)

Datasets & Languages

We selected four datasets spanning three languages (English, Dutch, and French):

• EHR dataset (Dutch) – 35 patient records (oncology & cardiology).

• Mantra GSC corpus (Dutch) – EMA subset (362 concepts).

• Quaero medical corpus (French) – EMA subset (1,970 concepts).

• E3C corpus (English) – Clinical records (2,389 concepts).

Evaluation Approach

• Named entity recognition (NER) was evaluated solely on overlap and concept identification rather than label assignment, due to the varying annotation schemes across the datasets.

• We measured the precision, recall, and F1-score of code-assignments, comparing codes extracted by models to the ground-truth annotations, for all overlapping extractions from the NER.

Results

View video for a visual representation of our findings

Findings for EHR Data (Dutch)

• Our pipeline outperforms all others on clinical narratives, highlighting the value of domain-specific fine-tuning.

• The Mistral + SapBERT pipeline slightly outperforms our model on precision, due to Mistral's preference for shorter (often single words), flat concepts, which are easier to map but do not capture all information.

• Our custom DNER model identifies nested and discontinuous concepts, which is more expressive but leads to lower precision in some cases.

Findings for Mantra GSC (Dutch)

• The highest-scoring pipeline combines a custom DNER model + SapBERT, outperforming Mistral + SapBERT by nearly 8 percentage points in F1-score.

• This underscores the importance of custom entity recognition models even when combined with a generic entity linker.

Findings for Quaero (French)

• Interestingly QuickUMLS performs best, benefiting from its close alignment with UMLS content.

• Based on automated metrics, the DNER + C-EL pipeline underperforms compared to QuickUMLS. This is likely because often longer, more specific concepts are extracted while 74% of Quaero’s annotations are unigrams.

• This highlights that differences in annotation and ranking standards across datasets significantly impact results.

Findings for E3C (English)

• Our Dutch fine-tuned model outperforms others, demonstrating strong cross-lingual generalization in clinical narratives.

• The best-performing approach also involved combining custom and open-source components, further emphasizing the importance of domain-specific adaptation.

Broader Implications

• Out-of-the-box LLM performance varies significantly depending on dataset language, annotation guidelines, and domain.

• Clinical NLP models require constant adaptation due to evolving medical knowledge, making static benchmarks insufficient.

• Expert validation remains crucial—final scores change based on medical experts' feedback and ontology adjustments.

• This study highlights the necessity of domain-specific customization in clinical NLP to ensure high-quality results in real-world healthcare applications.

Publication

NLP | Technical research paper | Evaluating the performance of LynxCare’s proprietary clinical NLP pipeline against two open-source alternatives

In our study, we evaluated the performance of LynxCare’s clinical NLP pipeline against two open-source alternatives. Our goal was to assess out-of-the-box performance on multilingual biomedical text, particularly electronic health record (EHR) data.

Continue reading below and learn more about our methodology and findings.

Download our full technical research paper by completing the form on the right.

DOWNLOAD
DOWNLOAD
Talk to an Expert

We evaluated the performance of LynxCare’s clinical NLP pipeline against two open-source alternatives. Our goal was to assess out-of-the-box performance on multilingual biomedical text, particularly electronic health record (EHR) data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.