Hybrid Retrieval for EC→SBO: A 96% In-Domain Win That Still Struggles to Generalize

I built a hybrid retrieval pipeline to map EC text to SBO terms: generate semantic embeddings, run BM25 lexical search, then weighted-sum fusion. On in-domain validation, Top-1 accuracy reached 96%. But for unseen / OOD (out-of-distribution) ECs, predictions were largely incorrect, so generalization remains the key challenge.

Objective

Automatically match EC text to the most appropriate SBO term using hybrid retrieval: semantic embedding + BM25 with weighted sum fusion.

Data

EC index: ec_index.csv (ec_number, text, depth)
SBO index: sbo_index.csv (sbo_id, sbo_name, sbo_comment)
Gold set: ec_to_sbo_full_fields_202508211435.csv (ec_num, sbo_id)

Method

SentenceTransformer (all-MiniLM-L6-v2) to produce embeddings
BM25Okapi over sbo_text = sbo_name + sbo_comment
WeightedSum fusion: score = α*vec + (1-α)*bm25, default α = 0.9
Cosine similarity with robust handling: safe L2 normalize, NaN/Inf cleaning, and clipping
Evaluation via evaluate_topk: Top1_acc / Hit@K / MRR
Error export via dump_top1_errors: writes top1_errors.csv and summarizes common True→Pred1 confusions

Results

Top-1 accuracy (in-domain validation): 96%
Unseen ECs: predictions mostly incorrect (generalization failure)

Why

Dataset bias: only ~300 gold pairs and class imbalance—the model is effectively “memorizing” seen samples, and performance drops sharply off-distribution.

In short: the idea works and the in-domain metric looks strong, but generalization still needs targeted fixes. I’ll keep iterating on both data and evaluation splits to improve robustness.

Search This Blog

Gsoc2025: Enhancing SBOannotator with LLM Integration & Dynamic Term Retrieval

GSoC Week 7

Hybrid Retrieval for EC→SBO: A 96% In-Domain Win That Still Struggles to Generalize

Objective

Data

Method

Results

Why

Comments

Post a Comment

Popular posts from this blog

GSoC Week1

GSoC week2

GSoC Week 5