GSoC Week 7
Hybrid Retrieval for EC→SBO: A 96% In-Domain Win That Still Struggles to Generalize
I built a hybrid retrieval pipeline to map EC text to SBO terms: generate semantic embeddings, run BM25 lexical search, then weighted-sum fusion. On in-domain validation, Top-1 accuracy reached 96%. But for unseen / OOD (out-of-distribution) ECs, predictions were largely incorrect, so generalization remains the key challenge.
Objective
Automatically match EC text to the most appropriate SBO term using hybrid retrieval: semantic embedding + BM25 with weighted sum fusion.
Data
-
EC index:
ec_index.csv
(ec_number, text, depth
) -
SBO index:
sbo_index.csv
(sbo_id, sbo_name, sbo_comment
) -
Gold set:
ec_to_sbo_full_fields_202508211435.csv
(ec_num, sbo_id
)
Method
-
SentenceTransformer (
all-MiniLM-L6-v2
) to produce embeddings -
BM25Okapi over
sbo_text = sbo_name + sbo_comment
-
WeightedSum fusion:
score = α*vec + (1-α)*bm25
, defaultα = 0.9
-
Cosine similarity with robust handling: safe L2 normalize, NaN/Inf cleaning, and clipping
-
Evaluation via
evaluate_topk
: Top1_acc / Hit@K / MRR -
Error export via
dump_top1_errors
: writestop1_errors.csv
and summarizes common True→Pred1 confusions
Results
-
Top-1 accuracy (in-domain validation): 96%
-
Unseen ECs: predictions mostly incorrect (generalization failure)
Why
-
Dataset bias: only ~300 gold pairs and class imbalance—the model is effectively “memorizing” seen samples, and performance drops sharply off-distribution.
In short: the idea works and the in-domain metric looks strong, but generalization still needs targeted fixes. I’ll keep iterating on both data and evaluation splits to improve robustness.
Comments
Post a Comment