GSoC Week 7

 

Hybrid Retrieval for EC→SBO: A 96% In-Domain Win That Still Struggles to Generalize


I built a hybrid retrieval pipeline to map EC text to SBO terms: generate semantic embeddings, run BM25 lexical search, then weighted-sum fusion. On in-domain validation, Top-1 accuracy reached 96%. But for unseen / OOD (out-of-distribution) ECs, predictions were largely incorrect, so generalization remains the key challenge.

Objective

Automatically match EC text to the most appropriate SBO term using hybrid retrieval: semantic embedding + BM25 with weighted sum fusion.

Data

  • EC index: ec_index.csv (ec_number, text, depth)

  • SBO index: sbo_index.csv (sbo_id, sbo_name, sbo_comment)

  • Gold set: ec_to_sbo_full_fields_202508211435.csv (ec_num, sbo_id)

Method

  • SentenceTransformer (all-MiniLM-L6-v2) to produce embeddings

  • BM25Okapi over sbo_text = sbo_name + sbo_comment

  • WeightedSum fusion: score = α*vec + (1-α)*bm25, default α = 0.9

  • Cosine similarity with robust handling: safe L2 normalize, NaN/Inf cleaning, and clipping

  • Evaluation via evaluate_topk: Top1_acc / Hit@K / MRR

  • Error export via dump_top1_errors: writes top1_errors.csv and summarizes common True→Pred1 confusions

Results

  • Top-1 accuracy (in-domain validation): 96%

  • Unseen ECs: predictions mostly incorrect (generalization failure)

Why

  • Dataset bias: only ~300 gold pairs and class imbalance—the model is effectively “memorizing” seen samples, and performance drops sharply off-distribution.

In short: the idea works and the in-domain metric looks strong, but generalization still needs targeted fixes. I’ll keep iterating on both data and evaluation splits to improve robustness.

Comments

Popular posts from this blog

GSoC Week 5

GSoC Week1

GSoC Week8