GSoC Week 7

 

Hybrid Retrieval for EC→SBO: A 96% In-Domain Win That Still Struggles to Generalize


I built a hybrid retrieval pipeline to map EC text to SBO terms: generate semantic embeddings, run BM25 lexical search, then weighted-sum fusion. On in-domain validation, Top-1 accuracy reached 96%. But for unseen / OOD (out-of-distribution) ECs, predictions were largely incorrect, so generalization remains the key challenge.

Objective

Automatically match EC text to the most appropriate SBO term using hybrid retrieval: semantic embedding + BM25 with weighted sum fusion.

Data

  • EC index: ec_index.csv (ec_number, text, depth)

  • SBO index: sbo_index.csv (sbo_id, sbo_name, sbo_comment)

  • Gold set: ec_to_sbo_full_fields_202508211435.csv (ec_num, sbo_id)

Method

  • SentenceTransformer (all-MiniLM-L6-v2) to produce embeddings

  • BM25Okapi over sbo_text = sbo_name + sbo_comment

  • WeightedSum fusion: score = α*vec + (1-α)*bm25, default α = 0.9

  • Cosine similarity with robust handling: safe L2 normalize, NaN/Inf cleaning, and clipping

  • Evaluation via evaluate_topk: Top1_acc / Hit@K / MRR

  • Error export via dump_top1_errors: writes top1_errors.csv and summarizes common True→Pred1 confusions

Results

  • Top-1 accuracy (in-domain validation): 96%

  • Unseen ECs: predictions mostly incorrect (generalization failure)

Why

  • Dataset bias: only ~300 gold pairs and class imbalance—the model is effectively “memorizing” seen samples, and performance drops sharply off-distribution.

In short: the idea works and the in-domain metric looks strong, but generalization still needs targeted fixes. I’ll keep iterating on both data and evaluation splits to improve robustness.

Comments

Popular posts from this blog

GSoC Week1

GSoC week2

GSoC Week 5