Hybrid Retrieval for EC→SBO: A 96% In-Domain Win That Still Struggles to Generalize I built a hybrid retrieval pipeline to map EC text to SBO terms : generate semantic embeddings , run BM25 lexical search, then weighted-sum fusion . On in-domain validation, Top-1 accuracy reached 96% . But for unseen / OOD (out-of-distribution) ECs , predictions were largely incorrect , so generalization remains the key challenge. Objective Automatically match EC text to the most appropriate SBO term using hybrid retrieval : semantic embedding + BM25 with weighted sum fusion . Data EC index : ec_index.csv ( ec_number, text, depth ) SBO index : sbo_index.csv ( sbo_id, sbo_name, sbo_comment ) Gold set : ec_to_sbo_full_fields_202508211435.csv ( ec_num, sbo_id ) Method SentenceTransformer ( all-MiniLM-L6-v2 ) to produce embeddings BM25Okapi over sbo_text = sbo_name + sbo_comment WeightedSum fusion: score = α*vec + (1-α)*bm25 , default α = 0.9 Cosine similarity ...