Posts

GSoC week10

  Overview We built a deep-learning–based SBO biochemical reaction classification system that uses a BioBERT pretrained model with a two-stage training strategy to automatically classify 42 biochemical reaction categories. The system reveals important findings about handling heterogeneous data quality across sources and, through controlled comparisons, demonstrates the decisive role of high-quality labeled data . Data Processing and Split Strategy Two data sources were used: 331 human-labeled high-quality samples and 6,966 GPT-generated predictions. During preprocessing, we established a field backfilling mechanism using the SBO term dictionary as the authoritative standard to ensure label consistency across sources. Split policy for the high-quality set: First 100 samples → validation set Samples 101–200 → test set Remaining 131 → base training set All 6,966 GPT-generated samples were retained as noise-adaptation training data for Stage 2. This design pr...

GSoC Week 9

 Three production upgrades were shipped around data-source query strategy and annotation quality control: Config-driven query strategy management Target SBO term tracking Smart reaction filtering to reduce LLM ambiguity 1) Config-Driven Query Strategy Management Presets Fast: BiGG only for the fastest response Balanced: BiGG + KEGG for a balance of recall and latency Comprehensive: all sources ( BiGG/KEGG/SEED/Reactome ) for maximum coverage Custom configuration: users can arbitrarily combine any of the four databases. Interactive UI: supports live edit , input validation , and exception handling to prevent task failures due to misconfiguration. 2) Target SBO Term Tracking Maintain a set of parent SBO terms to identify reactions that require further LLM processing. Perform in-pipeline monitoring during annotation; once a reaction matches a target term, immediately log details including the SBO term , list of EC numbers , and annotation ...

GSoC Week8

  Background & Problem Insufficient human-labeled EC number → SBO term pairs limit downstream machine learning (ML) training effectiveness and generalization. Solution Leverage GPT-5 ’s biochemical knowledge to generate 8,000 high-quality EC–SBO pairs as data augmentation corpora to boost downstream model performance. Architecture Model choice: Use GPT-5 (stronger reasoning) instead of GPT-4. Structured output: A unified JSON schema to ensure parsability and consistency. Prompt Engineering Inject biochemical rules (e.g., NAD⁺/NADH redox patterns). Leaf-first strategy in the SBO ontology to avoid over-generalization. Candidate list constraint: force selection from a provided list to reduce hallucinations. Quality Control Require a reason field to enable consistency review. Extract keywords to aid subsequent rule/retrieval checks. Error handling: format validation, retry/backoff, and isolation of problematic samples to keep the p...

GSoC Week 7

  Hybrid Retrieval for EC→SBO: A 96% In-Domain Win That Still Struggles to Generalize I built a hybrid retrieval pipeline to map EC text to SBO terms : generate semantic embeddings , run BM25 lexical search, then weighted-sum fusion . On in-domain validation, Top-1 accuracy reached 96% . But for unseen / OOD (out-of-distribution) ECs , predictions were largely incorrect , so generalization remains the key challenge. Objective Automatically match EC text to the most appropriate SBO term using hybrid retrieval : semantic embedding + BM25 with weighted sum fusion . Data EC index : ec_index.csv ( ec_number, text, depth ) SBO index : sbo_index.csv ( sbo_id, sbo_name, sbo_comment ) Gold set : ec_to_sbo_full_fields_202508211435.csv ( ec_num, sbo_id ) Method SentenceTransformer ( all-MiniLM-L6-v2 ) to produce embeddings BM25Okapi over sbo_text = sbo_name + sbo_comment WeightedSum fusion: score = α*vec + (1-α)*bm25 , default α = 0.9 Cosine similarity ...

GSoC Week 6

  To address the time issue , I introduced two key optimizations into the annotation pipeline : Early termination : as soon as a precise SBO term is obtained for the current reaction, stop querying further databases to avoid unnecessary requests. EC number truncation : truncate from the first non-digit character to enforce consistent EC formatting and reduce matching noise. I then re-ran the full evaluation on 108 models . Core Optimizations Early Termination Goal : within the adapter chain , once a non-generic SBO (i.e., not SBO:0000176 ) is found, immediately stop querying other data sources to reduce I/O and network wait . Effect : significantly lowers total query volume and shortens overall wall time . Results Across the 108 models , 3,317 reactions were converted from the generic SBO:0000176 to more specific SBO categories. Per-model average processing time : 432.99 seconds/model (≈ 7.22 minutes/model ). Compared with the previous sequentia...

GSoC Week 5

 This week, I mainly focused on diagnosing why adding KEGG still failed to yield finer-grained SBO classifications across 108 models, extending the EC lookup pipeline, and experimenting with additional data sources. Here’s the complete summary of the work update during the week. Root-Cause Investigation Added adviaecnumber() helper to attempt refinement via EC number –driven mapping. Re-ran the enhanced pipeline on 108 models  ,reactions that remained in generic SBO classes were still  missing EC numbers . Result: no additional reactions could be refined to more specific SBO classes. Conclusion: the main blocker is not the mapping rule itself but the absence of EC numbers for those reactions across the currently consulted sources. New Data Sources Added SEED adapter : uses the Solr query interface with strict null filtering to avoid spurious entries. Reactome adapter : combines web parsing with the QuickGO API to perform a multi-step conversion...