GSoC Week8
Background & Problem
Insufficient human-labeled EC number → SBO term pairs limit downstream machine learning (ML) training effectiveness and generalization.
Solution
Leverage GPT-5’s biochemical knowledge to generate 8,000 high-quality EC–SBO pairs as data augmentation corpora to boost downstream model performance.
Architecture
-
Model choice: Use GPT-5 (stronger reasoning) instead of GPT-4.
-
Structured output: A unified JSON schema to ensure parsability and consistency.
Prompt Engineering
-
Inject biochemical rules (e.g., NAD⁺/NADH redox patterns).
-
Leaf-first strategy in the SBO ontology to avoid over-generalization.
-
Candidate list constraint: force selection from a provided list to reduce hallucinations.
Quality Control
-
Require a reason field to enable consistency review.
-
Extract keywords to aid subsequent rule/retrieval checks.
-
Error handling: format validation, retry/backoff, and isolation of problematic samples to keep the pipeline stable.
Value (Data Augmentation)
-
Larger training set: 8,000 high-quality pairs can significantly improve classification/retrieval models.
-
Long-tail coverage: fills labeling gaps for rare EC numbers.
-
Cost-effective: compared to manual labeling, AI-assisted generation is faster and cheaper.
This Week’s Outputs
-
Completed batch generation and auto-QC for 8,000+ EC–SBO pairs.
-
Finalized structured output and candidate-constraint templates, producing reusable production scripts.
Comments
Post a Comment