GSoC Week8

 

Background & Problem

Insufficient human-labeled EC number → SBO term pairs limit downstream machine learning (ML) training effectiveness and generalization.

Solution

Leverage GPT-5’s biochemical knowledge to generate 8,000 high-quality EC–SBO pairs as data augmentation corpora to boost downstream model performance.

Architecture

  • Model choice: Use GPT-5 (stronger reasoning) instead of GPT-4.

  • Structured output: A unified JSON schema to ensure parsability and consistency.

Prompt Engineering

  • Inject biochemical rules (e.g., NAD⁺/NADH redox patterns).

  • Leaf-first strategy in the SBO ontology to avoid over-generalization.

  • Candidate list constraint: force selection from a provided list to reduce hallucinations.

Quality Control

  • Require a reason field to enable consistency review.

  • Extract keywords to aid subsequent rule/retrieval checks.

  • Error handling: format validation, retry/backoff, and isolation of problematic samples to keep the pipeline stable.

Value (Data Augmentation)

  • Larger training set: 8,000 high-quality pairs can significantly improve classification/retrieval models.

  • Long-tail coverage: fills labeling gaps for rare EC numbers.

  • Cost-effective: compared to manual labeling, AI-assisted generation is faster and cheaper.

This Week’s Outputs

  • Completed batch generation and auto-QC for 8,000+ EC–SBO pairs.

  • Finalized structured output and candidate-constraint templates, producing reusable production scripts.

Comments

Popular posts from this blog

GSoC Week1

GSoC week2

GSoC Week 5