Background & Problem

Insufficient human-labeled EC number → SBO term pairs limit downstream machine learning (ML) training effectiveness and generalization.

Solution

Leverage GPT-5’s biochemical knowledge to generate 8,000 high-quality EC–SBO pairs as data augmentation corpora to boost downstream model performance.

Architecture

Model choice: Use GPT-5 (stronger reasoning) instead of GPT-4.
Structured output: A unified JSON schema to ensure parsability and consistency.

Prompt Engineering

Inject biochemical rules (e.g., NAD⁺/NADH redox patterns).
Leaf-first strategy in the SBO ontology to avoid over-generalization.
Candidate list constraint: force selection from a provided list to reduce hallucinations.

Quality Control

Require a reason field to enable consistency review.
Extract keywords to aid subsequent rule/retrieval checks.
Error handling: format validation, retry/backoff, and isolation of problematic samples to keep the pipeline stable.

Value (Data Augmentation)

Larger training set: 8,000 high-quality pairs can significantly improve classification/retrieval models.
Long-tail coverage: fills labeling gaps for rare EC numbers.
Cost-effective: compared to manual labeling, AI-assisted generation is faster and cheaper.

This Week’s Outputs

Completed batch generation and auto-QC for 8,000+ EC–SBO pairs.
Finalized structured output and candidate-constraint templates, producing reusable production scripts.

Search This Blog

Gsoc2025: Enhancing SBOannotator with LLM Integration & Dynamic Term Retrieval

GSoC Week8

Background & Problem

Solution

Architecture

Prompt Engineering

Quality Control

Value (Data Augmentation)

This Week’s Outputs

Comments

Post a Comment

Popular posts from this blog

GSoC Week1

GSoC week2

GSoC Week 5