GSoC week10

September 07, 2025

Overview

We built a deep-learning–based SBO biochemical reaction classification system that uses a BioBERT pretrained model with a two-stage training strategy to automatically classify 42 biochemical reaction categories. The system reveals important findings about handling heterogeneous data quality across sources and, through controlled comparisons, demonstrates the decisive role of high-quality labeled data.

Data Processing and Split Strategy

Two data sources were used: 331 human-labeled high-quality samples and 6,966 GPT-generated predictions.
During preprocessing, we established a field backfilling mechanism using the SBO term dictionary as the authoritative standard to ensure label consistency across sources.

Split policy for the high-quality set:

First 100 samples → validation set
Samples 101–200 → test set
Remaining 131 → base training set

All 6,966 GPT-generated samples were retained as noise-adaptation training data for Stage 2. This design preserves high evaluation quality while enabling experiments that leverage large-scale data. The final dataset contains 36 effective SBO classes, covering major biochemical processes such as hydroxylation, acetylation, and deamination.

Model Architecture and Implementation

Encoder: dmis-lab/biobert-base-cased-v1.1, chosen for its strength in biomedical text understanding.
Classifier head: two fully connected layers mapping 768 → 384 → 42 categories.
Parameter count: 110,994,872 parameters, balancing performance and computational cost.
Optimization: AdamW with a linear warm-up scheduler (warm-up ratio 10%).
Class imbalance: integrated Focal Loss.

Two-Stage Training Experiments and Analysis

We designed a comparative two-stage training pipeline.

Stage 1 — Base
Trained on 131 high-quality samples for 80 epochs.

Validation F1 = 0.6875 (converged around epoch ~50).
Test set: Top-1 accuracy = 94%, Top-3 accuracy = 94%.
Per-class highlights:
- SBO:0000177 (non-covalent binding/association): Precision = 100%, Recall = 100% (~43% of the test set)
- SBO:0000217 (protonation): Precision = 100%, Recall = 100% (~21% of the test set)

Stage 2 — NoiseAdapt
Introduced 6,966 GPT samples for 10 epochs of noise adaptation; learning rate reduced to 10% of Stage 1; noise ratio 0.3.

Best F1 = 0.4808—an improvement over earlier baselines but still clearly below Stage 1.
Curve analysis shows a peak at epoch 1 followed by degradation, indicating systematic labeling bias in the GPT data negatively impacting performance.
The system’s automatic model selection correctly chose the Stage 1 best checkpoint as the final model, maximizing overall performance.

Performance Evaluation and Application Analysis

The final model’s test-set performance validates the approach:

Macro F1 = 0.4563—moderate at first glance, but appropriate given the 42 fine-grained classes and severe class imbalance (on average <4 training samples per class).
Top-1 accuracy = 94%, indicating high-quality predictions for practical use.
Confidence statistics: mean 78.83%, median 81.41%, max 88.21%, demonstrating reasonable uncertainty estimation.
Class distribution: strong performance on well-represented classes; conservative behavior on long-tail classes avoids overconfident mistakes.
Example: for SBO:0000401 (deamination) with only 3 test samples, the model still achieved F1 = 0.60. This aligns with real-world needs where correctly identifying common reaction types is typically more valuable than perfectly covering rare types.

Technical Findings and Data Quality Assessment

The most important finding is the quantitative confirmation that data quality outweighs data quantity for deep learning performance:

Stage 1 (131 high-quality samples) → F1 = 0.6875
Stage 2 (+6,966 GPT samples) → F1 = 0.4808

Despite some improvements over earlier versions, Stage 2 still lags notably. Further analysis reveals systematic labeling bias in the GPT-generated data, especially inconsistent distinctions among fine-grained classes, introducing noise that harms training. The system’s automatic model selection successfully recognized and avoided this degradation, reflecting the pipeline’s robustness.

The two-stage comparison clarifies next steps for data optimization: the current 6,966 GPT samples require additional quality control and filtering, or can be used as raw material for data augmentation from which a high-precision subset is extracted via confidence filtering and ontology consistency checks. Beyond constructing the classifier, the project also established a reproducible data-quality evaluation framework, laying a scientific foundation for future model improvements and data scaling.

Search This Blog

Gsoc2025: Enhancing SBOannotator with LLM Integration & Dynamic Term Retrieval

GSoC week10

Overview

Data Processing and Split Strategy

Model Architecture and Implementation

Two-Stage Training Experiments and Analysis

Performance Evaluation and Application Analysis

Technical Findings and Data Quality Assessment

Comments

Post a Comment

Popular posts from this blog

GSoC Week1

GSoC week2

GSoC Week 5