This research was done as a part of the machine learning research scientist internship at Owkin. It focuses on probing the limits of the nnUNet pipeline in low-data regimes and comparing it to SynthSeg, a synthetic-data-based segmentation approach.

Results

Experiments were performed using nnUNet and SynthSeg. SynthSeg is treated as an out-of-the-box benchmark requiring no training, while nnUNet serves as a data-driven baseline trained with very limited annotations (5 or 30 image–mask pairs).

Quantitative results are summarized below, followed by representative visualizations. For full experimental details and extended analysis, the reader is referred to the complete report.

Performance comparison

Model	OASIS DSC	OASIS HD95	OASIS NSD	ADNI DSC	ADNI HD95	ADNI NSD
nnUNet5	82.53	5.52	1.97	68.54	14.43*	7.75*
nnUNet30	88.97	1.58	29.15	84.85	4.72	2.05
SynthSeg	76.59	3.69	0.93	69.13	2.94	1.24

Performance comparison on OASIS and ADNI. * indicates one example with infinite surface distance that was excluded from the mean.

nnUNet trained on 5 images

nnUNet trained on only 5 OASIS images. Even in the extreme low-data regime, the model captures coarse anatomical structure.

nnUNet trained on 30 images

nnUNet trained on 30 ADNI images. Dice scores improve substantially, while surface metrics remain less stable.

SynthSeg (out-of-the-box)

SynthSeg performance on ADNI. Dice scores are lower than nnUNet30, but surface metrics are more consistent and robust.

Symmetry effect and qualitative analysis

Qualitative example illustrating the symmetry trap

Challenging ADNI example illustrating the symmetry trap. While nnUNet predictions have accurate shapes, labels are mixed across hemispheres. From left to right: ground truth, nnUNet5, nnUNet30, SynthSeg.

nnUNet cross-dataset inference reveals a strong bias toward symmetric structure placement: shapes are captured correctly, but left/right class assignment fails. Dice scores increase dramatically when labels are merged by structure or foreground, indicating preserved shape understanding but poor semantic consistency across domains.

This page presents a curated visual summary. For full metrics, ablations, and discussion, please refer to the complete technical report.