Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization

Xuefei (Julie) Wang1, Kai A. Horstmann2, Ethan Lin2, Jonathan Chen2, Alexander R. Farhang1, Sophia Stiles1, Atharva Sehgal3, Jonathan Light4, David Van Valen1, Yisong Yue1, Jennifer J. Sun2

1Caltech   2Cornell   3UT Austin   4Rensselaer Polytechnic Institute

Abstract

Adapting production-level computer vision tools to bespoke scientific datasets is a critical “last mile” bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.

Key Findings

1

Simple agents outperform human experts

A minimal two-agent system (coding agent + execution agent) consistently generates adaptation code that surpasses expert-engineered solutions across all three biomedical imaging tasks—replacing weeks to months of manual tuning.

2

More complexity does not mean better performance

Design choices like expert functions, reasoning LLMs, and function banks show mixed effects—what helps one task hurts another. Only the data prompt is universally beneficial. Omitting the API list consistently improved scores.

3

Task structure determines which components help

We introduce a framework characterizing each task’s solution space along two axes: API space (concentrated vs. dispersed) and parameter space (easy vs. hard to optimize). This explains the mixed effects and provides a practical roadmap for agent design.

4

Minimal agents match proprietary tree-search systems

When benchmarked against the AIDE agent (tree-search architecture) with comparable compute budgets, our open-source minimal agents achieve equivalent performance with greater transparency and lower cost.

Method

Agent optimization pipeline diagram
Overview of the agent optimization loop. Each iteration, the code writer agent generates preprocessing/postprocessing function pairs that are executed and evaluated. A function bank accumulates all solutions with metrics, feeding the best and worst back into prompts to guide exploration.

Results

Design Choice Study. Across all agent settings, all except the “Small LLM” outperform the expert baseline. We observe mixed effects across many design choices—what helps one task can hurt another.

Expert
Baseline
Base
Agent
+ Expert
Function
+ Function
Bank
Reasoning
LLM
Small
LLM
No Data
Prompt
No API
List
Polaris F1 0.841 0.867 0.929 0.889 0.844 0.805 0.856 0.868
Cellpose AP@0.5 0.402 0.409 0.410 0.416 0.412 0.397 0.406 0.417
MedSAM NSD+DSC 0.820 0.971 0.888 0.943 1.020 0.918 0.952 1.037
Comparison of expert-optimized and agent-optimized segmentation results
Comparison of expert-optimized and agent-optimized MedSAM segmentation results, showing visual outputs and the corresponding preprocessing/postprocessing code.

Task Showcase

Spot detection example
Molecular

Spot Detection

Polaris / DeepCell

Metric: F1 score · Expert: 0.841 · Best agent: 0.929

Detecting sub-pixel fluorescent spots for image-based spatial transcriptomics across modalities using different RNA capturing and tagging methods. Validated on 95 images.

Cell segmentation example
Cellular

Cell Segmentation

Cellpose 3 (cyto3)

Metric: AP @ IoU 0.5 · Expert: 0.402 · Best agent: 0.417

Cell instance segmentation on multiple modalities including whole cell & nucleus, fluorescent, and phase bacterial images. Validated on 100 images.

Medical segmentation example
Macroscopic

Medical Segmentation

MedSAM

Metric: NSD + DSC · Expert: 0.820 · Best agent: 1.037

Medical image segmentation using an extension of the Segment Anything Model adapted to medical domains, evaluated on dermoscopy data. Validated on 25 images.

Getting Started

Clone the repository and install dependencies using uv. See the Adding a Task guide for extending the framework to new workflows.

bash
# Clone and install
git clone https://github.com/xuefei-wang/simple-agent-opt.git
cd simple-agent-opt
uv venv && source .venv/bin/activate
uv pip install -e ".[cellpose]"   # or .[polaris], .[medsam]

# Set your API key
export OPENAI_API_KEY="your-key-here"

# Run an experiment
python main.py \
    --dataset /path/to/data \
    --experiment_name cellpose_segmentation \
    --random_seed 42 -k 3

Citation

If you find this work useful, please cite our paper:

bibtex
@article{wang2025simple,
  title   = {Simple Agents Outperform Experts in Biomedical Imaging
             Workflow Optimization},
  author  = {Wang, Xuefei and Horstmann, Kai A. and Lin, Ethan and
             Chen, Jonathan and Farhang, Alexander R. and Stiles, Sophia and
             Sehgal, Atharva and Light, Jonathan and Van Valen, David and
             Yue, Yisong and Sun, Jennifer J.},
  journal = {arXiv preprint arXiv:2512.06006},
  year    = {2025}
}