Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization

Xuefei (Julie) Wang¹, Kai A. Horstmann², Ethan Lin², Jonathan Chen², Alexander R. Farhang¹, Sophia Stiles¹, Atharva Sehgal³, Jonathan Light⁴, David Van Valen¹, Yisong Yue¹, Jennifer J. Sun²

¹Caltech ²Cornell ³UT Austin ⁴Rensselaer Polytechnic Institute

arXiv Code

Abstract

Adapting production-level computer vision tools to bespoke scientific datasets is a critical “last mile” bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.

Key Findings

Simple agents outperform human experts

A minimal two-agent system (coding agent + execution agent) consistently generates adaptation code that surpasses expert-engineered solutions across all three biomedical imaging tasks—replacing weeks to months of manual tuning.

More complexity does not mean better performance

Design choices like expert functions, reasoning LLMs, and function banks show mixed effects—what helps one task hurts another. Only the data prompt is universally beneficial. Omitting the API list consistently improved scores.

Task structure determines which components help

We introduce a framework characterizing each task’s solution space along two axes: API space (concentrated vs. dispersed) and parameter space (easy vs. hard to optimize). This explains the mixed effects and provides a practical roadmap for agent design.

Minimal agents match proprietary tree-search systems

When benchmarked against the AIDE agent (tree-search architecture) with comparable compute budgets, our open-source minimal agents achieve equivalent performance with greater transparency and lower cost.

Method

Agent optimization pipeline diagram — Overview of the agent optimization loop. Each iteration, the code writer agent generates preprocessing/postprocessing function pairs that are executed and evaluated. A function bank accumulates all solutions with metrics, feeding the best and worst back into prompts to guide exploration.

Base Agent — a coding agent (LLM-powered) and an execution agent alternate via state transitions. A task prompt and data prompt provide context; a curated API list gives the agent relevant function signatures.
Function Bank — a persistent memory of all generated solutions and their scores. The top-k and bottom-k functions are fed back into prompts to guide further exploration.
Expert Functions — human-expert optimized postprocessing functions can be provided as in-context examples to bootstrap the search.
AutoML Agent — an optional component that periodically analyzes generated code, identifies optimizable parameters, and runs Optuna-based hyperparameter search.

Results

Design Choice Study. Across all agent settings, all except the “Small LLM” outperform the expert baseline. We observe mixed effects across many design choices—what helps one task can hurt another.

	Expert Baseline	Base Agent	+ Expert Function	+ Function Bank	Reasoning LLM	Small LLM	No Data Prompt	No API List
Polaris F1	0.841	0.867	0.929	0.889	0.844	0.805	0.856	0.868
Cellpose AP@0.5	0.402	0.409	0.410	0.416	0.412	0.397	0.406	0.417
MedSAM NSD+DSC	0.820	0.971	0.888	0.943	1.020	0.918	0.952	1.037

Comparison of expert-optimized and agent-optimized segmentation results — Comparison of expert-optimized and agent-optimized MedSAM segmentation results, showing visual outputs and the corresponding preprocessing/postprocessing code.

Task Showcase

Molecular

Spot Detection

Polaris / DeepCell

Metric: F1 score · Expert: 0.841 · Best agent: 0.929

Detecting sub-pixel fluorescent spots for image-based spatial transcriptomics across modalities using different RNA capturing and tagging methods. Validated on 95 images.

Cellular

Cell Segmentation

Cellpose 3 (cyto3)

Metric: AP @ IoU 0.5 · Expert: 0.402 · Best agent: 0.417

Cell instance segmentation on multiple modalities including whole cell & nucleus, fluorescent, and phase bacterial images. Validated on 100 images.

Macroscopic

Medical Segmentation

MedSAM

Metric: NSD + DSC · Expert: 0.820 · Best agent: 1.037

Medical image segmentation using an extension of the Segment Anything Model adapted to medical domains, evaluated on dermoscopy data. Validated on 25 images.

Getting Started

Clone the repository and install dependencies using uv. See the Adding a Task guide for extending the framework to new workflows.

bash

# Clone and install
git clone https://github.com/xuefei-wang/simple-agent-opt.git
cd simple-agent-opt
uv venv && source .venv/bin/activate
uv pip install -e ".[cellpose]"   # or .[polaris], .[medsam]

# Set your API key
export OPENAI_API_KEY="your-key-here"

# Run an experiment
python main.py \
    --dataset /path/to/data \
    --experiment_name cellpose_segmentation \
    --random_seed 42 -k 3

Citation

If you find this work useful, please cite our paper:

bibtex

@article{wang2025simple,
  title   = {Simple Agents Outperform Experts in Biomedical Imaging
             Workflow Optimization},
  author  = {Wang, Xuefei and Horstmann, Kai A. and Lin, Ethan and
             Chen, Jonathan and Farhang, Alexander R. and Stiles, Sophia and
             Sehgal, Atharva and Light, Jonathan and Van Valen, David and
             Yue, Yisong and Sun, Jennifer J.},
  journal = {arXiv preprint arXiv:2512.06006},
  year    = {2025}
}