From DICOMs to fMRIPrep in One Conversation
Tags: fmri, skills
New skill for fMRI data preprocessing: dicom2fmriprep
The DICOM→BIDS→fMRIPrep pipeline has like five tools duct-taped together to help you format and preprocess your dilligently collected fMRI data (freshly minted DICOMs).
heudiconv
First you need a heuristic file to convert your DICOMs. To write that heuristic, you need to inspect dicominfo.tsv. To get dicominfo.tsv, you need to run heudiconv with -c none. To run heudiconv, you need to understand its CLI flags. It's circular. You're already tired.
And then there's the Siemens MoCo thing. Siemens scanners produce motion-corrected duplicates of your BOLD series. If you don't filter them out with is_motion_corrected, you get twice the data you expected, and half of it is garbage. The garbage looks perfectly valid. It will silently contaminate your analysis. You won't know until much later. Or maybe you'll never know. That's the worst part.
Oh, and if you forgot --minmeta, your JSON sidecars are now 500 lines of Siemens private headers. Fun.
Writing a good heuristic looks simple in the docs. It takes an entire afternoon in practice. Sometimes two.
BIDS validation
You finally have NIfTIs. You run the validator. It screams:
[ERR] func/sub-S001_task-rest_bold.json: TaskName is not defined
[ERR] fmap/sub-S001_dir-AP_epi.json: IntendedFor field is missing
[WARN] .DS_Store is not part of the BIDS specification
The TaskName one. Every time. You have one task. It's called "rest." The validator knows it's called "rest" because it's in the filename. But the JSON sidecar doesn't say "TaskName": "rest", so we all have to suffer. Why is this still a thing?
fMRIPrep
fMRIPrep is incredible software. Configuring it is not. Output spaces, CIFTI resolutions, thread counts, memory limits, FreeSurfer licensing, and the question of whether you set --omp-nthreads to one less than --n_cpus or not. Get one flag wrong and your 12-hour SLURM job dies at minute 3. I've done this more times than I'd like to admit.
BABS
If you're running fMRIPrep at scale, you're probably using BABS. BABS handles DataLad integration, container management, job submission. It's great. But its YAML schema is strict. Not "we'll warn you" strict. It just crashes. The traceback tells you nothing about which key is wrong.
The section has to be called input_datasets, not input_data. The args go under bids_app_args, not container_args. You need $SUBJECT_SELECTION_FLAG and $BABS_TMPDIR in exactly the right places. None of this is well-documented. You learn it from someone who's already been through it, or you learn it the hard way.
The problem with asking Claude
I've been using Claude Code for a lot of my research workflow. It's good at writing scripts. But when I asked it to generate a heudiconv heuristic, it produced something that looked right — correct function signatures, reasonable pattern matching — but it missed certain sections entirely. The BABS config it wrote had the wrong YAML section names.
So I built a Claude Code skill called dicom2fmriprep. It knows about Siemens MoCo series, the heudiconv two-pass workflow, --minmeta, POPULATE_INTENDED_FOR_OPTS, the exact BABS YAML schema, all the fMRIPrep flags you'll forget, the BIDS validation errors you'll hit. It doesn't just generate scripts. It asks about your data first, walks through the pipeline step by step, and explains why it's making each choice.
And I'm showing a little experiment on this task using Claude Code with skill and without skill.
What it generates
Real outputs from evaluation runs. I want to show these because the details matter.
The heudiconv heuristic
For a Siemens dataset with T1w MPRAGE, resting-state fMRI, and AP/PA fieldmaps:
POPULATE_INTENDED_FOR_OPTS = {
'matching_parameters': ['ImagingVolume', 'Shims'],
'criterion': 'Closest'
}
def infotodict(seqinfo):
# Single-session study — no {session} in templates
t1w = create_key('sub-{subject}/anat/sub-{subject}_T1w')
func_rest = create_key(
'sub-{subject}/func/sub-{subject}_task-rest_run-{item:02d}_bold'
)
fmap_ap = create_key('sub-{subject}/fmap/sub-{subject}_dir-AP_epi')
fmap_pa = create_key('sub-{subject}/fmap/sub-{subject}_dir-PA_epi')
info = {t1w: [], func_rest: [], fmap_ap: [], fmap_pa: []}
for s in seqinfo:
# ---- Skip motion-corrected and derived reconstructions ----
if s.is_motion_corrected or s.is_derived:
continue
protocol = s.protocol_name.lower()
# ... pattern matching for each modality ...
A few things worth noting.
That is_motion_corrected or s.is_derived check. This is the single most important thing in the whole file. Without it, Siemens data will have duplicate BOLD series that look valid but are the scanner's online motion correction output. The without-skill version didn't have this filter at all. At all.
POPULATE_INTENDED_FOR_OPTS with ImagingVolume and Shims matching. This tells heudiconv to automatically set IntendedFor in your fieldmap sidecars, matching fieldmaps to BOLD runs based on volume overlap and shim settings. The without-skill version used ModalityAcquisitionLabel, which is less specific and can mislink fieldmaps in multi-run protocols.
No {session} in the BIDS paths. Sounds trivial. But the without-skill version included {session} placeholders everywhere, which creates unnecessary directory nesting and can confuse downstream tools.
The BABS config
input_datasets:
BIDS:
required_files:
- "func/*_bold.nii*"
- "anat/*_T1w.nii*"
is_zipped: false
origin_url: "/project/data/bids_datalad"
path_in_babs: inputs/data/BIDS
cluster_resources:
interpreting_shell: "/bin/bash"
hard_memory_limit: 32G
temporary_disk_space: 200G
number_of_cpus: "8"
hard_runtime_limit: "24:00:00"
customized_text: |
#SBATCH -p normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --propagate=NONE
script_preamble: |
source "${CONDA_PREFIX}"/bin/activate babs
module load singularity/3.8
export TEMPLATEFLOW_HOME=/scratch/${USER}/templateflow
job_compute_space: "/scratch/${USER}/babs_tmp"
singularity_args:
- --cleanenv
bids_app_args:
$SUBJECT_SELECTION_FLAG: "--participant-label"
-w: "$BABS_TMPDIR"
--fs-license-file: "/path/to/freesurfer/license.txt"
--output-spaces: "MNI152NLin2009cAsym:res-2"
--cifti-output: "91k"
--force-bbr: ""
--n_cpus: "8"
--omp-nthreads: "7"
--mem-mb: "30000"
--skip-bids-validation: ""
--notrack: ""
zip_foldernames:
fmriprep: "24-1-1"
freesurfer: "24-1-1"
alert_log_messages:
stdout:
- "fMRIPrep failed"
- "Cannot allocate memory"
- "Excessive topologic defect encountered"
- "mris_curvature_stats: Could not open file"
- "Numerical result out of range"
- "No such file or directory"
If you've wrestled with BABS configs before, you'll notice the details. $SUBJECT_SELECTION_FLAG and $BABS_TMPDIR are BABS-specific variables interpolated at runtime — miss these and BABS can't parallelize across subjects. The alert_log_messages are stdout patterns BABS watches for to detect failures. "Cannot allocate memory" and "Excessive topologic defect encountered" are the ones that save you from wasting cluster hours. zip_foldernames version string (24-1-1) has to match your container. --omp-nthreads: "7" with --n_cpus: "8" — always one less, leaving a thread for orchestration.
The without-skill version didn't use the correct BABS YAML section names. The config would crash on babs init. You'd spend an hour figuring out why.
BIDS validation fixes
Both with-skill and without-skill handled BIDS fixes fine. Adding TaskName to functional sidecars, setting IntendedFor in fieldmap JSONs, removing .DS_Store. This makes sense — patching JSON files is just file manipulation. No deep domain knowledge required.
I actually think this is interesting. It shows exactly where the skill adds value and where it doesn't. Simple file manipulation? Claude already knows how to do that. Domain-specific gotchas that live in one person's head? That's where it falls apart without help.
The numbers
Three evaluations, each testing a different pipeline stage:
| Eval | With Skill | Without Skill | What the skill caught |
|---|---|---|---|
| heudiconv heuristic | 8/8 (100%) | 6/8 (75%) | MoCo filtering, --minmeta |
| BABS setup | 10/10 (100%) | 7/10 (70%) | YAML schema, container setup |
| BIDS fix | 5/5 (100%) | 5/5 (100%) | — (both nailed it) |
| Total | 23/23 (100%) | 18/23 (78%) |
The overhead:
| Metric | With Skill | Without Skill | Delta |
|---|---|---|---|
| Avg tokens | 28,766 | 17,140 | +11,626 (~1.7x) |
| Avg time | 101.8s | 98.7s | +3.1s |
About 3 extra seconds and some additional tokens. That's Claude reading the skill's reference material. In exchange, you avoid MoCo contamination that takes hours to debug, BABS crashes that waste a day of cluster allocation, and sidecar bloat that makes your dataset annoying to work with.
Try it
npx @yibeichen/claude-skills install dicom2fmriprep
You can say something like:
I have Siemens DICOM data from a resting-state study with T1w, BOLD, and AP/PA fieldmaps. Help me set up the full pipeline from DICOMs to fMRIPrep on our SLURM cluster.
It'll ask about your data — scanner manufacturer, protocol names, number of sessions, cluster setup — before generating anything.
Or jump to a specific step:
Write me a heudiconv heuristic for my dataset. Here's my dicominfo.tsv: [paste or attach]
Generate a BABS config for fMRIPrep 24.1.1 with CIFTI output on a SLURM cluster.
Enjoy Science!