CT-FM Data Download Guide
This guide details the steps to download and process the CT scan data used for both pre-training and downstream tasks in the CT-FM study. All datasets are publicly available, ensuring that every element of our study can be reproduced.
Pre-training Data
For our pre-training experiments, we utilize 148,394 CT scans from the Imaging Data Commons (IDC). Follow the steps below to obtain and prepare the exact dataset used in our study.
1. Run the Data Query in BigQuery
Execute the provided SQL query on Google BigQuery to filter for CT scans that meet our quality constraints. The query performs necessary quality checks on each scan.
Running this query returns a table with CT scan records that satisfy our criteria. We then convert these query results to a manifest file that can be used to download the data
2. Generate a Download Manifest
This has already been done so you can skip to the next step if you don't want to know how!
After reviewing the query results, use the Jupyter Notebook to create a manifest file. This manifest lists every DICOM file that needs to be downloaded.
3. Download the DICOM Files
To download the scans, first install the IDC Index tool:
Then, execute the following command—replacing <PATH_TO_MANIFEST.TXT>
and <DOWNLOAD_DIR>
with your manifest file path and desired download directory:
This command downloads all the specified DICOM files into the designated directory.
4. Sort and Convert the Data
The downloaded data is in DICOM format. To prepare it for your experiments, follow these steps:
-
Sorting: Organize the DICOM files using the tool "dicomsort". While the specific usage may depend on your environment, a common workflow involves running a command to categorize files by patient or study. For example, you might first list the files and then run:
For more detailed instructions and options, please refer to the thedicomsort GitHub repository
-
Conversion: Convert the sorted DICOM files to NRRD format using Plastimatch. A typical command looks similar to:
For additional details and advanced options, consult the Plastimatch documentation or relevant online resources.
-
Packaging: Finally, generate a
.pkl
file that lists the scans. This file serves as the required input for the pre-training experiments.
For a complete example of these final steps, refer again to the prepare_download.ipynb notebook.
Following these instructions will replicate the data download and preprocessing pipeline used in our study, enabling you to work with the same CT scan dataset.
Downstream Tasks Data
We use several publicly available datasets for our downstream tasks, including:
-
Whole Body Segmentation:
TotalSegmentator-v2 dataset -
Tumor Segmentation:
MSD dataset -
Medical Image Retrieval:
3D-MIR
OrganMNIST-3D -
Stability Testing:
RIDER