Data Manager — Multimodal Medical Foundation Models
Quick Summary
About the Role You will lead data operations for a cutting-edge research group developing 3D medical multimodal foundation models and agentic clinical AI systems .
Experience with vector databases , multimodal retrieval, or embedding store design. Familiarity with annotation tools (Labelbox, CVAT, iMerit, custom MONAI Label pipelines).
About the Role
~1 min readYou will lead data operations for a cutting-edge research group developing 3D medical multimodal foundation modelsand agentic clinical AI systems. These models rely on extremely high-quality, well-structured, and compliant datasets—including 3D medical imaging volumes (MRI, CT, PET), clinical text corpora, annotations, and multimodal metadata.
Your job is to own the end-to-end data lifecycle: acquisition, ingestion, cleaning, versioning, labeling, quality control, governance, and delivery to researchers. You are the central node ensuring our foundation model teams and medical agent teams have clean, scalable, well-documented data pipelines.
This is a pivotal foundational role—without great data, large models cannot be great.
Responsibilities
~1 min read- Oversee ingestion and processing of 3D medical volumes (DICOM, NIfTI, MHA) and associated clinical texts.
- Build automated pipelines for metadata extraction, de-identification, slice/series validation, and cohort structuring.
- Manage large-scale internal datasets and external research datasets (BraTS, LiTS, MIMIC-CXR, CheXpert, MosMed, etc.).
- Implement scalable data storage, cataloging, and retrieval systems for multimodal training data.
- Own dataset version control, lineage tracking, reproducibility, and dataset documentation.
- Collaborate with ML systems engineers on high-throughput data loaders, sharding strategies, and caching mechanisms.
- Lead medical annotation workflows with radiologists, medical students, and labeling vendors.
- Create guidelines for ROI labeling, segmentation, captioning, report alignment, and case-level curation.
- Build semi-automated labeling pipelines using model-assisted tools.
- Enforce strict standards on data quality, completeness, consistency, and bias control.
- Ensure adherence to medical data privacy, HIPAA-equivalent frameworks, and institutional data-sharing rules.
- Manage PHI de-identification, audit logs, access control, and compliance approvals.
- Work closely with foundation-model researchers to understand data needs for model training.
- Partner with agentic system designers to supply structured datasets for clinical reasoning tasks.
- Collaborate with foundational engineers on data access layers, performance bottlenecks, and dataset optimization.
- The foundation model relies on high-quality 3D and textual data at scale.
- You shape the data pipelines enabling next-generation medical AI agents.
- You ensure clinical-grade governance, safety, reproducibility, and trust.
- Your systems become the backbone for research, experiments, and deployments.
For candidates motivated by the intersection of data, healthcare, and machine learning, this is a high-impact opportunity.
- Strong experience managing large multimodal or imaging datasets, ideally medical imaging.
- Proficiency with DICOM/DICOMweb, NIfTI, PACS systems, and medical imaging toolkits (dicompyler, pydicom, MONAI, ITK).
- Experience with ETL pipelines, distributed data systems, and cloud/on-prem storage.
- Knowledge of metadata standards, ontologies, and text–image linking strategies.
- Comfortable working with Python, SQL, and data tooling (Airflow, Prefect, Dagster, DBT, Delta Lake, etc.).
- Understanding of data privacy, de-identification, and compliance requirements in healthcare.
- Strong communication skills and the ability to coordinate between engineers, researchers, clinicians, and data partners.
Nice to Have
~1 min read- Experience with vector databases, multimodal retrieval, or embedding store design.
- Familiarity with annotation tools (Labelbox, CVAT, iMerit, custom MONAI Label pipelines).
- Prior work with clinical NLP datasets or multilingual Indian medical corpora.
- Experience conducting bias audits, dataset characterization, or quality scoring at scale.
- Contributions to open datasets, benchmarks, or data documentation frameworks.
What We Offer
~1 min readLocation & Eligibility
Listing Details
- First seen
- March 26, 2026
- Last seen
- May 8, 2026
Posting Health
- Days active
- 43
- Repost count
- 0
- Trust Level
- 31%
- Scored at
- May 8, 2026
Signal breakdown
Please let Saigroup know you found this job on Jobera.
4 other jobs at Saigroup
View all →Explore open roles at Saigroup.
Similar Data Manager jobs
View all →Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.
