SAIGroup~4mo ago

Data Manager — Multimodal Medical Foundation Models

India·Bangaloremid

Data ScienceOtherDataHealthcareData ManagerClinical Data ManagerLaboratory & Life Sciences

10 views0 saves0 applied

Apply Now

Quick Summary

Overview

About the Role You will lead data operations for a cutting-edge research group developing 3D medical multimodal foundation models and agentic clinical AI systems .

Requirements Summary

Experience with vector databases , multimodal retrieval, or embedding store design. Familiarity with annotation tools (Labelbox, CVAT, iMerit, custom MONAI Label pipelines).

Technical Tools

airflowdbtpythonsqletlmachine-learning

About the Role

~1 min read

You will lead data operations for a cutting-edge research group developing 3D medical multimodal foundation modelsand agentic clinical AI systems. These models rely on extremely high-quality, well-structured, and compliant datasets—including 3D medical imaging volumes (MRI, CT, PET), clinical text corpora, annotations, and multimodal metadata.

Your job is to own the end-to-end data lifecycle: acquisition, ingestion, cleaning, versioning, labeling, quality control, governance, and delivery to researchers. You are the central node ensuring our foundation model teams and medical agent teams have clean, scalable, well-documented data pipelines.

This is a pivotal foundational role—without great data, large models cannot be great.

Responsibilities

~1 min read

Oversee ingestion and processing of 3D medical volumes (DICOM, NIfTI, MHA) and associated clinical texts.
Build automated pipelines for metadata extraction, de-identification, slice/series validation, and cohort structuring.
Manage large-scale internal datasets and external research datasets (BraTS, LiTS, MIMIC-CXR, CheXpert, MosMed, etc.).

Implement scalable data storage, cataloging, and retrieval systems for multimodal training data.
Own dataset version control, lineage tracking, reproducibility, and dataset documentation.
Collaborate with ML systems engineers on high-throughput data loaders, sharding strategies, and caching mechanisms.

Lead medical annotation workflows with radiologists, medical students, and labeling vendors.
Create guidelines for ROI labeling, segmentation, captioning, report alignment, and case-level curation.
Build semi-automated labeling pipelines using model-assisted tools.

Enforce strict standards on data quality, completeness, consistency, and bias control.
Ensure adherence to medical data privacy, HIPAA-equivalent frameworks, and institutional data-sharing rules.
Manage PHI de-identification, audit logs, access control, and compliance approvals.

Work closely with foundation-model researchers to understand data needs for model training.
Partner with agentic system designers to supply structured datasets for clinical reasoning tasks.
Collaborate with foundational engineers on data access layers, performance bottlenecks, and dataset optimization.

The foundation model relies on high-quality 3D and textual data at scale.
You shape the data pipelines enabling next-generation medical AI agents.
You ensure clinical-grade governance, safety, reproducibility, and trust.
Your systems become the backbone for research, experiments, and deployments.

For candidates motivated by the intersection of data, healthcare, and machine learning, this is a high-impact opportunity.

Strong experience managing large multimodal or imaging datasets, ideally medical imaging.
Proficiency with DICOM/DICOMweb, NIfTI, PACS systems, and medical imaging toolkits (dicompyler, pydicom, MONAI, ITK).
Experience with ETL pipelines, distributed data systems, and cloud/on-prem storage.
Knowledge of metadata standards, ontologies, and text–image linking strategies.
Comfortable working with Python, SQL, and data tooling (Airflow, Prefect, Dagster, DBT, Delta Lake, etc.).
Understanding of data privacy, de-identification, and compliance requirements in healthcare.
Strong communication skills and the ability to coordinate between engineers, researchers, clinicians, and data partners.

Nice to Have

~1 min read

Experience with vector databases, multimodal retrieval, or embedding store design.
Familiarity with annotation tools (Labelbox, CVAT, iMerit, custom MONAI Label pipelines).
Prior work with clinical NLP datasets or multilingual Indian medical corpora.
Experience conducting bias audits, dataset characterization, or quality scoring at scale.
Contributions to open datasets, benchmarks, or data documentation frameworks.