alephalpha
alephalpha18d ago
New

Senior AI Researcher - Pre-training Data

Heidelbergfull-timesenior
OtherAi Researcher
0 views0 saves0 applied

Quick Summary

Overview

Our Mission Aleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers - in finance, manufacturing, public administration - need models that understand German, meet European regulatory requirements, and work reliably in high-stakes settings.

Requirements Summary

PhD in machine learning, NLP, or equivalent research experience focusing on large-scale language modeling or data curation. A history of contributions to top-tier venues (NeurIPS, ICML, ICLR, ACL, etc.) specifically regarding data curation, scaling…

Technical Tools
pythonpytorchdeep-learningmachine-learning

Aleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers - in finance, manufacturing, public administration - need models that understand German, meet European regulatory requirements, and work reliably in high-stakes settings. We're building that in Heidelberg.

We're growing our pre-training team and hiring someone to passionately work on data: defining what goes into our models, building the systems that source and prepare it, and ensuring our training team has the highest-quality data to push model capabilities forward.

Team Culture

At Aleph Alpha, we foster a culture built on ownership, autonomy, and empowerment. Teams and individual contributors are trusted to take responsibility for their work and drive meaningful impact. We maintain a flat organisational structure with efficient, supportive management that enables quick decision‑making, open communication, and a strong sense of shared purpose.

About the role

As a Senior AI Researcher for Pre-training Data, you will shape and improve the underlying scientific methodology behind our pre-training corpora while also co-engineering the software and systems that enable this. Working with engineers and other researchers to build scalable pipelines, you will focus on relevant theoretical and empirical research required to understand which data makes models perform best on our targeted capabilities.

This role is for you if you have a strong background in large-scale language modeling and the scientific drive to answer complex questions about data scaling laws, synthetic data generation, and curriculum learning.

In your day-to-day, you will design targeted ablations across various scales, derive and test hypotheses from training dynamics, develop novel algorithms for estimating data quality and performing data curation, and contribute to a range of engineering tasks which facilitate these research directions. Together with a collaborative team of engineers and researchers, you will have a direct impact on the fundamental knowledge and capabilities of the models we ship. You will also help or lead the writing of technical reports for internal and external readers, as well as presenting at and contributing to technical meetings and conferences on an as-needed basis.

Responsibilities

~1 min read

Requirements

~1 min read
  • A deep understanding of machine learning theory, specifically regarding foundation model training dynamics, scaling laws, and data-centric AI.

  • Experience designing and evaluating complex ML experiments related to data composition, curriculum learning, or data quality on language model training.

  • Familiarity with statistical methods for evaluation and experiment design.

  • Ability to reason about the information-theoretic properties of a dataset and its predictive power for evaluated tasks: not just processing data, but understanding its signal.

  • Strong Python skills and comfort with ML tooling and deep learning frameworks (especially PyTorch).

  • Willingness to relocate to Heidelberg or travel at least fortnightly.

Requirements

~1 min read
  • PhD in machine learning, NLP, or equivalent research experience focusing on large-scale language modeling or data curation.

  • A history of contributions to top-tier venues (NeurIPS, ICML, ICLR, ACL, etc.) specifically regarding data curation, scaling laws, synthetic data, or LLM pre-training.

  • Experience training foundation models from scratch and diagnosing data-induced training pathologies.

  • Bonus, but not required: German language proficiency can be helpful for curating and assessing German-language data.

What We Offer

~1 min read
Become part of an AI revolution!
30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Mental health support through nilo.health
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work-life balance and hybrid working model
Virtual Stock Option Plan
JobRad® Bike Lease

Location & Eligibility

Where is the job
Heidelberg
Hybrid — some on-site time required
Who can apply
Same as job location

Listing Details

Posted
April 21, 2026
First seen
May 5, 2026
Last seen
May 8, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
21%
Scored at
May 6, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

alephalphaSenior AI Researcher - Pre-training Data