Senior Researcher

United Kingdom·Londonsenior

ResearcherRecruitment & Talent Acquisition

2 views0 saves0 applied

Apply Now

Quick Summary

Overview

Technical Tools

ResearcherRecruitment & Talent Acquisition

CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com.

We're proud to be a Living Wage accredited Employer.

We are looking for a Senior Researcher to join Monolith’s Research team, now part of CoreWeave. This is a high-impact, high-ownership role for a researcher who combines deep technical expertise in machine learning, statistical modelling, optimisation, and large-scale systems data with the ability to take complex, ambiguous problems from first principles through to production.

The Monolith Data Science team is building a layered reliability and intelligence platform that shifts CoreWeave from reactive troubleshooting to proactive reliability engineering. The platform spans telemetry ingestion, feature engineering, anomaly detection, failure prediction, distributed straggler detection, performance modelling, workload optimisation, and agentic root cause analysis.

You will work closely with Fleet, Infrastructure, AI Platform, engineering, product, and client-facing teams to improve cluster reliability, increase effective utilisation, reduce MTTR, protect uptime, and turn large-scale GPU infrastructure telemetry into measurable operational and commercial impact.

This is not a traditional data science role focused on dashboards, business metrics, or standard forecasting. The role sits at the intersection of applied research, GPU infrastructure, high-performance computing, distributed systems, reliability engineering, telemetry, optimisation, and Physical AI. It demands rigorous scientific thinking, strong execution, and comfort working in a high-ambiguity environment where the right problem framing is often as important as the final model.

Responsibilities

~1 min read

Contribute meaningfully to Monolith and CoreWeave’s research direction by identifying high-leverage problems in GPU infrastructure analytics, cluster reliability, workload performance, scheduling, and utilisation.
Originate novel research directions for turning raw infrastructure telemetry into actionable intelligence, rather than simply applying standard machine learning or data science techniques.
Evaluate emerging methods across statistical modelling, machine learning, observability, optimisation, simulation, reinforcement learning, anomaly detection, and autonomous diagnostics, providing well-grounded technical judgement on which approaches are most likely to create real-world impact.
Champion rigour, reproducibility, and scientific integrity across research outputs, experiments, prototypes, and production validation.
Help establish a research foundation for understanding how large-scale GPU systems behave, why workloads underperform, where bottlenecks emerge, and how reliability can be improved proactively.

Lead the design and development of sophisticated statistical, machine learning, and optimisation systems for large-scale GPU infrastructure telemetry, including compute, networking, storage, workload, and distributed systems data.
Develop advanced models and methodologies to optimise GPU utilisation, workload scheduling, infrastructure efficiency, and system reliability.
Build models and methods for anomaly detection, failure prediction, distributed straggler detection, degraded workload identification, bottleneck diagnosis, and agentic root cause analysis.
Design experiments, analyse large-scale system telemetry, and prototype predictive and optimisation algorithms that directly inform production systems.
Drive technical decisions on difficult modelling problems involving noisy time-series data, high-dimensional telemetry, causal inference, uncertainty, robustness, generalisation, and out-of-distribution behaviour.
Explore simulation, digital-twin, reinforcement learning, and adaptive scheduling approaches where they can improve understanding or optimisation of GPU clusters and distributed training environments.
Take end-to-end ownership of research work from problem framing and exploratory analysis through prototype development, validation, and collaboration with engineering teams on production deployment.
Maintain deep personal technical expertise; remain a hands-on contributor in Python and modern scientific computing / machine learning tooling.

Serve as a strong technical voice within the research organisation, helping shape how Monolith approaches complex infrastructure intelligence problems.
Work closely with Fleet, Infrastructure, AI Platform, engineering, product, and customer-facing teams to ensure research work lands with real operational and commercial impact.
Translate research findings into production-ready prototypes, deployable solutions, and technical recommendations that improve performance, reliability, utilisation, and cost efficiency.
Contribute to research practices and norms that improve how the team handles ambiguous, high-dimensional, real-world systems problems.
Communicate complex technical work and its implications clearly to a range of audiences, from close technical collaborators to senior leadership and external stakeholders.
Help build a shared understanding of how large-scale AI infrastructure behaves, where it fails, and how it can be made more reliable, efficient, and intelligent.

Applied machine learning for GPU infrastructure and distributed systems
Large-scale telemetry ingestion, feature engineering, and infrastructure analytics
GPU cluster reliability, utilisation, observability, and performance analysis
Anomaly detection, degradation detection, and failure prediction
Distributed straggler detection and workload performance diagnosis
Agentic root cause analysis and autonomous diagnostic systems
Time-series, high-dimensional, structured, and operational systems data
Performance modelling for distributed workloads and AI training jobs
Workload scheduling, capacity planning, forecasting, and resource allocation modelling
Optimisation techniques including stochastic optimisation, convex optimisation, reinforcement learning, and adaptive scheduling
Simulation and digital-twin approaches for complex infrastructure systems
Causal inference, controlled experiments, hypothesis testing, and statistical validation
End-to-end research systems: data pipelines, prototypes, validation, deployment, and monitoring

8+ years of experience, or equivalent research experience, applying statistical modelling, machine learning, optimisation, or applied AI to large-scale datasets.
MS or PhD in Computer Science, Statistics, Applied Mathematics, Machine Learning, Physics, Engineering, or a related quantitative field.
Strong proficiency in Python and scientific computing libraries such as NumPy, pandas, SciPy, scikit-learn, PyTorch, or TensorFlow.
Experience working with large-scale structured datasets, time-series data, infrastructure telemetry, performance data, sensor data, or other complex operational data.
Experience designing and analysing controlled experiments, including A/B testing, hypothesis testing, causal inference, or rigorous model validation.
Experience building and validating predictive models in production or research environments.
Experience with distributed data systems such as Spark, Ray, Dask, or similar.
Proficiency in SQL and working with large-scale structured data.
Strong understanding of optimisation techniques such as linear programming, convex optimisation, stochastic optimisation, reinforcement learning, or adaptive scheduling.
Demonstrated ability to solve ambiguous technical problems where the right approach is not already known.
Ability to translate research findings into production-ready prototypes, deployable workflows, or operational tooling.
Strong scientific judgement, including experimental design, reproducibility, validation, and awareness of uncertainty.
The ability to communicate clearly and influence across research, engineering, product, infrastructure, and leadership audiences.

Nice to Have

~1 min read

PhD with published research in systems optimisation, distributed computing, ML systems, performance modelling, reliability engineering, scientific computing, or a related area.
Experience with GPU workloads, distributed training, AI infrastructure, HPC, or large-scale compute environments.
Familiarity with Kubernetes, containerised workloads, cloud-native systems, or distributed infrastructure.
Experience developing reinforcement learning, adaptive scheduling, autonomous diagnostics, or agentic systems.
Background in capacity planning, forecasting, resource allocation modelling, or infrastructure efficiency.
Experience with observability, hardware telemetry, performance monitoring, root cause analysis, or failure prediction.
Contributions to open-source machine learning, systems, infrastructure, or scientific computing projects.

We believe in investing in our people and value candidates who bring diverse experiences to our teams, even if they are not a 100% skill or experience match.

You may be a strong fit if:

You love uncovering hidden failure patterns in massive, noisy infrastructure datasets.
You are curious about building autonomous or agentic systems that investigate, explain, and optimise complex system behaviour.
You have deep expertise in predictive modelling, reinforcement learning, optimisation, statistical modelling, or large-scale data analysis.
You enjoy working from first principles on problems where the correct approach is not obvious.
You are interested in GPU infrastructure, distributed systems, AI training workloads, reliability engineering, and the operational behaviour of large-scale compute environments.
You want your research to move beyond analysis and into systems that improve real-world performance, uptime, utilisation, and cost.

At CoreWeave, we work hard, have fun, and move fast. We’re in an exciting stage of hyper-growth, operating at the centre of the demand for large-scale accelerated compute. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at Your Core
Act Like an Owner
Empower Employees
Deliver Best-in-Class Client Experiences
Achieve More Together

By joining Monolith’s Research team within CoreWeave, you will work on problems that sit directly at the frontier of AI infrastructure: how massive GPU systems behave, why workloads underperform, how they fail, and how they can be made more reliable, efficient, and intelligent.

This is an opportunity to help build a new category of infrastructure intelligence — one that moves beyond monitoring and dashboards toward systems that can understand, explain, predict, and optimise the behaviour of large-scale GPU clusters.

We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As the organisation continues to grow, the opportunities to shape new technical directions are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.

To fulfill our obligation to protect client data, successful applicants offered employment with CoreWeave will be required to complete a basic criminal record check, conducted in compliance with GDPR. Employment offers are conditional upon receiving satisfactory check results

What We Offer

~1 min read

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

✓Family-level Medical Insurance

✓Family-level Dental Insurance

✓Generous Pension Contribution

✓Life Assurance at 4x Salary

✓Critical Illness Cover

✓Employee Assistance Programme

✓Tuition Reimbursement

✓Work culture focused on innovative disruption

CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.

Recruitment Agencies

CoreWeave does not accept speculative CVs. Any unsolicited CVs received will be treated as the property of CoreWeave and your Terms & Conditions associated with the use of CVs will be considered null and void.

Any unsolicited CVs sent by your company to us – that is to say, in any situation where we have not directly engaged your company in writing to supply candidates for a specific vacancy – will be considered by us to be a “free gift”, leaving us liable for no fees whatsoever should we choose to contact the candidate directly and engage the candidate’s services, and will in no way establish any prior claim by your company to representation of that candidate should the candidate’s details also be submitted by any other party.

This position requires access to export controlled information. To conform to U.S. Government export regulations applicable to that information, applicant must either be (A) a U.S. person, defined as a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident (green card holder), (iii) refugee under 8 U.S.C. § 1157, or (iv) asylee under 8 U.S.C. § 1158, (B) eligible to access the export controlled information without a required export authorization, or (C) eligible and reasonably likely to obtain the required export authorization from the applicable U.S. government agency. CoreWeave may, for legitimate business reasons, decline to pursue any export licensing process.

When you apply to a job on this site, the personal data contained in your application will be collected by CoreWeave UK Ltd. (“Controller”), which is located at

Phosphor (6th Floor), 133 Park Street, London, SE1 9EA

and can be contacted by emailing careers.eu@coreweave.com. Controller’s data protection officer can be contacted at privacy@coreweave.com. Your personal data will be processed for the purposes of managing Controller’s recruitment related activities, which include setting up and conducting interviews and tests for applicants, evaluating and assessing the results thereto, and as is otherwise needed in the recruitment and hiring processes. Such processing is legally permissible under Art. 6(1)(f) of (i) Regulation (EU) 2016/679 (General Data Protection Regulation (“GDPR”) and (ii) the GDPR as it forms part of the laws of the UK (“UK GDPR”), as necessary for the purposes of the legitimate interests pursued by the Controller, which are the solicitation, evaluation, and selection of applicants for employment. Your personal data will be shared with Greenhouse Software, Inc., a cloud services provider located in the United States of America and engaged by Controller to help manage its recruitment and hiring process on Controller’s behalf. With respect to transfers originating from the UK or the European Economic Area ("EEA") to a country outside the UK or the EEA, we implement the appropriate transfer mechanism(s) and other appropriate solutions to address cross-border transfers as required by applicable law. You may request a copy of the suitable mechanisms we have in place by contacting us at privacy@coreweave.com

Your personal data will be retained by Controller as long as Controller determines it is necessary to evaluate your application for employment. Where permitted by applicable law, we may also retain your personal data for a limited period after the recruitment process ends in order to consider you for future job opportunities, respond to legal claims, or comply with record-keeping obligations. Under the GDPR and the UK GDPR, you have the right to request access to your personal data, to request that your personal data be rectified or erased, and to request that processing of your personal data be restricted. You also have the right to data portability. In addition, you may lodge a complaint with the relevant supervisory authority: (i) A list of Europe’s data protection authorities can be found here; and (ii) for the UK, this is the Information Commissioner's Office.

For additional information, please see our Privacy Policy.