HPC Engineer

ChinaChina·Shanghaimid
OtherEngineer
0 views0 saves0 applied

Quick Summary

Technical Tools
OtherEngineer

AlphaGrep 是一家全球领先的量化交易公司,专注于股票、商品、外汇及固定收益等资产的算法交易。我们在国际市场拥有显著份额,依托自主开发的超低延迟系统与严格的风控体系,持续构建高效能策略。

AlphaGrep is a leading global quantitative trading firm specializing in algorithmic strategies across equities, commodities, FX, and fixed income. We hold significant market share internationally, powered by proprietary low-latency infrastructure and robust risk controls.

AlphaGrep China 是专注于中国市场人民币资产管理的机构,服务对象涵盖机构投资人、家族办公室与高净值客户。我们深耕中国资本市场,涵盖股票及衍生品等多类资产,结合全球量化研究体系与本地实战经验,构建多元交易策略,致力于实现长期稳健增长。

AlphaGrep China is a dedicated RMB asset management platform focused on the Chinese market, serving institutional investors, family offices, and high-net-worth individuals. We leverage deep expertise in China’s equity and derivatives markets, together with AlphaGrep’s global quantitative research capabilities, to deliver diversified strategies targeting long-term and stable returns.

 

Responsibilities

~1 min read

我们正在寻找一位 HPC / GPU 集群工程师,协助设计、运营并持续优化公司用于分布式模型训练及高性能计算的大规模 GPU 计算环境。您将端到端负责集群的性能与稳定性——从 GPU 硬件与互联网络,到存储层、调度系统、监控平台,以及保障数百块加速器高效运行的全套工具链。

We are looking for an HPC / GPU Cluster Engineer to help design, operate, and continuously optimize our large-scale GPU compute environment used for distributed model training and high-performance workloads. You will own the performance and reliability of the cluster end to end — from the GPUs and interconnect fabric up through the storage layer, scheduler, monitoring, and the tooling that keeps hundreds of accelerators running efficiently.

 

Requirements

~1 min read

具备生产环境下大规模 GPU 或 HPC 集群的运营经验,能够系统性地识别并解决性能瓶颈。

Experience operating large GPU or HPC clusters in production, identifying performance bottlenecks and resolving them systematically.

具备 RDMA 编程及底层调试的实战经验,包括 RDMA verbs / libibverbs、内存注册、队列对、完成队列、RDMA 读写、发送/接收操作及性能基准测试。

Hands-on experience with RDMA programming and low-level RDMA debugging, including RDMA verbs / libibverbs, memory registration, queue pairs, completion queues, RDMA read/write, send/recv, and performance benchmarking.

熟悉分布式训练的核心底层技术,包括 InfiniBand 和/或 RoCEv2、GPUDirect RDMA、GPUDirect Storage、NCCL、CUDA 驱动及 OFED。

Practical experience with the technologies underpinning distributed training, including InfiniBand and/or RoCEv2, GPUDirect RDMA, GPUDirect Storage, NCCL, CUDA drivers, and OFED.

具备 Slurm 等调度系统的使用经验,包括安装配置、队列/分区管理、作业故障排查及监控。

Experience with workload schedulers such as Slurm, including setup, configuration, queue/partition management, job troubleshooting, and monitoring.

具备高性能共享存储或并行/分布式文件系统的搭建与管理经验,如 Lustre、BeeGFS、WEKA、VAST、DDN/ExaScaler 等。

Experience setting up and managing high-performance shared storage or parallel/distributed filesystems such as Lustre, BeeGFS, WEKA, VAST, DDN/ExaScaler, or similar systems.

熟练掌握 Python、Bash,优先具备 C/C++ 能力,用于集群自动化、诊断、基准测试及监控开发。

Solid scripting/programming ability in Python, Bash, and preferably C/C++, for cluster automation, diagnostics, benchmarking, and monitoring.

熟悉 HPC 网络设计,包括阻塞与非阻塞网络架构、InfiniBand、高性能以太网、网络拓扑、拥塞控制及端到端带宽/延迟故障排查。

Familiarity with HPC networking, including blocking vs. non-blocking fabric design, InfiniBand, high-performance Ethernet, topology, congestion, and end-to-end bandwidth/latency troubleshooting.

扎实的 Linux 系统管理能力,包括网络配置、文件系统管理、内核/驱动问题处理、进程与资源管理及性能调试。Strong Linux system administration skills, including networking, filesystems, kernel/driver issues, process/resource management, and performance debugging.

 

信任是团队协作的根基

我们鼓励坦诚沟通与主动承担,让每一位成员都能在安全感中成长,自主决策、共同前行。这份信任源于彼此支持与并肩作战,是我们最珍贵的团队资产。

Trust is the foundation of collaboration.

We foster open communication and proactive ownership, empowering every team member to grow with a strong sense of security, make autonomous decisions, and move forward together. This trust—built through mutual support and shared commitment—is our most valued asset.

优秀的团队成员,我们汇聚了工程师、数学家、统计学家,保持好奇心,乐在其中。

Great People. We’re curious engineers, mathematicians, statisticians and like to have fun while achieving our goals.

透明的组织架构,我们重视每一位成员的想法与贡献。

Transparent Structure. Our employees know that we value their ideas and contributions.

轻松的办公环境,无等级文化,常有团建、聚会与休闲活动。

Relaxed Environment. Flat organization with yearly offsites, happy hours, and more.

健康福利支持,健身补贴、零食饮品、充足年假。

Health & Wellness Programs. Gym membership, stocked kitchen, and generous vacation.

Location & Eligibility

Where is the job
Shanghai, China
On-site at the office
Who can apply
CN

Listing Details

Posted
June 9, 2026
First seen
June 9, 2026
Last seen
June 9, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
60%
Scored at
June 9, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust

3 other jobs at Alphagrepsecurities

View all →

Explore open roles at Alphagrepsecurities.

Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

A
HPC Engineer