Sr. SRE Platform Software Engineer

US·San JoseRemotesenior

OtherPlatform Software Engineer

2 views0 saves0 applied

Apply Now

Quick Summary

Key Responsibilities

Collection & Storage: collection-agent, customer-sdk-gateway, metrics-store, logs-store, traces-store, profiles-store, analytics-lake, enrichment-service, collection-monitor. Alert,

Requirements Summary

7+ years of production software engineering experience, including 2 or more years operating what you built (real on-call experience, not just shipping code).

Technical Tools

OtherPlatform Software Engineer

Bitdeer is a world-leading technology company for AI and Bitcoin mining infrastructure.

Bitdeer is committed to providing comprehensive Bitcoin mining solutions for its customers and building AI computational infrastructure to support the AI revolution. Bitdeer handles complex processes involved in computing such as equipment procurement, transport logistics, data center design and construction, equipment management, and daily operations. Bitdeer also offers advanced cloud capabilities to customers with high demand for artificial intelligence.

Headquartered in Singapore, Bitdeer has deployed data centers across multiple countries, including the United States, Norway, Bhutan, and Ethiopia.

To learn more, visit https://ir.bitdeer.com/

Build and operate one or more bounded contexts of the NeoCloud SRE platform — the multi-region substrate that observes, protects, and operates a GPU rental fleet across self-built and OEM-rented data centers. You take an architect-approved design and turn it into production code that ships through GitOps + the CICD release pipeline, ride the Plugin Framework conventions, meet declared SLOs, and stay drift-free.

This is the build + run role. You don't only ship code; you ship a service that other squads, cloud-service teams, and tenants depend on. You take the on-call pager for what you build.

Responsibilities

~2 min read

→Software Engineering Experience: 7+ years of production software engineering experience, including 2 or more years operating what you built (real on-call experience, not just shipping code).
→Programming Languages: Production-depth mastery of at least one systems-grade language—Go (preferred), Rust, or Java. Proficiency in Python for tooling and SDK work.
→Distributed Systems Fundamentals: Strong grasp of at-least-once vs. exactly-once trade-offs, idempotency, back-pressure, leader election, consistent hashing, gossip, and fan-out. Ability to evaluate CRDT vs. Raft vs. Paxos and select the right tool for the job.
→Multi-Region Observability Stack: Experience at production scale with Prometheus, VictoriaMetrics, Mimir, Thanos, Loki, Elasticsearch, Tempo, Jaeger, or OpenTelemetry. Must have built or substantively contributed to the ingest, query, or storage paths of these systems.
→GitOps & CI/CD: Hands-on experience with Argo, Flux, Helm, Kustomize, Cosign signing, signed-bundle promotion, and blast-radius-aware rollouts.
→Kubernetes Operator Pattern: Proven experience writing a controller or CRD handling real production traffic, with a deep understanding of watch-cache mechanics, leader election, and reconcile loops.
→mTLS & Secrets Management: Experience executing end-to-end mTLS bootstrap with certificate rotation. Hands-on experience with HashiCorp Vault or cloud KMS (AWS KMS / GCP KMS).
→SQL & Time-Series Data: Ability to read a Prometheus query plan, build a recording-rule strategy, and write SQL that joins per-tenant telemetry against analytics-lake tables.
→Testing Discipline: Rigorous approach to unit, integration, contract, chaos, and soak testing. Experience writing and maintaining your own comprehensive tests.
→Technical Writing Fluency: Ability to author clear design docs that align with existing platform architecture, create runbooks optimized for 3 AM on-call responses, and write intent-driven PR descriptions.

Requirements

~1 min read

NVIDIA Internals: Deep understanding of DCGM and NVIDIA driver internals, including XID semantics and MIG / vGPU partitioning.
Networking & Fabrics: Experience with InfiniBand or RoCE fabrics, including subnet managers, partitioning, optical health, and NCCL collective tracing.
HPC Storage: Experience managing Lustre, NetApp, Pure, DDN, VAST, or NVMe-oF under multi-tenant loads.
Hardware Management: Hands-on experience with BMC, IPMI, and Redfish at OEM scale (Supermicro, Dell, HPE, Lenovo).
Cluster Platform Internals: Familiarity with Kubernetes GPU Operator, Slurm controller, or Ray GCS.
BS/MS in Computer Science or similar
Hyperscale or NeoCloud experience

--------------------------------------------------------------------

Bitdeer is committed to providing equal employment opportunities in accordance with country, state, and local laws. Bitdeer does not discriminate against employees or applicants based on conditions such as race, color, gender identity and/or expression, sexual orientation, marital and/or parental status, religion, political opinion, nationality, ethnic background or social origin, social status, disability, age, indigenous status, and union.

Location & Eligibility

Where is the job

San Jose, US

Remote within one country

Listing Details

Posted: June 23, 2026
First seen: June 24, 2026
Last seen: July 14, 2026

Posting Health

Days active: 0
Repost count: 0
Trust Level: 75%
Scored at: June 24, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust

Apply for this position

Bitdeer

breezy

Bitdeer Technologies Group (NASDAQ: BTDR) is a world-leading technology company for Bitcoin mining and AI cloud computing. Headquartered in Singapore, Bitdeer provides comprehensive Bitcoin mining solutions including equipment procurement, infrastructure management, and high-performance computing services.

Employees

240

Founded

2018

Domain

Jobs

External application · ~5 min on Bitdeer's site

Please let Bitdeer know you found this job on Jobera.