koantek4mo ago
New
New
Site Reliability Engineering (SRE)
OtherSite Reliability Engineering
0 views0 saves0 applied
Quick Summary
Overview
About the Role: We are seeking a highly skilled and experienced SREDatabricks Platform Administrator to join our DataOperations Team. In this critical role, you will be responsible for the availability, performance,Reliability and scalability of our enterprise Databricks platform.
Technical Tools
awsazuregcpgitlab-cigrafanajenkinsprometheuspythonsqlterraformci-cdmentoringperformance-optimizationsecurity-best-practices
About the Role: We are seeking a highly skilled and experienced SREDatabricks Platform Administrator to join our DataOperations Team. In this critical role, you will be responsible for the availability, performance,Reliability and scalability of our enterprise Databricks platform. You will blend deep expertise in Databricks administration with SRE principles to automate operations, proactively identify and resolve issues, and ensure a seamless experience for our data engineering, data science, and analytics teams. You will champion best practices for platform governance, security, and cost optimization, playing a pivotal role in our data ecosystem. Key Responsibilities: Platform Operations & Reliability: Design, implement, and maintain the Databricks platform infrastructure across multiple cloud environments (AWS, Azure,or GCP). Ensure high availability, disaster recovery, and business continuity of Databricks workspaces, clusters, and associated services. Develop and implement robust monitoring, alerting, and logging solutions for the Databricks platform using tools like Prometheus, Grafana, ELK stack, or cloud-native monitoring services (CloudWatch, Azure Monitor, GCP Operations Suite). Proactively identify and address performance bottlenecks, resource constraints, and potential issues within the Databricks environment. Participate in on-call rotations to respond to and resolve critical incidents swiftly, performing root cause analysis (RCA) and implementing preventative measures. Manage and optimize Databricks clusters, including auto-scaling,instance types, and cluster policies, for both interactive and job compute workloads to ensure cost-effectiveness and performance. Automation & Tooling: Develop and maintain Infrastructure as Code (IaC) using tools like Bicep/Terraform or CloudFormation to automate the provisioning, configuration, and management of Databricks resources. Automate repetitive operational tasks, deployments, and environment provisioning using scripting languages (Python,Bash) and CI/CD pipelines (Jenkins, Azure DevOps, GitLab CI). Build and maintain custom tools and scripts to enhance Databricks platform capabilities, improve observability, and streamline workflows. Security & Governance: Implement and enforce Databricks security best practices, including identity and access management (IAM) with Unity Catalog, SSO integration (Azure AD, Okta), service principals, and granular access controls (RBAC, row-level/column-level security). Ensure compliance with organizational security policies, data governance standards, and regulatory requirements (e.g., GDPR,HIPAA, industry-specific compliance). Conduct security audits and vulnerability assessments of the Databricks environment. Manage secrets using Databricks secrets or a cloud provider secret manager. Performance Optimization & Cost Management: Analyze Databricks usage patterns, DBU consumption, and cloud resource costs to identify opportunities for optimization and efficiency gains. Implement strategies for cost control, including spot instances utilization, intelligent cluster resizing, and effective use of instance pools. Work with data teams to optimize Spark jobs, notebooks, and SQL queries for performance and cost. Collaboration & Mentorship: Collaborate closely with data engineers, data scientists, architects, and other SREs to understand their requirements and provide expert guidance on Databricks best practices. Provide technical leadership and mentorship to junior administrators and engineers, fostering a culture of reliability and operational excellence. Stay up-to-date with the latest Databricks features, cloud services, and SRE methodologies, evaluating and recommending new technologies.
Location & Eligibility
Where is the job
India
On-site within the country
Listing Details
- Posted
- December 15, 2025
- First seen
- May 6, 2026
- Last seen
- May 8, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 14%
- Scored at
- May 6, 2026
Signal breakdown
freshnesssource trustcontent trustemployer trust
External application · ~5 min on koantek's site
Please let koantek know you found this job on Jobera.
4 other jobs at koantek
View all →Explore open roles at koantek.
Similar Site Reliability Engineering jobs
View all →Manager of Site Reliability Engineering
Software Engineer - Site Reliability Engineering
USD 140000–230000
Full-time
Director, Site Reliability Engineering
Manager- Site Reliability Engineering
Engineering Manager - Site Reliability Engineering
full-timeRemote
Technical Senior Manager - Site Reliability Engineering
USD 94000–163000
Regular Full TimeRemote
Browse Similar Jobs
Manager5.9kAssistant Manager5.5kTeam Member5.1kEngineer3.6kDirector2.9kAssistant2.7kConsultant2.5kAssociate2.5kData Collector2.2kFitness & Wellness2.1kTechnician2kSupervisor1.8kCoordinator1.8kRestaurant General Manager1.7kTeam Leader1.6kAnalyst1.5kBehavioral Health1.3kCrew Member1.2kPart Time1.2kSocial Worker1.1k
Newsletter
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
A
B
C
D
No spam. Unsubscribe at any time.