koantek7mo ago

Site Reliability Engineering (SRE)

India·Mumbaimid

OtherSite Reliability Engineering

3 views0 saves0 applied

Apply Now

Quick Summary

Overview

Technical Tools

awsazuregcpgitlab-cigrafanajenkinsprometheuspythonsqlterraformci-cdmentoringperformance-optimizationsecurity-best-practices

About the Role: We are seeking a highly skilled and experienced SREDatabricks Platform Administrator to join our DataOperations Team. In this critical role, you will be responsible for the availability, performance,Reliability and scalability of our enterprise Databricks platform. You will blend deep expertise in Databricks administration with SRE principles to automate operations, proactively identify and resolve issues, and ensure a seamless experience for our data engineering, data science, and analytics teams. You will champion best practices for platform governance, security, and cost optimization, playing a pivotal role in our data ecosystem. Key Responsibilities: Platform Operations & Reliability: Design, implement, and maintain the Databricks platform infrastructure across multiple cloud environments (AWS, Azure,or GCP). Ensure high availability, disaster recovery, and business continuity of Databricks workspaces, clusters, and associated services. Develop and implement robust monitoring, alerting, and logging solutions for the Databricks platform using tools like Prometheus, Grafana, ELK stack, or cloud-native monitoring services (CloudWatch, Azure Monitor, GCP Operations Suite). Proactively identify and address performance bottlenecks, resource constraints, and potential issues within the Databricks environment. Participate in on-call rotations to respond to and resolve critical incidents swiftly, performing root cause analysis (RCA) and implementing preventative measures. Manage and optimize Databricks clusters, including auto-scaling,instance types, and cluster policies, for both interactive and job compute workloads to ensure cost-effectiveness and performance. Automation & Tooling: Develop and maintain Infrastructure as Code (IaC) using tools like Bicep/Terraform or CloudFormation to automate the provisioning, configuration, and management of Databricks resources. Automate repetitive operational tasks, deployments, and environment provisioning using scripting languages (Python,Bash) and CI/CD pipelines (Jenkins, Azure DevOps, GitLab CI). Build and maintain custom tools and scripts to enhance Databricks platform capabilities, improve observability, and streamline workflows. Security & Governance: Implement and enforce Databricks security best practices, including identity and access management (IAM) with Unity Catalog, SSO integration (Azure AD, Okta), service principals, and granular access controls (RBAC, row-level/column-level security). Ensure compliance with organizational security policies, data governance standards, and regulatory requirements (e.g., GDPR,HIPAA, industry-specific compliance). Conduct security audits and vulnerability assessments of the Databricks environment. Manage secrets using Databricks secrets or a cloud provider secret manager. Performance Optimization & Cost Management: Analyze Databricks usage patterns, DBU consumption, and cloud resource costs to identify opportunities for optimization and efficiency gains. Implement strategies for cost control, including spot instances utilization, intelligent cluster resizing, and effective use of instance pools. Work with data teams to optimize Spark jobs, notebooks, and SQL queries for performance and cost. Collaboration & Mentorship: Collaborate closely with data engineers, data scientists, architects, and other SREs to understand their requirements and provide expert guidance on Databricks best practices. Provide technical leadership and mentorship to junior administrators and engineers, fostering a culture of reliability and operational excellence. Stay up-to-date with the latest Databricks features, cloud services, and SRE methodologies, evaluating and recommending new technologies.