Senior Site Reliability Engineer (SRE)

BAsenior

EngineeringDevOps & InfrastructureSite Reliability Engineer

0 views0 saves0 applied

Apply Now

Quick Summary

Key Responsibilities

Operate and support a 24/7 production VoIP hosting environment, including monitoring, incident response, and on-call rotation; Build, maintain,

Requirements Summary

Strong Linux systems administration (Debian/Ubuntu, RHEL/CentOS/Alma, kernel troubleshooting). Deep experience with LXC/LXD containers and at least one hypervisor (KVM/QEMU, Proxmox, VMware).

Technical Tools

EngineeringDevOps & InfrastructureSite Reliability Engineer

Founded in 2004, Bicom Systems has grown into a global communications company with team members and customers around the world. We create reliable, easy-to-use tools that help businesses stay connected - through calling, messaging, and modern collaboration solutions. Our products are used in thousands of workplaces every day, supporting millions of people as they communicate and work together. What sets us apart is the strong partnerships we build and the supportive global ecosystem behind everything we do.

At Bicom Systems, we believe results come from teams that take ownership, act with trust, and are accountable for their work. We invest in professional growth, encourage thoughtful decision-making, and support a healthy work-life balance.

We are looking for a senior professional with 5+ years of experience who wants to take initiative, make meaningful contributions, and grow with a supportive team. Join us!

Responsibilities

~1 min read

→Operate and support a 24/7 production VoIP hosting environment, including monitoring, incident response, and on-call rotation;
→Build, maintain, and optimize virtualization and container platforms supporting VoIP softswitches;
→Design and maintain monitoring, alerting, and observability across infrastructure and network layers;
→Automate infrastructure provisioning and lifecycle using IaC and scripting tools;
→Troubleshoot complex Linux, storage, and network issues with a focus on latency, QoS and reliability;
→Lead platform upgrades, maintenance, and migration with minimal or no downtime;
→Perform capacity planning and performance optimization;
→Create and maintain operational documentation and runbooks.

Requirements

~1 min read

Strong Linux systems administration (Debian/Ubuntu, RHEL/CentOS/Alma, kernel troubleshooting).
Deep experience with LXC/LXD containers and at least one hypervisor (KVM/QEMU, Proxmox, VMware).
Hands-on experience with ZFS (pool design, tuning, snapshots, send/receive, scrub/repair).
Storage networking: iSCSI target/initiator administration and familiarity with NVMe over Fabrics (NVMe-oF / nvme-oTCP) concepts and troubleshooting.
Proficiency with server-grade hardware: ECC memory, RAID controllers, NVMe drives, BMC remote management, and firmware/BIOS upgrades.
Scripting and development skills (Bash + one higher-level language such as Python, Go, or Ruby) for automation and tooling.
Experience implementing monitoring/observability (Prometheus, Grafana, ELK/EFK, or equivalent) and alerting thresholds.
Strong low-level troubleshooting skills: I/O performance profiling, CPU pinning, network capture/analysis (tcpdump, wireshark), and interpreting kernel logs.