Job Description
We are seeking a Site Reliability Engineer (SRE I / SRE II) to help manage our servers, deployments, and overall system reliability. The ideal candidate is passionate about scalability, automation, and troubleshooting production systems to ensure maximum uptime.What You'll Do : - Design & Maintain Reliable Systems : Ensure high availability and fault tolerance of deployed services by following best practices in cloud infrastructure and Kubernetes. - Support SDLC & CI/CD Pipelines : Collaborate with development teams to improve the CI/CD process, ensuring smooth deployments and rollback strategies. - Troubleshoot Production Issues : Investigate incidents, perform root cause analysis, and implement proactive monitoring to minimize downtime.
• Infrastructure as Code (IaC) & Automation : Build and manage infrastructure using Terraform and Kubernetes, reducing manual intervention. - Logging, Monitoring, & Alerting : Set up observability tools (Datadog, Prometheus, Grafana, ELK, etc.) to detect and resolve system anomalies. - Security & Compliance : Ensure patch management, security best practices, and cloud security policies are enforced. - Participate in On-Call Rotation : Share responsibility for system health, responding to alerts, and maintaining service reliability.Who You Are : - 2-5 years of experience designing and maintaining cloud-based applications and infrastructure.
• Strong Linux administration skills. - Hands-on experience with GCP or AWS (1-2 years minimum). - Experience working with Terraform for infrastructure-as-code (1+ years). - Knowledge of Kubernetes (GKE/EKS) and Docker for container orchestration.
• Strong troubleshooting skills, particularly in CI/CD failures, networking, and cloud services. - Familiarity with monitoring tools like Prometheus, Grafana, and Datadog. - Experience in scripting (Bash, Python, or Go) for automation. - Understanding of best practices in SDLC, version control (Git), and release management.Bonus Points : - Experience supporting on-prem deployments and hybrid cloud models.
• Exposure to incident response & disaster recovery planning. - Security compliance knowledge (SOC 2, NIST, etc.). (ref: hirist.tech)