Ensure availability and performance of customer-facing platform by provisioning and maintaining Windows/Linux/Kubernetes infrastructure, implementing IaC (Terraform/ARM/CloudFormation), automating workflows, monitoring with observability tools, managing backups and DR, analyzing logs and metrics, and promoting security and ITIL best practices across DevOps, DBA, and development teams.
Job Summary
As a Site Reliability Engineer, you will play a critical role in ensuring the availability and performance of our customer-facing platform. You will work closely with DevOps, DBA, and Development teams to provision and maintain infrastructure, deploy and monitor our applications, and automate workflows. Your contributions will have a direct impact on customer satisfaction and overall user experience.
Responsibilities and Deliverables
- Manage, monitor, and maintain highly available systems (Windows and Linux)
- Analyze metrics and trends to ensure performance and rapid scalability.
- Address routine service requests while identifying ways to automate and simplify.
- Create infrastructure as code using Terraform, ARM Templates, Cloud Formation.
- Maintain data backups and disaster recovery plans.
- Adhere to security best practices through all stages of the software development lifecycle
- Follow and champion ITIL best practices and standards.
Organizational Alignment
- Reports to the Senior SRE Manager
- This role involves close collaboration with DevOps, DBA, and security teams.
Technical Proficiencies
- Hands-on experience with AWS is a must-have.
- Proficiency analyzing application, IIS, system, security logs, and CloudTrail events.
- Experience with CI/CD tools such as Jenkins and GitHub Actions
- Experience maintaining and administering Windows, Linux, and Kubernetes.
- Experience in automation using scripting languages such as PowerShell, Bash, or Python.
- Good understanding of networking concepts (VPC, subnet, private link, peering).
- Familiarity with configuration management using Ansible, Azure Automation or similar.
- Familiarity with observability tools such as New Relic, AppDynamics, or DataDog.
Experience
- 3+ years of experience in SRE or System Administration role.
- Demonstrated ability building and supporting high availability Windows/Linux servers.
- 2+ years of experience working with cloud technologies including AWS, Azure.
- Comfortable using Scrum, Kanban, or Lean methodologies.
Education
- Bachelor’s Degree or College Diploma in Computer Science, Information Systems, or equivalent experience.
Similar Jobs
Artificial Intelligence • Fintech • Information Technology • Logistics • Payments • Business Intelligence • Generative AI
The Lead Site Reliability Engineer will build, deploy, and manage microservices in Kubernetes, optimize cloud applications, and integrate emerging technologies in AI and GenAI, ensuring high reliability and scalability.
Top Skills:
Amazon EksAWSAzureBashChefGCPGithub ActionsHelmKubernetesMySQLNew RelicPagerdutyPythonRundeckTerraform
Information Technology • Software • Web3 • Infrastructure as a Service (IaaS)
Operate and improve the Pod platform: respond to incidents, investigate root causes, build automation and observability, design monitoring/alerting, reduce alert fatigue, and drive reliability improvements across production systems.
Top Skills:
BashCi/CdCloudDockerGrafanaLinuxPagerdutyPrometheusPythonRust
Big Data • Information Technology • Software • Database • Analytics • Infrastructure as a Service (IaaS) • Big Data Analytics
Lead proactive reliability engineering for a multi-cloud streaming platform: build automation and tooling, define SLO/SLA frameworks, analyze systemic failures, own incident response standards, serve as incident commander, coach teams through post-mortems, produce customer-facing root cause analyses, and partner across engineering to reduce incidents and scale reliability practices.
Top Skills:
AWSAzureCi/CdConfluenceGCPGitJIRAKafkaKubernetesLoggingMetricsPagerdutyRootlySlackTracing
What you need to know about the Chennai Tech Scene
To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.



.png)