NatWest Group Logo

NatWest Group

Site Reliability Engineer (AWS & Kubernetes), VP

Posted 4 Hours Ago
Be an Early Applicant
In-Office
Chennai, Tamil Nadu, IND
Senior level
In-Office
Chennai, Tamil Nadu, IND
Senior level
As a Vice President Site Reliability Engineer, you'll enhance system reliability and performance using SRE principles on AWS and Kubernetes, manage incidents, and drive operational excellence.
The summary above was generated by AI

Join us as a Site Reliability Engineer

  • In this key role, you’ll improve, drive, and embed non-functional and operational characteristics such as availability, performance, efficiency, change management, monitoring, security, incident response, and capacity planning of our products and services
  • You’ll enjoy significant stakeholder interaction, working in collaboration with engineers to ensure a principled approach to deliver change in a safe and secure way
  • This is a chance to join an inclusive team with a collaborative ethos and a commitment to innovation and professional development
  • We're offering this role at vice president level
What you'll do

As a Senior Site Reliability Engineer, you’ll act as a hands‑on expert responsible for ensuring the reliability, availability and performance of critical production platforms.

You’ll lead the adoption of SRE practices, embedding resilience, observability and operational excellence into distributed systems running on AWS and Kubernetes. You’ll also take ownership of 24/7 production support models, ensuring systems are highly available and incidents are effectively managed and learned from.

In addition to this, you’ll:

  • Designing and operating highly resilient AWS-based Kubernetes platforms (EKS) aligned to enterprise standards
  • Owning and improving production reliability, availability, and SLA/SLO frameworks
  • Leading incident management, escalation and 24/7 on-call practices, including post-incident reviews
  • Embedding SRE principles such as error budgets, toil reduction, and reliability engineering into delivery teams
  • Implementing infrastructure and platform automation using Terraform and GitOps methodologies
  • Driving self-healing, auto-scaling and failure recovery mechanisms using tools such as Karpenter
  • Building secure, scalable networking and service communication (e.g. Cilium)
  • Defining and operating observability platforms using Grafana, Prometheus, Loki, Tempo
  • Partnering with DevOps and engineering teams to ensure production readiness and operational excellence
  • Leading complex troubleshooting across distributed systems and cloud-native environments
  • Developing reusable “golden paths”, operational runbooks and reliability patterns
  • Ensuring platforms meet regulatory, security and operational risk requirements
  • Using data, SLIs and metrics to drive continuous improvement and proactive reliability enhancements
The skills you'll need

We’re looking for a highly experienced SRE who has a strong background in operating large-scale, business-critical platforms with a passion for reliability engineering

We’re also looking for:

  • Deep expertise managing production systems on AWS and Kubernetes (EKS)
  • Strong experience in 24/7 support models, incident management and on-call leadership
  • Advanced knowledge of SRE principles (SLIs, SLOs, error budgets, toil reduction)
  • Proficiency in Terraform, GitOps, and cloud automation practices
  • Hands-on experience with GitLab CI/CD and Argo CD
  • Strong understanding of Kubernetes networking, security and service mesh technologies, ideally Cilium
  • Experience scaling infrastructure using Karpenter and auto-scaling strategies
  • Expertise in observability tooling (Grafana, Prometheus, Loki, Tempo)
  • Proven ability to troubleshoot and resolve complex, cross-system production issues
  • Experience operating in regulated or high-security environments
  • Strong leadership, mentoring, and stakeholder engagement capabilities
  • Ability to balance reliability, risk, and delivery in a fast-paced environment

Hours

45

Job Posting Closing Date:

16/06/2026

NatWest Group Chennai, Tamil Nadu, IND Office

Kosmo One, Plot No 14 3rd Main Road, Ambattur Industrial Estate, Chennai, Tamil Nadu, India, 600 058

Similar Jobs

12 Minutes Ago
In-Office
Chennai, Tamil Nadu, IND
Mid level
Mid level
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Develop and maintain full-stack applications, enhance APIs, support operational readiness, and collaborate with teams using modern technologies in a healthcare environment.
Top Skills: AWSAzureGCPGithub ActionsJavaPostgresPythonReactSpring Boot
12 Minutes Ago
In-Office
Chennai, Tamil Nadu, IND
Senior level
Senior level
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Lead the design and development of scalable web applications using Java and React, implementing cloud infrastructure on Azure and CI/CD practices.
Top Skills: SparkAzureAzure DevopsAzure FunctionsAzure SqlContext ApiCosmos DbCSS3DatabricksGithub ActionsHibernate/JpaHTML5Java 17+Key VaultReactReduxRestful ApisService BusSpring BootSpring SecuritySQLTerraformTypescript
7 Hours Ago
Easy Apply
Remote or Hybrid
India
Easy Apply
Mid level
Mid level
Cloud • Information Technology • Security • Software • Cybersecurity
The Lead Technical Enablement Engineer designs and delivers training programs for technical roles, focusing on onboarding, collaboration, and driving engagement for Solutions Consulting teams.
Top Skills: Cloud SecurityEnterprise Security TechnologiesNetworking

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account