Lead a team responsible for 24x7 infrastructure monitoring, incident management, and operational oversight, ensuring SLA adherence and service reliability.
Requisition Number: 2352103
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.
The IO Engineering Supervisor - Infrastructure Operations & Monitoring leads a team of L1.5 / L2 infrastructure analysts responsible for 24x7 monitoring, queue management, and incident triage across enterprise compute, virtualization, and foundational network environments.
This role ensures SLA adherence, operational discipline, and service reliability, while acting as the primary escalation and incident leadership point during high-severity production events. The supervisor partners closely with Engineering, Platform, and Service Management teams to continuously improve monitoring effectiveness and operational maturity.
The ideal candidate brings hands-on infrastructure operations experience, solid people leadership skills, and deep familiarity with enterprise monitoring platforms such as WhatsUp Gold, SolarWinds, SCOM, and Nagios.
Primary Responsibilities:
Required Qualifications:
Preferred Qualifications:
At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.
The IO Engineering Supervisor - Infrastructure Operations & Monitoring leads a team of L1.5 / L2 infrastructure analysts responsible for 24x7 monitoring, queue management, and incident triage across enterprise compute, virtualization, and foundational network environments.
This role ensures SLA adherence, operational discipline, and service reliability, while acting as the primary escalation and incident leadership point during high-severity production events. The supervisor partners closely with Engineering, Platform, and Service Management teams to continuously improve monitoring effectiveness and operational maturity.
The ideal candidate brings hands-on infrastructure operations experience, solid people leadership skills, and deep familiarity with enterprise monitoring platforms such as WhatsUp Gold, SolarWinds, SCOM, and Nagios.
Primary Responsibilities:
- Team Leadership and Operational Oversight
- Lead, mentor, and coach a team of infrastructure monitoring and operations analysts
- Manage shift schedules, staffing coverage, and workload distribution to meet 24x7 operational needs
- Conduct performance reviews, coaching sessions, and skills development planning
- Drive consistent adherence to:
- SLAs and OLAs
- Ticket quality and documentation standards
- Operational and escalation procedures
- Incident and Queue Management
- Oversee the prioritization and handling of incidents, service requests, and problem records
- Act as escalation leader and incident commander for high-severity and major incidents
- Ensure timely, clear, and accurate incident communication to stakeholders
- Validate escalation quality, technical accuracy, and completeness of RCA documentation
- Enforce best practices for ticket categorization, routing, and hygiene
- Monitoring Operations and Infrastructure Health
- Supervise real-time monitoring of compute, virtualization, and network components using enterprise tools including:
- WhatsUp Gold
- SolarWinds
- System Center Operations Manager (SCOM)
- Nagios (where applicable)
- Review alerts to:
- Identify recurring issues and systemic failures
- Reduce alert noise and false positives
- Ensure analysts consistently follow runbooks, SOPs, and response thresholds
- Collaborate with engineering teams to:
- Tune alerts
- Improve monitoring coverage
- Enhance event workflows and automation
- Supervise real-time monitoring of compute, virtualization, and network components using enterprise tools including:
- Technical Oversight (Compute / Virtualization / Network Fundamentals)
- Provide senior-level guidance during escalations involving:
- Windows Server
- Linux systems
- VMware vSphere
- Basic networking and connectivity issues
- Validate initial diagnostics and triage performed by analysts
- Partner with engineering and platform teams to drive resolution of complex issues
- Support readiness activities for planned changes, upgrades, and maintenance events
- Provide senior-level guidance during escalations involving:
- Process, Governance and Continuous Improvement
- Enforce ITIL-aligned practices for Incident, Change, and Problem Management
- Own and maintain:
- SOPs and runbooks
- Escalation matrices
- Operational documentation
- Lead initiatives to improve:
- Alert noise reduction
- Queue efficiency
- Analyst productivity and response quality
- Drive cross-functional collaboration to raise overall operational maturity
- Comply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business environment). The Company may adopt, vary or rescind these policies and directives in its absolute discretion and without any limitation (implied or otherwise) on its ability to do so
Required Qualifications:
- Bachelor's degree in computer science, Engineering, or equivalent practical experience
- 7+ years of experience in:
- Infrastructure operations
- Compute administration
- NOC or 24x7 monitoring environments
- 2+ years in a team lead or supervisory role within operations or NOC teams
- Solid hands-on experience with enterprise monitoring tools including:
- WhatsUp Gold
- SolarWinds
- SCOM
- Experience with ticketing and ITSM platforms:
- ServiceNow (preferred)
- JIRA, Rally (acceptable alternatives)
- Solid understanding of:
- Windows Server administration
- Linux fundamentals
- VMware vSphere concepts
- Working knowledge of networking fundamentals:
- TCP/IP
- DNS
- Routing and ports
- Firewalls
- Proven excellent communication, leadership, and incident management skills
- Demonstrated ability to operate effectively in a fast-paced, global operations environment
Preferred Qualifications:
- Solid understanding of ITIL processes; certification
- Solid hands-on experience with enterprise monitoring tools including:
- Nagios (preferred)
- Experience with scripting or automation tools:
- PowerShell
- Python
- Shell scripting
- Exposure to cloud platforms:
- Microsoft Azure
- AWS
- Google Cloud Platform
- Industry certifications:
- ITIL
- VMware
- Microsoft
- Linux
At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.
Top Skills
AWS
Dns
Google Cloud Platform
JIRA
Linux
Azure
Nagios
Powershell
Python
Rally
Scom
Servicenow
Shell Scripting
Solarwinds
Tcp/Ip
Vmware Vsphere
Whatsup Gold
Windows Server
Optum Chennai, Tamil Nadu, IND Office
Chennai, India, India
Similar Jobs at Optum
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
The Senior Software Engineer I will develop and maintain front-end and back-end systems using React and AWS serverless technologies, ensuring secure data integration and optimal performance.
Top Skills:
Api GatewayAws LambdaCloudFormationDynamoDBGitGithub ActionsJavaScriptNode.jsPythonReactS3TerraformTypescript
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
The role involves implementing data mapping and transformation, troubleshooting issues, collaborating with stakeholders, and documenting technical processes to improve health outcomes.
Top Skills:
SparkAWSAzureData WarehousingETLHadoopHiveOraclePythonScalaSQLSQL Server
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Design and develop software architecture, automate workflows, lead a small team, optimize code, and ensure system scalability and reliability.
Top Skills:
AzureCi/CdDockerGitGrafanaJava 1.8JpaJunitKubernetesMockitoPostgresSplunkSpring BootSpring RestTestng
What you need to know about the Chennai Tech Scene
To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.

