Citi Logo

Citi

Senior Incident Optimization & Reliability Specialist - End-User Technology – Vice President

Posted 12 Days Ago
Be an Early Applicant
In-Office
Chennai, Tamil Nadu, IND
Senior level
In-Office
Chennai, Tamil Nadu, IND
Senior level
Lead incident reduction and reliability for end-user technology by applying SRE principles. Analyze alerts, design intelligent event correlation on AIOps platforms, build automation and self-healing playbooks, mature observability across desktop and collaboration services, and partner with engineering and platform teams to reduce operational toil and improve alert quality.
The summary above was generated by AI

Position Information

  • Position Information

  • Job Title: Senior Incident Optimization & Reliability Specialist - End-User Technology
  • Job Level: C-13
  • Department: Foundational Services - Production Operations
  • Location: Chennai, India

Must to have 10-16 years of proven experience in infrastructure operations, software engineering. Primarily as Incident Optimization specialist, as Site Reliability Engineer (SRE) for End-User Computing

Position Summary

The Senior Incident Optimization & Reliability Specialist serves as a critical bridge between our Technology Incident Optimization Program and the core End-User Technology domains, including cloud desktop infrastructure, Microsoft productivity tools, content management, and conference/video platforms. This role demands deep technical expertise combined with a strategic, data-driven mindset to drive tactical incident reduction while architecting the future state of intelligent event management and automation for end-user services.

Applying core Site Reliability Engineering (SRE) principles, you will be responsible for maturing our observability posture, building automated incident remediation workflows, and achieving measurable reductions in operational toil. By focusing on intelligent event management, automation, and continuous improvement, you will enhance the reliability and performance of services that are critical to our end-users. This position offers the unique opportunity to shape the future of a highly reliable, automated enterprise environment.

Key Responsibilities

  • Incident & Alert Analysis: Conduct comprehensive analysis of alert and incident patterns to identify top sources of operational noise, determine root causes, and develop data-driven strategies for reduction.
  • Intelligent Event Management: Design, implement, and optimize rules for event correlation, de-duplication, and suppression on AIOps and event management platforms. Develop domain-specific correlation logic leveraging configuration management data and end-user service topology.
  • Automation & Toil Reduction: Architect and develop automation playbooks for incident data enrichment and create self-healing capabilities to reduce manual intervention (toil) for common and recurring end-user technology incident scenarios.
  • Observability Maturity: Assess the current observability footprint across all end-user technology domains. Identify gaps and drive enhancements in telemetry, logging, and tracing to provide deeper insights and enable proactive issue detection.
  • Apply SRE Principles: Champion and apply core SRE practices to systematically improve service reliability. This includes contributing to the definition of Service Level Objectives (SLOs), using a data-driven approach to continuous improvement.
  • Cross-Functional Collaboration: Partner closely with end-user services, engineering, and platform teams to understand incident drivers, validate correlation logic, and provide expert guidance on event management and reliability best practices.
  • Quality Assurance: Continuously validate the effectiveness of implemented rules and automation to ensure no business-impacting alerts are missed. Monitor and report on alert quality metrics and lead iterative improvements.

Required Qualifications

  • Education: Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related technical field.
  • Experience: A minimum of 8+ years of hands-on experience in IT operations, end-user computing, or a related field, with proven experience in incident reduction and operational excellence.
  • Event Management & Incident Reduction: Demonstrated success in leading event management and incident reduction initiatives with quantifiable results. Direct, hands-on experience with modern AIOps and enterprise event management platforms (e.g., BigPanda) is required.
  • Technical Expertise:
    • Deep understanding of end-user technology ecosystems, including VMWare-hosted cloud desktop infrastructure, Microsoft 365 suite (Teams, Outlook, Office), SharePoint, and collaboration platforms.
    • Expertise with a broad range of domain-specific monitoring and observability tools.
  • Automation & Orchestration: Hands-on experience developing robust automation solutions using scripting languages (e.g., Python, PowerShell) and modern automation frameworks to reduce manual tasks.
  • Data Analysis: Proficiency in log analysis, pattern recognition, and using query languages for data analysis on log aggregation platforms.
  • Problem-Solving & Analytical Skills: Excellent analytical abilities with a systematic approach to troubleshooting complex issues and a holistic view of technology systems.
  • Communication & Leadership: Exceptional communication skills with the ability to influence and collaborate effectively across diverse, cross-functional teams.

Preferred Qualifications

  • An advanced degree (Master's) in a relevant technical field.
  • Relevant industry certifications (e.g., Microsoft 365, VMWare, ITIL).
  • Experience with Site Reliability Engineering (SRE) practices and applying them in an enterprise context.
  • Knowledge of ITSM platforms, CMDB management, and infrastructure-as-code (IaC) principles.
  • Familiarity with financial services regulatory requirements.

------------------------------------------------------

Job Family Group: Technology

------------------------------------------------------

Job Family:Infrastructure

------------------------------------------------------

Time Type:Full time

------------------------------------------------------

Most Relevant Skills Please see the requirements listed above.

------------------------------------------------------

Other Relevant Skills For complementary skills, please see above and/or contact the recruiter.

------------------------------------------------------

Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.

 

If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.
View Citi’s EEO Policy Statement and the Know Your Rights poster.

Citi Chennai, Tamil Nadu, IND Office

C P Ramaswamy Road, Chennai, Tamil Nadu, India, 600018

Similar Jobs

An Hour Ago
Hybrid
Chennai, Tamil Nadu, IND
Expert/Leader
Expert/Leader
Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics
Lead backend application development and architecture using C#/.NET. Drive testability, CI/CD, version control practices, design decisions, and mentor/communicate design concepts with the team.
Top Skills: .NetC#Continuous DeploymentContinuous IntegrationGitSubversion
2 Hours Ago
Hybrid
Chennai, Tamil Nadu, IND
Senior level
Senior level
eCommerce • Fashion • Retail • Sales • Wearables • Design
Manage service providers' quality, manufacturing and CSR performance. Drive corrective actions, lean manufacturing, process improvements and monthly production plans. Build cross-functional supplier relationships, ensure staffing and QA systems, introduce production innovations, and lead supplier performance to meet Tapestry standards.
Top Skills: ExcelSAP
2 Hours Ago
Remote or Hybrid
India
Senior level
Senior level
Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Lead business analysis and delivery for loans transformations: gather end-to-end requirements, map current/target-state loan processes, define user stories and acceptance criteria, support solution design, drive UAT, and manage delivery governance, risks, and milestone reporting across stakeholders.
Top Skills: Advanced AnalyticsAgileAi-Enabled SolutionsAutomationConfluenceJIRA

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account