Lead SRE at JPMorgan Chase focusing on site reliability, defining requirements, managing incidents, mentoring, and developing AI/ML solutions.
Job Description
Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.
As a Principal Site Reliability Engineer at JPMorgan Chase within the AI/ML & Data platform team, you work with your fellow stakeholders to define non-functional requirements (NFRs) and availability targets for the services in your application and product lines. You will ensure those NFRs are accounted for in your products' design and test phases, that your service level indicators are effectively measuring customer experience, and that service level objectives are defined with stakeholders and implemented in production.
Job responsibilities
Required qualifications, capabilities, and skills
Preferred qualifications, capabilities, and skills
Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.
As a Principal Site Reliability Engineer at JPMorgan Chase within the AI/ML & Data platform team, you work with your fellow stakeholders to define non-functional requirements (NFRs) and availability targets for the services in your application and product lines. You will ensure those NFRs are accounted for in your products' design and test phases, that your service level indicators are effectively measuring customer experience, and that service level objectives are defined with stakeholders and implemented in production.
Job responsibilities
- Demonstrate expertise in application development and support with multiple technologies such as Databricks, Snowflake, AWS, Kubernetes, etc.
- Coordinate incident management coverage to ensure effective resolution of application issues.
- Collaborate with cross-functional teams to perform root cause analysis and implement production changes.
- Mentor and guide team members to foster innovation and strategic change.
- Develop and support AI/ML solutions for troubleshooting and incident resolution.
Required qualifications, capabilities, and skills
- Formal training or certification on SRE concepts and 5+ years applied experience
- Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
- Proficiency in running production incident calls and managing incident resolution.
- Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Strong understanding of SLI/SLO/SLA and Error Budgets
- Proficiency in Python or PySpark for AI/ML modeling.
- Must be able to reduce toil by building new tools to automate repeated tasks.
- Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
- Understanding of network topologies, load balancing, and content delivery networks.
- Awareness of risk controls and compliance with departmental and company-wide standards.
- Ability to work collaboratively in teams and build meaningful relationships to achieve common goals.
Preferred qualifications, capabilities, and skills
- SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies.
- AWS and Databricks certifications.
Top Skills
AWS
Databricks
Datadog
Dynatrace
Grafana
Kubernetes
Prometheus
Pyspark
Python
Snowflake
Splunk
Similar Jobs at JPMorganChase
Financial Services
As a Lead Site Reliability Engineer, you will improve application reliability, mentor engineers, and lead technical initiatives to enhance service levels and resolve complex issues.
Top Skills:
AWSAzureDatadogDockerDynatraceEcsGitlabGrafanaJenkinsKubernetesPrometheusPythonSplunkTerraform
Financial Services
Be part of an agile team enhancing and delivering technology products. Provide technical guidance, develop high-quality code, and contribute to technical operations.
Top Skills:
AWSJavaNoSQLSpringSQL
Financial Services
As a Compliance and Operations Risk Test Lead, you will enhance compliance and operational risk management by executing testing processes, identifying control gaps, and coordinating cross-functional teams for quality outcomes.
Top Skills:
Analytical ToolsProject Management MethodologiesTesting Processes
What you need to know about the Chennai Tech Scene
To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.

