NVIDIA Logo

NVIDIA

Senior Site Reliability Engineer

Job Posted 20 Days Ago Posted 20 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in India
Senior level
Remote
Hiring Remotely in India
Senior level
As a Senior Site Reliability Engineer, you will design, build, and implement scalable cloud systems, improve deployment, and automate operations for DGX Cloud.
The summary above was generated by AI

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

NVIDIA is looking for a passionate member to join our DGX Cloud Engineering Team as a Sr. Site Reliability Engineer. In this role, you will play a significant part in helping to craft and guide the future of AI & GPUs in the Cloud. NVIDIA DGX Cloud is a cloud platform tailored for AI tasks, enabling organizations to transition AI projects from development to deployment in the age of intelligent AI. Are you passionate about cloud software development and strive for quality? Do you pride yourself in building cloud-scale software systems? If so, join our team at NVIDIA, where we are dedicated to delivering GPU-powered services around the world!

What you'll be doing:

You will play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform by helping to build our deployment infrastructure processes, creating world-class SRE measurement and creating automation tools to improve efficiency of operations, and maintaining a high standard of perfection in service operability and reliability.

  • Design, build, and implement scalable cloud-based systems for PaaS/IaaS.

  • Work closely with other teams on new products or features/improvements of existing products.

  • Develop, maintain and improve cloud deployment of our software.

  • Participate in the triage & resolution of complex infra-related issues

  • Collaborate with developers, QA and Product teams to establish, refine and streamline our software release process, software observability to ensure service operability, reliability, availability.

  • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces

  • Develop, maintain and improve automation tools that can help improve efficiency of SRE operations

  • Practice balanced incident response and blameless postmortems

  • Be part of an on-call rotation to support production systems

What we need to see:

  • BS or MS in Computer Science or equivalent program from an accredited University/College.

  • 8+ years of hands-on software engineering or equivalent experience.

  • Demonstrate understanding of cloud design in the areas of virtualization and global infrastructure, distributed systems, and security.

  • Expertise in Kubernetes (K8s) & KubeVirt and building RESTful web services.

  • Understanding of building AI Agentic solutions preferably Nvidia open source AI solutions. Demonstrate working experiences in SRE principles like metrics emission for observability, monitoring, alerting using logs, traces and metrics

  • Hands on experience working with Docker, Containers and Infrastructure as a Code like terraform deployment CI/CD.

  • Exhibit knowledge in concepts of working with CSPs, for example: AWS (Fargate, EC2, IAM, ECR, EKS, Route53 etc...), Azure etc.

Ways to stand out from the crowd:

  • Expertise in technologies such as Stack-storm, OpenStack, Redhat OpenShift, AI DBs like Milvus.

  • A track record of solving complex problems with elegant solutions.

  • Prior experience with Go & Python, React.

  • Demonstrate delivery of complex projects in previous roles.

  • Showcase ability in developing Frontend application with concepts of SSA, RBAC

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Top Skills

AWS
Azure
Docker
Go
Kubernetes
Kubevirt
Python
React
Restful Web Services
Terraform

Similar Jobs

5 Days Ago
In-Office or Remote
Bengaluru, Karnataka, IND
Senior level
Senior level
Cloud • Marketing Tech • Professional Services • Social Impact • Software
The Senior Director of Cloud Operations leads the global operations team, focusing on cloud integrity, incident management, and data-driven performance improvement for enterprise cloud environments.
Top Skills: AWSAzureDatadogGCPGrafanaServicenowSplunk
17 Days Ago
In-Office or Remote
Mumbai, Maharashtra, IND
Senior level
Senior level
Artificial Intelligence • Software
Lead the development of scalable applications, maintain high-quality code, mentor engineers, and drive project delivery in a dynamic team environment.
Top Skills: AWSFastapiFlaskGoogle Cloud PlatformKubernetesMongoDBPython
19 Days Ago
In-Office or Remote
5 Locations
Senior level
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Senior SRE will design, build, and maintain services for data analytics and ML, ensuring reliability, scalability, and operational efficiency.
Top Skills: ElkGithub ActionsGoJenkinsKafkaKubernetesOpenstackPerlPrometheusPythonRubySpark

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.
By clicking Apply you agree to share your profile information with the hiring company.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account