RunPod Logo

RunPod

Infrastructure Solutions Engineer

Posted 11 Hours Ago
Be an Early Applicant
Easy Apply
Remote
Hiring Remotely in USA
Mid level
Easy Apply
Remote
Hiring Remotely in USA
Mid level
As an Infrastructure Solutions Engineer, you will design and implement GPU-powered infrastructure solutions for AI/ML applications. You'll help customers optimize and scale their workloads across hybrid and cloud environments, troubleshoot issues, and collaborate with various teams to align infrastructure needs with product development.
The summary above was generated by AI

RunPod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full-stack AI applications. Founded in 2022, we are a rapidly growing, well-funded company with a remote-first organization spread globally. Our mission is to empower innovators and enterprises to unlock AI's true potential, driving technology and transforming industries. Join us as we shape the future of AI.

As our organization continues its rapid expansion in managing large-scale, distributed systems, we are looking for an Infrastructure Solutions Engineer to join our team.This is a unique opportunity to work at the forefront of AI infrastructure, helping customers design and implement GPU-accelerated solutions that power the next wave of innovation. If you thrive in technical problem-solving, love working with cutting-edge infrastructure, and want to shape the future of cloud-native AI, this role is for you.

As an Infrastructure Solutions Engineer, you’ll play a critical role in helping our customers unlock the full potential of GPU-powered infrastructure. From architecting high-performance solutions for AI/ML workflows to optimizing large-scale deployments in AI datacenters and edge environments, you’ll be at the heart of the action. This is a hybrid role that blends customer interaction, technical consulting, and hands-on engineering. You’ll work with enterprise customers, research teams, and internal stakeholders to deliver infrastructure solutions that are fast, reliable, and scalable. If you’re looking for a role where you can have a tangible impact on groundbreaking AI applications, this is it.

Responsibilities:

  • Design, build, and deploy GPU-centric infrastructure solutions that enable customers to accelerate their AI/ML workloads at scale.
  • Act as a trusted advisor, helping customers architect high-performance compute environments in GPU datacenters, edge environments, and hybrid cloud scenarios.
  • Lead technical onboarding for new customers, guiding them through best practices for managing and scaling GPU infrastructure.
  • Develop scripts, tools, and automations that improve efficiency and simplify large-scale GPU deployments.
  • Analyze and optimize infrastructure for performance, cost, and reliability — whether it’s for multi-cloud, on-premises AI datacenters, or hybrid models.
  • Troubleshoot customer issues related to GPU provisioning, container orchestration, and AI workload performance.
  • Work with Product, Sales, and Engineering teams to align customer needs with the development of new features and services.

Requirements:

  • 2-4 years of experience with GPU cloud platforms or roles in infrastructure engineering, DevOps, or SRE.
  • Experience with monitoring tools like DataDog, Grafana, or ELK (Elasticsearch, Logstash, Kibana) to support large-scale infrastructure environments.
  • Hands-on experience with NVIDIA GPUs (A100, H100, or similar) and a strong understanding of how GPUs accelerate AI/ML workloads.
  • Expertise with Kubernetes (K8s) and Docker for orchestrating containerized AI/ML workloads.
  • Proficiency in Python, Bash, or Go for automation, tooling, and infrastructure management
  • Strong communication and interpersonal skills, with experience delivering technical solutions to both technical and non-technical stakeholders.

Preferred:

  • Experience supporting AI/ML frameworks like TensorFlow, PyTorch, or JAX in a production environment.
  • Familiarity with data center operations, including power, cooling, and rack deployment for GPU-heavy workloads.
  • Proficiency with cloud platforms like AWS, GCP, or Azure, with an emphasis on GPU instances, hybrid/multi-cloud deployments, and AI datacenter operations.

What You’ll Receive:

  • The competitive base pay for this position ranges from $100,000 - $160,000. Factors that may be used to determine your actual pay may include your specific job related knowledge, skills and experience
  • Stock options
  • The flexibility of remote work with an inclusive, collaborative team.
  • An opportunity to grow with a company that values innovation and user-centric design.
  • Generous vacation policy to ensure work-life harmony and well-being.
  • Contribute to a company with a global impact based in the US, Canada, and Europe.

RunPod is committed to maintaining a workplace free from discrimination and upholding the principles of equality and respect for all individuals. We believe that diversity in all its forms enhances our team. As an equal opportunity employer, RunPod is committed to creating an inclusive workforce at every level. We evaluate qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, marital status, protected veteran status, disability status, or any other characteristic protected by law.

Top Skills

Bash
Go
Python

Similar Jobs at RunPod

11 Hours Ago
Easy Apply
Remote
USA
Easy Apply
Mid level
Mid level
Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
As a Software Development Engineer in Test (SDET) at RunPod, you will design and implement test automation frameworks, execute load and resilience testing for cloud-scale distributed systems, and collaborate with teams to ensure reliability and performance in production environments.
Top Skills: GoPythonTypescript
11 Hours Ago
Easy Apply
Remote
USA
Easy Apply
Expert/Leader
Expert/Leader
Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
As a Senior Frontend Engineer at RunPod, you'll design and optimize user interfaces, collaborate with product managers and designers, enhance browser performance, and implement best practices in coding and testing. Your role involves troubleshooting, providing recommendations, and using advanced JavaScript technologies like TypeScript and React.
Top Skills: JavaScriptTypescript
11 Hours Ago
Easy Apply
Remote
USA
Easy Apply
Senior level
Senior level
Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
As a Developer Experience Engineer at RunPod, you will identify and resolve developer pain points, enhance developer interactions through tooling and workflows, and maintain open source repositories. You will collaborate with the Product team on new feature designs while ensuring excellent support for the developer community.
Top Skills: GoPythonRust

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account