Voltage Park Logo

Voltage Park

Infrastructure Engineer (Infiniband)

Posted 9 Days Ago
In-Office or Remote
2 Locations
Senior level
In-Office or Remote
2 Locations
Senior level
Responsibilities include designing and maintaining automation and frameworks for infrastructure management, tuning InfiniBand networking and NCCL configurations, and improving performance across GPU systems.
The summary above was generated by AI

We are seeking an Infrastructure Engineer with a focus on InfiniBand/NCCL to join our Infrastructure Engineering team. Our engineers design and build automation, tooling, and systems that bridge the gap between physical infrastructure and the platforms that power large-scale AI/ML and HPC workloads.

This role combines the breadth of a core infrastructure engineer with a specialty in high-performance networking and GPU communication. You’ll help ensure our InfiniBand fabric and NCCL stack are tuned, reliable, and efficient at scale — supporting some of the world’s largest GPU clusters.

This is a fully remote position, although candidates must be based in the continental United States. Unfortunately, we are unable to provide sponsorship for this role.

Responsibilities
  • Design, build, and maintain automation, APIs, and frameworks to manage physical infrastructure at scale.

  • Develop and extend systems for server lifecycle management.

  • Implement and tune InfiniBand networking and NCCL configurations for multi-GPU communication.

  • Collaborate with Network, Platform, and Infrastructure Operations teams to support new infrastructure rollouts.

  • Diagnose and improve performance across GPU, NVSwitch, PCIe, and InfiniBand layers.
    Write clear design documents and technical documentation to capture best practices.

Qualifications
  • 8+ years of professional experience in infrastructure engineering, HPC, or related domains.

  • Strong experience with Linux in production environments.

  • Proficiency in Python or similar languages for automation.

  • Deep understanding of InfiniBand networking (CX7 HCAs, fabrics, partitioning, GPUDirect).

  • Familiarity with NCCL, CUDA, and GPU topology optimization.

  • Knowledge of containerization and orchestration concepts.

  • Strong written and verbal communication skills.

Ideal Experiences
  • Experience with Dell PowerEdge XE9680 or other GPU-dense servers.

  • Prior work with NVIDIA H100s, NVSwitch, and large-scale NCCL testing.

  • Familiarity with Mellanox OFED, UCX, and Redfish/iDRAC for management.

  • Broader experience across infrastructure areas (storage, virtualization, networking).

Culture
  • Enjoy collaborating with a motivated, execution-focused team.

  • Comfortable operating with autonomy while aligning to company objectives.

  • Value precision, documentation, and knowledge-sharing.

  • Excited to grow as both a domain specialist (InfiniBand/NCCL) and a generalist infrastructure engineer.

Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law.

Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter. 

Compensation Range: $140K - $180K


#BI-Remote

Top Skills

Containerization
Cuda
Gpudirect
Infiniband
Linux
Nccl
Orchestration
Python

Similar Jobs at Voltage Park

8 Hours Ago
In-Office or Remote
2 Locations
Senior level
Senior level
Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
Design, deploy, and support high-performance AI network systems. Optimize low-latency connections, maintain standards, and collaborate on architecture decisions.
Top Skills: AclsAi InfrastructureAnsibleBgpEvpnGrafanaInfinibandInfluxdbMplsNvidia Fabric ManagerOspfPrometheusPythonQosSflowTerraformVxlan
Yesterday
In-Office or Remote
2 Locations
Senior level
Senior level
Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
Design and implement systems for infrastructure automation and management, ensuring efficient interaction with extensive hardware for AI/ML workloads.
Top Skills: ContainerizationDell HardwareFirewallsGpu ServersHpcJuniper NetworksLinuxNetwork SwitchesOrchestrationPalo AltoPythonRoutersSonicVast Storage Systems
5 Days Ago
Remote
USA
Mid level
Mid level
Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
The Staff Accountant will manage accounts payable, perform reconciliations, support month-end close, facilitate audits and tax filings, and partner with finance teams.
Top Skills: ExcelGoogle SheetsNetSuiteSage Intacct

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account