World Wide Technology

HPC Engineer - Compute

Posted Yesterday

Be an Early Applicant

Remote

Hiring Remotely in IND

Mid level

Remote

Hiring Remotely in IND

Mid level

Build, provision, validate, and support large-scale NVIDIA GPU compute clusters. Execute BCM-based bare-metal provisioning, run Ansible playbooks, perform firmware lifecycle tasks, run HPL/NCCL benchmarks and burn-in tests, resolve L2 support tickets (Slurm, driver/kernel issues), and produce validation reports. Operate on shifts aligned to client time zones and ensure nodes meet Golden Config standards.

The summary above was generated by AI

Job Summary & Responsibilities

Technical Competencies

Essential Skills

HPC & AI Infrastructure:

Cluster Provisioning: Proficiency with NVIDIA Base Command Manager (BCM) for bare-metal provisioning and image management.
Hardware Architecture: Deep understanding of NVIDIA DGX/HGX/NVL72/MGX architectures, including PCIe topology, NVLink/NVSwitch connectivity, and GPU memory hierarchy.
Workload Management: ability to troubleshoot basic Slurm issues (job dependencies, partition misconfigurations, node draining).

Linux & Automation:

Linux Administration: Solid mastery of RHEL/Ubuntu internals, including systemd, kernel modules, and package management.
Automation: Ability to read and execute Ansible playbooks and write basic Python/Bash scripts for task automation.
Version Control: Familiarity with Git workflows (pulling code, creating branches, committing config changes).

Desirable Experience

Networking: Experience with high-speed interconnects (InfiniBand NDR/HDR, RoCEv2) and debugging connectivity issues.
Containerisation: Experience with Docker and Kubernetes (specifically the NVIDIA GPU Operator).
Cisco Integration: Familiarity with Cisco UCS or Cisco Nexus configurations in an AI context.

Certifications

Highly Desirable:

NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)
NVIDIA-Certified Professional: AI Infrastructure (NCP-AII)
Red Hat Certified System Administrator (RHCSA)

Success Metrics (KPIs)

Deployment Velocity: Achieving <24 hour turnaround from "Rack Handover" to "OS Provisioned" for assigned nodes.
Validation Accuracy: 100% of handed-over nodes pass the "Golden Config" validation script (zero "Dead-on-Arrival" nodes handed to the client).
Ticket Efficiency: Consistently meeting SLAs for L2 support tickets.

Preferred Qualifications

Role Title: HPC Engineer – Compute

Location: India (Must align with Client Time Zone)

Employment Type: Full-Time

About the Role

The HPC Engineer - Compute acts as the primary execution engine for the physical and logical lifecycle of high-performance GPU compute fleets. While the Domain Architect designs the "Gold Standard" and the Squad Lead directs the workflow, you are the "Builder" responsible for the hands-on configuration and validation of the infrastructure. You are a "doer" who is as comfortable flashing firmware on an NVIDIA B300 in the CLI as you are debugging a failed Ansible playbook.

As a System Integrator, we do not simply manage a static cloud; we design and deliver bespoke, high-scale AI factories for the world's leading enterprises. In this role, you will move beyond standard server administration to execute the repeatable, scalable, and automated deployment of NVIDIA SuperPOD, NVIDIA BasePOD, and Cisco AI Factory environments.

In this role, you will operate with a 100% focus on Delivery, executing the Low-Level Designs (LLD) assigned by your Squad Lead. You will own the "Build" phase in the critical "Compute-Network-Storage" triad, ensuring our clients receive defect-free, validated infrastructure.

CRITICAL REQUIREMENT: This role typically operates on Shift Hours to align with the onshore client's time zone (e.g., early shifts for Australian clients, or split shifts for European clients).

Key Responsibilities

Build & Provisioning (Execution)

Cluster Assembly: Execute the automated provisioning of bare-metal AI nodes (DGX H100/Blackwell, HGX, IGX, MGX, NVL72) using NVIDIA Base Command Manager (BCM) (formerly Bright Cluster Manager).
Infrastructure as Code (IaC) Execution: Run and maintain the Ansible playbooks required to configure Operating Systems (RHEL/Ubuntu), inject user authentication (LDAP/AD), and mount high-performance parallel file systems.
Hardware Lifecycle: Perform complex firmware upgrades (SBIOS, BMC, GPU VBIOS, NVSwitch) across massive clusters without disrupting operations, ensuring strict adherence to the NVIDIA Firmware recipe.
Network Integration: Configure compute-side networking (IPoIB, Netplan) to align with the Ethernet or InfiniBand fabric specifications provided by the Network Domain.

Validation & Testing

Performance Benchmarking: Execute and log standard acceptance tests, including HPL (High-Performance Linpack) and NCCL-tests, to verify node performance against the "Gold Standard" benchmarks.
Health Checks: Run "Burn-in" scripts on new hardware to identify "Dead on Arrival" components (e.g., faulty GPU memory, loose NVLink cables, Xid errors) before client handover.
Test Reporting: Generate "As-Built" validation reports that the Field Solutions Engineers (FSEs) use for final client sign-off.

Operations & Support

Ticket Resolution: Handle L2 support tickets escalated from L1, such as "Stuck Slurm Jobs," "GPU falling off the bus," "ECC Errors," or "Node not draining."
Routine Maintenance: Execute scheduled maintenance windows for kernel patching, driver updates, and security hardening.

Top Skills

Nvidia Base Command Manager (Bcm),Nvidia Dgx,Hgx,Nvl72,Mgx,Slurm,Rhel,Ubuntu,Ansible,Python,Bash,Git,Infiniband Ndr,Hdr,Rocev2,Docker,Kubernetes,Nvidia Gpu Operator,Cisco Ucs,Cisco Nexus,Hpl,Nccl,Ldap,Active Directory,Ipoib,Netplan,Systemd

Similar Jobs

Dynatrace

Sales Manager

3 Hours Ago

Remote or Hybrid

Bengaluru, Bengaluru Urban, Karnataka, IND

Expert/Leader

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Big Data Analytics • Automation

Recruit, develop and manage partner relationships to drive enterprise software and services revenue for the Dynatrace platform. Build joint go-to-market strategies, generate partner-sourced pipeline, run quarterly business reviews, coordinate with regional sales teams, and travel to enable partner initiatives and ensure successful execution.

Top Skills: AWSCloud ServicesDavis AiDevOpsDynatraceGCPMicrosoftSaaS

Ericsson

Incident Manager

5 Hours Ago

In-Office or Remote

Mid level

Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)

Lead 24x7 incident management to minimize service disruption, coordinate cross-team response, ensure restoration within SLAs, maintain documentation/repositories, produce post-incident reports, support problem management, and follow business continuity procedures.

Top Skills: Itil,Etom,Wla,Ericsson Bss,Ericsson Service Layer,Ericsson Core,Ericsson Access,Eridoc,Ericoll,Gsm,Wcdma,Lte,Network Topology,Business Continuity

BlackLine

Security Engineer

6 Hours Ago

Remote or Hybrid

Bengaluru, Bengaluru Urban, Karnataka, IND

Senior level

Cloud • Fintech • Information Technology • Machine Learning • Software • App development • Generative AI

The Sr. Information Security Engineer leads security strategy execution, manages security alerts, develops security assessments, and mentors junior engineers, ensuring robust information security practices at BlackLine.

Top Skills: Aws Security HubAzure Security CenterBashDlpEdrPowershellPythonScceTerraformWaf

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.