World Wide Technology Logo

World Wide Technology

Senior HPC Engineer

Posted Yesterday
Be an Early Applicant
Remote
Hiring Remotely in IND
Senior level
Remote
Hiring Remotely in IND
Senior level
Lead hands-on deployment and automation of NVIDIA GPU-based HPC/AI clusters. Build and maintain Ansible/Terraform IaC, configure Slurm and Kubernetes integrations, run HPL/NCCL validation, debug GPU/kernel/fabric issues, and lead/mentor an offshore squad ensuring code quality and shift-based delivery aligned with client timezones.
The summary above was generated by AI
Job Summary & Responsibilities

Technical Competencies

Essential Skills

  • HPC & AI Infrastructure:
    • Expert-level knowledge of NVIDIA Base Command Manager (BCM) or Metal-as-a-Service (MaaS) provisioning tools.
    • Deep understanding of Slurm configuration (cgroups, plugin development, accounting).
    • Proficiency with NVIDIA DGX/HGX hardware architecture and the associated software stack (Drivers, CUDA, DCGM).
  • Linux & Automation (DevOps for Hardware):
    • Mastery of Red Hat Enterprise Linux (RHEL) / Ubuntu internals (Systemd, Kernel Tuning, Hugepages).
    • Advanced proficiency in Ansible (writing custom modules/roles) and Python (automating admin tasks).
    • Experience with Git workflows (Branching, PRs, CI/CD).
  • Containerisation:
    • Hands-on experience with Docker, Singularity/Apptainer, and Kubernetes (specifically NVIDIA GPU Operator and NVIDIA Network Operator).

Desirable Experience

  • Network Awareness: Ability to troubleshoot basic InfiniBand/RoCEv2 issues (ibstat, perf query) to distinguish between a "Node Issue" and a "Network Issue."
  • Storage Integration: Experience mounting high-performance parallel file systems (VAST/Lustre/WEKA/GPFS) and tuning client-side performance.
  • Certifications:
    • NVIDIA Certified Associate - AI in the Data Center.
    • Red Hat Certified Engineer (RHCE).
    • CKA (Certified Kubernetes Administrator).

Success Metrics (KPIs)

  1. Deployment Velocity: Reduction in "Time-to-Hello-World" (time from power-on to running the first successful GPU job) for new clusters.
  2. Code Quality: >95% of Pull Requests pass automated linting and require fewer than 2 review cycles before merge.
  3. Stability: Zero "Configuration Drift" incidents in production (e.g., manual changes breaking the cluster) due to strict IaC enforcement.
Preferred Qualifications

Role Title: Senior HPC Engineer

Reports To: Domain Architect - AI Compute

Location: India (Must align with Client Time Zone)

Employment Type: Full-Time


About the Role

The Senior HPC Engineer is the "Foreman" of the “AI Factory”. While the Domain Architect defines the architectural vision, you are responsible for the hands-on build and deployment. You act as the Technical Squad Lead for our offshore engineering teams, bridging the gap between the onshore architectural vision and the hands-on execution.

As a System Integrator, we thrive on velocity and precision. You will not just "maintain" clusters; you will lead the automated deployment of NVIDIA SuperPOD and BasePOD infrastructure for global enterprise clients. You are the "Lieutenant" to the Domain Architect, translating High-Level Designs (HLDs) into executable Ansible playbooks and ensuring your squad of HPC Engineers delivers defect-free infrastructure.

In this role, you are 100% Delivery-Focused, split between Technical Leadership (40%) and Hands-on Engineering (60%). You are the escalation point for complex kernel panics, the guardian of our Infrastructure-as-Code (IaC) repository, and the mentor who unblocks junior engineers when a Slurm job fails to schedule.

CRITICAL REQUIREMENT: This role typically operates on Shift Hours to align with the onshore client's time zone (e.g., early shifts for Australian clients, or split shifts for European clients).

Key Responsibilities

  1. Hands-on Engineering & Automation (60%)
  • Cluster Provisioning Factory:
    • Lead the deployment of NVIDIA Base Command Manager (BCM) (formerly Bright Cluster Manager) to provision bare-metal DGX/HGX nodes at scale.
    • Develop and maintain the Ansible / Terraform library used to configure OS settings, user authentication (LDAP/AD), and storage mounts across hundreds of nodes.
    • Execute HPL (High-Performance Linpack) and NCCL-tests to validate cluster performance, tuning BIOS and OS parameters to hit "Gold Standard" benchmarks.
  • Scheduler & Workload Orchestration:
    • Configure complex Slurm Workload Manager policies, including Fair Share, Preemption, and GPU Partitioning (MIG).
    • Integrate Kubernetes-based orchestrators (e.g., NVIDIA Base Command, Run:AI, or Red Hat OpenShift) with the underlying HPC hardware.
  • Deep-Dive Troubleshooting:
    • Debug "Silent Data Corruption" and "Xid Errors" on GPUs, analysing nvidia-smi logs and kernel message buffers (dmesg).
    • Diagnose fabric-related performance drops (e.g., Identifying a specific flapping link causing global slowdowns) in collaboration with the Network Squad.
  1. Squad Leadership & Quality Assurance (40%)
  • Technical Direction (The "Foreman"):
    • Translate the Low-Level Design (LLD) provided by the Domain Architect into granular Jira tasks for your squad of HPC Engineers.
    • Conduct daily stand-ups to unblock engineers, clarifying requirements and making technical decisions on the fly (e.g., "Use Ansible roles, not shell scripts for this").
  • Code Quality & Governance:
    • Act as the Primary Gatekeeper for the code repository. Perform mandatory Code Reviews on all Pull Requests (PRs) to ensure idempotency and error handling.
    • Enforce "Config-as-Code" discipline, ensuring no manual changes are made to production clusters without a committed playbook.
  • Mentorship:
    • Guide mid-level and junior engineers on best practices for Linux Systems Administration and HPC environments.

Top Skills

Nvidia Base Command Manager (Bcm),Metal-As-A-Service (Maas),Slurm,Nvidia Dgx,Nvidia Hgx,Cuda,Dcgm,Red Hat Enterprise Linux (Rhel),Ubuntu,Systemd,Hugepages,Ansible,Python,Git,Terraform,Docker,Singularity/Apptainer,Kubernetes,Nvidia Gpu Operator,Nvidia Network Operator,Infiniband,Rocev2,Ibstat,Perf,Hpl,Nccl,Run:Ai,Red Hat Openshift,Vast,Lustre,Weka,Gpfs,Ldap,Active Directory,Nvidia-Smi,Dmesg,Jira,Ci/Cd

Similar Jobs

3 Hours Ago
Remote or Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
Expert/Leader
Expert/Leader
Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Big Data Analytics • Automation
Recruit, develop and manage partner relationships to drive enterprise software and services revenue for the Dynatrace platform. Build joint go-to-market strategies, generate partner-sourced pipeline, run quarterly business reviews, coordinate with regional sales teams, and travel to enable partner initiatives and ensure successful execution.
Top Skills: AWSCloud ServicesDavis AiDevOpsDynatraceGCPMicrosoftSaaS
5 Hours Ago
In-Office or Remote
18 Locations
Mid level
Mid level
Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
Lead 24x7 incident management to minimize service disruption, coordinate cross-team response, ensure restoration within SLAs, maintain documentation/repositories, produce post-incident reports, support problem management, and follow business continuity procedures.
Top Skills: Itil,Etom,Wla,Ericsson Bss,Ericsson Service Layer,Ericsson Core,Ericsson Access,Eridoc,Ericoll,Gsm,Wcdma,Lte,Network Topology,Business Continuity
6 Hours Ago
Remote or Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
Senior level
Senior level
Cloud • Fintech • Information Technology • Machine Learning • Software • App development • Generative AI
The Sr. Information Security Engineer leads security strategy execution, manages security alerts, develops security assessments, and mentors junior engineers, ensuring robust information security practices at BlackLine.
Top Skills: Aws Security HubAzure Security CenterBashDlpEdrPowershellPythonScceTerraformWaf

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account