World Wide Technology

HPC Engineer - Storage

Posted Yesterday

Be an Early Applicant

Remote

Hiring Remotely in IND

Mid level

Remote

Hiring Remotely in IND

Mid level

Deploy, tune, and support high-performance parallel storage for AI clusters: provision clients, configure RDMA protocols, integrate storage with Kubernetes, run I/O benchmarks (IOR/FIO), resolve L2 storage incidents, and produce acceptance reports ensuring GPUs receive data at line-rate.

The summary above was generated by AI

Job Summary & Responsibilities

Technical Competencies

Essential Skills

High-Performance Storage:

Parallel Filesystems: Hands-on operational experience with at least one major AI storage platform: VAST Data, Weka.io, DDN Lustre (Exascaler), or IBM GPFS (Spectrum Scale).
Linux I/O Stack: Deep understanding of the Linux VFS (Virtual File System), block devices, and how to debug I/O performance using tools like iostat, iotop, and strace.
RDMA Storage: Experience configuring NVMe-over-Fabrics (NVMe-oF) or NFS-over-RDMA, understanding the dependency on the underlying InfiniBand/RoCE network.

Automation & Containerisation:

Ansible Storage: Proficiency in writing Ansible playbooks to automate the installation of storage clients and configuration of mount points.
Kubernetes Storage: Understanding of StorageClasses, PVCs, and how to debug CSI Driver pods (checking logs for mount failures).
GPUDirect: Conceptual understanding of NVIDIA GPUDirect Storage (GDS) and the ability to verify if GDS is active.

Desirable Experience

Vendor Specifics: Deep certification or experience with Pure Storage (FlashBlade) or NetApp ONTAP AI configurations.
Object Storage: Experience interacting with S3-compatible object stores via CLI for model weight retrieval.
Data Migration: Experience using tools like fpsync or rclone to move petabyte-scale datasets between tiers.

Certifications

Highly Desirable:

NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)
Vendor Certifications:

VAST Certified Administrator (VCP-AD1)
WEKA Technical Xpert Certification

Red Hat Certified Specialist in Storage Administration

Success Metrics (KPIs)

I/O Performance: Achieving >95% of the theoretical line-rate throughput on IOR/FIO benchmarks for provisioned clients.
Mount Stability: Zero "Stale File Handles" or disconnected mounts across the cluster during the 72-hour burn-in period.
Ticket Velocity: Consistently meeting SLAs for storage-related support tickets.

Preferred Qualifications

Role Title: HPC Engineer – Storage

Location: India (Must align with Client Time Zone)

Employment Type: Full-Time

About the Role

The HPC Engineer - Storage acts as the primary execution engine for the data persistence layer that feeds the AI Factory. While the Domain Architect designs the tiering strategy and namespace layout, you are the "Builder" responsible for mounting filesystems, tuning client-side I/O, and ensuring data reaches the GPUs at line rate. You are a "doer" who is as comfortable compiling a custom kernel module for a parallel filesystem as you are debugging a stale NFS handle.

As a System Integrator, we design and deliver bespoke, high-scale AI factories. In this role, you will move beyond standard enterprise NAS management to execute the deployment of ultra-high-performance parallel file systems for NVIDIA SuperPOD, NVIDIA BasePOD, and Cisco AI Factory environments.

In this role, you will operate with a 100% focus on Delivery, executing the Low-Level Designs (LLD) assigned by your Squad Lead. You will own the "Storage" leg of the critical "Compute-Network-Storage" triad, ensuring that "I/O Wait" never becomes the bottleneck for training jobs.

CRITICAL REQUIREMENT: This role typically operates on Shift Hours to align with the onshore client's time zone (e.g., early shifts for Australian clients, or split shifts for European clients).

Key Responsibilities

Storage Integration & Client Configuration

Client Provisioning: Execute the deployment of high-performance storage clients (VAST, Weka, GPFS/Spectrum Scale, Lustre) on bare-metal DGX/HGX nodes using Ansible.
Protocol Configuration: Configure and tune RDMA-based protocols (NVMe-oF, NFS over RDMA, GPUDirect Storage) to bypass the CPU and deliver data directly to GPU memory.
Kubernetes Integration: Install and troubleshoot CSI (Container Storage Interface) drivers to ensure dynamic provisioning of Persistent Volumes (PVs) for AI workloads running in K8s.
Mount Management: Manage complex mount maps and automounter configurations to ensure consistent namespace views across thousands of compute nodes.

Validation & Performance Benchmarking

Throughput Testing: Execute standard I/O benchmarks to validate that the storage subsystem meets the "Gold Standard" read/write targets (e.g., 400GB/s read throughput).
Latency Tuning: Tune client-side kernel parameters (read-ahead buffers, queue depths, sysctl settings) to minimize latency for small-file random I/O patterns common in checkpointing.
Acceptance Reporting: Generate "As-Built" storage validation reports, documenting effective throughput and IOPS for client sign-off.

Operations & Support

Capacity & Quotas: Implement project-level quotas and monitor usage trends to prevent "Disk Full" outages on critical scratch filesystems.
Ticket Resolution: Handle L2 support tickets for storage issues, such as "Stale file handles," "Slow dataset loading," or "CSI Driver crashes."
Lifecycle Management: Execute non-disruptive client-side driver upgrades and firmware patches during maintenance windows.

Top Skills

Vast Data,Weka.Io,Ddn Lustre (Exascaler),Ibm Gpfs (Spectrum Scale),Linux Vfs,Iostat,Iotop,Strace,Nvme-Over-Fabrics (Nvme-Of),Nfs-Over-Rdma,Infiniband,Roce,Ansible,Kubernetes,Storageclasses,Pvcs,Csi Driver,Nvidia Gpudirect Storage (Gds),Pure Storage Flashblade,Netapp Ontap,S3-Compatible Object Stores,Fpsync,Rclone,Ior,Fio,Dgx,Hgx

Similar Jobs

Dynatrace

Sales Manager

3 Hours Ago

Remote or Hybrid

Bengaluru, Bengaluru Urban, Karnataka, IND

Expert/Leader

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Big Data Analytics • Automation

Recruit, develop and manage partner relationships to drive enterprise software and services revenue for the Dynatrace platform. Build joint go-to-market strategies, generate partner-sourced pipeline, run quarterly business reviews, coordinate with regional sales teams, and travel to enable partner initiatives and ensure successful execution.

Top Skills: AWSCloud ServicesDavis AiDevOpsDynatraceGCPMicrosoftSaaS

Ericsson

Incident Manager

5 Hours Ago

In-Office or Remote

Mid level

Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)

Lead 24x7 incident management to minimize service disruption, coordinate cross-team response, ensure restoration within SLAs, maintain documentation/repositories, produce post-incident reports, support problem management, and follow business continuity procedures.

Top Skills: Itil,Etom,Wla,Ericsson Bss,Ericsson Service Layer,Ericsson Core,Ericsson Access,Eridoc,Ericoll,Gsm,Wcdma,Lte,Network Topology,Business Continuity

BlackLine

Security Engineer

6 Hours Ago

Remote or Hybrid

Bengaluru, Bengaluru Urban, Karnataka, IND

Senior level

Cloud • Fintech • Information Technology • Machine Learning • Software • App development • Generative AI

The Sr. Information Security Engineer leads security strategy execution, manages security alerts, develops security assessments, and mentors junior engineers, ensuring robust information security practices at BlackLine.

Top Skills: Aws Security HubAzure Security CenterBashDlpEdrPowershellPythonScceTerraformWaf

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.