N-iX Logo

N-iX

SRE & DevOps Engineer (Ray.io)

Reposted 3 Days Ago
Be an Early Applicant
India
Senior level
India
Senior level
You will build and support ML infrastructure, automate deployment, troubleshoot issues, and collaborate with teams to enhance operational excellence.
The summary above was generated by AI

N-iX is a global software development service company that helps businesses across the world develop successful software products. Founded in 2002, N-iX has come a long way, expanding its presence across Europe, the US, and Latin America. Today, we are a strong community of 2,000+ professionals and a reliable partner for global industry leaders and Fortune 500 companies. 

Our client is a global commerce leader where you can influence how the world buys, sells, and gives. You’ll be part of a work culture that’s been genuinely committed to diversity and inclusion since its founding over twenty five years ago. Here, you can be yourself, do your
best work along with a team of professionals, and have a meaningful impact on people across the globe. We seek people with drive, ideas, and a passion for helping small businesses succeed to help.

About the team:  You will join the AI Platform Team, providing highly available, scalable, and automated machine learning infrastructure for researchers and data scientists globally. We are looking for a motivated, self-reliant SRE / DevOps engineer with Python and C++ experience to drive operational excellence, automation, and platform reliability, with a focus on Ray.io.

About the role: This role focuses on maintaining, deploying, and improving AI/ML platform services using Ray.io, with strong emphasis on DevOps, SRE practices, and automation. You will collaborate closely with developers, researchers, and infrastructure teams to ensure robust, scalable, and highly available distributed ML systems.

Responsibilities:

DevOps tasks (~60%)

  • Design, implement, and maintain CI/CD pipelines for AI/ML platform services. 
  • Manage and troubleshoot Kubernetes clusters, Docker containers, and cloud infrastructure.
  • Ensure high availability (99.999%), system reliability, and security across platforms.
  • Automate operational tasks, monitoring, and deployment workflows.
  • Deploy and maintain Ray.io clusters, ensuring workload scheduling and distributed job reliability.
  • Monitor production systems via Ray Dashboard, CLI tools, and integrate alerting/metrics.
  • Analyze and resolve production issues, performance bottlenecks, and functional problems.
  • Define operational standards, versioning practices, and advise teams on DevOps best practices.
  • Prepare documentation, training materials, and provide technical support to platform users.

Development tasks (~40%):

  • Design, build, and refactor Python and C++ services for Ray.io workflows.
  • Work with Ray ecosystem libraries such as Ray Train, Ray Tune, Ray Serve, Ray Data.
  • Integrate Ray with tools such as Airflow, MLflow, Dask, DeepSpeed (plus).
  • Work with ML frameworks such as PyTorch, TensorFlow, and Triton.
  • Collaborate with developers to integrate distributed ML pipelines into automated CI/CD workflows.

Requirements:

  • Strong Python and C++ development experience (2–4 years).
  • Hands-on experience with Ray.io: cluster deployment, workload management, distributed task scheduling.
  • Familiarity with Ray ecosystem libraries (Train, Tune, Serve, Data) and integration with ML tooling.
  • Solid understanding of Kubernetes, Docker, Linux fundamentals, and DevOps practices.
  • Experience with CI/CD pipelines (Jenkins or similar), test automation, and monitoring.
  • Strong debugging and triaging skills for distributed systems.
  • Excellent communication and collaboration skills with cross-functional teams.
  • Strong organizational skills to manage multiple projects in a fast-paced environment.
  • Fluent in English (spoken and written).
  • Overall 3-5 years of relevant DevOps / SRE experience.

We offer*:

  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits

*not applicable for freelancers

Top Skills

Docker
Jenkins
Kubernetes
PyTorch
Ray.Io
TensorFlow

Similar Jobs

3 Days Ago
In-Office
Chennai, Tamil Nadu, IND
Senior level
Senior level
Hardware • Other • Appliances
As a Senior DevOps Engineer and Site Reliability Engineer, you will lead a team to enhance platform reliability through automation and performance improvements, architectural design, incident management, and cross-team collaboration.
Top Skills: C#Cloud-Native ArchitectureDockerGoGrafanaKubernetesPrometheusPythonShell ScriptingTypescriptUnix/Linux
2 Hours Ago
Hybrid
Hyderabad, Telangana, IND
Senior level
Senior level
Fintech • Financial Services
Lead and participate in risk analytics initiatives, analyze data models, ensure compliance, mentor staff, and develop modeling strategies.
Top Skills: AirflowAnsibleAutomation ToolsDjangoDockerGoogle Cloud PlatformInfrastructure-As-Code ToolsJIRAKubernetesPysparkPythonRed Hat OpenshiftServicenowSparkTerraform
2 Hours Ago
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
Senior level
Senior level
Fintech • Financial Services
The Senior Wealth Underwriter leads complex initiatives in Wealth Underwriting, analyzes borrower financial data for high-net-worth loans, resolves operational issues, and mentors junior staff.

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account