N-iX is a global software development service company that helps businesses across the world develop successful software products. Founded in 2002, N-iX has come a long way, expanding its presence across Europe, the US, and Latin America. Today, we are a strong community of 2,000+ professionals and a reliable partner for global industry leaders and Fortune 500 companies.
Our client is a global commerce leader where you can influence how the world buys, sells, and gives. You’ll be part of a work culture that’s been genuinely committed to diversity and inclusion since its founding over twenty five years ago. Here, you can be yourself, do your best work along with a team of professionals, and have a meaningful impact on people across the globe. We seek people with drive, ideas, and a passion for helping small businesses succeed to help.
We are seeking a highly motivated, experienced SRE/MLOps engineer with Python and Ray.io to build and maintain the next generation AI platform. This role focuses on developing software on top of open-source libraries such as Ray, enabling internal teams to run ML workloads efficiently.
Responsibilities:
- Build, refactor, and release software for the AI platform (feature development and bug fixes)
- Deploy and manage applications on Ray.io, including workload management, cluster deployment, distributed task scheduling, and troubleshooting
- Use Ray Dashboard and CLI tools to monitor and debug distributed jobs
- Work with Ray ecosystem libraries: Ray Train, Ray Tune, Ray Serve, Ray Data
- Integrate with tools such as Airflow, MLflow, Dask, DeepSpeed (a plus)
- Collaborate with AI platform developers to provide CI/CD pipelines for automated deployment and configuration
- Ensure high availability (target 99.999%) and monitor production systems.
- Develop automation for problem management and operational efficiency
- Write documentation and provide technical support for internal users
- Follow best practices for development: versioning, source control, branching, and merging patterns.
Requirements:
- Main coding language: Python (C++ good to have)
- Strong experience with Ray.io, including at least two areas such as Ray Train or Ray Serve
- Kubernetes / Docker: Proficient / Experienced
- Hands-on experience with distributed systems, cluster management, and cloud technologies
- Familiarity with DevOps practices, CI/CD pipelines, and test automation
- Excellent problem-solving, debugging, and triaging skills
- Strong communication skills for collaboration with partners, customers, and engineers
Ability to manage multiple projects in a fast-paced environment - TensorRT, DeepSpeed, PyTorch Distributed - will be a plus
- English proficiency (oral and written).
Role specifics:
- Infra vs. coding requirements: 30% infrastructure (can be learned with guidance), 70% coding (essential for features and bug fixes)
- The role targets engineers rather than data scientists: focus on deployment, abstractions, monitoring, and alerting of Ray applications at scale
- Ray proficiency is critical; second version of the platform will be built on Ray
- Understanding Racer for real-time serving and Ray Train for model training is required
We offer*:
- Flexible working format - remote, office-based or flexible
- A competitive salary and good compensation package
- Personalized career growth
- Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
- Active tech communities with regular knowledge sharing
- Education reimbursement
- Memorable anniversary presents
- Corporate events and team buildings
- Other location-specific benefits
*not applicable for freelancers
