BentoML Logo

BentoML

Site Reliability Engineer

Reposted 3 Hours Ago
In-Office
18 Locations
Senior level
In-Office
18 Locations
Senior level
The Senior Site Reliability Engineer at BentoML will manage infrastructure for AI services, focusing on Kubernetes, Terraform, GPU clusters, and observability tools, while mentoring and driving SRE best practices.
The summary above was generated by AI
About BentoML

BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. With support from investors such as DCM, enterprises around the world rely on us for consistent scalability and performance in production. Our portfolio includes both open source and commercial products, and our goal is to help each team build its own competitive advantage through AI.

Role

Join BentoML as a Senior Site Reliability Engineer and take charge of the infrastructure that delivers large language model and generative AI services worldwide. You will architect and operate Kubernetes clusters across AWS, Google Cloud, and on premises environments, turning vast GPU fleets into responsive inference pools. Your work will span writing clean Terraform code, refining GitOps pipelines, tuning Prometheus, and leading incident response. You will set service level objectives that matter, guide teammates through complex production challenges, and build processes that keep our platform robust and fast. If you thrive on solving difficult problems at scale and want your decisions to shape how enterprises run AI in production, this role is for you.

Responsibilities
  • Kubernetes operations – design, run, and improve large multi-cluster Kubernetes environments on AWS and Google Cloud, plus on-prem clusters; add support for Azure or Oracle Cloud when needed.

  • Infrastructure as code – manage everything with Terraform or Pulumi and follow GitOps workflows.

  • CI/CD – keep automated build and release pipelines reliable, with safe rollback paths.

  • GPU fleet management – run NVIDIA drivers, MIG partitioning, autoscaling, and firmware updates; extend the same practices to AMD GPUs when they appear.

  • Observability – operate and scale Prometheus and Grafana, define SLIs/SLOs, and automate capacity tracking.

  • Incident response – share an on-call rotation, lead post-incident reviews, and keep runbooks current.

  • Mentorship and process building – establish standard SRE processes and teach best practices to the wider engineering team.

Qualifications
  • Expert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem.

  • Hands-on experience with AWS and Google Cloud; familiarity with Azure or Oracle Cloud is a plus.

  • Strong skills with Terraform or Pulumi, GitOps tools (Argo CD, Flux, or similar), and CI/CD pipelines.

  • Deep understanding of Linux and networking fundamentals.

  • Experience managing NVIDIA GPU clusters; AMD/ROCm knowledge is a bonus.

  • Familiarity with specialized GPU clouds such as Lambda or Nebius is helpful.

  • Solid background with Prometheus and Grafana at scale.

  • Clear written and spoken communication and comfort working across time zones.

Why join us
  • Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.

  • Technical scope – operate distributed LLM inference and large GPU clusters worldwide.

  • Customer reach – support organizations around the globe that rely on BentoML.

  • Influence – lead SRE practices and infrastructure choices.

  • Compensation – competitive salary, equity, learning budget, and paid conference travel.

Top Skills

Amd Gpu
AWS
Azure
Ci/Cd
Gitops
GCP
Grafana
Kubernetes
Nvidia Gpu
Oracle Cloud
Prometheus
Pulumi
Terraform

Similar Jobs

9 Days Ago
In-Office or Remote
47 Locations
Senior level
Senior level
Artificial Intelligence • Blockchain • Internet of Things • Machine Learning • Software • App development • Automation
As a Staff SRE, you will ensure the reliability, scalability, and performance of systems, lead incident management, and drive automation efforts.
Top Skills: AnsibleAWSAzureBashDockerElk StackGCPGitlab CiGoGrafanaJavaJenkinsKubernetesPrometheusPythonTerraform
3 Hours Ago
In-Office or Remote
30 Locations
Expert/Leader
Expert/Leader
Fintech • Financial Services
Lead product management for risk technology at Xendit, focusing on Risk-as-a-Service architecture, process optimization, tooling, and project management for various stakeholders.
Top Skills: APIsEcommerce PlatformsRisk Management Tools
2 Days Ago
In-Office
18 Locations
Junior
Junior
Fintech • Payments • Consulting • Financial Services
The Support & Implementation Engineer will implement solutions, troubleshoot customer issues, ensure technical documentation accuracy, and collaborate with engineering teams.
Top Skills: AdAnsibleApache TomcatAWSAzureCi/CdGCPHashicorp NomadHelmIbmIbm WebsphereKubernetesLdapLinuxMs WindowsMs-SqlOciOracleOracle WeblogicPostgres

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account