Expedite Commerce Logo

Expedite Commerce

LLM Ops Engineer - Serverless & CI/CD (AWS)

Job Posted 21 Days Ago Posted 21 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in India
Junior
Remote
Hiring Remotely in India
Junior
As an LLM Ops Engineer, you'll build and maintain CI/CD infrastructure for AI systems on AWS, collaborating with teams to enhance automated deployments and ensure system reliability and cost efficiency.
The summary above was generated by AI
Description

This isn't your average DevOps role. This isn't just about pipelines or cloud provisioning. This is about engineering the backbone of Agentic AI systems that drive the next generation of enterprise SaaS—where conversational interfaces, dynamic UIs, and intelligent agents operate seamlessly on AWS Serverless infrastructure, with deep integration into Salesforce and cross-agent protocols.

This is for builders with something to prove. For engineers who’ve gone beyond cloud fluency to orchestrate complex, multi-agent ecosystems—who want to shape how enterprise applications are deployed, debugged, scaled, and observed in real time.

If you’re driven by deep automation, passionate about creating fault-tolerant agentic systems, and thrive where innovation is the expectation—not the exception—you’re in the right place. Join us to redefine SaaS infrastructure and champion a new era of AI-powered, product-led enterprise experiences.

The Role

We are seeking a hands-on Agentic AI Ops Engineer who thrives at the intersection of cloud infrastructure, AI agent systems, and DevOps automation. In this role, you will build and maintain the CI/CD infrastructure for Agentic AI solutions using Terraform on AWS, while also developing, deploying, and debugging intelligent agents and their associated tools. This position is critical to ensuring scalable, traceable, and cost-effective delivery of agentic systems in production environments.

The ResponsibilitiesCI/CD Infrastructure for Agentic AI
  • Design, implement, and maintain CI/CD pipelines for Agentic AI applications using Terraform, AWS CodePipeline, CodeBuild, and related tools.
  • Automate deployment of multi-agent systems and associated tooling, ensuring version control, rollback strategies, and consistent environment parity across dev/test/prod.
Agent Development & Debugging
  • Collaborate with ML/NLP engineers to develop and deploy modular, tool-integrated AI agents in production.
  • Lead the effort to create debuggable agent architectures, with structured logging, standardized agent behaviors, and feedback integration loops.
  • Build agent lifecycle management tools that support quick iteration, rollback, and debugging of faulty behaviors.
Monitoring, Tracing & Reliability
  • Implement end-to-end observability for agents and tools, including runtime performance metrics, tool invocation traces, and latency/accuracy tracking.
  • Design dashboards and alerting mechanisms to capture agent failures, degraded performance, and tool bottlenecks in real-time.
  • Build lightweight tracing systems that help visualize agent workflows and simplify root cause analysis.
Cost Optimization & Usage Analysis
  • Monitor and manage cost metrics associated with agentic operations including API call usage, toolchain overhead, and model inference costs.
  • Set up proactive alerts for usage anomalies, implement cost dashboards, and propose strategies for reducing operational expenses without compromising performance.
Collaboration & Continuous Improvement
  • Work closely with product, backend, and AI teams to evolve the agentic infrastructure design and tool orchestration workflows.
  • Drive the adoption of best practices for Agentic AI DevOps, including retraining automation, secure deployments, and compliance in cloud-hosted environments.
  • Participate in design reviews, postmortems, and architectural roadmap planning to continuously improve reliability and scalability.
Requirements
  • 2+ years of experience in DevOps, MLOps, or Cloud Infrastructure with exposure to AI/ML systems.
  • Deep expertise in AWS serverless architecture, including hands-on experience with:
    • AWS Lambda – function design, performance tuning, cold-start optimization.
    • Amazon API Gateway – managing REST/HTTP APIs and integrating with Lambda securely.
    • Step Functions – orchestrating agentic workflows and managing execution states.
    • S3, DynamoDB, EventBridge, SQS – event-driven and storage patterns for scalable AI systems.
  • Strong proficiency in Terraform to build and manage serverless AWS environments using reusable, modular templates.
  • Experience deploying and managing CI/CD pipelines for serverless and agent-based applications using AWS CodePipeline, CodeBuild, CodeDeploy, or GitHub Actions.
  • Hands-on experience with agent and tool development in Python, including debugging and performance tuning in production.
  • Solid understanding of IAM roles and policies, VPC configuration, and least-privilege access control for securing AI systems.
  • Deep understanding of monitoring, alerting, and distributed tracing systems (e.g., CloudWatch, Grafana, OpenTelemetry).
  • Ability to manage environment parity across dev, staging, and production using automated infrastructure pipelines.
  • Excellent debugging, documentation, and cross-team communication skills.
Benefits
  • Equity participation program.
  • Health Insurance, PTO, and Leave time
  • Ongoing paid professional training and certifications 
  • Fully Remote work Opportunity
  • Strong Onboarding & Training program

Work Timings - 1 pm -10 pm IST

Next Steps

We’re looking for someone who embodies the spirit of a boundary-pushing Principal Architect—ready to own ambitious projects, craft scalable multi-cloud solutions, and skillfully integrate AI where it truly elevates outcomes.

  1. Apply Now: Send us your resume and a brief summary of your experience leading teams, including notable multi-platform or AI-driven projects.
  2. Show Us Your Ingenuity: Be prepared to discuss your boldest cross-platform solutions, how you integrated new technologies, and how you overcame tough technical hurdles.
  3. Collaborate & Ideate: If selected, you’ll workshop a real-world scenario with our leadership—so we can see firsthand how you approach challenges across AWS, AI, and beyond.

This is your opportunity to shape the future of enterprise solutions—across AWS, emerging AI platforms, and the occasional Salesforce ecosystem. We can’t wait to hear from you!

Our Belief

We believe extraordinary things happen when technology and human creativity unite. By empowering teams with cloud solutions, AI insights, and thoughtful architecture, we free them to focus on meaningful relationships, innovative strategies, and real impact. It’s more than just code—it’s about sparking a revolution in how people interact with systems, solve problems, and propel businesses forward.

If this resonates with you—if you’re driven, daring, and ready to build the next wave of multi-platform innovation—then let’s do this. Apply now and help us shape the future.

About Expedite Commerce

At Expedite Commerce, we believe that people achieve their best when technology enables them to build relationships and explore new ideas. So we build systems that free you up to focus on your customers and drive innovations. We have a great commerce platform that changes the way you do business!

See more about us at expeditecommerce.com. You can also read about us on G2/products/expedite-commerce and on Salesforce Appexchange/ExpediteCommerce.

EEO Statement

All qualified applicants to Expedite Commerce are considered for employment without regard to race, color, religion, age, sex, sexual orientation, gender identity, national origin, disability, veteran's status or any other protected characteristic.

Top Skills

Amazon Api Gateway
AWS
Aws Codebuild
Aws Codedeploy
Aws Codepipeline
Aws Lambda
Cloudwatch
DynamoDB
Eventbridge
Github Actions
Grafana
Opentelemetry
Python
S3
Sqs
Step Functions
Terraform

Similar Jobs

10 Minutes Ago
Remote
India
Senior level
Senior level
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Lead and manage a team to build a scalable ML platform at Coinbase, focusing on Language Models for an open financial system, while mentoring engineers and ensuring high quality architecture and processes.
Top Skills: DockerDynamoDBGoLanguage ModelsMachine LearningMongoDBPostgres
18 Minutes Ago
Easy Apply
Remote
India
Easy Apply
Mid level
Mid level
Artificial Intelligence • Enterprise Web • Information Technology • Productivity • Sales • Software • Database
The Senior Product Manager will lead the dialer roadmap, drive strategic vision, enhance user engagement, and innovate for future growth at Apollo.io.
Top Skills: AIProduct ManagementSaaS
4 Hours Ago
Easy Apply
Remote or Hybrid
2 Locations
Easy Apply
Mid level
Mid level
Artificial Intelligence • Big Data • Logistics • Machine Learning • Software • Transportation
Drive the product strategy for AI-powered digital workers, lead product lifecycle, and collaborate across teams to enhance supply chain operations.
Top Skills: A/B TestingAIAPIsData-Driven Product DevelopmentErpMlSupply Chain Management SystemsTmsWms

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.
By clicking Apply you agree to share your profile information with the hiring company.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account