The Lead AI Engineer will deploy and manage AI models and infrastructure, ensure system performance, automate tasks, and collaborate with teams.
It's fun to work in a company where people truly BELIEVE in what they're doing!
Job Description:
Deployment & Infrastructure Management:
- Deploy, configure, and manage AI models, agentic systems, and supporting infrastructure in cloud (e.g., GCP) and on-premise environments.
- Implement and maintain CI/CD pipelines for AI/ML models and agentic applications (MLOps/Agent Ops).
- Manage and optimize cloud resources, ensuring cost-effectiveness and scalability for AI workloads.
- Collaborate with infrastructure teams to ensure network, storage, and compute resources meet the demands of AI systems.
Monitoring, Logging & Alerting:
- Develop and implement comprehensive monitoring, logging, and alerting solutions for AI agents and infrastructure to ensure high availability and performance.
- Proactively identify and address potential issues, performance bottlenecks, and anomalies in production AI systems.
- Track key operational metrics and create dashboards for system health and performance.
Incident Response & Troubleshooting:
- Provide operational support for production AI systems, including incident response, root cause analysis, and resolution of technical issues.
- Develop and maintain runbooks and standard operating procedures for common operational tasks and incident management.
- Participate in on-call rotations as needed to support critical AI services.
Automation & Operational Excellence:
- Automate routine operational tasks, deployment processes, and system maintenance activities using scripting (e.g., Python, Bash) and automation tools.
- Contribute to the development and enforcement of operational best practices, security standards, and compliance requirements for AI systems.
- Work with development teams to improve the deployability, manageability, and observability of AI applications.
Collaboration & Documentation:
- Collaborate effectively with AI developers, data scientists, AI architects, and other stakeholders to ensure smooth transitions from development to production.
- Maintain clear and comprehensive documentation for system configurations, operational procedures, and troubleshooting guides.
- Provide feedback to development teams on operational aspects and system performance.
Preferred Qualifications & Experience:
- Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related technical field.
- 4-7+ years of experience in a MLOps or Agent Ops role, preferably supporting AI/ML or data-intensive applications.
- Hands-on experience with cloud computing platforms (e.g., Google Cloud Platform - especially Vertex AI) and managing cloud-based infrastructure.
- Proficiency in scripting languages such as Python, Bash, or PowerShell for automation.
- Experience with CI/CD tools and practices (e.g., Bitbucket, GitLab CI, GitHub Actions).
- Familiarity with containerization technologies (e.g., Docker, Kubernetes) and orchestration.
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, Google Cloud Monitoring, Langfuse).
- Understanding of networking concepts, security best practices, and infrastructure-as-code (IaC) principles (e.g., Terraform, Ansible).
- Strong troubleshooting and problem-solving skills with an analytical mindset.
- Excellent communication skills and ability to work collaboratively in a team environment.
- A proactive approach to identifying and resolving issues and improving system reliability.
- Master's degree in a relevant field.
- Specific experience in MLOps or Agent Ops, including deploying and managing machine learning models or large language model applications in production.
- Familiarity with AI/ML frameworks and libraries (e.g., TensorFlow, PyTorch, scikit-learn).
- Understanding of agentic AI concepts and the operational challenges they present.
- Experience with managing vector databases or other specialized data stores for AI.
- Knowledge of data pipeline tools (e.g., Apache Airflow, Kubeflow Pipelines).
- Relevant cloud certifications (e.g., Google Cloud Professional ML Engineer).
- Experience working in an agile development environment.
Why Join Us?
Play a critical role in operationalizing cutting-edge Agentic AI and AI systems for a global industry leader.
- Gain hands-on experience with the latest MLOps, Agent Ops, and cloud technologies.
- Work in a dynamic, innovative, and collaborative AI Center of Excellence.
- Opportunity to significantly impact the reliability and efficiency of transformative AI solutions.
- Competitive salary, bonus, and benefits package.
Top Skills
Bash
Ci/Cd
GCP
Mlops
Python
Similar Jobs
Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
The Lead AI Engineer designs and delivers AI/ML and Generative AI solutions for banking platforms, integrating data and enabling AI-driven decisions.
Top Skills:
AWSBedrockLambdaLangchainOpenaiPythonPyTorchS3SagemakerScikit-LearnTensorFlow
Other
The AI Architect/Lead Engineer will oversee the architecture and implementation of a hybrid Agentic AI Platform, manage a small team, design systems, and ensure data security and privacy compliance.
Top Skills:
AWSDockerEvent-Driven SystemsJavaKubernetesLlmsMicroservicesNode.jsPythonTerraform
Artificial Intelligence • Analytics
The Lead AI Engineer designs and scales AI-driven platforms for case management and customer service, collaborating with stakeholders to develop machine learning solutions and automate workflows.
Top Skills:
Aws AiAzure AiDockerFaissGenerative AiGoogle Vertex AiKubernetesLlmsMachine LearningMilvusNlpPythonWeaviate
What you need to know about the Chennai Tech Scene
To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.



