Checkmate (itsacheckmate.com)

Data Architect - Annotation (India)

Posted 6 Days Ago

Be an Early Applicant

Remote

Hiring Remotely in India

Senior level

Remote

Hiring Remotely in India

Senior level

The Data Architect - Annotation will manage data workflows between teams, ensure data quality for AI systems, automate processes with Python and SQL, and analyze performance metrics for continuous improvement.

The summary above was generated by AI

As a Data Architect - Annotation, you’ll serve as the critical bridge between the Prompt Engineering team and the Data Labeling team, ensuring that the data feeding our AI systems is clean, consistent, and production-ready. You will own the workflows that generate, organize, and maintain high-quality datasets across multiple modalities, while using LLMs, automation, and statistical analysis to detect anomalies and improve data quality at scale.

Your work will directly influence the reliability of our VoiceAI and AI-driven products by ensuring that labeling pipelines, annotation standards, and evaluation data are robust enough to support high-stakes, real-world restaurant operations.

Essential Job Functions:

Data Operations & Workflow Ownership

Act as the transition point between Prompt Engineering and Data Labeling, translating model and product requirements into concrete data and annotation workflows.
Design, implement, and maintain scalable data workflows for dataset generation, curation, and ongoing maintenance.
Ensure data quality and consistency across labeling projects, with a focus on operational reliability for production AI systems.

Annotation & Quality Management

Create, review, and maintain high-quality annotations across multiple modalities, including text, audio, conversational transcripts, and structured datasets.
Identify labeling inconsistencies, data errors, and edge cases; propose and enforce corrective actions and improvements to annotation standards.
Utilize platforms such as Labelbox, Label Studio, or Langfuse to manage large-scale labeling workflows and enforce consistent task execution.

Automation, Tooling & LLM-Assisted QA

Use Python and SQL for data extraction, validation, transformation, and workflow automation across labeling pipelines.
Leverage LLMs (e.g., GPT-4, Claude, Gemini) for prompt-based quality checks, automated review, and data validation of annotation outputs.
Implement automated QA checks and anomaly-detection mechanisms to scale quality assurance for large datasets.

Analysis, Metrics & Continuous Improvement

Analyze annotation performance metrics and quality trends to surface actionable insights that improve labeling workflows and overall data accuracy.
Apply statistical analysis to detect data anomalies, annotation bias, and quality issues, and partner with stakeholders to mitigate them.
Collaborate with ML and Operations teams to refine labeling guidelines and enhance instructions based on observed patterns and error modes.

Cross-Functional Collaboration & Documentation

Work closely with Prompt Engineering, Data Labeling, and ML teams to ensure that data operations align with model requirements and product goals.
Document data standards, annotation guidelines, and workflow best practices for use by internal teams and external labeling partners.

Requirements

Experience with data annotation and hands-on use of platforms such as Labelbox, Label Studio, or Langfuse for managing large-scale labeling workflows.
Proficiency in Python and SQL for data extraction, validation, and workflow automation in a data operations or data engineering context.
Hands-on experience using LLMs (e.g., GPT-4, Claude, Gemini) for prompt-based quality checks, automated review, and data validation.
Demonstrated experience working with large-scale / high-volume datasets.
At least one prior role where data workflow automation is explicitly part of the job scope or responsibilities.
Ability to perform statistical analysis to detect data anomalies, annotation bias, and quality issues.
Strong requirement-elicitation and communication skills, with a process-driven and detail-oriented mindset when working with cross-functional teams.

Qualifications:

B.S. or higher in a quantitative discipline (Data Science, Computer Science, Engineering, or related field)
5+ years of relevant experience with a B.S. degree, or 3+ years of experience with a Master's degree
Demonstrated proficiency in SQL for reporting and Python for automation and scripting
Academic or applied research experience related to the NLP, LLM Benchmarking dataset is a strong plus

Top Skills

Claude

Gemini

Gpt-4

Label Studio

Labelbox

Langfuse

Llms

Python

SQL

Similar Jobs

SailPoint

Senior Software Engineer

5 Hours Ago

Remote or Hybrid

India

Senior level

Artificial Intelligence • Cloud • Sales • Security • Software • Cybersecurity • Data Privacy

SailPoint is seeking a Senior Software Engineer to build Python SDKs and frameworks for their big data platform. Responsibilities include designing, delivering, and testing backend services while collaborating with teammates and engaging in product demos and customer support.

Top Skills: AWSDockerEksGrafanaJIRAKafkaKibanaNoSQLPrometheusPythonRedisSQL

Motorola Solutions

Strategic Territory Director India

6 Hours Ago

Remote or Hybrid

India

Senior level

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics

The Strategic Territory Director will develop sales strategies in India, oversee the sales lifecycle, manage customer relationships, and conduct market analysis for growth.

Top Skills: CRMManetRf NetworkingTactical Radio TechnologiesWireless Communications

Boomi

Database Administrator

10 Hours Ago

Remote

India

Mid level

Cloud • Information Technology • Productivity • Software • Automation

Design and manage AWS database solutions including MySQL and PostgreSQL, monitoring performance, ensuring high availability, and implementing backup strategies.

Top Skills: Aurora MysqlAWSCloudwatchLinuxMongoDBMySQLPostgresRdsRds Mariadb

What you need to know about the Chennai Tech Scene

To locals, it's no secret that South India is leading the charge in big data infrastructure. While the environmental impact of data centers has long been a concern, emerging hubs like Chennai are favored by companies seeking ready access to renewable energy resources, which provide more sustainable and cost-effective solutions. As a result, Chennai, along with neighboring Bengaluru and Hyderabad, is poised for significant growth, with a projected 65 percent increase in data center capacity over the next decade.