Blog · · 3 min read

How to Label Data with Generative AI

Tips for how to approach cost-effect data labeling with LLMs

How to Label Data with Generative AI
You hyper intelligent data labeller hard at work.

Before LLMs, you needed some level of machine learning experience and heaps of data in order to deploy any custom data labeling pipeline. Now you can achieve similar results with a carefully worded prompt.

Why Label Data with AI

Labeling is one of the most common things people try to use AI for when processing their data. Common use cases I see are people labeling…

  • product reviews as positive or negative
  • GitHub issues into categories
  • essays with a letter grade
  • sales leads into their respective industries
  • emails into folders
  • research papers into fields of study

Generally, when someone is hoping to create a data labeling pipeline they’re trying to process data at scale. It should take them less time to build the solution then it would to manually go through the data. Thankfully, setting up a powerful data labeling pipeline only takes a few minutes with the help of generative AI.

Here are some general tips to leverage AI (GPT specifically) to categorize large amounts of data in a cost-effective and reliable way.


When categorizing with large language models, you might encounter the frustration of receiving more than just the answer. If you’re using GPT and aren’t using Functions/JSON mode yet you’re missing out. I wrote more about functions here but they are the absolute best way to ensure GPT is returning nothing but the label/category as output.

Without functions, you may get a reply like “Sure, I’ve decided to categorize this object as…” which would most likely throw a wrench in your data processing pipeline.

Include a Justification

Understanding why the label was selected is super helpful for debugging or justifying decisions. One trick we often use is requesting a short justification for why the decision was made. This is especially useful during the initial iterations of the pipeline because it surfaces any incorrect assumptions and blind spots in the LLMs understanding of the task

We normally request a justification as part of the same function definition. Here’s an example of a GPT function we used to categorize news stories.

  "name": "categorize", 
  "description": "Categorize the input data into user defined buckets.", 
  "parameters": { 
    "type": "object", 
    "properties": { 
      "category": { 
        "type": "string", 
        "enum": ["US Politics", "Pandemic", "Economy", "Pop culture", "Other"], 
        "description": "US Politics: Related to US politics or US politicians, Pandemic: Related to the Coronavirus Pandemix, Economy: Related to the economy of a specific country or the world. , Pop culture: Related to pop culture, celebrity media or entertainment., Other: Doesn't fit in any of the defined categories. " 
      "justification": { 
        "type": "string", 
        "description": "A short justification explaining why the input data was categorized into the selected category." 
    "required": ["category", "justification"] 

Failure Labels

It’s essential to always have a failure category/label so the ai model can declare it’s confused if needed. Without a failure category like ‘other’ or ‘unknown’ the model will try to force an ill fitting category much more often.

A false positive is a worst case scenario because it increases the doubt in all of your models’ responses. I tend to be hyper cautious in my prompt and request the model default to this failure category if it has any uncertainty whatsoever.

AI self reporting it failed

Escalate Failures

There are multiple approaches you can take to handle these failure scenarios. The most reliable is to escalate directly to human intervention but this is often not scalable for large scale tasks. A similarly reliable approach is to waterfall your labeling requests by escalating it to a more capable (and often more expensive) ai model.

You can use some base level LLM like GPT-3.5 to handle the brunt of the work while prompting it to always default to this failure category when unsure. You can then feed all data labeled “Other” to increasingly more capable models with more provided context.

This waterfall approach should help address the large majority of edge cases in a cost effective, scalable and reliable way.


Defining too many categories in a single prompt will negatively impact results. As a general rule, the larger your prompt, the more likely your LLM is to get confused and fail to follow instructions. GPT functions are no exception to this rule, a very long function definition will cause hallucinations or errors. One strategy we’ve used in sub-categorization to get more refined results.

If you’re labeling github tasks for example, defining an initial labeling of frontend/backend followed by two separate sets of more granular options should help reduce your prompt size by 50%.

Data Labeling on AgentHub

AgentHub is a platform that allows you to build complex AI-powered automations with no code. One thing we pride ourselves on is the simplification of common tasks like scaled data labeling. We have a prebuilt ‘Categorizer’ node that allows you to even more easily get up and running with only a few sentences.

Happy labeling!

Read next