Grasp AI Modern Conversion Website

The Challenge

Here's a pattern I see constantly with enterprise customers: someone builds a proof of concept that uses an LLM to pull structured data from free text. It works brilliantly in a notebook. Then they try to put it in a pipeline, and everything falls apart.

The problem isn't the extraction itself. LLMs are remarkably good at reading a support ticket and pulling out the product name, the issue category, and the customer's requested resolution. The problem is consistency. Run the same prompt ten times and you might get "defect" in one response, "Defective" in another, and "product malfunction" in a third. Field names drift. Types change. Downstream systems choke.

Data engineers end up writing more post-processing code to validate and normalise LLM output than they spent on the extraction itself. At that point, you've traded one set of brittle parsing rules for another.

What's Changed

Microsoft has introduced ExtractLabel, a new class in the Fabric AI Functions library that solves this properly. Instead of passing loose label names to ai.extract() and hoping the output stays consistent, you define a full JSON Schema contract. The extraction engine enforces that contract on every single row.

The basic ai.extract() call is simple enough — pass in labels like "name" or "city" and get back columns with extracted values. ExtractLabel takes this further with typed fields, constrained enums, arrays, nullable fields, and descriptions that guide the model's interpretation of ambiguous text.

Here's what that looks like in practice. Say you're processing warranty claims — each arriving as a block of free text, but your downstream systems need structured fields:

from synapse.ml.aifunc import ExtractLabel

claim_schema = ExtractLabel(
label="claim",
max_items=1,
type="object",
description="Extract structured warranty claim information",
properties={
"product_name": {"type": "string"},
"problem_category": {
"type": "string",
"enum": ["defect", "damage_in_transit", "missing_part", "other"],
"description": "defect=stopped working, damage_in_transit=arrived damaged, missing_part=something not included"
},
"troubleshooting_tried": {
"type": "array",
"items": {"type": "string"}
},
"requested_resolution": {"type": ["string", "null"]}
}
)

df[["claim"]] = df["text"].ai.extract(claim_schema)

One line of code. Structured output, every time, conforming to your schema. No model deployment. No ML infrastructure. No post-processing.

The enum constraint is particularly valuable. Instead of cleaning up "Defective", "defect", "DEFECT", and "broken" into a single category after extraction, you specify the allowed values upfront and the model maps to them. The description field acts as a disambiguation guide — you're effectively giving the model a classification rubric alongside the extraction task.

For teams using Pydantic (and most Python data teams are), you can define your schema as a normal Python class with type hints, then export it with model_json_schema(). This keeps your extraction schema in sync with your validation logic — define once, use everywhere.

Getting Started

The barrier to entry is low. If you're already working in Microsoft Fabric notebooks, you're one import away:

Start simple: Use ai.extract() with string labels to validate the approach works for your data
Add structure: Define an ExtractLabel schema with typed fields and enums for your production use case
Test thoroughly: Run extraction against labelled samples and iterate on your field descriptions — they're the most important lever for extraction quality
Scale with PySpark: Switch from pandas to PySpark DataFrames (same schema, same call) to distribute extraction across your Fabric cluster

The example notebook in the fabric-toolbox repository walks through a complete warranty claims scenario with both JSON Schema and Pydantic approaches.

Key documentation: - AI Functions overview - ExtractLabel parameters reference

One practical note: check the billing updates for Fabric AI Functions. ExtractLabel operations consume Fabric capacity units, and at scale those costs need to be part of your pipeline economics.

What This Means

This is the kind of capability that quietly changes how data teams approach unstructured data. Not a flashy announcement — just a well-engineered tool that removes a genuine friction point.

The bigger picture: Microsoft is systematically embedding AI capabilities directly into Fabric's data engineering surface. You don't need a separate AI platform or ML ops pipeline. The extraction happens where your data already lives, using the tools your team already knows. That matters more than any individual feature, because it means data engineers can add AI-powered enrichment to existing pipelines without architectural changes.

For organisations sitting on large volumes of unstructured text — and that's most of them — ExtractLabel makes the path from "we should extract insights from this data" to "we have a production pipeline doing it" considerably shorter.

Leon Godwin, Principal Cloud Evangelist at Cloud Direct

ExtractLabel in Microsoft Fabric: Schema-Driven Data Extraction That Actually Works in Production

The Challenge

What's Changed

Getting Started

What This Means