Tutorial
A Prompting Guide for Structured Outputs / JSON mode
August 26, 2025


Author

Mike McCarthy
Why Structured Outputs for Data Extraction?
This blog first appeared on our substack
When working with language models, receiving data in a well-defined format (like JSON) ensures that downstream systems can process the results automatically. Instead of dealing with paragraphs of text, you get a fixed set of fields that map directly to your use case. This approach:
- Helps integrate the model’s responses into business workflows without cleanup
- Prepares and transforms the data for direct use in an application
- Reduces ambiguity and hallucinations in output
- Makes validation easier
Who This Guide is For
Anyone who needs to reliably extract specific information from text using an LLM and wants to receive that information in a structured format. Most likely you are a technical person setting up a data extraction pipeline from scratch, or a user of a platform like Cloudsquid that abstracts the code-based steps to allow easy batch data processing or pipeline setup.
Thanks for reading cloudsquid blog! Subscribe for free to receive new posts and support my work.
Examples include:
- Retrieving critical legal terms from customer contracts
- Pulling product details from technical manuals
- Getting line items from an invoice or purchase order
- Processing ID documents or other information for KYC checks
This guide is not intended specifically for cloudsquid customers but as a general-purpose guide for structured data extraction prompting. Many online guides focus primarily on technical implementation and schema structure rather than deeply exploring how to write comprehensive, performant prompts. For cloudsquid-specific guidance, see our prompting guide in our docs page.
Theoretical Considerations
How to Think About Prompts
Think of prompts as instructions you'd give to someone unfamiliar with the task. Ask yourself: "Could someone perform this extraction accurately using only these instructions?" Clearly specify:
- What you want (e.g., "Extract the price for installation labor.")
- How you want it formatted (e.g., "Output as a decimal value.")
- Any constraints (e.g., "If multiple values appear, return null.")
Be explicit—vagueness often results in hallucinated outputs. Providing constraints and fallbacks limits this issue.
What LLMs Are Not Good At Doing
- Complex Math & Logic: If your task requires calculations, averages, or complicated logic, LLMs might make mistakes. It’s better to have the LLM extract relevant components of data, then perform complex calculations externally or with specialized tools. One thing to note, reasoning models are much better at math but you will most likely be using a more basic model with structured outputs or JSON mode for data extraction tasks.
- Resolving Contradictory Data: Clearly instruct how to handle contradictory information to avoid random guesses.
Setting Up a System Prompt
Purpose of a System Prompt
A system prompt sets foundational rules for data extraction, independent of individual prompts.
Example System Prompt (Legal context example)
"You are a legal clerk assisting in data extraction. Always follow these rules:
- If extracting exact legal terms, retain original language and wording.
- Only extract numeric values with an exact document match.
- Return null or an empty string if conflicting values exist."
Individual Prompts
How to Describe a Field You Want to Extract
- Name: Specify the exact field name (e.g., "title" vs. "product_title").
- Prompt: Aim for precision with added details and context to help the model fully understand the task. For example you could include examples, locations in the document, or words that are descriptive of the task like “Gross Salary” vs. “Salary”
- Format or Data Type: If it’s a date, say something like "date": "YYYY-MM-DD". For numbers, specify if you need integers or floats.
- Fallbacks: If data for that field doesn’t exist, specify what to do. For example, “If no date is found, return null.”
Writing Prompts, Bad vs. Good
Bad Prompt
Extract all the details about products that you can find from the quote
Why it’s bad: It’s too vague, doesn’t specify structure, doesn’t explain the required fields, and doesn’t handle missing data.
Good Prompt (including schema for structured output mode)
Extract all product line items from the provided quote. Each product line item includes:
1. Product Description: A textual description of the product.
2. Product SKU: The unique identifier or stock-keeping unit (SKU) for the product.
3. Quantity: The number of units of the product (as a number).
4. Price Per Item: The price for one unit of the product (as a decimal).
Adhere strictly to the following schema:
{
"type": "object",
"properties": {
"product_line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"product_description": {
"type": "string"
},
"product_sku": {
"type": "string"
},
"quantity": {
"type": "number"
},
"price_per_item": {
"type": "number"
}
},
"required": [
"product_description",
"product_sku",
"quantity",
"price_per_item"
]
}
}
},
"required": [
"product_line_items"
]
}
Note: The above example is the full input including the schema which needs to be passed to the model directly with the prompt, the model then ensures the output conforms to the schema. This might vary depending on the foundational model, this example is specific to OpenAI.
Accepted Ranges and Validations
In some cases, you may want to limit or validate numeric fields. For example:
- Price Range: “If the price is below 0 or above 999.99, return null for price.”
- Rating: “Ratings should be between 1 and 5. If the extracted rating is out of range, set it to 5 if it’s above 5, or to 1 if it’s below 1.”
By including such rules, you decrease the chance of invalid data.
Default Values for Extractions
Real-world data can be messy, and you may not always find a particular field. Provide defaults:
- Strings: Use an empty string ("").
- Numbers: Use 0 or null, depending on your system’s preference.
- Arrays: Use an empty array ([]).
Example instruction:
If you cannot find the manufacturer in the text, set "manufacturer": "".
Extracting Lists or Arrays
For fields where multiple entries might appear, instruct the model to provide a list:
"tags": ["tag1", "tag2", ...]
If there’s a maximum:
Include only the first 5 tags you find. If none are found, return an empty array.
This helps keep the output standardized.
Using Categories to Force a Specific Output
Sometimes data must fit one of a few predefined options (e.g., "Low", "Medium", or "High" for a risk rating). The LLM might invent new categories like "Very High". To avoid this:
- Explicitly restrict it: “Possible values for risk_level are only: Low, Medium, High. If no relevant data is found, set risk_level to Low.”
- Validate on your side too. If the LLM tries returning something else, you can revert to your default or handle the error.
Summarization & Generation Tasks
While structured extraction is your main goal, you might also want short summaries or creative text under fixed keys:
"summary": "A single sentence summary of the text."
"headline": "A catchy 5-10 word headline describing the document."
Be clear with length limits:
“Summaries should be under 30 words. Truncate if longer.”
Additional Tips & Topics
Nested Objects vs. Flat Data
If your data is hierarchical (e.g., an address with street, city, zip), sometimes it’s clearer to keep those fields nested:
"location": {
"street": "123 Main St.",
"city": "Example City",
"zip": "12345"
}
Otherwise, a flat structure may be easier for quick lookups:
"street": "123 Main St.",
"city": "Example City",
"zip": "12345"
Choose based on how your system will use the data.
Testing & Iterating Prompts
- Test with examples that span a range of real-world data for each use case.
- Check if the model’s output adheres to the constraints (e.g., does it really truncate the description? Does it respect max array sizes?).
- Adjust your prompt until you consistently get the structure you want.
Handling Large Inputs
If your input text is too long (e.g., multiple pages of instructions or a large PDF), consider chunking the text and running multiple extractions. You can then merge or reconcile the results afterward.
Conclusion
Key Takeaways
- Be Explicit: LLMs rely on clear, direct instructions to produce consistent JSON or other structured outputs.
- Provide Fallbacks: Ensure you specify default or null values for missing data.
- Validate: Even with a good prompt, always validate the output to handle unexpected results against a large enough sample size of data. Consider using a tool to run evals if you need to test on an ongoing basis.