How to Format JSONL for OpenAI Fine-Tuning

OpenAI fine-tuning takes a JSONL file — one JSON object per line, each holding a full chat conversation under a messages key. The format looks simple, but a single misstep (a JSON array instead of JSONL, a wrong role name, a missing assistant message) rejects the whole job. This guide covers the exact per-line schema OpenAI's supervised fine-tuning documentation specifies, the system/user/assistant/tool roles, the optional weight field for multi-turn examples, token limits and silent truncation, and a checklist of every validation pitfall. Run any file through the JSONL validator before you upload it.

What JSONL Is (and the Array Trap)

JSONL stands for JSON Lines. The rules are simple and strict:

One complete JSON object per line.
Lines are newline-delimited (a literal \n after each record).
The file is UTF-8.
There are no enclosing array brackets and no commas between records.

The single most common invalid mistake is handing OpenAI a JSON array — the entire dataset wrapped in [ ... ] with commas separating the objects. That is perfectly valid JSON, and it is invalid JSONL. Compare:

# WRONG — a JSON array (one big object, commas between, brackets around)
[
  {"messages": [...]},
  {"messages": [...]}
]

# RIGHT — JSONL: one object per line, no brackets, no commas between lines
{"messages": [...]}
{"messages": [...]}

Because each line stands alone, the file can be read line by line — one malformed record doesn't poison the parse of the rest, and the dataset never has to be loaded into memory all at once.

The messages Format

Each line is an object with a required top-level messages key. The array under it is exactly a Chat Completions conversation — a system instruction, a user turn, and the assistant response the model should learn to produce:

{"messages": [{"role": "system", "content": "You are a terse SQL assistant. Answer with SQL only."}, {"role": "user", "content": "Get all users who signed up in 2024."}, {"role": "assistant", "content": "SELECT * FROM users WHERE EXTRACT(YEAR FROM created_at) = 2024;"}]}

That whole thing is one line of the file. Pretty-printed for readability, the same record looks like this — but in the actual .jsonl file it must be collapsed onto a single physical line:

{
  "messages": [
    {"role": "system", "content": "You are a terse SQL assistant. Answer with SQL only."},
    {"role": "user", "content": "Get all users who signed up in 2024."},
    {"role": "assistant", "content": "SELECT * FROM users WHERE EXTRACT(YEAR FROM created_at) = 2024;"}
  ]
}

The top-level messages key is required. For function-calling fine-tunes you may also include an optional top-level tools array (the function definitions) and a parallel_tool_calls field alongside messages.

Roles and content

The valid roles inside the messages array are:

system — the instruction that sets behavior. (Note: the newer developer role from the Responses API is not confirmed valid inside fine-tuning JSONL — use system.)
user — the input the model is responding to.
assistant — the response the model is being trained to produce. This is the training target.
tool — the result returned from a tool/function call, used in function-calling training.

In the standard case content is a string. An assistant message may instead carry a tool_calls array (a function name plus JSON-string arguments) when you're training the model to call tools:

{"messages": [{"role": "user", "content": "What's the weather in Paris?"}, {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": "18C, clear"}, {"role": "assistant", "content": "It's 18C and clear in Paris right now."}]}

Note that the function arguments is itself a JSON string — the inner quotes are escaped. Every example must contain at least one assistant message, because that turn is what the model is trained to generate.

Multi-Turn Examples and the weight Field

A single example can hold a whole multi-turn conversation. By default the model is trained on every assistant turn in the example. Sometimes you only want it to learn from one turn — say, the corrected final answer — while keeping the earlier assistant turns as context. That's what the optional weight field is for.

weight goes on assistant messages only and takes the value 0 or 1:

weight: 1 — the default if omitted. The turn is included in the training loss; the model learns to reproduce it.
weight: 0 — the turn is excluded from the loss. The model sees it as context but isn't trained to generate it.

Here a first-draft assistant answer is kept as context (weight 0) and only the corrected answer is trained on (weight 1):

{"messages": [{"role": "system", "content": "You answer concisely."}, {"role": "user", "content": "How many moons does Mars have?"}, {"role": "assistant", "content": "Mars has one moon.", "weight": 0}, {"role": "user", "content": "Are you sure?"}, {"role": "assistant", "content": "You're right to check — Mars has two moons, Phobos and Deimos.", "weight": 1}]}

Setting weight on system, user, or tool messages has no defined effect — it only changes how assistant turns contribute to training.

How Many Examples, and Data Quality

OpenAI requires a minimum of 10 examples to create a fine-tuning job. Their supervised fine-tuning documentation notes that clear, measurable improvement typically starts around 50 to 100 well-crafted examples, and scales from there.

Consistency matters more than raw count. The model learns the patterns it sees repeatedly, so:

Keep the system message consistent across examples (or omit it consistently) — mixing styles teaches the model nothing stable.
Make the assistant responses uniformly demonstrate the format and tone you actually want at inference time.
Cover the real distribution of inputs you expect, including edge cases, rather than a hundred near-identical prompts.

A few hundred clean, on-distribution examples will beat thousands of noisy or contradictory ones.

Token Limits and Silent Truncation

Per-example token limits are model-dependent — each base model has a maximum context length that applies to the whole example (all messages combined). The dangerous part is what happens when you exceed it:

An over-length example is truncated, not rejected. The tail of the example is silently dropped to fit the limit — and since the assistant answer usually comes last, a too-long example can lose its training target entirely with no error message. The job runs, but those examples teach the model nothing useful (or teach it to produce cut-off output).

Before uploading, check the token length of your longest examples. A quick way is to count the tokens of the combined content of each line and flag any that approach the model's limit. The token counter gives you a fast estimate per example so you can split or trim the long ones before they get clipped.

Preparing and Uploading the File

Validate locally before uploading — line by line, never as one blob. jq with the -c (compact) and --seq/line-oriented flags is the quickest sanity check that every line is valid JSON:

# Validate that every line parses as JSON; prints the line number of any failure
jq -e . data.jsonl > /dev/null \
  || echo "Invalid JSON somewhere in the file"

# Stricter: check each line independently and report the bad ones
while IFS= read -r line; do
  echo "$line" | jq -e . > /dev/null || echo "BAD LINE: $line"
done < data.jsonl

In Python, read the file line by line and assert the schema you care about — that each line parses, has a messages array, and contains at least one assistant message:

import json

with open("data.jsonl", encoding="utf-8") as f:
    for i, line in enumerate(f, 1):
        line = line.strip()
        if not line:
            raise ValueError(f"Line {i}: blank line not allowed")
        obj = json.loads(line)              # raises on invalid JSON
        msgs = obj["messages"]              # raises if key missing
        roles = {m["role"] for m in msgs}
        assert "assistant" in roles, f"Line {i}: no assistant message"
        for m in msgs:
            assert m["role"] in {"system", "user", "assistant", "tool"}, \
                f"Line {i}: bad role {m['role']!r}"

print("OK")

Reading the file line by line (rather than wrapping it in json.load) is also the test that the file is JSONL and not an array. The same format works unchanged on Azure OpenAI and Microsoft Foundry.

Validation Pitfalls Checklist

Run through this before every upload. Each item is mechanical and catchable ahead of time:

File is a JSON array — wrapped in [ ] with commas between objects. Must be one object per line instead.
Trailing commas — after the last property or array element. Invalid JSON.
Single quotes — JSON requires double quotes for strings and keys.
Unquoted keys — {messages: ...} is invalid; it must be {"messages": ...}.
BOM — a UTF-8 byte-order mark at the start of the file breaks the first line's parse.
Blank lines — empty lines between records aren't valid JSONL.
Wrong role names — bot, ai, or human instead of system/user/assistant/tool.
No assistant message — every example needs at least one assistant turn as the training target.
Empty content — empty content strings where text is expected.
Over-length examples — these are truncated silently, not rejected; check token counts.

Try It Live

The JSONL validator checks your file line by line — it flags array-instead-of-JSONL, trailing commas, single quotes, blank lines, missing messages keys, and bad role names before OpenAI ever sees the file. Everything runs in your browser, so you can validate training data containing proprietary prompts without sending it anywhere. Pair it with the token counter to catch over-length examples before they get silently truncated.

Frequently Asked Questions

What is JSONL and why does OpenAI fine-tuning require it instead of JSON?

JSONL (JSON Lines) is a text format where each line is one complete, self-contained JSON object, the lines are newline-delimited, the file is UTF-8, and there are no enclosing array brackets and no commas between records. OpenAI fine-tuning uses it because each line is one independent training example — the file can be streamed and read line by line without loading the whole dataset into memory, and one malformed line doesn't invalidate the parse of the others. The single most common mistake is submitting a JSON array (the whole file wrapped in [ ... ] with commas between objects). That is valid JSON but invalid JSONL, and the job will be rejected. Each example must sit on its own line with a literal newline after it and nothing wrapping the set.

What is the per-line shape of an OpenAI fine-tuning example?

Each line is an object with a required top-level messages key holding an array of message objects, exactly like a Chat Completions request: {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}. Valid roles inside messages are system, user, assistant, and tool. In the standard case content is a string. Every example must contain at least one assistant message, because the assistant turn is the training target the model learns to produce. For function-calling fine-tunes you may also add an optional top-level tools array and a parallel_tool_calls field alongside messages.

What does the weight field do in a fine-tuning example?

The optional weight field goes on assistant messages only and takes the value 0 or 1. A weight of 1 (the default if omitted) includes that assistant turn in the training loss — the model learns from it. A weight of 0 excludes that turn from the loss, so the model sees it as context but is not trained to reproduce it. This is useful in multi-turn examples where you only want the model to learn the final answer, or one specific turn, while keeping earlier assistant turns as conversational context. It only affects assistant messages; setting weight on system, user, or tool messages has no defined effect.

How many examples do I need, and what happens if an example is too long?

OpenAI requires a minimum of 10 examples to create a fine-tuning job, but their supervised fine-tuning documentation notes that clear improvements typically start around 50 to 100 well-crafted examples. Quality and consistency matter far more than raw count — a few hundred clean, on-distribution examples beat thousands of noisy ones. Per-example token limits are model-dependent. Critically, an example that exceeds the limit is truncated, not rejected — the tail is silently dropped, so a long example can lose its assistant answer entirely without any error. Check the token length of your longest examples with a token counter before uploading.

What are the most common pitfalls that break a fine-tuning file?

The recurring ones: (1) submitting the file as a single JSON array instead of one object per line; (2) trailing commas after the last property or array element; (3) single quotes instead of double quotes; (4) unquoted object keys; (5) a UTF-8 BOM at the start of the file; (6) blank lines between records; (7) wrong role names such as bot, ai, or human instead of system/user/assistant/tool; (8) an example with no assistant message, so there is no training target; and (9) empty content strings. All of these are mechanical and catchable before upload — validate line by line rather than trusting that the file parses as one blob.