How to Diff Strings or Files in Python

Python ships with a complete diff toolkit in the standard library — no pip install required. The difflib module handles line-level, character-level, and ratio-based comparison with multiple output formats including unified diff (the format git diff and patch files use). This guide covers difflib end to end, plus when to reach for Levenshtein distance (single-character precision) or deepdiff (structured data). Verify any output against the diff checker tool.

The Standard Library: difflib

Everything you need for text diffing in Python is in the difflib module — no pip install required. The main entry points:

  • unified_diff(a, b) — generates unified diff format (what git diff produces)
  • context_diff(a, b) — older context diff format, used by some patch tools
  • ndiff(a, b) — line-by-line diff with + - ? prefixes, human-readable
  • Differ().compare(a, b) — the same as ndiff but as a class for reuse
  • SequenceMatcher(None, a, b) — algorithm class for similarity ratios and matching blocks
  • HtmlDiff().make_table(a, b) — generate a side-by-side HTML diff
  • get_close_matches(word, candidates) — find the closest matches to a target

Every function takes sequences of strings (typically lines). Call .splitlines(keepends=True) to convert a multi-line string into the expected input.

Unified Diff

The format git diff, diff -u, and patch files use. Most useful for storing or transmitting diffs:

import difflib

old = """hello
world
foo
""".splitlines(keepends=True)

new = """hello
WORLD
foo
bar
""".splitlines(keepends=True)

diff = difflib.unified_diff(old, new, fromfile="old.txt", tofile="new.txt")
print("".join(diff))
# --- old.txt
# +++ new.txt
# @@ -1,3 +1,4 @@
#  hello
# -world
# +WORLD
#  foo
# +bar

n sets the number of context lines (default 3). Pass lineterm="" if your input doesn't have trailing newlines, otherwise the output will have extra blank lines.

Human-Readable Diff (ndiff)

For showing a diff to a user, ndiff produces a more readable output with intra-line hints:

import difflib

old = "The quick brown fox\njumps over the lazy dog"
new = "The slow brown fox\njumps over the sleepy cat"

diff = difflib.ndiff(old.splitlines(keepends=True), new.splitlines(keepends=True))
print("".join(diff))
# - The quick brown fox
# ?     ^^^^^
# + The slow brown fox
# ?     ^^^^
# - jumps over the lazy dog
# ?                 ^^^^ ^^^
# + jumps over the sleepy cat
# ?                 ^^^^^^ ^^^

The ? lines mark exact character positions that changed within a line — a visual aid no other format provides. Don't try to parse the output; it's for display only.

Similarity Ratio

"How similar are these two strings?" — answered by SequenceMatcher.ratio():

import difflib

s = difflib.SequenceMatcher(None, "The quick brown fox", "The quick red fox")
print(s.ratio())
# 0.8421052631578947  (84% similar)

# quick_ratio is faster but only an upper bound — useful for filtering large
# candidate sets before running the full comparison
candidates = ["lorem ipsum", "the slow brown fox", "the quick brown bear", ...]
target = "the quick brown fox"
likely = [c for c in candidates if difflib.SequenceMatcher(None, target, c).quick_ratio() > 0.7]
# Now run the real ratio only on `likely`

# get_close_matches: find best matches in a list above a threshold
matches = difflib.get_close_matches("hellow", ["hello", "help", "world"], n=3, cutoff=0.6)
# ['hello']

The first argument to SequenceMatcher is a "junk" function — pass None to ignore nothing, or a callable returning True for characters to skip (e.g., lambda c: c == " " to ignore spaces). Useful for fuzzy matching with controlled normalization.

HTML Diff (Side-by-Side)

For displaying diffs in a web page, HtmlDiff generates a complete side-by-side HTML table:

import difflib

old_lines = ["The quick brown fox", "jumps over", "the lazy dog"]
new_lines = ["The slow brown fox", "jumps over", "the sleepy cat"]

html_table = difflib.HtmlDiff().make_table(
    old_lines, new_lines,
    fromdesc="Original", todesc="Updated"
)
# Returns a complete <table> element with CSS classes for added/removed/changed
# Embed in a full page with HtmlDiff().make_file() instead

The output is a full HTML table with classes like diff_add, diff_sub, and diff_chg — style them however you want in your CSS. make_file() wraps the table in a complete HTML document with default styles.

When Stdlib Isn't Enough

Levenshtein distance for spell-check precision

difflib's ratio is good for "similar enough" decisions; Levenshtein distance is better when the exact number of single-character edits matters:

# pip install python-Levenshtein  (or rapidfuzz, which is faster)
import Levenshtein

Levenshtein.distance("color", "colour")    # 1
Levenshtein.distance("kitten", "sitting")  # 3

# Common spell-check pattern: candidates within 2 edits
candidates = ["hello", "help", "world", "hold"]
target = "hellp"
matches = [c for c in candidates if Levenshtein.distance(target, c) <= 2]
# ['hello', 'help', 'hold']

For modern projects, rapidfuzz is the recommended package — same API as python-Levenshtein but ~10x faster and actively maintained.

Structured data: deepdiff

For JSON, YAML, dataclasses, or any nested Python object, text diff is the wrong tool. Use deepdiff:

# pip install deepdiff
from deepdiff import DeepDiff

old = {"name": "Janeer", "tools": ["bcrypt", "jwt"], "version": 1}
new = {"name": "Janeer", "tools": ["bcrypt", "argon2", "jwt"], "version": 2}

diff = DeepDiff(old, new)
print(diff)
# {
#   'values_changed': {"root['version']": {'new_value': 2, 'old_value': 1}},
#   'iterable_item_added': {"root['tools'][1]": 'argon2'}
# }

deepdiff handles cycles, ignores key order in dicts (configurable), and produces a JSON-serializable result. It's the right choice for comparing API responses, configuration files, or any structured payload.

Common Pitfalls

Forgetting splitlines

difflib functions expect sequences (lists or generators of strings), not raw text. Passing a multi-line string directly compares it character by character, which is rarely what you want. Always call .splitlines(keepends=True) first.

Line endings

Files edited on Windows have \r\n line endings; Unix uses \n. A cross-platform file may diff against itself as 100% different. Normalize first: text.replace("\r\n", "\n").replace("\r", "\n"). Or open files with newline="" in open() and let Python normalize.

Encoding mismatches

Reading a UTF-8 file as Latin-1 (or vice versa) produces garbled text that diffs as completely different. Always specify encoding="utf-8" in open() unless you know otherwise. Files with a UTF-8 byte-order mark (BOM) need encoding="utf-8-sig" to strip the leading .

Diffing too much

Character-level diffs (passing a string instead of splitlines) on multi-MB inputs can take minutes — difflib is quadratic in the worst case. For large inputs, always work at the line level. For very large inputs (gigabytes), don't load the whole file into memory; diff chunks or use diff on the command line via subprocess.

Try It Live

The diff checker tool runs entirely in your browser and produces a line-level diff similar to difflib.ndiff — paste two pieces of text to see the differences highlighted. For the JavaScript equivalents, see the JavaScript diff guide.

Frequently Asked Questions

What is the easiest way to compare two strings in Python?

For equality, use a == b. For locale-aware comparison or sorting, use locale.strcoll(a, b). For getting a list of what differs between two pieces of text, use the stdlib difflib module — no external dependency needed. The most common entry point is difflib.unified_diff(old.splitlines(), new.splitlines()) which returns a generator of unified-diff lines exactly like git diff produces.

What is the difference between difflib.Differ, ndiff, and unified_diff?

Differ is the lowest-level class — call differ.compare(a, b) to get a generator of human-readable change lines with + - ? prefixes. ndiff() is a convenience wrapper around Differ that returns the same output as a list. unified_diff() produces unified diff format (the patch-file format with @@ -1,3 +1,4 @@ hunks). context_diff() produces the older context-diff format. For showing diffs to users, ndiff is most readable; for generating patches or storing diffs, unified_diff is the standard.

How do I get a similarity ratio between two strings in Python?

difflib.SequenceMatcher(None, a, b).ratio() returns a float between 0.0 and 1.0 — how similar the two strings are. ratio() is the standard metric; quick_ratio() and real_quick_ratio() are faster upper-bound approximations useful for filtering large candidate sets before running the real comparison. For finding "the closest match" in a list, difflib.get_close_matches(word, candidates, n=3, cutoff=0.6) returns the top matches above a threshold — useful for spell-check suggestions or fuzzy lookup.

When should I use Levenshtein distance instead of difflib?

Levenshtein distance counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another — a specific algorithm with a specific output (an integer). difflib.SequenceMatcher uses a different algorithm (Ratcliff-Obershelp) that gives a similarity ratio, not an edit count. Use Levenshtein when the absolute number of edits matters (spell-check thresholds, fuzzy matching with character-level precision). The python-Levenshtein package and the newer rapidfuzz are both significantly faster than equivalent difflib operations on long strings.

How do I diff two JSON or YAML structures in Python?

Don't use text diff on serialized JSON or YAML — reformatting changes (key order, indentation, trailing commas) will make equivalent documents look completely different. Parse both inputs first, then compare the resulting Python dictionaries. For a deep-equality check, use a == b after parsing. For a structured diff showing exactly which keys changed, use the deepdiff library (pip install deepdiff) — it produces a categorized diff (added, removed, changed, type-changed) that's much more useful than a text diff of the serialized form. deepdiff also handles nested structures, lists with order-sensitivity options, and cycle detection.