Python Regex Guide

Python's re module is a mature, well-documented regex engine that ships in the standard library. The basics — search, match, findall, sub — are straightforward, but a handful of Python-specific quirks (raw strings, the difference between match and search, named-group syntax, verbose mode) trip up developers coming from other languages. This guide is a practical reference. Pair it with the regex tester and the explainer to confirm any pattern you write.

Always Use Raw Strings

Python interprets backslash sequences like \n, \t, and \b in normal string literals before the regex engine sees them. That means a regex like \d needs to be written as "\\d" in a regular string — and even then, "\b" silently becomes a backspace character. Raw strings, prefixed with r, disable that interpretation:

import re

# Wrong — \d is fine but \b becomes a backspace
re.search("\b\d+\b", "abc 123 def")  # no match

# Right — raw string preserves the backslashes
re.search(r"\b\d+\b", "abc 123 def")  # <Match: '123'>

Every example in the official re documentation uses raw strings. Make it your default — the cost is one character, the benefit is never debugging a phantom bug caused by string interpretation.

The Module Functions

The re module exposes nine main functions:

re.search(pattern, string)     # first match anywhere — most common
re.match(pattern, string)      # match at start (does NOT require whole-string match)
re.fullmatch(pattern, string)  # whole string must match
re.findall(pattern, string)    # list of all matches (strings or tuples for groups)
re.finditer(pattern, string)   # iterator of Match objects (preferred for large data)
re.sub(pattern, repl, string)  # replace all matches with repl
re.subn(pattern, repl, string) # like sub, but also returns the replacement count
re.split(pattern, string)      # split by matches
re.compile(pattern)            # compile to a Pattern object for reuse

match vs search vs fullmatch

This is the most common confusion for non-Python regex users:

re.match(r"\d+", "abc 123")       # None — \d+ doesn't match at start
re.match(r"\d+", "123 abc")       # Match: '123' — matches at start (does not need whole string)
re.search(r"\d+", "abc 123")      # Match: '123' — matches anywhere
re.fullmatch(r"\d+", "123")       # Match: '123' — whole string is digits
re.fullmatch(r"\d+", "123abc")    # None — extra chars at end

For form-input validation, use re.fullmatch. For "does this string contain ...", use re.search. re.match is a footgun; most code that uses it actually wants re.fullmatch.

Flags

Pass flags as a keyword argument or combine with |:

import re

re.search(r"hello", text, flags=re.IGNORECASE)
re.findall(r"^\d+", text, flags=re.MULTILINE | re.VERBOSE)

re.IGNORECASE (re.I) — case-insensitive matching.
re.MULTILINE (re.M) — ^ and $ match line boundaries instead of string boundaries.
re.DOTALL (re.S) — . matches newline characters.
re.VERBOSE (re.X) — allows whitespace and # comments inside the pattern.
re.UNICODE (re.U) — Unicode-aware character classes (default in Python 3, no-op).
re.ASCII (re.A) — restricts \w, \d, \s to ASCII characters only.
re.DEBUG — prints the parsed pattern to stderr at compile time. Useful when a complex pattern misbehaves.

Inline Flags

You can also embed flags in the pattern itself:

re.search(r"(?i)hello", text)         # case-insensitive
re.search(r"(?i:hello) world", text)  # only "hello" is case-insensitive (Python 3.7+)

Inline flag groups ((?flags:...)) are scoped, which is useful when only part of the pattern needs the flag.

Verbose Mode for Readable Patterns

Without re.VERBOSE, a non-trivial regex is unreadable. With it, you can format the pattern across lines and add comments:

email_re = re.compile(r"""
    ^                      # start of string
    [a-zA-Z0-9._%+-]+      # local part
    @                      # literal @
    [a-zA-Z0-9.-]+         # domain
    \.                     # dot before TLD
    [a-zA-Z]{2,}           # TLD
    $                      # end of string
""", re.VERBOSE)

Whitespace inside character classes ([abc d]) is preserved. To match a literal space, use \ followed by a space, \s, or wrap it in a character class. Use verbose mode for any regex more than ~30 characters — your future self will thank you.

Named Groups

Python uses the (?P<name>...) syntax for named groups (the P dates back to early Python history). Access via .group("name") or .groupdict():

m = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", "2026-04-30")
m.group("year")    # '2026'
m.groupdict()      # {'year': '2026', 'month': '04', 'day': '30'}

# Backreference to a named group
re.search(r"(?P<word>\w+) (?P=word)", "the the")  # matches "the the"

# In replacement strings
re.sub(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})",
    r"\g<day>/\g<month>/\g<year>",
    "2026-04-30"
)
# '30/04/2026'

Python 3.12 added support for the JavaScript/.NET-style (?<name>...) as well. For maximum portability, stick with (?P<name>...) when targeting older Python versions.

compile() and Pattern Objects

For patterns you reuse, re.compile() returns a Pattern object whose methods bypass the module-level cache lookup:

email_re = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")

for line in lines:
    if email_re.search(line):
        process(line)

The speedup is real but small — Python's module-level functions cache up to a few hundred recently-used patterns internally. The clearer benefit is readability: a compiled pattern bound to a meaningful name documents intent better than an inline literal in every call site.

Pattern objects expose all the same methods (search, match, findall, sub) but as instance methods, so the pattern argument disappears from each call.

findall, finditer, and Capture Groups

The behaviour of findall with capture groups is one of Python's stranger regex features:

# No groups — returns list of full matches
re.findall(r"\d+", "a1 b22 c333")
# ['1', '22', '333']

# One group — returns list of group values (NOT full matches)
re.findall(r"(\d)\d*", "a1 b22 c333")
# ['1', '2', '3']

# Multiple groups — returns list of tuples
re.findall(r"(\w+)=(\d+)", "x=1 y=22 z=333")
# [('x', '1'), ('y', '22'), ('z', '333')]

If you want full matches plus group access, use finditer — it returns Match objects so you keep both:

for m in re.finditer(r"(\w+)=(\d+)", "x=1 y=22 z=333"):
    print(m.group(0), m.group(1), m.group(2))
# x=1 x 1
# y=22 y 22
# z=333 z 333

finditer is also more memory-efficient on large inputs because it yields one match at a time rather than building a list.

sub() with a Function

If the replacement depends on the match, pass a function instead of a string:

def shout(m):
    return m.group(0).upper() + "!"

re.sub(r"\b\w+\b", shout, "hello world")
# 'HELLO! WORLD!'

# Equivalent with lambda
re.sub(r"\b\w+\b", lambda m: m.group(0).upper() + "!", "hello world")

Common pattern — keyed replacement with a dict:

replacements = {"&": "&amp;", "<": "&lt;", ">": "&gt;"}
re.sub(r"[&<>]", lambda m: replacements[m.group(0)], html)

Lookbehind and Lookahead

Python supports all four lookaround forms. Lookbehind in Python's re module must be fixed-width — variable-length lookbehind is rejected. The third-party regex module supports variable-length lookbehind if you need it.

re.findall(r"\d+(?= dollars)", "100 dollars 50 euros")  # ['100']
re.findall(r"\d+(?! dollars)", "100 dollars 50 euros")  # ['00', '0', '50']
re.findall(r"(?<=\$)\d+", "$50 and $100")               # ['50', '100']
re.findall(r"(?<!\$)\d+", "$50 and 100 items")          # ['0', '100']

Note the second example: \d+(?! dollars) backtracks and matches partial numbers because the lookahead can succeed at any non-final position. Anchor with \b if you want word-aligned matches: \b\d+\b(?! dollars).

For deeper coverage, see the lookahead and lookbehind guide.

The Third-Party `regex` Module

For features the standard library does not support, install regex from PyPI:

pip install regex

Drop-in compatible with re for the basics, plus:

Variable-length lookbehind.
Recursive patterns with (?R) for matching balanced brackets and nested structures.
Atomic groups (?>...) and possessive quantifiers *+, ++, ?+ for preventing catastrophic backtracking.
Unicode property escapes like \p{L} for any letter, \p{Sc} for currency symbols.
Fuzzy matching for approximate string match within a Hamming distance.

For most code, the standard re module is sufficient. Reach for regex when you hit a specific limit — most often, variable-length lookbehind or fuzzy matching.

Common Gotchas

Forgetting r"..." on patterns with backslashes. Always use raw strings.
Using re.match when you wanted re.fullmatch. match only anchors the start.
findall returning groups instead of full matches when the pattern has any capture group. Use a non-capturing group (?:...), or switch to finditer.
Catastrophic backtracking on patterns like (a+)+b. Avoid nested quantifiers; use atomic groups in the third-party regex module.
\b behaving unexpectedly with non-ASCII text. Word boundaries respect Unicode in Python 3 by default; if you wrote regex assuming ASCII semantics, results may differ.
Backreferences in replacement strings use \1 or \g<1>, not $1. Coming from JavaScript or Perl, this trips people up.

Try Patterns Live

The regex tester uses JavaScript syntax, but the syntax overlap with Python is large enough that 95% of patterns work in both engines. Check the differences in the regex cheat sheet when migrating between languages, and use the regex explainer for a token-by-token breakdown of any unfamiliar pattern.

Frequently Asked Questions

Why should I use raw strings for Python regex patterns?

Python's string literals interpret backslash sequences like \n and \t before the regex engine ever sees them, which means a literal backslash in a regex needs to be doubled ("\\d" instead of "\d"). Raw strings, prefixed with r, disable that interpretation, so r"\d+" passes a literal backslash-d to re. Always use r"..." for regex patterns — every Python style guide and re documentation example does, because non-raw strings are a constant source of bugs.

When should I use re.compile() instead of re.search() directly?

Compile when you reuse the same pattern many times in a hot loop. re.compile() does the work of parsing the pattern once and returns a Pattern object whose methods are slightly faster than calling the module-level functions repeatedly. The module-level functions actually use an internal cache, so the speedup is small (10-20% in tight loops), but the compiled form is also more readable when the pattern is long and you want to bind it to a meaningful name. For one-off use, re.search and friends are fine.

What is the difference between re.match, re.search, and re.fullmatch?

re.match anchors at the start of the string but does not require matching the entire string — it returns a match if the start of the string matches. re.search scans the entire string and returns the first match anywhere. re.fullmatch (added in Python 3.4) requires the entire string to match the pattern. Most code wants re.search; re.match is a common source of bugs because developers expect it to match the whole string. For validation, use re.fullmatch to make the intent explicit.

How do I make my Python regex case-insensitive?

Pass the flag as a keyword argument: re.search(r"hello", text, flags=re.IGNORECASE) or its short form re.I. You can also embed the flag in the pattern with the inline syntax (?i) at the start of the pattern. Combine multiple flags with bitwise OR: re.IGNORECASE | re.MULTILINE | re.DOTALL. Inline flag groups like (?i:hello) (Python 3.7+) apply the flag to a specific section of the pattern only, which is useful when most of the pattern is case-sensitive but one part is not.

What is verbose mode in Python regex?

The re.VERBOSE flag (or re.X) lets you write multi-line, commented regex patterns by ignoring whitespace and # comments inside the pattern. Whitespace inside character classes [ ] is preserved, and you can still match a literal whitespace with \s or a literal # with \#. Verbose mode is essential for any regex more than a line long — without it, a complex pattern is unreadable. Pair it with raw triple-quoted strings to keep the source clean.