How to Extract Emails From Text Using Regex

Pulling email addresses out of free-form text — log files, scraped pages, customer messages, exported data dumps — is one of the most common reasons developers reach for regex. This guide covers the pragmatic pattern that handles 99% of real-world cases, working code in JavaScript, Python, and Bash, and the edge cases worth knowing about. Test any pattern below in the regex tester with your own data before deploying.

The Pattern

This is the standard pragmatic email regex used by most extraction code:

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

It breaks down into three parts:

  • Local part: [a-zA-Z0-9._%+-]+ — letters, digits, dot, underscore, percent, plus, hyphen, one or more times.
  • @ sign: a literal at-sign.
  • Domain plus TLD: [a-zA-Z0-9.-]+\.[a-zA-Z]{2,} — letters, digits, dot, hyphen, then a literal dot, then a TLD of at least two letters.

Add \b word boundaries on either side if you're scanning unstructured text and want to avoid partial matches inside larger tokens:

\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b

This is the same pattern in the common regex patterns library — copy directly from there if you want.

JavaScript

const text = "Contact: jane@example.com or sales@foo.co.uk. Spam: bob@test";
const emails = text.match(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g);
// ['jane@example.com', 'sales@foo.co.uk']
// Note: 'bob@test' is excluded because there's no .TLD

// Deduplicate
const unique = [...new Set(emails)];

// Lowercase for consistent comparison
const normalized = [...new Set(emails.map(e => e.toLowerCase()))];

// matchAll for richer iteration with positions
for (const m of text.matchAll(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g)) {
  console.log(m[0], 'at index', m.index);
}

The g flag is essential — without it, match returns only the first match. See the JavaScript regex guide for more on the matchAll iterator and avoiding lastIndex bugs.

Python

import re

text = "Contact: jane@example.com or sales@foo.co.uk. Spam: bob@test"
emails = re.findall(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", text)
# ['jane@example.com', 'sales@foo.co.uk']

# Deduplicate while preserving order (Python 3.7+)
unique = list(dict.fromkeys(emails))

# Lowercase for consistent comparison
normalized = list(dict.fromkeys(e.lower() for e in emails))

# Use finditer for positions and Match objects
for m in re.finditer(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", text):
    print(m.group(0), "at index", m.start())

Always use raw strings (r"...") for Python regex patterns. See the Python regex guide for the difference between findall, finditer, and match.

Bash / grep

# Extract emails from a file, one per line
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' input.txt

# Deduplicate
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' input.txt | sort -u

# Lowercase first, then dedupe
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' input.txt \
  | tr '[:upper:]' '[:lower:]' \
  | sort -u

# Count emails per file in a directory
grep -roE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' . -c

The -o flag prints only the matches (not the surrounding line); -E enables extended regex syntax. On macOS, the BSD grep works identically for this pattern. For very large files, ripgrep (rg -oNI) is dramatically faster.

Common Edge Cases

Display names with angle brackets

Given Jane Doe <jane@example.com>, the regex extracts jane@example.com and leaves the brackets and display name behind. That's usually what you want. If you need the full RFC 5322 mailbox (name plus address), reach for a real parser:

# Python
from email.utils import parseaddr
parseaddr('Jane Doe <jane@example.com>')
# ('Jane Doe', 'jane@example.com')

Plus-tags and dots in the local part

Addresses like jane+newsletter@example.com and j.a.n.e@example.com match the pattern correctly. Many forms strip the plus-tag for deduplication purposes — that's an application-level decision, not a regex one.

Internationalized addresses (IDN)

The pattern is ASCII-only. Modern email allows Unicode in both the local part and the domain (RFC 6531). For full IDN support, decode the punycoded domain (xn--*) before matching, or use a Unicode-aware character class — though Unicode emails are still rare in most data sources you'll encounter.

Quoted local parts

RFC 5322 technically allows quoted local parts like "jane doe"@example.com. These are rare in the wild — almost no real address uses them. If you need to handle them, use a dedicated email parser rather than expanding the regex.

Multiple addresses separated by commas or semicolons

The regex doesn't care about separators — it just finds matches. Given jane@a.com, sales@b.com; alice@c.com, the global match returns all three. No need to split first.

Performance Notes

For most extraction work, this pattern runs in microseconds per kilobyte of input. Three things to watch on very large inputs (gigabytes):

  • Avoid backtracking-heavy patterns. The pattern above is linear — every character is consumed exactly once. A naive variation like .+@.+ would backtrack catastrophically on long lines without an @ sign.
  • Stream rather than slurp. Don't load a 10GB log into memory. Iterate line-by-line in Python (for line in f:) or pipe through grep. The regex engine works on small chunks just as well.
  • Compile patterns you reuse. Python's re.compile shaves a small constant per call. JavaScript's new RegExp is similar but RegExp literals are already cached by the engine.

Try It Live

Paste any of these patterns into the regex tester with a sample of your real data. The tester runs entirely in your browser, so log files containing internal email addresses stay on your machine. For an explanation of any pattern token-by-token, the regex explainer breaks it down.

Frequently Asked Questions

What regex pattern extracts email addresses?

The practical pattern used in most extraction code is [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}. It matches the local-part (letters, digits, dot, underscore, percent, plus, hyphen), an @, the domain, and a TLD of at least two letters. The full RFC 5322 grammar accepts strings like comments inside addresses that you almost never want to match in real text, so the pragmatic pattern is shorter and more accurate to what you actually find in inboxes, log files, and web pages.

How do I extract every email address from a string?

In JavaScript, str.match(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g) returns an array of every match (or null if none). The g flag is essential — without it, only the first email is returned. In Python, re.findall(pattern, text) does the same. In Bash, grep -oE produces one match per line. To deduplicate, wrap in new Set(...) in JavaScript or use a set in Python.

How do I handle false positives like 'foo@bar' (no TLD)?

The pattern above already requires a TLD of at least two letters after a dot, so foo@bar without a dot will not match. To be even stricter, use word boundaries: \b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b. The boundaries prevent partial matches like 'jane@x' inside 'jane@xtreme' (where xtreme is interpreted as a TLD). For full validation, send a verification email — regex cannot prove deliverability.

Can the regex match emails wrapped in angle brackets?

The basic pattern matches the address portion regardless of surrounding characters. Given 'Jane Doe <jane@example.com>', the regex extracts 'jane@example.com' and leaves the brackets behind. If you want to capture the full RFC 5322 mailbox (display name plus angle-bracketed address), the regex grows considerably — easier to use a real email parser like Python's email.utils.parseaddr or a Node.js email-addresses library.

Should I use this regex to validate email input on a form?

For form validation, use type="email" on the input and let the browser's built-in validation handle it — the regex behind it is similar to the one above but maintained by browser vendors. If you need to validate server-side, the same pragmatic pattern is fine. Do not write a stricter regex hoping to reject every invalid address — strict regex blocks valid users with unusual but legitimate addresses (apostrophes, plus tags, very long TLDs). The only definitive validation is sending a verification email.