How to Extract Phone Numbers From Text Using Regex
Phone numbers are some of the messiest data you will ever extract. Real text contains them with parentheses, dashes, dots, spaces, country codes, extensions, and every combination in between. This guide covers the pragmatic patterns that catch the formats you actually find in customer messages, signature blocks, and scraped pages, working code in JavaScript, Python, and Bash, plus a clear note on when to stop fighting regex and reach for libphonenumber. Test any pattern in the regex tester with your own data first.
The Pattern
This is the pragmatic pattern for US phone numbers in free-form text:
(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}
It breaks down into four parts:
- Optional country code:
(?:\+?1[-.\s]?)?— an optional+, then1, then an optional separator. The outer(?:...)?makes the whole country-code chunk optional. - Area code:
\(?\d{3}\)?— three digits, optionally wrapped in parentheses. - Separator + exchange:
[-.\s]?\d{3}— an optional dash, dot, or whitespace, then three digits. - Separator + line number:
[-.\s]?\d{4}— same, then four digits.
Add \b word boundaries on either side if you're scanning text mixed with code or part numbers and want to avoid mid-string matches:
\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b
For international numbers in E.164 format (always starting with +):
\+\d{1,3}[\d\s\-\.]{7,14}
This matches a leading +, a 1- to 3-digit country code, and 7 to 14 more digits or separators. Always normalize matches by stripping non-digit characters before storing or comparing.
JavaScript
const text = "Call (555) 123-4567 or 555.987.6543. International: +44 20 7946 0958.";
const usPattern = /(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}/g;
const matches = text.match(usPattern);
// ['(555) 123-4567', '555.987.6543']
// Normalize: strip everything except digits
const normalized = matches.map(m => m.replace(/\D/g, ''));
// ['5551234567', '5559876543']
// Add a leading 1 for E.164-ish storage
const e164 = normalized.map(d => d.length === 10 ? '+1' + d : '+' + d);
// ['+15551234567', '+15559876543']
// Deduplicate
const unique = [...new Set(normalized)];
// matchAll for positions and capture groups
for (const m of text.matchAll(usPattern)) {
console.log(m[0], 'at index', m.index);
}
The g flag is essential — without it, match returns only the first phone number. See the JavaScript regex guide for the difference between match, matchAll, and replace.
Python
import re
text = "Call (555) 123-4567 or 555.987.6543. International: +44 20 7946 0958."
us_pattern = r"(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
matches = re.findall(us_pattern, text)
# ['(555) 123-4567', '555.987.6543']
# Normalize: strip everything except digits
normalized = [re.sub(r"\D", "", m) for m in matches]
# ['5551234567', '5559876543']
# Deduplicate while preserving order
unique = list(dict.fromkeys(normalized))
# Use finditer for positions
for m in re.finditer(us_pattern, text):
print(m.group(0), "at index", m.start())
# For production-grade parsing, use phonenumbers
# pip install phonenumbers
import phonenumbers
for match in phonenumbers.PhoneNumberMatcher(text, "US"):
print(phonenumbers.format_number(match.number, phonenumbers.PhoneNumberFormat.E164))
The phonenumbers package is the Python port of Google's libphonenumber and handles every country's quirks. Use regex to extract candidates fast, then validate the ones that matter with the parser. See the Python regex guide for more on findall versus finditer.
Bash / grep
# Extract US phone numbers from a file, one per line
grep -oE '(\+?1[-.[:space:]]?)?\(?[0-9]{3}\)?[-.[:space:]]?[0-9]{3}[-.[:space:]]?[0-9]{4}' input.txt
# Deduplicate
grep -oE '(\+?1[-.[:space:]]?)?\(?[0-9]{3}\)?[-.[:space:]]?[0-9]{3}[-.[:space:]]?[0-9]{4}' input.txt | sort -u
# Strip non-digit characters from each match (normalize)
grep -oE '(\+?1[-.[:space:]]?)?\(?[0-9]{3}\)?[-.[:space:]]?[0-9]{3}[-.[:space:]]?[0-9]{4}' input.txt \
| tr -cd '0-9\n' | sort -u
# Find files containing phone numbers
grep -lrE '\(?[0-9]{3}\)?[-.[:space:]]?[0-9]{3}[-.[:space:]]?[0-9]{4}' .
The grep variant uses POSIX character classes ([:space:]) and [0-9] instead of \d, which most BSD/POSIX greps don't support. The -o flag prints only the matches; -E enables extended regex. For very large files, ripgrep (rg -oNI) supports PCRE-style \d and runs much faster.
Common Edge Cases
Extensions
Phone numbers in business signatures often include extensions like ext. 99, x99, or extension 99. Append an optional extension group:
(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}(?:\s*(?:ext\.?|x|extension)\s*\d+)?
The non-capturing group keeps your other capture groups (if any) at the same positions. The pattern matches 555-123-4567 ext 99, 555.123.4567 x99, and (555) 123-4567 extension 99.
International formats
Real international numbers come in dozens of formats: +44 20 7946 0958 (UK), +33 1 42 86 82 00 (France), +81 3-3501-1234 (Japan). The pragmatic E.164 pattern catches them, but country-specific national formats (020 7946 0958 without the country code) need country-aware logic. Either require E.164 input or hand candidates to libphonenumber with a region hint.
False positives: dates, part numbers, IDs
The loose pattern matches anything that looks like 10 digits with separators. 12-25-2024 won't match (the year is 4 digits in the wrong position) but 123-4567890 (a part number) will. Three layers of defense:
- Word boundaries (
\b) prevent matches inside larger tokens. - Post-filter on context — check the surrounding words for "call", "phone", "tel", "mobile" if your input has them.
- Validate with libphonenumber — it knows that area code
123doesn't exist in the North American Numbering Plan and rejects the candidate.
Numbers split across lines
Numbers in HTML or formatted documents sometimes break across lines: (555)\n123-4567. Add \s (which matches newlines in most engines with the s flag, or always in Python) to the separator class, or pre-process the text by collapsing whitespace before matching.
Numbers with country code but no plus
Strings like 1-555-123-4567 match the US pattern. Strings like 44 20 7946 0958 (UK without leading +) won't match the E.164 pattern because there's no plus sign. If your data source omits the plus, the only reliable approach is to know which country the data comes from and apply the matching national format pattern.
When to Stop Using Regex
Regex is excellent for finding candidates in messy text. It is bad at validating phone numbers because the rules vary per country, per area code, and per number type (mobile vs landline vs toll-free). If any of these apply, use libphonenumber instead:
- You need to confirm a number is dialable, not just well-formed.
- You need to format extracted numbers consistently (E.164, international, national).
- You need to identify the country, region, or carrier of a number.
- You need to distinguish mobile from landline.
- Your input includes numbers from many countries.
libphonenumber is available as libphonenumber-js (JavaScript), phonenumbers (Python), libphonenumber (Java), libphonenumber.dart (Dart), and ports for most other languages. The PhoneNumberMatcher class extracts and validates in one pass — often the right tool when you'd otherwise be combining a regex extract with a validation step.
Try It Live
Paste any of these patterns into the regex tester with a sample of your real data — the tester runs entirely in your browser, so customer phone numbers stay on your machine. For an explanation of any pattern token-by-token, the regex explainer breaks it down. Common patterns including this one are also in the regex pattern library for one-click copy.
Frequently Asked Questions
What regex pattern extracts US phone numbers?
The practical pattern for US phone numbers in free-form text is (?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}. It matches an optional +1 country code, an optional area code with or without parentheses, and the seven digits with dashes, dots, or spaces between groups. This catches the formats you actually find in real data: 555-123-4567, (555) 123-4567, 555.123.4567, +1 555 123 4567, and 1-555-123-4567. For numbers with extensions or international formats, see the patterns below.
How do I extract international phone numbers in E.164 format?
E.164 numbers always start with a + and contain 8 to 15 digits total. The pattern \+\d{1,3}[\d\s\-\.]{7,14} matches the country code (1 to 3 digits) followed by 7 to 14 more characters that are digits, spaces, dashes, or dots. Strip non-digit characters from each match before storing or comparing — store the canonical form (+15551234567) and reformat for display. For full international validation across every country code and number length, libphonenumber is the only thing that gets it consistently right.
How do I match phone number extensions like x123 or ext. 4567?
Append an optional extension group to your base pattern: (?:\s*(?:ext\.?|x|extension)\s*\d+)?. The full US pattern with extension becomes (?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}(?:\s*(?:ext\.?|x|extension)\s*\d+)?. This matches 555-123-4567 ext 99, 555.123.4567 x99, and (555) 123-4567 extension 99. Use a non-capturing group (?:...) so the extension does not affect your numbered capture groups elsewhere.
Why do I get false positives like dates or part numbers?
Loose phone patterns happily match anything that looks like a 10-digit string with separators. Dates (12-25-2024), part numbers (ABC-123-4567), and timestamps can all match. Two practical fixes: add word boundaries (\b at each end) to avoid mid-string matches, or post-filter each match by stripping non-digits and checking that the result has exactly 10 digits (US) or 11 digits starting with 1. For production validation, parse each candidate with libphonenumber, which knows valid area codes and exchange prefixes per country.
Should I use regex to validate user phone input on a form?
For user input validation, use type="tel" on the input and validate server-side with libphonenumber rather than regex. Phone number rules are not just digits and separators — area codes have specific valid ranges, country codes have specific lengths, and mobile prefixes vary by carrier. A regex that accepts every valid number across every country would be enormous and still wrong. Use regex for extracting candidates from messy text, then validate each candidate with a real parser.