How to Extract URLs From Text Using Regex

The Pattern

This is the standard pragmatic URL regex for finding HTTP/HTTPS links in text:

https?:\/\/[a-zA-Z0-9.-]+(?::\d+)?(?:\/[^\s]*)?

It breaks down into four parts:

Protocol: https?:\/\/ — http or https, then a literal ://.
Host: [a-zA-Z0-9.-]+ — letters, digits, dots, hyphens. Loose enough to match most real hostnames; not strict enough to block IDN-like cases.
Optional port: (?::\d+)? — a literal colon plus digits, optional.
Optional path / query / fragment: (?:\/[^\s]*)? — a slash followed by anything that isn't whitespace, optional. Whitespace ends the URL.

This is the same pattern in the common regex patterns library — copy directly from there if you want.

JavaScript

const text = "Visit https://janeer.com or http://example.com:8080/api?id=42. Old: ftp://files.example.com.";
const urls = text.match(/https?:\/\/[a-zA-Z0-9.-]+(?::\d+)?(?:\/[^\s]*)?/g);
// ['https://janeer.com', 'http://example.com:8080/api?id=42.']
// Note: ftp is excluded because the pattern requires http(s)
//       The trailing dot got included on the second URL — clean below

// Strip trailing punctuation
const cleaned = urls.map(u => u.replace(/[.,;!?)\]]+$/, ''));
// ['https://janeer.com', 'http://example.com:8080/api?id=42']

// Validate each match with the real URL parser
const valid = cleaned.filter(u => {
  try { new URL(u); return true; } catch { return false; }
});

// Deduplicate
const unique = [...new Set(valid)];

// Iterate with positions via matchAll
for (const m of text.matchAll(/https?:\/\/[a-zA-Z0-9.-]+(?::\d+)?(?:\/[^\s]*)?/g)) {
  console.log(m[0], 'at index', m.index);
}

Always validate matches with new URL() when correctness matters — the constructor throws on malformed URLs. See the JavaScript regex guide for more on matchAll and the g-flag lastIndex trap.

Python

import re
from urllib.parse import urlparse

text = "Visit https://janeer.com or http://example.com:8080/api?id=42. Old: ftp://files.example.com."
urls = re.findall(r"https?://[a-zA-Z0-9.-]+(?::\d+)?(?:/[^\s]*)?", text)
# ['https://janeer.com', 'http://example.com:8080/api?id=42.']

# Strip trailing punctuation
cleaned = [re.sub(r"[.,;!?)\]]+$", "", u) for u in urls]

# Validate with the real URL parser
valid = [u for u in cleaned if urlparse(u).scheme and urlparse(u).netloc]

# Deduplicate while preserving order
unique = list(dict.fromkeys(valid))

# Iterate with positions
for m in re.finditer(r"https?://[a-zA-Z0-9.-]+(?::\d+)?(?:/[^\s]*)?", text):
    print(m.group(0), "at index", m.start())

Always use raw strings (r"..."). See the Python regex guide for the difference between findall, finditer, and match.

Bash / grep

# Extract URLs from a file, one per line
grep -oE 'https?://[a-zA-Z0-9.-]+(:[0-9]+)?(/[^[:space:]]*)?' input.txt

# Strip trailing punctuation with sed
grep -oE 'https?://[a-zA-Z0-9.-]+(:[0-9]+)?(/[^[:space:]]*)?' input.txt \
  | sed -E 's/[.,;!?)\]]+$//'

# Deduplicate after cleaning
grep -oE 'https?://[a-zA-Z0-9.-]+(:[0-9]+)?(/[^[:space:]]*)?' input.txt \
  | sed -E 's/[.,;!?)\]]+$//' \
  | sort -u

# Count URLs per file in a directory
grep -roE 'https?://[a-zA-Z0-9.-]+' . -c

BSD grep on macOS doesn't support \d or \s in -E mode reliably — the POSIX character classes [0-9] and [[:space:]] work everywhere. ripgrep (rg -oNI) is dramatically faster on large files and supports the full PCRE2 syntax.

Common Edge Cases

Trailing punctuation

The biggest source of dirty matches. URLs followed by sentence-ending punctuation get the punctuation included:

"Check https://example.com." → matches "https://example.com."
"Use https://api.com/v1, then..." → matches "https://api.com/v1,"
"See [https://example.com] for details" → matches "https://example.com]"

The .replace(/[.,;!?)\]]+$/, '') post-processing step above handles all of these. Including the punctuation in the regex (e.g. [^\s.,;!?)\]]* at the end) gets ugly fast and breaks legitimate URLs that contain those characters in query strings.

URLs in markdown

For Markdown text like [link text](https://example.com), the closing paren causes problems — the path matcher consumes it. Fix by stripping the trailing close paren when no open paren appears in the URL, or extract Markdown links separately with the dedicated pattern from the patterns library.

URLs with parentheses in the path

Wikipedia URLs like https://en.wikipedia.org/wiki/Regex_(programming) contain legitimate parens. The basic pattern handles these correctly because the path matcher is greedy and the parens are not whitespace. Stripping trailing close-parens unconditionally (the simple post-process) would corrupt these — only strip if there's no matching open paren in the URL.

Bare domains and protocol-less URLs

The pattern requires http:// or https://. If your text contains bare domains like example.com without a protocol, you'll miss them. There's no clean regex for this — bare domains false-positive on filenames (script.js), version numbers (1.0.0), and prose (etc.com). For protocol-less URL detection, use a dedicated library like Linkify (JavaScript) or urlextract (Python) that ships with TLD lists and contextual heuristics.

Internationalized domain names (IDN)

The pattern is ASCII-only. Punycoded IDNs like https://xn--exmple-cua.com match because they're ASCII. Native Unicode IDNs like https://例え.テスト don't match — extend the host character class to include - if you need them, or use a Unicode-aware library.

Other protocols

To match ftp, ws, file, etc., either widen the protocol part:

(?:https?|ftp|ws|wss|file)://[a-zA-Z0-9.-]+(?::\d+)?(?:\/[^\s]*)?

Or use a generic scheme matcher: [a-zA-Z][a-zA-Z0-9+.-]*://... per RFC 3986 — but this is rarely worth it. Most text contains http(s) URLs only.

HTML Is Different

If your input is HTML rather than plain text, do not use this regex. Use a real HTML parser:

// Browser JavaScript
const doc = new DOMParser().parseFromString(html, 'text/html');
const urls = [...doc.querySelectorAll('a[href]')].map(a => a.href);

// Node.js (with jsdom)
const { JSDOM } = require('jsdom');
const dom = new JSDOM(html);
const urls = [...dom.window.document.querySelectorAll('a[href]')].map(a => a.href);

# Python (with BeautifulSoup)
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
urls = [a['href'] for a in soup.find_all('a', href=True)]

HTML can have URLs in href, src, action, data-*, inline styles, srcset, and many other attributes. A parser handles them all consistently; regex would need a different rule per attribute and would still get nested tags wrong.

Try It Live

Paste the URL pattern and a sample of your real data into the regex tester to see matches highlighted in real time. The tester runs entirely in your browser. For an explanation of any token, the regex explainer breaks down the pattern piece by piece. For more pre-built patterns (email, phone, date, UUID, etc.), see the common patterns library.

Frequently Asked Questions

What regex pattern extracts URLs?

The pragmatic pattern for finding HTTP and HTTPS URLs is https?:\/\/[a-zA-Z0-9.-]+(?::\d+)?(?:\/[^\s]*)?. It matches http:// or https://, the host, an optional port, and an optional path. The path stops at the first whitespace, which is what you almost always want — text following a URL is usually separated by a space, comma, or newline. For full URL validation, pass each match to a real URL parser like new URL() in JavaScript or urllib.parse in Python.

How do I extract every URL from a string?

In JavaScript, str.match(/https?:\/\/[a-zA-Z0-9.-]+(?::\d+)?(?:\/[^\s]*)?/g) returns an array of every match. The g flag is essential. In Python, re.findall(pattern, text) does the same. In Bash, grep -oE produces one match per line. Deduplicate with new Set(...) in JavaScript or set() in Python. Normalize protocol case before comparing — http://EXAMPLE.com and http://example.com are equivalent hosts.

How do I strip trailing punctuation from extracted URLs?

URLs followed by punctuation in prose — 'See https://example.com.' — get the trailing dot included by the basic pattern, which is usually wrong. Two fixes: tighten the pattern to exclude common trailing punctuation with [^\s.,;!?)\]] at the end of the path, or post-process each match with a strip step that removes a trailing dot, comma, semicolon, exclamation mark, question mark, closing paren, or closing bracket. Post-processing is simpler and more flexible.

Does the regex match URLs without http:// or https:// prefix?

No — the basic pattern requires http:// or https://. To also match bare domains like example.com or www.example.com that appear in text without a protocol, you need a much looser pattern, but those false-positive easily on words containing dots (file.txt, version 1.0.0). Better practice: require the protocol in your input text, or use a dedicated library like linkifyjs (JavaScript) or urlextract (Python) that has heuristics for protocol-less URLs.

How do I extract URLs from HTML?

Don't use plain regex on HTML — it cannot reliably parse nested tags or handle quoted attributes correctly. For HTML, use a real parser: DOMParser in browser JavaScript, BeautifulSoup in Python, or jsdom in Node.js. Then iterate document.querySelectorAll('a[href]') (or the equivalent) and read the href attribute on each anchor. Regex is the right tool for plain text and log files; HTML is structured data and deserves a structured parser.