How to Parse Log Lines With Regex

Log files are mostly structured but rarely standard. Apache and Nginx have their own formats, syslog has two, and every application invents its own. Regex is the right tool for pulling fields out of one log format at a time — fast, no dependencies, runs in any language. This guide covers patterns for the four formats you actually see in production, working code with named capture groups in JavaScript, Python, and Bash, and the cases where regex stops being the right answer. Test any pattern in the regex tester against a sample line first.

The Four Formats You Actually See

Most log parsing in the wild involves one of these four formats:

  • Apache / Nginx access logs — the Combined Log Format. Same pattern works for both servers.
  • Syslog — traditional (RFC 3164) on most older Unix systems, modern (RFC 5424) on systemd-journald and newer setups.
  • Application logs — your own framework's format, usually [timestamp] LEVEL message.
  • JSON logs — structured logs from modern apps. Don't use regex; use a JSON parser.

The patterns below cover the first three. Each uses named capture groups so the resulting code reads naturally.

Apache / Nginx Combined Log Format

A typical line:

192.168.1.1 - alice [10/Mar/2024:13:55:36 +0000] "GET /index.html HTTP/1.1" 200 2326 "https://example.com/" "Mozilla/5.0 (Macintosh)"

The pattern, broken across lines for readability:

^(?<ip>\S+) \S+ (?<user>\S+) \[(?<ts>[^\]]+)\] "(?<method>\S+) (?<path>\S+) (?<proto>\S+)" (?<status>\d{3}) (?<size>\d+|-) "(?<referer>[^"]*)" "(?<agent>[^"]*)"$

Captured fields, in order: client IP, remote user, timestamp, HTTP method, request path, protocol, status code, response size (or - for none), referer header, user-agent header.

JavaScript

const line = '192.168.1.1 - alice [10/Mar/2024:13:55:36 +0000] "GET /index.html HTTP/1.1" 200 2326 "https://example.com/" "Mozilla/5.0"';

const apachePattern = /^(?<ip>\S+) \S+ (?<user>\S+) \[(?<ts>[^\]]+)\] "(?<method>\S+) (?<path>\S+) (?<proto>\S+)" (?<status>\d{3}) (?<size>\d+|-) "(?<referer>[^"]*)" "(?<agent>[^"]*)"$/;

const m = line.match(apachePattern);
if (m) {
  const { ip, ts, method, path, status, size } = m.groups;
  console.log(`${method} ${path} → ${status} (${size}b) from ${ip} at ${ts}`);
}

// Stream a whole file with readline
import { createReadStream } from 'fs';
import { createInterface } from 'readline';

const rl = createInterface({
  input: createReadStream('access.log'),
  crlfDelay: Infinity,
});

let errors = 0;
for await (const line of rl) {
  const m = line.match(apachePattern);
  if (m && m.groups.status.startsWith('5')) errors++;
}
console.log(`5xx errors: ${errors}`);

Named groups land on match.groups as a plain object. Destructure to pull the fields you need. See the JavaScript regex guide for more on named groups and the matchAll iterator.

Python

import re
from collections import Counter

apache_pattern = re.compile(
    r'^(?P<ip>\S+) \S+ (?P<user>\S+) \[(?P<ts>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) (?P<proto>\S+)" '
    r'(?P<status>\d{3}) (?P<size>\d+|-) '
    r'"(?P<referer>[^"]*)" "(?P<agent>[^"]*)"$'
)

# Top 5 paths returning 5xx
errors_by_path = Counter()
with open('access.log') as f:
    for line in f:
        m = apache_pattern.match(line)
        if m and m.group('status').startswith('5'):
            errors_by_path[m.group('path')] += 1

for path, count in errors_by_path.most_common(5):
    print(f"{count:6d}  {path}")

# Convert each match to a dict for downstream processing
def parse(line):
    m = apache_pattern.match(line)
    return m.groupdict() if m else None

Compile the pattern once at module level — Python caches a small number of patterns, but explicit re.compile is clearer for anything reused. Use match.groupdict() to convert all named groups into a dict in one call. See the Python regex guide for the difference between match, search, and fullmatch.

Bash / grep / awk

# Count requests per status code
grep -oE '" [0-9]{3} ' access.log | tr -d '" ' | sort | uniq -c | sort -rn

# Find all 5xx errors with their paths
grep -E '" [0-9]{3} ' access.log \
  | awk -F'"' '$3 ~ / 5[0-9][0-9] / { match($0, /"[A-Z]+ ([^ ]+) /, m); print m[1] }'

# Top 10 IPs by request count
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head

# Filter by date range (using grep on the bracketed timestamp)
grep '\[10/Mar/2024:1[3-5]:' access.log | wc -l

# For complex parsing, prefer awk or jq (for JSON logs) over chained greps
awk '{
  match($0, /"([A-Z]+) ([^ ]+) ([^"]+)" ([0-9]+)/, m);
  print m[4], m[1], m[2]
}' access.log

For one-line summaries, grep + awk is fast and works on any Unix system. For anything more complex, write a 10-line Python or Node script — your future self will thank you. ripgrep with -N --no-filename beats grep for large log directories.

Syslog

Traditional syslog (RFC 3164) is what most older Unix systems still emit:

Mar 10 13:55:36 hostname program[123]: message text here

The pattern, with optional PID:

^(?<ts>\w{3}\s+\d+\s+\d+:\d+:\d+) (?<host>\S+) (?<program>\S+?)(?:\[(?<pid>\d+)\])?: (?<message>.*)$

RFC 5424 (modern syslog, used by systemd-journald) adds a priority field in angle brackets and an ISO 8601 timestamp:

<13>1 2024-03-10T13:55:36.123Z hostname program 123 - - message text

The pattern:

^<(?<pri>\d+)>\d+ (?<ts>\S+) (?<host>\S+) (?<program>\S+) (?<pid>\S+) (?<msgid>\S+) (?<sd>\S+) (?<message>.*)$

Note the - placeholders for fields that aren't set (msgid, structured data). The pattern accepts them as \S+.

Application Logs

Most app logs follow a [timestamp] LEVEL [context] message shape:

[2024-03-10 13:55:36] ERROR [worker-3] Database connection timeout after 30s

The pattern:

^\[(?<ts>[\d\- :]+)\] (?<level>\w+) \[(?<ctx>[^\]]+)\] (?<message>.*)$

If your logger omits the context bracket:

^\[(?<ts>[\d\- :]+)\] (?<level>\w+) (?<message>.*)$

For your own logs, prefer JSON output and skip regex entirely — JSON parsing is faster, less error-prone, and survives format changes.

Common Edge Cases

Multi-line entries (stack traces)

A Python or Java stack trace spans many lines but is one logical log entry:

[2024-03-10 13:55:36] ERROR Something failed
Traceback (most recent call last):
  File "app.py", line 42, in handler
    raise ValueError("bad input")
ValueError: bad input

Two approaches:

  • Pre-process by joining continuation lines (lines that don't start with the entry-marker pattern, like [ or a date) into the parent entry. This is what Logstash multiline, Fluent Bit multiline, and Vector's parse_logfmt do under the hood.
  • Use a multi-line regex with the s (dotall) flag and anchor each entry by its leading timestamp pattern: \[\d{4}-\d{2}-\d{2} [^]]+\] \w+ .+?(?=\n\[|\Z). The non-greedy .+? with a lookahead for the next entry marker captures everything up to the next entry.

Quoted fields with embedded quotes

The Apache pattern uses "([^"]*)" for quoted fields, which fails on quotes-inside-quotes (rare, but possible in user-agent strings). For strict parsing, allow escaped quotes: "((?:\\"|[^"])*)". For most log analysis the simple version is fine — the false-negative rate is <0.01%.

IPv6 addresses

The \S+ for client IP captures both IPv4 (192.168.1.1) and IPv6 (2001:db8::1) without changes. If you need to validate the captured IP, do it in code after extraction — IP-validation regexes are notoriously long and error-prone.

Custom log formats

Many sites tweak Apache's default format to add request time, upstream backend, or response time. Check your LogFormat directive (Apache) or log_format block (Nginx) and rebuild the regex from the field list. Each field becomes one named capture group; separators stay literal.

Performance on large files

The patterns above are linear and avoid backtracking — they parse millions of lines per minute on a single core. Three rules for big files:

  • Stream, don't slurp. Iterate line-by-line; never load a multi-GB file into memory.
  • Compile patterns once. In Python, re.compile at module scope; in JavaScript, define the regex outside the loop.
  • Filter before parsing. If you only care about 5xx, grep '" 5[0-9][0-9] ' first to cut the line count by ~99%, then parse the rest.

Try It Live

Paste any of these patterns into the regex tester with a sample log line — the tester runs entirely in your browser, so production logs stay on your machine. For named groups and a token-by-token explanation, the regex explainer walks through each part. Common patterns including timestamps, IPs, and HTTP fields are in the regex pattern library.

Frequently Asked Questions

What regex parses Apache or Nginx access log lines?

For the Apache Combined Log Format (the default for both Apache and Nginx), the pattern is ^(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) (\S+)" (\d{3}) (\d+|-) "([^"]*)" "([^"]*)"$. Captured groups in order are: client IP, timestamp, HTTP method, path, protocol, status code, response size, referer, and user-agent. Use named groups (?P...) in Python or (?...) in JavaScript and modern regex engines for clearer access by field name.

How do I use named capture groups in regex?

Named groups give each captured field a label so you can access it by name instead of by position. In JavaScript and PCRE, the syntax is (?...). In Python and older PCRE, it is (?P...). After matching, access the value via match.groups.name in JavaScript, match.group("name") in Python, or ${name} in replacement strings. Named groups make log-parsing code dramatically easier to read and maintain — anyone reading the code can tell what each field is.

How do I parse syslog format?

Traditional syslog (RFC 3164) lines look like Mar 10 13:55:36 hostname program[123]: message. The pattern ^(?P\w{3}\s+\d+\s+\d+:\d+:\d+) (?P\S+) (?P\S+?)(?:\[(?P\d+)\])?: (?P.*)$ extracts the timestamp, hostname, program name, optional PID, and message. RFC 5424 syslog (the modern variant) prefixes lines with a priority number in angle brackets and uses ISO 8601 timestamps — the pattern grows but the same approach works.

How do I handle multi-line log entries like stack traces?

A typical stack trace spans many lines, all logically part of one log entry. Two approaches work. First, pre-process the file by joining continuation lines into the parent entry — most stack-trace lines start with whitespace or the word "at", which is a reliable signal. Second, switch to a multi-line regex with the s (dotall) flag so . matches newlines, then anchor each entry by its leading timestamp pattern. Tools like Logstash, Fluent Bit, and Vector have built-in multi-line handlers — for production pipelines they are easier than rolling your own.

When should I stop using regex and use a structured log parser?

Regex is the right tool when log lines follow a consistent format you control. It stops being the right tool when logs are JSON (use a JSON parser), when entries span many lines and have variable structure, when you need to query historical logs across many fields (use Loki, Elasticsearch, or a SQL-on-logs tool), or when the same logs come from different sources with different formats. The most reliable production setup is to emit JSON logs from your application and never need regex at all. Regex is for parsing logs you cannot change at the source.