Regular expressions (regex) are one of the most powerful text-manipulation utilities in software engineering. A single line of regex can replace dozens of lines of string parsing, checking for email validation, extracting logs, or matching route parameters.
But regex is also notoriously opaque, and poorly written patterns can lead to severe performance bottlenecks. In the worst case, a bad pattern can trigger a **Regular Expression Denial of Service (ReDoS)**, locking up your server CPU. Let's explore the core syntax blocks and how to write secure, performant regex.
1. Core Building Blocks
To use regex effectively, you must understand how engines parse strings. Engines search for literal matches, modified by operators:
- Character Classes: `[a-z]` matches lowercase letters, `[0-9]` matches digits, and `[^a-z]` matches any character *except* lowercase letters. Special abbreviations like `\d` (digit), `\w` (word character), and `\s` (whitespace) keep patterns readable.
- Quantifiers: Denoted by `*` (zero or more), `+` (one or more), `?` (zero or one), and `{min,max}` (explicit repeat bounds).
- Anchors: `^` asserts the start of the string, while `$` asserts the end. Anchors are vital for security—without them, a pattern like `[0-9]+` will match `"abc123xyz"` instead of enforcing a pure numeric input.
2. Greedy vs. Lazy Quantifiers
By default, quantifiers in regex are **greedy**—they match as much text as possible. For example, if you run the pattern `<.*>` against the HTML string `<div>Hello</div>`, the engine will match the entire string `<div>Hello</div>` rather than just the opening tag `<div>`.
To make a quantifier **lazy** (matching the shortest possible segment), append a `?` after it. The pattern `<.*?>` applied to the same HTML string will correctly stop at the first `>`, matching only `<div>`.
3. Lookahead and Lookbehind Assertions
Lookarounds allow you to match a pattern only if it is (or is not) preceded or followed by another pattern, without including the surrounding text in the match result.
- Positive Lookahead `(?=...)`: Matches if the suffix exists. E.g., `\d+(?=\sUSD)` matches `100` in the string `"100 USD"`, but not `"100 EUR"`.
- Negative Lookahead `(?!...)`: Matches if the suffix does *not* exist. E.g., `\d+(?!\d)` matches the last digit in a number sequence.
- Lookbehinds `(?<=...)` and `(?<!...)`: Perform similar checks looking backward from the current position.
4. The Danger of ReDoS (Catastrophic Backtracking)
Most programming languages (like JavaScript, Python, Java, and Ruby) use backtracking regex engines. When these engines encounter a mismatch, they backtrack through previous matching paths to see if another path yields a match.
If you write nested quantifiers (e.g., `(a+)+`), the number of possible matching paths grows exponentially. If you run the pattern `^(a+)+$` against a string with 25 characters of `"a"` followed by a single `"b"` (which doesn't match the `$`), the engine will try billions of permutations before concluding there is no match. This causes **catastrophic backtracking**, pegging the server's CPU core to 100%.