Chapter 23: Advanced Python Programming – Mastering Python Regular Expressions



23.1 Introduction to Regular Expressions (Regex)

Regular Expressions (Regex) are powerful tools used to perform pattern matching and text manipulation in strings. Python provides a built-in library called re for working with regular expressions. Understanding and mastering Regex can dramatically increase efficiency in text processing, data extraction, and input validation tasks.


23.2 The re Module in Python

The re module offers several functions:

  • re.match(): Determines if the RE matches at the beginning of the string.

  • re.search(): Scans through a string looking for any location where the RE matches.

  • re.findall(): Returns all non-overlapping matches of RE in the string.

  • re.finditer(): Returns an iterator yielding match objects over all matches.

  • re.sub(): Replaces matched substrings with a new string.

  • re.split(): Splits the string by occurrences of the pattern.

Example:

import re

text = "Python is powerful. Python is easy to learn."
matches = re.findall("Python", text)
print(matches)  # ['Python', 'Python']

23.3 Regex Metacharacters

Metacharacters are characters with special meanings in regular expressions:

Metacharacter Description
. Matches any character except newline
^ Matches the beginning of a string
$ Matches the end of a string
* Matches 0 or more repetitions
+ Matches 1 or more repetitions
? Matches 0 or 1 repetition
{m,n} Matches between m and n repetitions
[] Matches any character in the set
` `
() Groups sub-patterns
\ Escapes special characters

23.4 Special Sequences

Python Regex supports special sequences for common patterns:

Sequence Meaning
\d Matches any digit (0-9)
\D Matches any non-digit
\w Matches any alphanumeric character
\W Matches any non-alphanumeric character
\s Matches any whitespace character
\S Matches any non-whitespace character
\b Matches word boundaries
\B Matches non-word boundaries

23.5 Compiling Regular Expressions

For better performance and reusability, compile regex patterns using re.compile():

pattern = re.compile(r"\d{3}-\d{2}-\d{4}")
result = pattern.search("My SSN is 123-45-6789")
print(result.group())  # 123-45-6789

23.6 Grouping and Capturing

Parentheses () are used to group parts of a regex and capture the match:

match = re.search(r"(\d+)-(\d+)", "Phone: 123-456")
print(match.group(1))  # 123
print(match.group(2))  # 456

Named groups improve clarity:

match = re.search(r"(?P<area>\d+)-(?P<number>\d+)", "123-456")
print(match.group("area"))   # 123
print(match.group("number")) # 456

23.7 Substitution with re.sub()

Used to replace parts of the string that match the pattern:

text = "apple, banana, mango"
new_text = re.sub(r"\b\w{5}\b", "fruit", text)
print(new_text)  # fruit, banana, fruit

23.8 Splitting Strings with re.split()

Useful for splitting by complex patterns:

data = "one, two;three|four"
items = re.split(r"[,;|]", data)
print(items)  # ['one', ' two', 'three', 'four']

23.9 Greedy vs Non-Greedy Matching

  • Greedy: Matches as much text as possible.

  • Non-Greedy: Matches as little as possible.

greedy = re.search(r"<.*>", "<tag>content</tag>")
print(greedy.group())  # <tag>content</tag>

non_greedy = re.search(r"<.*?>", "<tag>content</tag>")
print(non_greedy.group())  # <tag>

23.10 Lookahead and Lookbehind

Lookahead: Asserts what follows.

re.findall(r"\w+(?=\.)", "example.com")  # ['example']

Lookbehind: Asserts what precedes.

re.findall(r"(?<=@)\w+", "user@example.com")  # ['example']

23.11 Real-world Examples

Example 1: Email Validation

email_pattern = r"^[\w\.-]+@[\w\.-]+\.\w{2,}$"
re.match(email_pattern, "user@example.com")

Example 2: Phone Number Matching

phone = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
re.findall(phone, "Call me at 123-456-7890 or (123) 456-7890")

Example 3: HTML Tag Stripping

html = "<p>This is <b>bold</b></p>"
clean = re.sub(r"<.*?>", "", html)
print(clean)  # This is bold

23.12 Best Practices for Regex in Python

  • Use raw strings (r"") to avoid conflicts with escape sequences.

  • Compile patterns for repeated use.

  • Use meaningful variable names and comments for complex patterns.

  • Avoid overly complex expressions – they reduce readability.

  • Test patterns thoroughly using tools like regex101.com.


23.13 Summary

Regular Expressions offer a powerful mechanism for text pattern matching and processing. With Python’s re module, one can write efficient and concise code for validating data, extracting information, and formatting text. Mastering regex enhances your ability to work with data, especially in fields like web scraping, natural language processing, and log analysis.


23.14 Exercises

  1. Write a regex to extract all dates in the format DD-MM-YYYY.

  2. Validate a string that is a valid IPv4 address.

  3. Replace all multiple spaces in a string with a single space.

  4. Extract hashtags from a tweet.

  5. Write a regex that matches all URLs in a string.

Comments