Chapter 23: Advanced Python Programming – Mastering Python Regular Expressions
23.1 Introduction to Regular Expressions (Regex)
Regular Expressions (Regex) are powerful tools used to perform pattern matching and text manipulation in strings. Python provides a built-in library called re
for working with regular expressions. Understanding and mastering Regex can dramatically increase efficiency in text processing, data extraction, and input validation tasks.
23.2 The re
Module in Python
The re
module offers several functions:
-
re.match()
: Determines if the RE matches at the beginning of the string. -
re.search()
: Scans through a string looking for any location where the RE matches. -
re.findall()
: Returns all non-overlapping matches of RE in the string. -
re.finditer()
: Returns an iterator yielding match objects over all matches. -
re.sub()
: Replaces matched substrings with a new string. -
re.split()
: Splits the string by occurrences of the pattern.
Example:
import re
text = "Python is powerful. Python is easy to learn."
matches = re.findall("Python", text)
print(matches) # ['Python', 'Python']
23.3 Regex Metacharacters
Metacharacters are characters with special meanings in regular expressions:
Metacharacter | Description |
---|---|
. |
Matches any character except newline |
^ |
Matches the beginning of a string |
$ |
Matches the end of a string |
* |
Matches 0 or more repetitions |
+ |
Matches 1 or more repetitions |
? |
Matches 0 or 1 repetition |
{m,n} |
Matches between m and n repetitions |
[] |
Matches any character in the set |
` | ` |
() |
Groups sub-patterns |
\ |
Escapes special characters |
23.4 Special Sequences
Python Regex supports special sequences for common patterns:
Sequence | Meaning |
---|---|
\d |
Matches any digit (0-9) |
\D |
Matches any non-digit |
\w |
Matches any alphanumeric character |
\W |
Matches any non-alphanumeric character |
\s |
Matches any whitespace character |
\S |
Matches any non-whitespace character |
\b |
Matches word boundaries |
\B |
Matches non-word boundaries |
23.5 Compiling Regular Expressions
For better performance and reusability, compile regex patterns using re.compile()
:
pattern = re.compile(r"\d{3}-\d{2}-\d{4}")
result = pattern.search("My SSN is 123-45-6789")
print(result.group()) # 123-45-6789
23.6 Grouping and Capturing
Parentheses ()
are used to group parts of a regex and capture the match:
match = re.search(r"(\d+)-(\d+)", "Phone: 123-456")
print(match.group(1)) # 123
print(match.group(2)) # 456
Named groups improve clarity:
match = re.search(r"(?P<area>\d+)-(?P<number>\d+)", "123-456")
print(match.group("area")) # 123
print(match.group("number")) # 456
23.7 Substitution with re.sub()
Used to replace parts of the string that match the pattern:
text = "apple, banana, mango"
new_text = re.sub(r"\b\w{5}\b", "fruit", text)
print(new_text) # fruit, banana, fruit
23.8 Splitting Strings with re.split()
Useful for splitting by complex patterns:
data = "one, two;three|four"
items = re.split(r"[,;|]", data)
print(items) # ['one', ' two', 'three', 'four']
23.9 Greedy vs Non-Greedy Matching
-
Greedy: Matches as much text as possible.
-
Non-Greedy: Matches as little as possible.
greedy = re.search(r"<.*>", "<tag>content</tag>")
print(greedy.group()) # <tag>content</tag>
non_greedy = re.search(r"<.*?>", "<tag>content</tag>")
print(non_greedy.group()) # <tag>
23.10 Lookahead and Lookbehind
Lookahead: Asserts what follows.
re.findall(r"\w+(?=\.)", "example.com") # ['example']
Lookbehind: Asserts what precedes.
re.findall(r"(?<=@)\w+", "user@example.com") # ['example']
23.11 Real-world Examples
Example 1: Email Validation
email_pattern = r"^[\w\.-]+@[\w\.-]+\.\w{2,}$"
re.match(email_pattern, "user@example.com")
Example 2: Phone Number Matching
phone = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
re.findall(phone, "Call me at 123-456-7890 or (123) 456-7890")
Example 3: HTML Tag Stripping
html = "<p>This is <b>bold</b></p>"
clean = re.sub(r"<.*?>", "", html)
print(clean) # This is bold
23.12 Best Practices for Regex in Python
-
Use raw strings (
r""
) to avoid conflicts with escape sequences. -
Compile patterns for repeated use.
-
Use meaningful variable names and comments for complex patterns.
-
Avoid overly complex expressions – they reduce readability.
-
Test patterns thoroughly using tools like regex101.com.
23.13 Summary
Regular Expressions offer a powerful mechanism for text pattern matching and processing. With Python’s re
module, one can write efficient and concise code for validating data, extracting information, and formatting text. Mastering regex enhances your ability to work with data, especially in fields like web scraping, natural language processing, and log analysis.
23.14 Exercises
-
Write a regex to extract all dates in the format
DD-MM-YYYY
. -
Validate a string that is a valid IPv4 address.
-
Replace all multiple spaces in a string with a single space.
-
Extract hashtags from a tweet.
-
Write a regex that matches all URLs in a string.
Comments
Post a Comment
"Thank you for seeking advice on your career journey! Our team is dedicated to providing personalized guidance on education and success. Please share your specific questions or concerns, and we'll assist you in navigating the path to a fulfilling and successful career."