Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Regular Expression in Python

Online Regular Expression Tester

  1. The Python module re automatically compiles a plain/text pattern using re.compile and caches it, so there's not much benefit to compile plain/text patterns by yourself.

  2. Some regular expression patterns are defined using a single leading backslash, e.g., \s, \b, etc. However, since special characters (e.g., \) need to be escaped in strings in most programming languages, you will need the string "\\s" to represent the regular expression pattern \s, and similar for other regular expression patterns with a leading backslash. Python is specialy as it provides raw strings (without escaping) to make it easier to write regular expression patterns. Python even goes one step further to auto correct non-properly escape strings. For example, "\s" is auto corrected to "\\s" in Python, and similarly for other non-properly escaped strings. I persoanlly dislike this behavior very much as it causes confusions especially when you call other languages (e.g., Spark SQL) to perform regular expression operations from Python. It is suggested that you use raw strings as much as possible to write regulare expressions in Python.

  3. It becomes tricky if you call another programming language to perform regular expression operations from Python. Taking \s for example, since \ needs to be escaped in both programming languages, you will end up using \\\\s to represent \s. Lucky that you can use raw strings to simplify things. For example, instead of "\\\\s", you can use r"\\s" in Python.

  4. The regular expression modifier (?i) turns on case-insensitive matching.

  5. re.match looks for a match only at the beginning of the string while re.search looks for a match anywhere in the string. Since the regular expression symbol ^ stands for the beginning of a string, you can prefix your regular expression with ^ to make re.search look for a match only at the beginnin of the string. To sum up, re.search is more flexible than re.match and it is suggested that you always use re.search instead of re.match.

  6. Passing re.DOTALL to the argument flag makes the dot (.) matches anything including a newline (by default the dot does not matches a newline).

  7. re.search search for the first match anywhere in the string.

  8. re.match search for the first match at the beginning of the string.

  9. re.findall find all matches in the string.

  10. re.finditer find all matches and return an iterator of the matches.

  11. Passing re.DOTALL to the flags option make the dot matches anything including the newline.

In [1]:
import re

re.compile

The compiled object is of type re.Pattern and has methods search, match, sub, findall, finditer, etc.

In [4]:
p = re.compile("\d{4}-\d{2}-\d{2}$")
In [5]:
type(p)
Out[5]:
re.Pattern
In [8]:
[mem for mem in dir(p) if not mem.startswith("_")]
Out[8]:
['findall',
 'finditer',
 'flags',
 'fullmatch',
 'groupindex',
 'groups',
 'match',
 'pattern',
 'scanner',
 'search',
 'split',
 'sub',
 'subn']

re.sub

In [10]:
re.sub("\d{4}-\d{2}-\d{2}$", "YYYY-mm-dd", "Today is 2018-05-02")
Out[10]:
'Today is YYYY-mm-dd'
In [2]:
re.sub("\s", "", "a b\tc")
Out[2]:
'abc'
In [5]:
s = """this is 
    /* BEGIN{NIMA}
    what 
    ever
    END{NIMA} */
    an example
    """
print(re.sub("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", "", s))
this is 
    
    an example
    

Make sure there is a space and only one space after a comma.

In [10]:
re.sub(", *", ", ", "ab,cd")
Out[10]:
'ab, cd'
In [11]:
re.sub(", *", ", ", "ab,    cd")
Out[11]:
'ab, cd'

Substitute using parts of matched patterns.

In [6]:
s = "aaa@gmail.com bbb@amazon.com ccc@walmart.com"

re.sub(r"([a-z]*)@", r"\1 19-@", s)
Out[6]:
'aaa 19-@gmail.com bbb 19-@amazon.com ccc 19-@walmart.com'

re.split

In [16]:
re.split("[+-/*]", "a-b/c*d")
Out[16]:
['a', 'b', 'c', 'd']
In [17]:
re.split("[*+-/]", "a-b/c*d")
Out[17]:
['a', 'b', 'c', 'd']
In [18]:
re.split("[+*-/]", "a-b/c*d")
Out[18]:
['a', 'b', 'c', 'd']

* cannot be used after - in [] list as - has ambiguity here whether it is a literal minus sign or a range operator.

In [19]:
re.split("[+-*/]", "a-b/c*d")
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-19-ea74b77d2b91> in <module>
----> 1 re.split("[+-*/]", "a-b/c*d")

/usr/lib/python3.8/re.py in split(pattern, string, maxsplit, flags)
    229     and the remainder of the string is returned as the final element
    230     of the list."""
--> 231     return _compile(pattern, flags).split(string, maxsplit)
    232 
    233 def findall(pattern, string, flags=0):

/usr/lib/python3.8/re.py in _compile(pattern, flags)
    302     if not sre_compile.isstring(pattern):
    303         raise TypeError("first argument must be string or compiled pattern")
--> 304     p = sre_compile.compile(pattern, flags)
    305     if not (flags & DEBUG):
    306         if len(_cache) >= _MAXCACHE:

/usr/lib/python3.8/sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--> 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

/usr/lib/python3.8/sre_parse.py in parse(str, flags, state)
    946 
    947     try:
--> 948         p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
    949     except Verbose:
    950         # the VERBOSE flag was switched on inside the pattern.  to be

/usr/lib/python3.8/sre_parse.py in _parse_sub(source, state, verbose, nested)
    441     start = source.tell()
    442     while True:
--> 443         itemsappend(_parse(source, state, verbose, nested + 1,
    444                            not nested and not items))
    445         if not sourcematch("|"):

/usr/lib/python3.8/sre_parse.py in _parse(source, state, verbose, nested, first)
    596                     if hi < lo:
    597                         msg = "bad character range %s-%s" % (this, that)
--> 598                         raise source.error(msg, len(this) + 1 + len(that))
    599                     setappend((RANGE, (lo, hi)))
    600                 else:

error: bad character range +-* at position 1

re.match

In [20]:
re.match("^\d{4}-\d{2}-\d{2}$", "2018-07-01")
Out[20]:
<re.Match object; span=(0, 10), match='2018-07-01'>
In [21]:
re.match("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")

re.search

In [22]:
import re

re.search("^\d{4}-\d{2}-\d{2}$", "2018-07-01")
Out[22]:
<re.Match object; span=(0, 10), match='2018-07-01'>
In [23]:
import re

re.search("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")
Out[23]:
<re.Match object; span=(9, 19), match='2018-07-01'>
In [6]:
re.search(",", "ab,cd")
Out[6]:
<re.Match object; span=(2, 3), match=','>
In [7]:
re.search("\b,", "ab,cd")
In [8]:
re.search("\B,", "ab,cd")
In [ ]:
re.search(",", "ab ,cd")
In [ ]:
re.search("\b,", "ab ,cd")
In [ ]:
re.search("\B,", "ab ,cd")

re.Match.group / re.Match.groups

Matched strings in parentheses can be accessed using the method Match.group or Match.groups.

In [17]:
m = re.search("(\d{4}-\d{2}-\d{2})", "Today is 2018-07-01.")
m
Out[17]:
<re.Match object; span=(9, 19), match='2018-07-01'>
In [18]:
m.groups()
Out[18]:
('2018-07-01',)
In [19]:
m.group(0)
Out[19]:
'2018-07-01'

re.findall

Find all matched strings.

In [5]:
import re

s = 'It is "a" good "day" today.'
re.findall('".*?"', s)
Out[5]:
['"a"', '"day"']
In [12]:
s = """this is 
    /* BEGIN{NIMA}
    what 
    ever
    END{NIMA} */
    an example
    """
re.findall("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", s)
Out[12]:
['/* BEGIN{NIMA}\n    what \n    ever\n    END{NIMA} */']
In [13]:
sql = """
    select ${cal_dt}, ${path} from some_table
    """
re.findall(r"\$\{\w+\}", sql)
Out[13]:
['${cal_dt}', '${path}']

The OR Operator |

  1. A|B matches A or B where A and B can be any regular expressions. Notice that there is no necessary to put A and B into parentheses (groups) when they are multi-character regular expressions. (A)|(B) is equivalent to A|B for any (valid) regular expressions.

  2. | can be used in groups.

In [21]:
re.search("ab|bcd", "abcd")
Out[21]:
<re.Match object; span=(0, 2), match='ab'>
In [24]:
re.search("a(b|b)cd", "abcd")
Out[24]:
<re.Match object; span=(0, 4), match='abcd'>
In [23]:
re.search("(ab|bc)d", "abcd")
Out[23]:
<re.Match object; span=(1, 4), match='bcd'>

Lookahead and Lookbehind

Lookahead and lookbehind provides ways of matching patterns without consuming them in regular expressions. This is extremely useful when you want to split a string according to a delimiter but want to keep the delimiter.

Split a string into lines but keep the trailing \n.

In [1]:
s = "line 1\nline 2\nline 3"
re.split("(?<=\n)", s)
Out[1]:
['line 1\n', 'line 2\n', 'line 3']

Split a string into lines but keep \n in the beginning of each line.

In [2]:
s = "line 1\nline 2\nline 3"
re.split("(?=\n)", s)
Out[2]:
['line 1', '\nline 2', '\nline 3']

Escape & Non-escape

{ and } need not to be escaped.

In [ ]:
 

Comments