Online Regular Expression Tester
The Python module
reautomatically compiles a plain/text pattern usingre.compileand caches it, so there's not much benefit to compile plain/text patterns by yourself.Some regular expression patterns are defined using a single leading backslash, e.g.,
\s,\b, etc. However, since special characters (e.g.,\) need to be escaped in strings in most programming languages, you will need the string"\\s"to represent the regular expression pattern\s, and similar for other regular expression patterns with a leading backslash. Python is specialy as it provides raw strings (without escaping) to make it easier to write regular expression patterns. Python even goes one step further to auto correct non-properly escape strings. For example,"\s"is auto corrected to"\\s"in Python, and similarly for other non-properly escaped strings. I persoanlly dislike this behavior very much as it causes confusions especially when you call other languages (e.g., Spark SQL) to perform regular expression operations from Python. It is suggested that you use raw strings as much as possible to write regulare expressions in Python.It becomes tricky if you call another programming language to perform regular expression operations from Python. Taking
\sfor example, since\needs to be escaped in both programming languages, you will end up using\\\\sto represent\s. Lucky that you can use raw strings to simplify things. For example, instead of"\\\\s", you can user"\\s"in Python.The regular expression modifier
(?i)turns on case-insensitive matching.re.matchlooks for a match only at the beginning of the string whilere.searchlooks for a match anywhere in the string. Since the regular expression symbol^stands for the beginning of a string, you can prefix your regular expression with^to makere.searchlook for a match only at the beginnin of the string. To sum up,re.searchis more flexible thanre.matchand it is suggested that you always usere.searchinstead ofre.match.Passing
re.DOTALLto the argumentflagmakes the dot (.) matches anything including a newline (by default the dot does not matches a newline).re.searchsearch for the first match anywhere in the string.re.matchsearch for the first match at the beginning of the string.re.findallfind all matches in the string.re.finditerfind all matches and return an iterator of the matches.Passing
re.DOTALLto theflagsoption make the dot matches anything including the newline.
import re
re.compile¶
The compiled object is of type re.Pattern
and has methods search, match, sub, findall, finditer, etc.
p = re.compile("\d{4}-\d{2}-\d{2}$")
type(p)
[mem for mem in dir(p) if not mem.startswith("_")]
re.sub¶
re.sub("\d{4}-\d{2}-\d{2}$", "YYYY-mm-dd", "Today is 2018-05-02")
re.sub("\s", "", "a b\tc")
s = """this is
/* BEGIN{NIMA}
what
ever
END{NIMA} */
an example
"""
print(re.sub("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", "", s))
Make sure there is a space and only one space after a comma.
re.sub(", *", ", ", "ab,cd")
re.sub(", *", ", ", "ab, cd")
Substitute using parts of matched patterns.
s = "aaa@gmail.com bbb@amazon.com ccc@walmart.com"
re.sub(r"([a-z]*)@", r"\1 19-@", s)
re.split¶
re.split("[+-/*]", "a-b/c*d")
re.split("[*+-/]", "a-b/c*d")
re.split("[+*-/]", "a-b/c*d")
* cannot be used after - in [] list as - has ambiguity here whether it is a literal minus sign or a range operator.
re.split("[+-*/]", "a-b/c*d")
re.match¶
re.match("^\d{4}-\d{2}-\d{2}$", "2018-07-01")
re.match("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")
re.search¶
import re
re.search("^\d{4}-\d{2}-\d{2}$", "2018-07-01")
import re
re.search("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")
re.search(",", "ab,cd")
re.search("\b,", "ab,cd")
re.search("\B,", "ab,cd")
re.search(",", "ab ,cd")
re.search("\b,", "ab ,cd")
re.search("\B,", "ab ,cd")
re.Match.group / re.Match.groups¶
Matched strings in parentheses can be accessed using the method Match.group or Match.groups.
m = re.search("(\d{4}-\d{2}-\d{2})", "Today is 2018-07-01.")
m
m.groups()
m.group(0)
re.findall¶
Find all matched strings.
import re
s = 'It is "a" good "day" today.'
re.findall('".*?"', s)
s = """this is
/* BEGIN{NIMA}
what
ever
END{NIMA} */
an example
"""
re.findall("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", s)
sql = """
select ${cal_dt}, ${path} from some_table
"""
re.findall(r"\$\{\w+\}", sql)
The OR Operator |¶
A|BmatchesAorBwhereAandBcan be any regular expressions. Notice that there is no necessary to putAandBinto parentheses (groups) when they are multi-character regular expressions.(A)|(B)is equivalent toA|Bfor any (valid) regular expressions.|can be used in groups.
re.search("ab|bcd", "abcd")
re.search("a(b|b)cd", "abcd")
re.search("(ab|bc)d", "abcd")
Lookahead and Lookbehind¶
Lookahead and lookbehind provides ways of matching patterns without consuming them in regular expressions. This is extremely useful when you want to split a string according to a delimiter but want to keep the delimiter.
Split a string into lines but keep the trailing \n.
s = "line 1\nline 2\nline 3"
re.split("(?<=\n)", s)
Split a string into lines but keep \n in the beginning of each line.
s = "line 1\nline 2\nline 3"
re.split("(?=\n)", s)
Escape & Non-escape¶
{ and } need not to be escaped.