Online Regular Expression Tester
The Python module
re
automatically compiles a plain/text pattern usingre.compile
and caches it, so there's not much benefit to compile plain/text patterns by yourself.Some regular expression patterns are defined using a single leading backslash, e.g.,
\s
,\b
, etc. However, since special characters (e.g.,\
) need to be escaped in strings in most programming languages, you will need the string"\\s"
to represent the regular expression pattern\s
, and similar for other regular expression patterns with a leading backslash. Python is specialy as it provides raw strings (without escaping) to make it easier to write regular expression patterns. Python even goes one step further to auto correct non-properly escape strings. For example,"\s"
is auto corrected to"\\s"
in Python, and similarly for other non-properly escaped strings. I persoanlly dislike this behavior very much as it causes confusions especially when you call other languages (e.g., Spark SQL) to perform regular expression operations from Python. It is suggested that you use raw strings as much as possible to write regulare expressions in Python.It becomes tricky if you call another programming language to perform regular expression operations from Python. Taking
\s
for example, since\
needs to be escaped in both programming languages, you will end up using\\\\s
to represent\s
. Lucky that you can use raw strings to simplify things. For example, instead of"\\\\s"
, you can user"\\s"
in Python.The regular expression modifier
(?i)
turns on case-insensitive matching.re.match
looks for a match only at the beginning of the string whilere.search
looks for a match anywhere in the string. Since the regular expression symbol^
stands for the beginning of a string, you can prefix your regular expression with^
to makere.search
look for a match only at the beginnin of the string. To sum up,re.search
is more flexible thanre.match
and it is suggested that you always usere.search
instead ofre.match
.Passing
re.DOTALL
to the argumentflag
makes the dot (.
) matches anything including a newline (by default the dot does not matches a newline).re.search
search for the first match anywhere in the string.re.match
search for the first match at the beginning of the string.re.findall
find all matches in the string.re.finditer
find all matches and return an iterator of the matches.Passing
re.DOTALL
to theflags
option make the dot matches anything including the newline.
import re
re.compile¶
The compiled object is of type re.Pattern
and has methods search
, match
, sub
, findall
, finditer
, etc.
p = re.compile("\d{4}-\d{2}-\d{2}$")
type(p)
[mem for mem in dir(p) if not mem.startswith("_")]
re.sub¶
re.sub("\d{4}-\d{2}-\d{2}$", "YYYY-mm-dd", "Today is 2018-05-02")
re.sub("\s", "", "a b\tc")
s = """this is
/* BEGIN{NIMA}
what
ever
END{NIMA} */
an example
"""
print(re.sub("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", "", s))
Make sure there is a space and only one space after a comma.
re.sub(", *", ", ", "ab,cd")
re.sub(", *", ", ", "ab, cd")
Substitute using parts of matched patterns.
s = "aaa@gmail.com bbb@amazon.com ccc@walmart.com"
re.sub(r"([a-z]*)@", r"\1 19-@", s)
re.split¶
re.split("[+-/*]", "a-b/c*d")
re.split("[*+-/]", "a-b/c*d")
re.split("[+*-/]", "a-b/c*d")
*
cannot be used after -
in []
list as -
has ambiguity here whether it is a literal minus sign or a range operator.
re.split("[+-*/]", "a-b/c*d")
re.match¶
re.match("^\d{4}-\d{2}-\d{2}$", "2018-07-01")
re.match("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")
re.search¶
import re
re.search("^\d{4}-\d{2}-\d{2}$", "2018-07-01")
import re
re.search("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")
re.search(",", "ab,cd")
re.search("\b,", "ab,cd")
re.search("\B,", "ab,cd")
re.search(",", "ab ,cd")
re.search("\b,", "ab ,cd")
re.search("\B,", "ab ,cd")
re.Match.group / re.Match.groups¶
Matched strings in parentheses can be accessed using the method Match.group
or Match.groups
.
m = re.search("(\d{4}-\d{2}-\d{2})", "Today is 2018-07-01.")
m
m.groups()
m.group(0)
re.findall¶
Find all matched strings.
import re
s = 'It is "a" good "day" today.'
re.findall('".*?"', s)
s = """this is
/* BEGIN{NIMA}
what
ever
END{NIMA} */
an example
"""
re.findall("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", s)
sql = """
select ${cal_dt}, ${path} from some_table
"""
re.findall(r"\$\{\w+\}", sql)
The OR Operator |¶
A|B
matchesA
orB
whereA
andB
can be any regular expressions. Notice that there is no necessary to putA
andB
into parentheses (groups) when they are multi-character regular expressions.(A)|(B)
is equivalent toA|B
for any (valid) regular expressions.|
can be used in groups.
re.search("ab|bcd", "abcd")
re.search("a(b|b)cd", "abcd")
re.search("(ab|bc)d", "abcd")
Lookahead and Lookbehind¶
Lookahead and lookbehind provides ways of matching patterns without consuming them in regular expressions. This is extremely useful when you want to split a string according to a delimiter but want to keep the delimiter.
Split a string into lines but keep the trailing \n
.
s = "line 1\nline 2\nline 3"
re.split("(?<=\n)", s)
Split a string into lines but keep \n
in the beginning of each line.
s = "line 1\nline 2\nline 3"
re.split("(?=\n)", s)
Escape & Non-escape¶
{
and }
need not to be escaped.