Online Regular Expression Tester
The Python module
reautomatically compiles a plain/text pattern usingre.compileand caches it, so there’s not much benefit to compile plain/text patterns by yourself.Some regular expression patterns are defined using a single leading backslash, e.g.,
\s,\b, etc. However, since special characters (e.g.,\) need to be escaped in strings in most programming languages, you will need the string"\\s"to represent the regular expression pattern\s, and similar for other regular expression patterns with a leading backslash. Python is specialy as it provides raw strings (without escaping) to make it easier to write regular expression patterns. Python even goes one step further to auto correct non-properly escape strings. For example,"\s"is auto corrected to"\\s"in Python, and similarly for other non-properly escaped strings. I persoanlly dislike this behavior very much as it causes confusions especially when you call other languages (e.g., Spark SQL) to perform regular expression operations from Python. It is suggested that you use raw strings as much as possible to write regulare expressions in Python.It becomes tricky if you call another programming language to perform regular expression operations from Python. Taking
\sfor example, since\needs to be escaped in both programming languages, you will end up using\\\\sto represent\s. Lucky that you can use raw strings to simplify things. For example, instead of"\\\\s", you can user"\\s"in Python.The regular expression modifier
(?i)turns on case-insensitive matching.re.matchlooks for a match only at the beginning of the string whilere.searchlooks for a match anywhere in the string. Since the regular expression symbol^stands for the beginning of a string, you can prefix your regular expression with^to makere.searchlook for a match only at the beginnin of the string. To sum up,re.searchis more flexible thanre.matchand it is suggested that you always usere.searchinstead ofre.match.Passing
re.DOTALLto the argumentflagmakes the dot (.) matches anything including a newline (by default the dot does not matches a newline).re.searchsearch for the first match anywhere in the string.re.matchsearch for the first match at the beginning of the string.re.findallfind all matches in the string.re.finditerfind all matches and return an iterator of the matches.Passing
re.DOTALLto theflagsoption make the dot matches anything including the newline.
import rere.compile¶
The compiled object is of type re.Pattern
and has methods search, match, sub, findall, finditer, etc.
p = re.compile("\d{4}-\d{2}-\d{2}$")type(p)re.Pattern[mem for mem in dir(p) if not mem.startswith("_")]['findall',
'finditer',
'flags',
'fullmatch',
'groupindex',
'groups',
'match',
'pattern',
'scanner',
'search',
'split',
'sub',
'subn']re.sub¶
re.sub("\d{4}-\d{2}-\d{2}$", "YYYY-mm-dd", "Today is 2018-05-02")'Today is YYYY-mm-dd're.sub("\s", "", "a b\tc")'abc's = """this is
/* BEGIN{NIMA}
what
ever
END{NIMA} */
an example
"""
print(re.sub("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", "", s))this is
an example
Make sure there is a space and only one space after a comma.
re.sub(", *", ", ", "ab,cd")'ab, cd're.sub(", *", ", ", "ab, cd")'ab, cd'Substitute using parts of matched patterns.
s = "aaa@gmail.com bbb@amazon.com ccc@walmart.com"
re.sub(r"([a-z]*)@", r"\1 19-@", s)'aaa 19-@gmail.com bbb 19-@amazon.com ccc 19-@walmart.com're.split¶
re.split("[+-/*]", "a-b/c*d")['a', 'b', 'c', 'd']re.split("[*+-/]", "a-b/c*d")['a', 'b', 'c', 'd']re.split("[+*-/]", "a-b/c*d")['a', 'b', 'c', 'd']* cannot be used after - in [] list as - has ambiguity here whether it is a literal minus sign or a range operator.
re.split("[+-*/]", "a-b/c*d")---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-19-ea74b77d2b91> in <module>
----> 1 re.split("[+-*/]", "a-b/c*d")
/usr/lib/python3.8/re.py in split(pattern, string, maxsplit, flags)
229 and the remainder of the string is returned as the final element
230 of the list."""
--> 231 return _compile(pattern, flags).split(string, maxsplit)
232
233 def findall(pattern, string, flags=0):
/usr/lib/python3.8/re.py in _compile(pattern, flags)
302 if not sre_compile.isstring(pattern):
303 raise TypeError("first argument must be string or compiled pattern")
--> 304 p = sre_compile.compile(pattern, flags)
305 if not (flags & DEBUG):
306 if len(_cache) >= _MAXCACHE:
/usr/lib/python3.8/sre_compile.py in compile(p, flags)
762 if isstring(p):
763 pattern = p
--> 764 p = sre_parse.parse(p, flags)
765 else:
766 pattern = None
/usr/lib/python3.8/sre_parse.py in parse(str, flags, state)
946
947 try:
--> 948 p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
949 except Verbose:
950 # the VERBOSE flag was switched on inside the pattern. to be
/usr/lib/python3.8/sre_parse.py in _parse_sub(source, state, verbose, nested)
441 start = source.tell()
442 while True:
--> 443 itemsappend(_parse(source, state, verbose, nested + 1,
444 not nested and not items))
445 if not sourcematch("|"):
/usr/lib/python3.8/sre_parse.py in _parse(source, state, verbose, nested, first)
596 if hi < lo:
597 msg = "bad character range %s-%s" % (this, that)
--> 598 raise source.error(msg, len(this) + 1 + len(that))
599 setappend((RANGE, (lo, hi)))
600 else:
error: bad character range +-* at position 1re.match¶
re.match("^\d{4}-\d{2}-\d{2}$", "2018-07-01")<re.Match object; span=(0, 10), match='2018-07-01'>re.match("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")import re
re.search("^\d{4}-\d{2}-\d{2}$", "2018-07-01")<re.Match object; span=(0, 10), match='2018-07-01'>import re
re.search("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")<re.Match object; span=(9, 19), match='2018-07-01'>re.search(",", "ab,cd")<re.Match object; span=(2, 3), match=','>re.search("\b,", "ab,cd")re.search("\B,", "ab,cd")re.search(",", "ab ,cd")re.search("\b,", "ab ,cd")re.search("\B,", "ab ,cd")re.Match.group / re.Match.groups¶
Matched strings in parentheses can be accessed using the method Match.group or Match.groups.
m = re.search("(\d{4}-\d{2}-\d{2})", "Today is 2018-07-01.")
m<re.Match object; span=(9, 19), match='2018-07-01'>m.groups()('2018-07-01',)m.group(0)'2018-07-01're.findall¶
Find all matched strings.
import re
s = 'It is "a" good "day" today.'
re.findall('".*?"', s)['"a"', '"day"']s = """this is
/* BEGIN{NIMA}
what
ever
END{NIMA} */
an example
"""
re.findall("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", s)['/* BEGIN{NIMA}\n what \n ever\n END{NIMA} */']sql = """
select ${cal_dt}, ${path} from some_table
"""
re.findall(r"\$\{\w+\}", sql)['${cal_dt}', '${path}']The OR Operator |¶
A|BmatchesAorBwhereAandBcan be any regular expressions. Notice that there is no necessary to putAandBinto parentheses (groups) when they are multi-character regular expressions.(A)|(B)is equivalent toA|Bfor any (valid) regular expressions.|can be used in groups.
re.search("ab|bcd", "abcd")<re.Match object; span=(0, 2), match='ab'>re.search("a(b|b)cd", "abcd")<re.Match object; span=(0, 4), match='abcd'>re.search("(ab|bc)d", "abcd")<re.Match object; span=(1, 4), match='bcd'>Lookahead and Lookbehind¶
Lookahead and lookbehind provides ways of matching patterns without consuming them in regular expressions. This is extremely useful when you want to split a string according to a delimiter but want to keep the delimiter.
Split a string into lines but keep the trailing \n.
s = "line 1\nline 2\nline 3"
re.split("(?<=\n)", s)['line 1\n', 'line 2\n', 'line 3']Split a string into lines but keep \n in the beginning of each line.
s = "line 1\nline 2\nline 3"
re.split("(?=\n)", s)['line 1', '\nline 2', '\nline 3']Escape & Non-escape¶
{ and } need not to be escaped.