Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Regular Expression Equivalent

  1. The order of precedence of operators in POSIX extended regular expression is as follows.

    1. Collation-related bracket symbols [==], [::], [..]
    2. Escaped characters \
    3. Character set (bracket expression) []
    4. Grouping ()
    5. Single-character-ERE duplication *, +, ?, {m,n}
    6. Concatenation
    7. Anchoring ^, $
    8. Alternation |
  2. Some regular expression patterns are defined using a single leading backslash, e.g., \s, \b, etc. However, since special characters (e.g., \) need to be escaped in strings in most programming languages, you will need the string "\\s" to represent the regular expression pattern \s, and similar for other regular expression patterns with a leading backslash. Python is specialy as it provides raw strings (without escaping) to make it easier to write regular expression patterns. It even goes one step further to auto correct non-properly escape strings. For more discussions on Python regular expressions, pleaser fer to Regular Expression in Python .

  3. It becomes tricky if you use a programming language to call another programming language to perform regular expression operations. Taking \s for example, since \ needs to be escaped in both programming languages, you will end up using \\\\s to represent \s. If you use Python to call other languages to perform regular expression patterns, things can be simplifed by using raw strings in Python. For example, instead of "\\\\s", you can use r"\\s" in Python.

  4. In some programming languages, you have to compile a plain/text pattern into a regular expression pattern object before using it. The Python module re automatically compiles a plain/text pattern (using re.compile) and caches it, so there is not much benefit to compile regular expressions by yourself in Python.

  5. \W does not include ^ and $.

  6. Regular expression modifiers makes regular expression more flexible and powerful. It is also a more universal way than remembering different options in different programming languages or tools. It is suggested that you use regular expression modifiers when possible.

  7. Word boundry (\b) is a super set of white spaces (\s).

  8. [[:alnum:]] contains all letters and numbers while \w contains not only letters and numbers but also some special character such as _. So in short \w is a super set of [[:alnum:]].

Vim search Python JavaScript Teradata SQL Oracle SQL grep sed
Modifiers Partial[1] Partial[1] Full No[2] Full[3]
Greedy
or not
Both[4]
Popular
functions
re.search, re.sub regexp_instr
White
spaces
\s "\\s" or r"\s" [5] [[:blank:]] [[:space:]] \s or [[:space:]] [[:space:]] (recommended) or \s
Non-white
space
\S "\\S" or r"\S" [[:blank:]] [[:space:]] \S [^[:space:]] or \S
Lower-case
letters
[a-z] or \l [a-z] [a-z] [a-z]
Non lower-case
characters
[^a-z] or \L [^a-z] [^a-z] [^a-z]
Upper-case
letters
[A-Z] or \u [A-Z] [A-Z] [A-Z]
Non upper-case
characters
[^A-Z] or \U [^A-Z] [^A-Z] [^A-Z]
Letters [a-zA-Z] or \a [a-zA-Z] [a-zA-Z] [a-zA-Z]
Non letters [^a-zA-Z] or \A [^a-zA-Z] [^a-zA-Z] [^a-zA-Z]
Digits \d "\\d" or r"\d" [[:digit:]] \d
Non digits \D "\\D" or r"\D" [^[:digit:]] \D
Hex digits [0-9a-fA-F] or \x [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]
Non-Hex digit
characters
[^0-9a-fA-F] or \X [^0-9a-fA-F] [^0-9a-fA-F] [^0-9a-fA-F]
Octal digits [0-7] or \o [0-7] [0-7] [0-7]
Non-octal digit
Characters
[^0-7] or \O [^0-7] [^0-7] [^0-7]
Head of word [a-zA-Z_] or \h [a-zA-Z_] [a-zA-Z_] [a-zA-Z_]
Non-head
of word
[^a-zA-Z_] or \H [^a-zA-Z_] [^a-zA-Z_] [^a-zA-Z_]
Printable
Characters
\p
Non printable
Characters
\P
Word characters \w "\\w" or r"\w" \w \w
Word boundry \b "\\b" or r"\b" \b \b
Non word
characters
\W \W \W \W
grouping \(\) () () () () \(\) ()
0 or more
matches
* * * *
0 or more matches
(as few as possible)
\\{-\\}
0 or 1
matches
\= ? ? ?
1 or more
matches
\+ + + +
Exactly m
matches
\\{m\\} {m} {m} {m}
m or more
matches
\\{m,\\} {m,} {m,} {m,}
m or more matches
(as few as possible)
\\{-m,\\}
m to n
matches
\\{m,n\\} {m,n} {m,n} {m,n}
m to n matches
(as few as possible)
\\{-m,n\\}
up to n
matches
\\{,n\\} {,n} {,n} {,n}
up to n matches
(as few as possible)
\\{-,n\\}
Any character
except a newline
. . . .
Start of
a line
^ ^ ^ ^
End of
a line
$ $ $ $
Literal / \/
(need to escape)
/
(no need to escape)
Literal dot \\.
Lookahead (?=...) \\.
Negative lookahead (?!...) \\.
Positive lookbehind (?<=...) \\.
Negative lookbehind (?<!...) \\.

[1]: Python/JavaScript partially supports regular expression modifiers. To be more specifically, turning modifiers on is supported but turning modifiers off is not supported. Modifiers (once turned on) are applied to the entire regular expression and cannot be turned off.

[2]: Behavior of regular expressions in Oracle SQL is control via parameters of regular expression functions instead of via regular expression modifiers.

[3]: grep fully supports regular expression modifiers via Perl style regular (the -P option) expressions.

[4]: grep matches pattern greedly by default. However, in Perl style syntax you can use the modifer ? after a quantifier to perform a non-greedy match. For example, instead of .* you can use .*? to do a non-greedy match.

[5]: As a matter of fact, "\s" also works in Python and it is equivalent to "\\s" and r"\s". However, it is suggested that you avoid using "\s" as causes confusions especially when you call other programming languges (e.g., Spark SQL) to run regular expression operations from Python. The raw string pattern r"\s" is preferred for its unambiguity and simplicity. For more discussions on Python regular expressions, please refer to Regular Expression in Python .

References

Comments