3. Regular ExpressionsThe regular expression syntax follows the underlying Perl Compatible Regular Expressions (PCRE) library, which is close to the syntax of Perl. (See [1] for further information and documentation.) A regular expression in Mathematica is denoted by the head RegularExpression. The following basic elements can be used in regular expression strings: The following represent classes of characters: The following named classes can be used: alnum, alpha, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word, and xdigit. The following represent positions in strings: The following set options for all regular expression elements that follow them: The following are lookahead/lookbehind constructs: Discussion of a few issues regarding regular expressions follows. This looks for runs of word characters of length between 2 and 4. In[28]:=  |
Out[28]=
|
With the possessive "+" quantifier, as many characters as possible are grabbed by the matcher, and no characters are given up, even if the rest of the patterns requires it. In[29]:=  |
Out[29]=
|
In[30]:=  |
Out[30]=
|
In[31]:=  |
Out[31]=
|
[[:xdigit:]] corresponds to characters in a hexadecimal number. In[32]:=  |
Out[32]=
|
The complete list of characters that need to be escaped in a regular expression consists of ., \, ?, (, ), {, }, [, ], ^, $, *, +, and |. For instance, to write a literal period, use "\\." and to write a literal backslash, use "\\\\". Inside a character class "[...]", the complete list of escaped characters is ^, -, \, [, and ] . By default, ^ and $ match the beginning and end of the string, respectively. In multiline mode, these match the beginning/end of lines instead. In[33]:=  |
Out[33]=
|
In[34]:=  |
Out[34]=
|
In multiline mode, \\A and \\Z can be used to denote the beginning and end of the string. In[35]:=  |
Out[35]=
|
The (?x) modifier allows you to add whitespace and comments to a regular expression for readability. In[36]:=  |
Out[36]=
|
Named subpatterns are achieved by surrounding them with parentheses (subpatt); they then become numbered subpatterns. The number of a given subpattern counts the opening parenthesis, starting from the start of the pattern. You can refer to these subpatterns using \\n for the nth pattern later in the pattern, or by "$n" in the right-hand side of a rule. "$0" refers to all of the matched pattern. In[37]:=  |
Out[37]=
|
In[38]:=  |
Out[38]=
|
If you need a literal $ in this context (when the head of the left-hand side is RegularExpression), you can escape it by using backslashes, "\\$2". In[39]:=  |
Out[39]=
|
If you happen to need a single literal backslash followed by a literal $ under these circumstances, you need to be a bit tricky and split into two strings temporarily. In[40]:=  |
Out[40]=
|
If you need to group a part of the pattern, but you do not want to count the group as a numbered subpattern, you can use the (?:patt) construct. In[41]:=  |
Out[41]=
|
Lookahead and lookbehind patterns are used to ensure a pattern is matched without actually including that text as part of the match. This picks out words following the string "the ". In[42]:=  |
Out[42]=
|
This tries to pick out all even numbers in the string, but it will find matches that include partial numbers. In[43]:=  |
Out[43]=
|
Using lookbehind/lookahead, you can ensure that the characters before/after the match are not digits (note that the lookbehind test is superfluous in this particular case). In[44]:=  |
Out[44]=
|
|