8. Tips and Tricks for Efficient MatchingThis section addresses some issues involving efficiency in string pattern matching. StringExpression versus RegularExpressionSince a string pattern written in Mathematica syntax is immediately translated to a regular expression and then compiled and cached, there is very little overhead in using the Mathematica syntax as opposed to the regular expression syntax directly. One exception can happen if many different patterns are used a few times, in that case the overhead might be noticeable. Conditions and PatternTestsIf a pattern contains Condition (/;) or PatternTest (?) statements, the general Mathematica evaluator must be invoked during the match, thus slowing it down. If a pattern can be written without such constructs, it will typically be faster. In[146]:=  |
In[147]:=  |
Out[147]=
|
In[148]:=  |
Out[148]=
|
Avoid Nested QuantifiersBecause of the nondeterministic finite automaton (NFA) algorithm used in the match, patterns involving nested quantifiers (such as __ and patt.. or the regular expression equivalents) can become arbitrarily slow. Such patterns can usually be "unrolled" into more efficient versions (see Friedl [2] for additional information). Avoid Many Calls to a FunctionIf you are searching through a long list of strings for certain matches, it is more efficient to feed the whole list to a string function at once, rather than using something like Select and StringMatchQ (see the earlier dictionary example for an illustration). Here is another example that generates a list of 2000 strings with 10 characters each and searches for the strings that start with an "a" and contain "ggg" as a substring. In[149]:=  |
In[150]:=  |
Out[150]=
|
Here is the slower version, using Select and StringMatchQ. In[151]:=  |
Out[151]=
|
If you instead feed the whole list to StringMatchQ at once, it will be much faster. Then Pick can be used to extract the wanted elements. In[152]:=  |
Out[152]=
|
Alternatively, you could use StringCases, which is also fast. Note that you need to anchor the pattern using StartOfString to ensure that the "a" is at the start (the EndOfString is superfluous in this particular case). In[153]:=  |
Out[153]=
|
Rewrite General Expression Searches as String SearchesBecause the string matching algorithm is different than the algorithm Mathematica uses for general expression matching (string matching can assume a finite alphabet and a flat structure, for instance), there are cases where it is advantageous to translate a normal expression-matching problem to a string-matching problem. A typical case is matching a long list of symbols against a pattern involving several occurrences of __ and ___. As an example, assume you want to find primes (after prime number 1000000, say) that have at least four identical digits. Using ordinary pattern matching, it could be accomplished like this. In[154]:=  |
Out[154]=
|
By converting the list of integers to a string, you can use string matching instead. In[155]:=  |
Out[155]=
|
By using the previous tips of using Pick or StringCases, you can speed it up even more. In[156]:=  |
Out[156]=
|
In[157]:=  |
Out[157]=
|
For long sequences, the difference can be significant. In[158]:=  |
In[159]:=  |
Out[159]=
|
In[160]:=  |
Out[160]=
|
In[161]:=  |
In[162]:=  |
Out[162]=
|
In[163]:=  |
Out[163]=
|
|