Patterns

8 min read

Purpose #

Patterns are used for matching text when the exact text to identify is not exactly known. If you know the exact text you are looking for, you can use a simple string comparison such as is equal to. Perhaps there are different variations of the text you are interested in (e.g. optional bits) and for this reason a simple text comparison is insufficient. This is where patterns play an important role in helping you identify content of interest.

Patterns are often used to identify content in the text flow of your document. But, you can also use them to match the value of any property or to match style names.

When you use a pattern in a rule condition’s data content test, the pattern is tested on the data content of all elements to which the rule applies. So, for paragraph-based rules, paragraph content is what the pattern is tested on. If you use patterns within a contains element test, then the data content is the content of the contained element.

Migrate patterns are based on the regular expression support of W3C XML Schema Datatypes.

Simple String Pattern #

  • Do not use patterns for simple text comparisons. Migrate offers the following simple string comparisons:
    • is equal to
    • is not equal to
    • starts with
    • does not start with
    • contains
    • does not contain
    • ends with
    • does not end with

    It is more efficient to use a simple comparison than a matches comparison.

  • Patterns in Migrate are anchored to the start and end of the content they evaluated on. Thus the patern B will not match the inputs AB, BC, or ABC. In particular, you should be careful to ignore white space at the start or end of your input, or colons at the end, as appropriate. For example, the pattern Example will not match the input Example:

Alternatives in Patterns #

A pattern can be used to match one of several specified alternatives. The pipe symbol | is used to separate the alternatives. For example, cat|dog will match either of the inputs cat or dog. Note that the patttern will not match catdog, it will only match one of the alternatives.

Quantifiers #

Quantifiers are another mechanism for broadening the range of content identified by a pattern. Quantifiers allow you to identify repetition of content. The edge case is to indicate that 0 or 1 repetitions is acceptable, that is that the presence of the content is optional. Here is a complete list of quantifiers.

Quantifier Description Example
? the content is optional and may appear 0 or 1 times A? matches only the empty string or the string A
* the content may appear 0 or more times A* matches the empty string, or the strings AAAAAA etc.
+ the content may appear 1 or more times A+ matches the strings AAAAAA etc.
{n,m} the content must appear at least n times and at most m times A{2,4} matches only the strings AAAAAAAAA
{n} the content must appear exactly n times A{4} matches only the string AAAA
{n,} the content must appear at least n times A{2,} matches the strings AAAAAAAAA etc.
  • The quantifiers *, +, and even ? are greedy. They will consume as much input as they can. Migrate extends the usual set of quantifiers with a non-greedy versions for you convenience.
Quantifier Description
?? non-greedy version of ?
*? non-greedy version of *
+? non-greedy version of +

Example: greedy vs non-greedy patterns

Pattern Input Explanation
(.*)(\d+) abc123 successful match, but the first capture contains abc12 and the second 3
(.*?)(\d+) abc123 successful match, but the first capture contains abc and the second 123
(1?)(\d+) 123 successful match, but the first capture contains 1 and the second 23
(1??)(\d+) 123 123 successful match, but the first capture contains nothing and the second 123

Example: unanchored patterns

This example shows a very common technique for writing unanchored patterns. As previously mentioned, patterns are anchored to the start and end of the content to which they are tried on. If you are looking for some text that may appear anywhere inside this content, simply place .* before and after your pattern. This effectively means that anything can precede or follow the stuff you are interested in. To be precise, . does not match carriage return and newline characters, but this is not usually an issue. In Migrate the content your matching against typically does not contain these characters, at least for the content of paragraph, span, image and title elements.

Pattern Explanation
.*Ice Nine.* matches content that contains the string “Ice Nine”

Character Classes #

A character class identifies a set of characters that a pattern should match. There are a few possibilities.

Description Example
explicit list the characters to match [aeiou] will match content consisting of a single vowel
range of characters to match [0-9] will match any (base 10) digit
complement of character class [^aeiou] will match any consonant
combination of two ranges [a-zA-Z] will match any upper or lower case letter
combination of a range and explicit characters [_:a-z] will match underscore, colon or any lower case letter
character class subtraction [\S-[:-]] will match any non-white space character except for colons and dashes

There are some builtin character classes. You can use these inside a character class definition (i.e. inside the square brackets) or outside.

Builtin Class Description
\n new line character (#xA)
\r carriage return character (#xD)
\t tab character (#x9)
. anything except a newline or carriage return (i.e. [^\n\r])
\s space, tab, newline or carriage return (i.e. [#x20\t\n\r])
\S non-space character (i.e. [^\s])
\i a letter, underscore or colon
\I not a letter, underscore or colon (i.e. [^\i])
\d same as [0-9]
\D same as [^\d]
\w common characters found in words, excludes punctuation and other separators
\W same as [^\w]

Example: character class

Pattern Explanation
ABC\tDEF matches content that starts with ABC, is followed by tab, and then ends with DEF

Grouping #

It is sometimes necessary to group contiguous parts of you pattern. For example, if a quantifier is intended to apply to more than one part of your pattern, you need to group these. This is done using parenthesis.

Example: grouping

Pattern Explanation
([A-Z][a-z]*)+ Matches camel-cased strings (e.g. ThisIsAVeryLongIdentifier). Note that the * applies just to the character class for lower case letters, but the + applies to the combination of the two grouped character classes.

Metacharacters #

As we have seen, some characters have special meaning when constructing patterns. If you want to refer to these characters literally you need to escape them with a backslash. So, if you’d like to match a question mark, you must type \?. Here is the full list of metacharacters

. \ ? * + { } ( ) [ ].

Anonymous Captures #

Patterns are useful for identifying relevant content. They can also be used to pick out some of the intersting parts for use. This ability to tease apart content is really very powerful when it comes to creating your rules. In Migrate, these content references can be used in annotation arguments.

Anonymous captures are created by surrounding parts of your patterns in parentheses. You can then reference the matched portions with the backslash notation: \1 for the first capture, \2 for the second, and so on. The captures are counted left-to-right within a pattern, and top-down if you have more than one pattern in your rule. It is the placement of these parentheses in your patterns that permits the matched content to be referenced in this way. This is why they are called captures — you are capturing content.

Remember that parentheses are also used for grouping, as described earlier. This does not cause trouble in practice. Just identify your captures by counting opening parentheses from left to right in your pattern.

Example: anonymous captures

Pattern Use Explanation
(.*)\.tiff set-attribute(src=\1.jpg) Change the extension of .tiff images .jpg.
Version:\s+(\d+)\.(\d+) p.map.product-version(version=\1;release=\2) Extract the major an minor numbers from a version string in order to populate map metadata.
\(\d{3}\)\s*\d{3}-\d{4} prolog.meta.othermeta((area code)(\1)) Extract the area code of a phone number and place it in prolog metadata. Note that in this example the pattern had to match literal parentheses. These have been escaped by a backslash. When escaped, the parentheses lose their special meaning for grouping and indicating captures.

Named Captures #

If you like, you can name your captures in your patterns. Doing so means that you won’t have to worry about counting your groups. This is also more robust because the count can be thrown off if you add a capture to the rule at some later time. A meaningful name also indicates the purpose of the capture more clearly to others working with the rule set.

You name your captures by using curly braces. You reference the capture with a backslash followed by the capture name in braces.

Example: named captures

Pattern Use
Version:\s+({major}\d+)\.({minor}\d+) p.map.product-version(version=\{major};release=\{minor})

Character Class Escapes #

See the W3C XML Schema Datatypes specification for more details on character class escapes.