pcretest - a program for testing Perl-compatible regular expressions. 


pcretest [options] [source] [destination]

pcretest was written as a test program for the PCRE regular expressionlibrary itself, but it can also be used for experimenting with regularexpressions. This document describes the features of the test program; fordetails of the regular expressions themselves, see thepcrepatterndocumentation. For details of the PCRE library function calls and theiroptions, see thepcreapidocumentation. 


Behave as if each regex has the /B (show bytecode) modifier; the internalform is output after compilation.
Output the version number of the PCRE library, and all available informationabout the optional features that are included, and then exit.
Behave as if each regex has the /D (debug) modifier; the internalform and information about the compiled pattern is output after compilation;-d is equivalent to -b -i.
Behave as if each data line contains the \D escape sequence; this causes thealternative matching function, pcre_dfa_exec(), to be used instead of thestandard pcre_exec() function (more detail is given below).
Output a brief summary these options and then exit.
Behave as if each regex has the /I modifier; information about thecompiled pattern is given after compilation.
Output the size of each compiled pattern after it has been compiled. This isequivalent to adding /M to each regular expression. For compatibilitywith earlier versions of pcretest, -s is a synonym for -m.
-o osize
Set the number of elements in the output vector that is used when callingpcre_exec() or pcre_dfa_exec() to be osize. The default valueis 45, which is enough for 14 capturing subexpressions for pcre_exec() or22 different matches for pcre_dfa_exec(). The vector size can bechanged for individual matching calls by including \O in the data line (seebelow).
Behave as if each regex has the /P modifier; the POSIX wrapper API isused to call PCRE. None of the other options has any effect when -p isset.
Do not output the version number of pcretest at the start of execution.
-S size
On Unix-like systems, set the size of the runtime stack to sizemegabytes.
Run each compile, study, and match many times with a timer, and outputresulting time per compile or match (in milliseconds). Do not set -m with-t, because you will then get the size output a zillion times, and thetiming will be distorted. You can control the number of iterations that areused for timing by following -t with a number (as a separate item on thecommand line). For example, "-t 1000" would iterate 1000 times. The default isto iterate 500000 times.
This is like -t except that it times only the matching phase, not thecompile or study phases.


If pcretest is given two filename arguments, it reads from the first andwrites to the second. If it is given only one filename argument, it reads fromthat file and writes to stdout. Otherwise, it reads from stdin and writes tostdout, and prompts for each line of input, using "re>" to prompt for regularexpressions, and "data>" to prompt for data lines.The program handles any number of sets of input on a single input file. Eachset starts with a regular expression, and continues with any number of datalines to be matched against the pattern.Each data line is matched separately and independently. If you want to domulti-line matches, you have to use the \n escape sequence (or \r or \r\n,etc., depending on the newline setting) in a single line of input to encode thenewline sequences. There is no limit on the length of data lines; the inputbuffer is automatically extended if it is too small.An empty line signals the end of the data lines, at which point a new regularexpression is read. The regular expressions are given enclosed in anynon-alphanumeric delimiters other than backslash, for example:


White space before the initial delimiter is ignored. A regular expression maybe continued over several input lines, in which case the newline characters areincluded within it. It is possible to include the delimiter within the patternby escaping it, for example


If you do so, the escape and the delimiter form part of the pattern, but sincedelimiters are always non-alphanumeric, this does not affect its interpretation.If the terminating delimiter is immediately followed by a backslash, forexample,


then a backslash is added to the end of the pattern. This is done to provide away of testing the error condition that arises if a pattern finishes with abackslash, because


is interpreted as the first line of a pattern that starts with "abc/", causingpcretest to read the next line as a continuation of the regular expression. 


A pattern may be followed by any number of modifiers, which are mostly singlecharacters. Following Perl usage, these are referred to below as, for example,"the /i modifier", even though the delimiter of the pattern need notalways be a slash, and no slash is used when writing modifiers. Whitespace mayappear between the final pattern delimiter and the first modifier, and betweenthe modifiers themselves.The /i, /m, /s, and /x modifiers set the PCRE_CASELESS,PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively, whenpcre_compile() is called. These four modifier letters have the sameeffect as they do in Perl. For example:


The following table shows additional modifiers for setting PCRE options that donot correspond to anything in Perl:

  /A          PCRE_ANCHORED
  /f          PCRE_FIRSTLINE
  /J          PCRE_DUPNAMES
  /U          PCRE_UNGREEDY
  /X          PCRE_EXTRA
  /<cr>       PCRE_NEWLINE_CR
  /<lf>       PCRE_NEWLINE_LF
  /<crlf>     PCRE_NEWLINE_CRLF
  /<any>      PCRE_NEWLINE_ANY

Those specifying line ending sequences are literal strings as shown, but theletters can be in either case. This example sets multiline matching with CRLFas the line ending sequence:


Details of the meanings of these PCRE options are given in thepcreapidocumentation. 

Finding all matches in a string

Searching for all possible matches within each subject string can be requestedby the /g or /G modifier. After finding a match, PCRE is calledagain to search the remainder of the subject string. The difference between/g and /G is that the former uses the startoffset argument topcre_exec() to start searching at a new point within the entire string(which is in effect what Perl does), whereas the latter passes over a shortenedsubstring. This makes a difference to the matching process if the patternbegins with a lookbehind assertion (including \b or \B).If any call to pcre_exec() in a /g or /G sequence matches anempty string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHOREDflags set in order to search for another, non-empty, match at the same point.If this second match fails, the start offset is advanced by one, and the normalmatch is retried. This imitates the way Perl handles such cases when using the/g modifier or the split() function. 

Other modifiers

There are yet more modifiers for controlling the way pcretestoperates.The /+ modifier requests that as well as outputting the substring thatmatched the entire pattern, pcretest should in addition output the remainder ofthe subject string. This is useful for tests where the subject containsmultiple copies of the same substring.The /B modifier is a debugging feature. It requests that pcretestoutput a representation of the compiled byte code after compilation. Normallythis information contains length and offset values; however, if /Z isalso present, this data is replaced by spaces. This is a special feature foruse in the automatic test scripts; it ensures that the same output is generatedfor different internal link sizes.The /L modifier must be followed directly by the name of a locale, forexample,


For this reason, it must be the last modifier. The given locale is set,pcre_maketables() is called to build a set of character tables for thelocale, and this is then passed to pcre_compile() when compiling theregular expression. Without an /L modifier, NULL is passed as the tablespointer; that is, /L applies only to the expression on which it appears.The /I modifier requests that pcretest output information about thecompiled pattern (whether it is anchored, has a fixed first character, andso on). It does this by calling pcre_fullinfo() after compiling apattern. If the pattern is studied, the results of that are also output.The /D modifier is a PCRE debugging feature, and is equivalent to/BI, that is, both the /B and the /I modifiers.The /F modifier causes pcretest to flip the byte order of thefields in the compiled pattern that contain 2-byte and 4-byte numbers. Thisfacility is for testing the feature in PCRE that allows it to execute patternsthat were compiled on a host with a different endianness. This feature is notavailable when the POSIX interface to PCRE is being used, that is, when the/P pattern modifier is specified. See also the section about saving andreloading compiled patterns below.The /S modifier causes pcre_study() to be called after theexpression has been compiled, and the results used when the expression ismatched.The /M modifier causes the size of memory block used to hold the compiledpattern to be output.The /P modifier causes pcretest to call PCRE via the POSIX wrapperAPI rather than its native API. When this is done, all other modifiers except/i, /m, and /+ are ignored. REG_ICASE is set if /i ispresent, and REG_NEWLINE is set if /m is present. The wrapper functionsforce PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8option set. This turns on support for UTF-8 character handling in PCRE,provided that it was compiled with this support enabled. This modifier alsocauses any non-printing characters in output strings to be printed using the\x{hh...} notation if they are valid UTF-8 sequences.If the /? modifier is used with /8, it causes pcretest tocall pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress thechecking of the string for UTF-8 validity. 


Before each data line is passed to pcre_exec(), leading and trailingwhitespace is removed, and it is then scanned for \ escapes. Some of these arepretty esoteric features, intended for checking out some of the morecomplicated features of PCRE. If you are just testing "ordinary" regularexpressions, you probably don't need any of these. The following escapes arerecognized:

  \a         alarm (BEL, \x07)
  \b         backspace (\x08)
  \e         escape (\x27)
  \f         formfeed (\x0c)
  \n         newline (\x0a)
  \qdd       set the PCRE_MATCH_LIMIT limit to dd
               (any number of digits)
  \r         carriage return (\x0d)
  \t         tab (\x09)
  \v         vertical tab (\x0b)
  \nnn       octal character (up to 3 octal digits)
  \xhh       hexadecimal character (up to 2 hex digits)
  \x{hh...}  hexadecimal character, any number of digits
               in UTF-8 mode
  \A         pass the PCRE_ANCHORED option to pcre_exec()
               or pcre_dfa_exec()
  \B         pass the PCRE_NOTBOL option to pcre_exec()
               or pcre_dfa_exec()
  \Cdd       call pcre_copy_substring() for substring dd
               after a successful match (number less than 32)
  \Cname     call pcre_copy_named_substring() for substring
               "name" after a successful match (name termin-
               ated by next non alphanumeric character)
  \C+        show the current captured substrings at callout
  \C-        do not supply a callout function
  \C!n       return 1 instead of 0 when callout number n is
  \C!n!m     return 1 instead of 0 when callout number n is
               reached for the nth time
  \C*n       pass the number n (may be negative) as callout
               data; this is used as the callout return value
  \D         use the pcre_dfa_exec() match function
  \F         only shortest match for pcre_dfa_exec()
  \Gdd       call pcre_get_substring() for substring dd
               after a successful match (number less than 32)
  \Gname     call pcre_get_named_substring() for substring
               "name" after a successful match (name termin-
               ated by next non-alphanumeric character)
  \L         call pcre_get_substringlist() after a
               successful match
  \M         discover the minimum MATCH_LIMIT and
               MATCH_LIMIT_RECURSION settings
  \N         pass the PCRE_NOTEMPTY option to pcre_exec()
               or pcre_dfa_exec()
  \Odd       set the size of the output vector passed to
               pcre_exec() to dd (any number of digits)
  \P         pass the PCRE_PARTIAL option to pcre_exec()
               or pcre_dfa_exec()
  \Qdd       set the PCRE_MATCH_LIMIT_RECURSION limit to dd
               (any number of digits)
  \R         pass the PCRE_DFA_RESTART option to pcre_dfa_exec()
  \S         output details of memory get/free calls during matching
  \Z         pass the PCRE_NOTEOL option to pcre_exec()
               or pcre_dfa_exec()
  \?         pass the PCRE_NO_UTF8_CHECK option to
               pcre_exec() or pcre_dfa_exec()
  \>dd       start the match at offset dd (any number of digits);
               this sets the startoffset argument for pcre_exec()
               or pcre_dfa_exec()
  \<cr>      pass the PCRE_NEWLINE_CR option to pcre_exec()
               or pcre_dfa_exec()
  \<lf>      pass the PCRE_NEWLINE_LF option to pcre_exec()
               or pcre_dfa_exec()
  \<crlf>    pass the PCRE_NEWLINE_CRLF option to pcre_exec()
               or pcre_dfa_exec()
  \<anycrlf> pass the PCRE_NEWLINE_ANYCRLF option to pcre_exec()
               or pcre_dfa_exec()
  \<any>     pass the PCRE_NEWLINE_ANY option to pcre_exec()
               or pcre_dfa_exec()

The escapes that specify line ending sequences are literal strings, exactly asshown. No more than one newline setting should be present in any data line.A backslash followed by anything else just escapes the anything else. Ifthe very last character is a backslash, it is ignored. This gives a way ofpassing an empty line as data, since a real empty line terminates the datainput.If \M is present, pcretest calls pcre_exec() several times, withdifferent values in the match_limit and match_limit_recursionfields of the pcre_extra data structure, until it finds the minimumnumbers for each parameter that allow pcre_exec() to complete. Thematch_limit number is a measure of the amount of backtracking that takesplace, and checking it out can be instructive. For most simple matches, thenumber is quite small, but for patterns with very large numbers of matchingpossibilities, it can become large very quickly with increasing length ofsubject string. The match_limit_recursion number is a measure of how muchstack (or, if PCRE is compiled with NO_RECURSE, how much heap) memory is neededto complete the match attempt.When \O is used, the value specified may be higher or lower than the size setby the -O command line option (or defaulted to 45); \O applies only tothe call of pcre_exec() for the line in which it appears.If the /P modifier was present on the pattern, causing the POSIX wrapperAPI to be used, the only option-setting sequences that have any effect are \Band \Z, causing REG_NOTBOL and REG_NOTEOL, respectively, to be passed toregexec().The use of \x{hh...} to represent UTF-8 characters is not dependent on the useof the /8 modifier on the pattern. It is recognized always. There may beany number of hexadecimal digits inside the braces. The result is from one tosix bytes, encoded according to the original UTF-8 rules of RFC 2279. Thisallows for values in the range 0 to 0x7FFFFFFF. Note that not all of those arevalid Unicode code points, or indeed valid UTF-8 characters according to thelater rules in RFC 3629. 


By default, pcretest uses the standard PCRE matching function,pcre_exec() to match each data line. From release 6.0, PCRE supports analternative matching function, pcre_dfa_test(), which operates in adifferent way, and has some restrictions. The differences between the twofunctions are described in thepcrematchingdocumentation.If a data line contains the \D escape sequence, or if the command linecontains the -dfa option, the alternative matching function is called.This function finds all possible matches at a given point. If, however, the \Fescape sequence is present in the data line, it stops after the first match isfound. This is always the shortest possible match. 


This section describes the output when the normal matching function,pcre_exec(), is being used.When a match succeeds, pcretest outputs the list of captured substrings thatpcre_exec() returns, starting with number 0 for the string that matchedthe whole pattern. Otherwise, it outputs "No match" or "Partial match"when pcre_exec() returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PARTIAL,respectively, and otherwise the PCRE negative error number. Here is an exampleof an interactive pcretest run.

  $ pcretest
  PCRE version 7.0 30-Nov-2006

    re> /^abc(\d+)/
  data> abc123
   0: abc123
   1: 123
  data> xyz
  No match

If the strings contain any non-printing characters, they are output as \0xescapes, or as \x{...} escapes if the /8 modifier was present on thepattern. See below for the definition of non-printing characters. If thepattern has the /+ modifier, the output for substring 0 is followed bythe the rest of the subject string, identified by "0+" like this:

    re> /cat/+
  data> cataract
   0: cat
   0+ aract

If the pattern has the /g or /G modifier, the results of successivematching attempts are output in sequence, like this:

    re> /\Bi(\w\w)/g
  data> Mississippi
   0: iss
   1: ss
   0: iss
   1: ss
   0: ipp
   1: pp

"No match" is output only if the first match attempt fails.If any of the sequences \C, \G, or \L are present in adata line that is successfully matched, the substrings extracted by theconvenience functions are output with C, G, or L after the string numberinstead of a colon. This is in addition to the normal full list. The stringlength (that is, the return from the extraction function) is given inparentheses after each string for \C and \G.Note that whereas patterns can be continued over several lines (a plain ">"prompt is used for continuations), data lines may not. However newlines can beincluded in data by means of the \n escape (or \r, \r\n, etc., depending onthe newline sequence setting). 


When the alternative matching function, pcre_dfa_exec(), is used (bymeans of the \D escape sequence or the -dfa command line option), theoutput consists of a list of all the matches that start at the first point inthe subject where there is at least one match. For example:

    re> /(tang|tangerine|tan)/
  data> yellow tangerine\D
   0: tangerine
   1: tang
   2: tan

(Using the normal matching function on this data finds only "tang".) Thelongest matching string is always given first (and numbered zero).If /g is present on the pattern, the search for further matches resumesat the end of the longest match. For example:

    re> /(tang|tangerine|tan)/g
  data> yellow tangerine and tangy sultana\D
   0: tangerine
   1: tang
   2: tan
   0: tang
   1: tan
   0: tan

Since the matching function does not support substring capture, the escapesequences that are concerned with captured substrings are not relevant. 


When the alternative matching function has given the PCRE_ERROR_PARTIAL return,indicating that the subject partially matched the pattern, you can restart thematch with additional subject data by means of the \R escape sequence. Forexample:

    re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
  data> 23ja\P\D
  Partial match: 23ja
  data> n05\R\D
   0: n05

For further information about partial matching, see thepcrepartialdocumentation. 


If the pattern contains any callout requests, pcretest's callout functionis called during matching. This works with both matching functions. By default,the called function displays the callout number, the start and currentpositions in the text at the callout time, and the next pattern item to betested. For example, the output

    0    ^  ^     \d

indicates that callout number 0 occurred for a match attempt starting at thefourth character of the subject string, when the pointer was at the seventhcharacter of the data, and when the next pattern item was \d. Just onecircumflex is output if the start and current positions are the same.Callouts numbered 255 are assumed to be automatic callouts, inserted as aresult of the /C pattern modifier. In this case, instead of showing thecallout number, the offset in the pattern, preceded by a plus, is output. Forexample:

    re> /\d?[A-E]\*/C
  data> E*
   +0 ^      \d?
   +3 ^      [A-E]
   +8 ^^     \*
  +10 ^ ^
   0: E*

The callout function in pcretest returns zero (carry on matching) bydefault, but you can use a \C item in a data line (as described above) tochange this.Inserting callouts can be helpful when using pcretest to checkcomplicated regular expressions. For further information about callouts, seethepcrecalloutdocumentation. 


When pcretest is outputting text in the compiled version of a pattern,bytes other than 32-126 are always treated as non-printing characters are aretherefore shown as hex escapes.When pcretest is outputting text that is a matched part of a subjectstring, it behaves in the same way, unless a different locale has been set forthe pattern (using the /L modifier). In this case, the isprint()function to distinguish printing and non-printing characters. 


The facilities described in this section are not available when the POSIXinteface to PCRE is being used, that is, when the /P pattern modifier isspecified.When the POSIX interface is not in use, you can cause pcretest to write acompiled pattern to a file, by following the modifiers with > and a file name.For example:

  /pattern/im >/some/file

See thepcreprecompiledocumentation for a discussion about saving and re-using compiled patterns.The data that is written is binary. The first eight bytes are the length of thecompiled pattern data followed by the length of the optional study data, eachwritten as four bytes in big-endian order (most significant byte first). Ifthere is no study data (either the pattern was not studied, or studying did notreturn any data), the second length is zero. The lengths are followed by anexact copy of the compiled pattern. If there is additional study data, thisfollows immediately after the compiled pattern. After writing the file,pcretest expects to read a new pattern.A saved pattern can be reloaded into pcretest by specifing < and a filename instead of a pattern. The name of the file must not contain a < character,as otherwise pcretest will interpret the line as a pattern delimited by <characters.For example:

   re> </some/file
  Compiled regex loaded from /some/file
  No study data

When the pattern has been loaded, pcretest proceeds to read data lines inthe usual way.You can copy a file written by pcretest to a different host and reload itthere, even if the new host has opposite endianness to the one on which thepattern was compiled. For example, you can compile on an i86 machine and run ona SPARC machine.File names for saving and reloading can be absolute or relative, but note thatthe shell facility of expanding a file name that starts with a tilde (~) is notavailable.The ability to save and reload files in pcretest is intended for testingand experimentation. It is not intended for production use because only asingle pattern can be written to a file. Furthermore, there is no facility forsupplying custom character tables for use with a reloaded pattern. If theoriginal pattern was compiled with custom tables, an attempt to match a subjectstring using a reloaded pattern is likely to cause pcretest to crash.Finally, if you attempt to load a file that is not in the correct format, theresult is undefined. 


pcre(3), pcreapi(3), pcrecallout(3), pcrematching(3),pcrepartial(d), pcrepattern(3), pcreprecompile(3). 


Philip HazelUniversity Computing ServiceCambridge CB2 3QH, England.


