|
1 .TH PCRECOMPAT 3 |
|
2 .SH NAME |
|
3 PCRE - Perl-compatible regular expressions |
|
4 .SH "DIFFERENCES BETWEEN PCRE AND PERL" |
|
5 .rs |
|
6 .sp |
|
7 This document describes the differences in the ways that PCRE and Perl handle |
|
8 regular expressions. The differences described here are mainly with respect to |
|
9 Perl 5.8, though PCRE versions 7.0 and later contain some features that are |
|
10 expected to be in the forthcoming Perl 5.10. |
|
11 .P |
|
12 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details of what |
|
13 it does have are given in the |
|
14 .\" HTML <a href="pcre.html#utf8support"> |
|
15 .\" </a> |
|
16 section on UTF-8 support |
|
17 .\" |
|
18 in the main |
|
19 .\" HREF |
|
20 \fBpcre\fP |
|
21 .\" |
|
22 page. |
|
23 .P |
|
24 2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits |
|
25 them, but they do not mean what you might think. For example, (?!a){3} does |
|
26 not assert that the next three characters are not "a". It just asserts that the |
|
27 next character is not "a" three times. |
|
28 .P |
|
29 3. Capturing subpatterns that occur inside negative lookahead assertions are |
|
30 counted, but their entries in the offsets vector are never set. Perl sets its |
|
31 numerical variables from any such patterns that are matched before the |
|
32 assertion fails to match something (thereby succeeding), but only if the |
|
33 negative lookahead assertion contains just one branch. |
|
34 .P |
|
35 4. Though binary zero characters are supported in the subject string, they are |
|
36 not allowed in a pattern string because it is passed as a normal C string, |
|
37 terminated by zero. The escape sequence \e0 can be used in the pattern to |
|
38 represent a binary zero. |
|
39 .P |
|
40 5. The following Perl escape sequences are not supported: \el, \eu, \eL, |
|
41 \eU, and \eN. In fact these are implemented by Perl's general string-handling |
|
42 and are not part of its pattern matching engine. If any of these are |
|
43 encountered by PCRE, an error is generated. |
|
44 .P |
|
45 6. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE is |
|
46 built with Unicode character property support. The properties that can be |
|
47 tested with \ep and \eP are limited to the general category properties such as |
|
48 Lu and Nd, script names such as Greek or Han, and the derived properties Any |
|
49 and L&. |
|
50 .P |
|
51 7. PCRE does support the \eQ...\eE escape for quoting substrings. Characters in |
|
52 between are treated as literals. This is slightly different from Perl in that $ |
|
53 and @ are also handled as literals inside the quotes. In Perl, they cause |
|
54 variable interpolation (but of course PCRE does not have variables). Note the |
|
55 following examples: |
|
56 .sp |
|
57 Pattern PCRE matches Perl matches |
|
58 .sp |
|
59 .\" JOIN |
|
60 \eQabc$xyz\eE abc$xyz abc followed by the |
|
61 contents of $xyz |
|
62 \eQabc\e$xyz\eE abc\e$xyz abc\e$xyz |
|
63 \eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz |
|
64 .sp |
|
65 The \eQ...\eE sequence is recognized both inside and outside character classes. |
|
66 .P |
|
67 8. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) |
|
68 constructions. However, there is support for recursive patterns. This is not |
|
69 available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE "callout" |
|
70 feature allows an external function to be called during pattern matching. See |
|
71 the |
|
72 .\" HREF |
|
73 \fBpcrecallout\fP |
|
74 .\" |
|
75 documentation for details. |
|
76 .P |
|
77 9. Subpatterns that are called recursively or as "subroutines" are always |
|
78 treated as atomic groups in PCRE. This is like Python, but unlike Perl. |
|
79 .P |
|
80 10. There are some differences that are concerned with the settings of captured |
|
81 strings when part of a pattern is repeated. For example, matching "aba" against |
|
82 the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b". |
|
83 .P |
|
84 11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), (*FAIL), (*F), |
|
85 (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in the forms without an |
|
86 argument. PCRE does not support (*MARK). If (*ACCEPT) is within capturing |
|
87 parentheses, PCRE does not set that capture group; this is different to Perl. |
|
88 .P |
|
89 12. PCRE provides some extensions to the Perl regular expression facilities. |
|
90 Perl 5.10 will include new features that are not in earlier versions, some of |
|
91 which (such as named parentheses) have been in PCRE for some time. This list is |
|
92 with respect to Perl 5.10: |
|
93 .sp |
|
94 (a) Although lookbehind assertions must match fixed length strings, each |
|
95 alternative branch of a lookbehind assertion can match a different length of |
|
96 string. Perl requires them all to have the same length. |
|
97 .sp |
|
98 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ |
|
99 meta-character matches only at the very end of the string. |
|
100 .sp |
|
101 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no special |
|
102 meaning is faulted. Otherwise, like Perl, the backslash is quietly ignored. |
|
103 (Perl can be made to issue a warning.) |
|
104 .sp |
|
105 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is |
|
106 inverted, that is, by default they are not greedy, but if followed by a |
|
107 question mark they are. |
|
108 .sp |
|
109 (e) PCRE_ANCHORED can be used at matching time to force a pattern to be tried |
|
110 only at the first matching position in the subject string. |
|
111 .sp |
|
112 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAPTURE |
|
113 options for \fBpcre_exec()\fP have no Perl equivalents. |
|
114 .sp |
|
115 (g) The \eR escape sequence can be restricted to match only CR, LF, or CRLF |
|
116 by the PCRE_BSR_ANYCRLF option. |
|
117 .sp |
|
118 (h) The callout facility is PCRE-specific. |
|
119 .sp |
|
120 (i) The partial matching facility is PCRE-specific. |
|
121 .sp |
|
122 (j) Patterns compiled by PCRE can be saved and re-used at a later time, even on |
|
123 different hosts that have the other endianness. |
|
124 .sp |
|
125 (k) The alternative matching function (\fBpcre_dfa_exec()\fP) matches in a |
|
126 different way and is not Perl-compatible. |
|
127 .sp |
|
128 (l) PCRE recognizes some special sequences such as (*CR) at the start of |
|
129 a pattern that set overall options that cannot be changed within the pattern. |
|
130 . |
|
131 . |
|
132 .SH AUTHOR |
|
133 .rs |
|
134 .sp |
|
135 .nf |
|
136 Philip Hazel |
|
137 University Computing Service |
|
138 Cambridge CB2 3QH, England. |
|
139 .fi |
|
140 . |
|
141 . |
|
142 .SH REVISION |
|
143 .rs |
|
144 .sp |
|
145 .nf |
|
146 Last updated: 11 September 2007 |
|
147 Copyright (c) 1997-2007 University of Cambridge. |
|
148 .fi |