fbpx
Wikipedia

Perl Compatible Regular Expressions

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997.[3] PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors (BRE, ERE)[4] and than that of many other regular-expression libraries.

Perl Compatible Regular Expressions
Original author(s)Philip Hazel
Stable release(s)
PCRE18.45 / June 15, 2021; 2 years ago (2021-06-15)[1]
PCRE210.42 / December 12, 2022; 14 months ago (2022-12-12)[2]
Repository
  • github.com/PCRE2Project/pcre2
Written inC
Operating systemCross-platform
TypePattern matching library
LicenseBSD
Websitepcre.org

While PCRE originally aimed at feature-equivalence with Perl, the two implementations are not fully equivalent. During the PCRE 7.x and Perl 5.9.x phase, the two projects coordinated development, with features being ported between them in both directions.[5]

In 2015, a fork of PCRE was released with a revised programming interface (API). The original software, now called PCRE1 (the 1.xx–8.xx series), has had bugs mended, but no further development. As of 2020, it is considered obsolete, and the current 8.45 release is likely to be the last. The new PCRE2 code (the 10.xx series) has had a number of extensions and coding improvements and is where development takes place.

A number of prominent open-source programs, such as the Apache and Nginx HTTP servers, and the PHP and R scripting languages, incorporate the PCRE library; proprietary software can do likewise, as the library is BSD-licensed. As of Perl 5.10, PCRE is also available as a replacement for Perl's default regular-expression engine through the re::engine::PCRE module.

The library can be built on Unix, Windows, and several other environments. PCRE2 is distributed with a POSIX C wrapper,[Note 1] several test programs, and the utility program pcre2grep that is built in tandem with the library.

Features edit

Just-in-time compiler support edit

This optional feature is available if enabled when the PCRE2 library is built. Large performance benefits are possible when (for example) the calling program utilizes the feature with compatible patterns that are executed repeatedly. The just-in-time compiler support was written by Zoltan Herczeg and is not addressed in the POSIX wrapper.

Flexible memory management edit

The use of the system stack for backtracking can be problematic in PCRE1, which is why this feature of the implementation was changed in PCRE2. The heap is now used for this purpose, and the total amount can be limited. The problem of stack overflow, which came up regularly with PCRE1, is no longer an issue with PCRE2 from release 10.30 (2017).

Consistent escaping rules edit

Like Perl, PCRE2 has consistent escaping rules: any non-alpha-numeric character may be escaped to mean its literal value by prefixing a \ (backslash) before the character. Any alpha-numeric character preceded by a backslash typically gives it a special meaning. In the case where the sequence has not been defined to be special, an error occurs. This is different to Perl, which gives an error only if it is in warning mode (PCRE2 does not have a warning mode). In basic POSIX regular expressions, sometimes backslashes escaped non-alpha-numerics (e.g. \.), and sometimes they introduced a special feature (e.g. \(\)).

Extended character classes edit

Single-letter character classes are supported in addition to the longer POSIX names. For example, \d matches any digit exactly as [[:digit:]] would in POSIX regular expressions.

Minimal matching (a.k.a. "ungreedy") edit

A ? may be placed after any repetition quantifier to indicate that the shortest match should be used. The default is to attempt the longest match first and backtrack through shorter matches: e.g. a.*?b would match first "ab" in "ababab", where a.*b would match the entire string.

If the U flag is set, then quantifiers are ungreedy (lazy) by default, while ? makes them greedy.

Unicode character properties edit

Unicode defines several properties for each character. Patterns in PCRE2 can match these properties: e.g. \p{Ps}.*?\p{Pe} would match a string beginning with any "opening punctuation" and ending with any "close punctuation" such as [abc]. Matching of certain "normal" metacharacters can be driven by Unicode properties when the compile option PCRE2_UCP is set. The option can be set for a pattern by including (*UCP) at the start of pattern. The option alters behavior of the following metacharacters: \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. For example, the set of characters matched by \w (word characters) is expanded to include letters and accented letters as defined by Unicode properties. Such matching is slower than the normal (ASCII-only) non-UCP alternative. Note that the UCP option requires the library to have been built to include Unicode support (this is the default for PCRE2). Very early versions of PCRE1 supported only ASCII code. Later, UTF-8 support was added. Support for UTF-16 was added in version 8.30, and support for UTF-32 in version 8.32. PCRE2 has always supported all three UTF encodings.

Multiline matching edit

^ and $ can match at the beginning and end of a string only, or at the start and end of each "line" within the string, depending on what options are set.

Newline/linebreak options edit

When PCRE is compiled, a newline default is selected. Which newline/linebreak is in effect affects where PCRE detects ^ line beginnings and $ ends (in multiline mode), as well as what matches dot (regardless of multiline mode, unless the dotall option (?s) is set). It also affects PCRE matching procedure (since version 7.0): when an unanchored pattern fails to match at the start of a newline sequence, PCRE advances past the entire newline sequence before retrying the match. If the newline option alternative in effect includes CRLF as one of the valid linebreaks, it does not skip the \n in a CRLF if the pattern contains specific \r or \n references (since version 7.3). Since version 8.10, the metacharacter \N always matches any character other than linebreak characters. It has the same behavior as . when the dotall option aka (?s) is not in effect.

The newline option can be altered with external options when PCRE is compiled and when it is run. Some applications using PCRE provide users with the means to apply this setting through an external option. So the newline option can also be stated at the start of the pattern using one of the following:

  • (*LF) Newline is a linefeed character. Corresponding linebreaks can be matched with \n.
  • (*CR) Newline is a carriage return. Corresponding linebreaks can be matched with \r.
  • (*CRLF) Newline/linebreak is a carriage return followed by a linefeed. Corresponding linebreaks can be matched with \r\n.
  • (*ANYCRLF) Any of the above encountered in the data will trigger newline processing. Corresponding linebreaks can be matched with (?:\r\n?|\n) or with \R. See below for configuration and options concerning what matches backslash-R.
  • (*ANY) Any of the above plus special Unicode linebreaks.

When not in UTF-8 mode, corresponding linebreaks can be matched with (?:\r\n?|\n|\x0B|\f|\x85)[Note 2] or \R.

In UTF-8 mode, two additional characters are recognized as line breaks with (*ANY):

  • LS (line separator, U+2028),
  • PS (paragraph separator, U+2029).

On Windows, in non-Unicode data, some of the ANY linebreak characters have other meanings.

For example, \x85 can match a horizontal ellipsis, and if encountered while the ANY newline is in effect, it would trigger newline processing.

See below for configuration and options concerning what matches backslash-R.

Backslash-R options edit

When PCRE is compiled, a default is selected for what matches \R. The default can be either to match the linebreaks corresponding to ANYCRLF or those corresponding to ANY. The default can be overridden when necessary by including (*BSR_UNICODE) or (*BSR_ANYCRLF) at the start of the pattern. When providing a (*BSR..) option, you can also provide a (*newline) option, e.g., (*BSR_UNICODE)(*ANY)rest-of-pattern. The backslash-R options also can be changed with external options by the application calling PCRE2, when a pattern is compiled.

Beginning of pattern options edit

Linebreak options such as (*LF) documented above; backslash-R options such as (*BSR_ANYCRLF) documented above; Unicode Character Properties option (*UCP) documented above; (*UTF8) option documented as follows: if your PCRE2 library has been compiled with UTF support, you can specify the (*UTF) option at the beginning of a pattern instead of setting an external option to invoke UTF-8, UTF-16, or UTF-32 mode.

Backreferences edit

A pattern may refer back to the results of a previous match. For example, (a|b)c\1 would match either "aca" or "bcb" and would not match, for example, "acb".

Named subpatterns edit

A sub-pattern (surrounded by parentheses, like (...)) may be named by including a leading ?P<name> after the opening parenthesis. Named subpatterns are a feature that PCRE adopted from Python regular expressions.

This feature was subsequently adopted by Perl, so now named groups can also be defined using (?<name>...) or (?'name'...), as well as (?P<name>...). Named groups can be backreferenced with, for example: (?P=name) (Python syntax) or \k'name' (Perl syntax).

Subroutines edit

While a backreference provides a mechanism to refer to that part of the subject that has previously matched a subpattern, a subroutine provides a mechanism to reuse an underlying previously defined subpattern. The subpattern's options, such as case independence, are fixed when the subpattern is defined. (a.c)(?1) would match "aacabc" or "abcadc", whereas using a backreference (a.c)\1 would not, though both would match "aacaac" or "abcabc". PCRE also supports a non-Perl Oniguruma construct for subroutines. They are specified using \g<subpat-number> or \g<subpat-name>.

Atomic grouping edit

Atomic grouping is a way of preventing backtracking in a pattern. For example, a++bc will match as many "a"s as possible and never back up to try one less.

Look-ahead and look-behind assertions edit

Assertion Lookbehind Lookahead
Positive (?<=pattern) (?=pattern)
Negative (?<!pattern) (?!pattern)
Look-behind and look-ahead assertions
in Perl regular expressions

Patterns may assert that previous text or subsequent text contains a pattern without consuming matched text (zero-width assertion). For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab itself.

Look-behind assertions cannot be of uncertain length though (unlike Perl) each branch can be a different fixed length.

\K can be used in a pattern to reset the start of the current whole match. This provides a flexible alternative approach to look-behind assertions because the discarded part of the match (the part that precedes \K) need not be fixed in length.

Escape sequences for zero-width assertions edit

E.g. \b for matching zero-width "word boundaries", similar to (?<=\W)(?=\w)|(?<=\w)(?=\W)|^|$.

Comments edit

A comment begins with (?# and ends at the next closing parenthesis.

Recursive patterns edit

A pattern can refer back to itself recursively or to any subpattern. For example, the pattern \((a*|(?R))*\) will match any combination of balanced parentheses and "a"s.

Generic callouts edit

PCRE expressions can embed (?Cn), where n is some number. This will call out to an external user-defined function through the PCRE API and can be used to embed arbitrary code in a pattern.

Differences from Perl edit

Differences between PCRE2 and Perl (as of Perl 5.9.4) include but are not limited to:[6]

Until release 10.30 recursive matches were atomic in PCRE and non atomic in Perl edit

This meant that "<<!>!>!>><>>!>!>!>" =~ /^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/ would match in Perl but not in PCRE2 until release 10.30.

The value of a capture buffer deriving from the ? quantifier (match 1 or 0 times) when nested in another quantified capture buffer is different edit

In Perl "aba" =~ /^(a(b)?)+$/; will result in $1 containing "a" and $2 containing undef, but in PCRE will result in $2 containing "b".

PCRE allows named capture buffers to be given numeric names; Perl requires the name to follow the rule of barewords edit

This means that \g{} is unambiguous in Perl, but potentially ambiguous in PCRE.

This is no longer a difference since PCRE 8.34 (released on 2013-12-15), which no longer allows group names to start with a digit.[7]

PCRE allows alternatives within lookbehind to be different lengths edit

Within lookbehind assertions, both PCRE and Perl require fixed-length patterns.

That is, both PCRE and Perl disallow variable-length patterns using quantifiers within lookbehind assertions.

However, Perl requires all alternative branches of a lookbehind assertion to be the same length as each other, whereas PCRE allows those alternative branches to have different lengths from each other as long as each branch still has a fixed length.

PCRE does not support certain "experimental" Perl constructs edit

Such as (??{...}) (a callback whose return is evaluated as being part of the pattern) nor the (?{}) construct, although the latter can be emulated using (?Cn).

Recursion control verbs added in the Perl 5.9.x series are also not supported.

Support for experimental backtracking control verbs (added in Perl 5.10) is available in PCRE since version 7.3.

They are (*FAIL), (*F), (*PRUNE), (*SKIP), (*THEN), (*COMMIT), and (*ACCEPT).

Perl's corresponding use of arguments with backtracking control verbs is not generally supported.

Note however that since version 8.10, PCRE supports the following verbs with a specified argument: (*MARK:markName), (*SKIP:markName), (*PRUNE:markName), and (*THEN:markName).

Since version 10.32 PCRE2 has supported (*ACCEPT:markName), (*FAIL:markName), and (*COMMIT:markName).

PCRE and Perl are slightly different in their tolerance of erroneous constructs edit

Perl allows quantifiers on the (?!...) construct, which is meaningless but harmless (albeit inefficient); PCRE produces an error in versions before 8.13.

PCRE has a hard limit on recursion depth, Perl does not edit

With default build options "bbbbXcXaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /.X(.+)+X/ will fail to match due to the limit, but Perl will match this correctly.

Perl uses the heap for recursion and has no hard limit for recursion depth, whereas PCRE2 has a compile-time default limit that can be adjusted up or down by the calling application.

Verification edit

With the exception of the above points, PCRE is capable of passing the tests in the Perl "t/op/re_tests" file, one of the main syntax-level regression tests for Perl's regular expression engine.

Notes and references edit

Notes edit

  1. ^ The core PCRE2 library provides both matching and match and replace functionality.
  2. ^ Sure the \x85 part is not \xC2\x85? (i.e. (?:\r\n?|\n|\x0B|\f|\xC2\x85), as U+0085 != 0x85)

    Caveat: If the pattern \xC2\x85 failed to work: experiment with the RegEx implementation's Unicode settings, or try substituting with the following:
    • \x{0085}
    • \u0085

References edit

[8]

  1. ^ Final release of PCRE1: https://lists.exim.org/lurker/message/20210615.162400.c16ff8a3.en.html
  2. ^ Releases: https://github.com/PCRE2Project/pcre2/releases
  3. ^ Exim and PCRE: How free software hijacked my life (1999-12), by Philip Hazel, p. 7: https://www.ukuug.org/events/winter99/proc/PH.ps

    What about PCRE?

    • Written summer 1997, placed on ftp site.
    • People found it, and started a mailing list.
    • There has been a trickle of enhancements.
  4. ^
  5. ^ PCRE2 - Perl-compatible regular expressions (revised API) (2020), by University of Cambridge: https://pcre.org/pcre2.txt
  6. ^ Differences Between PCRE2 and Perl (2019-07-13), by Philip Hazel: https://www.pcre.org/current/doc/html/pcre2compat.html
  7. ^ Quote PCRE changelog (https://www.pcre.org/original/changelog.txt): "Perl no longer allows group names to start with digits, so I have made this change also in PCRE."
  8. ^ ChangeLog for PCRE2: https://www.pcre.org/changelog.txt

See also edit

External links edit

perl, compatible, regular, expressions, pcre, library, written, which, implements, regular, expression, engine, inspired, capabilities, perl, programming, language, philip, hazel, started, writing, pcre, summer, 1997, pcre, syntax, much, more, powerful, flexib. Perl Compatible Regular Expressions PCRE is a library written in C which implements a regular expression engine inspired by the capabilities of the Perl programming language Philip Hazel started writing PCRE in summer 1997 3 PCRE s syntax is much more powerful and flexible than either of the POSIX regular expression flavors BRE ERE 4 and than that of many other regular expression libraries Perl Compatible Regular ExpressionsOriginal author s Philip HazelStable release s PCRE18 45 June 15 2021 2 years ago 2021 06 15 1 PCRE210 42 December 12 2022 14 months ago 2022 12 12 2 Repositorygithub wbr com wbr PCRE2Project wbr pcre2Written inCOperating systemCross platformTypePattern matching libraryLicenseBSDWebsitepcre wbr orgWhile PCRE originally aimed at feature equivalence with Perl the two implementations are not fully equivalent During the PCRE 7 x and Perl 5 9 x phase the two projects coordinated development with features being ported between them in both directions 5 In 2015 a fork of PCRE was released with a revised programming interface API The original software now called PCRE1 the 1 xx 8 xx series has had bugs mended but no further development As of 2020 update it is considered obsolete and the current 8 45 release is likely to be the last The new PCRE2 code the 10 xx series has had a number of extensions and coding improvements and is where development takes place A number of prominent open source programs such as the Apache and Nginx HTTP servers and the PHP and R scripting languages incorporate the PCRE library proprietary software can do likewise as the library is BSD licensed As of Perl 5 10 PCRE is also available as a replacement for Perl s default regular expression engine through the re engine PCRE module The library can be built on Unix Windows and several other environments PCRE2 is distributed with a POSIX C wrapper Note 1 several test programs and the utility program pcre2grep that is built in tandem with the library Contents 1 Features 1 1 Just in time compiler support 1 2 Flexible memory management 1 3 Consistent escaping rules 1 4 Extended character classes 1 5 Minimal matching a k a ungreedy 1 6 Unicode character properties 1 7 Multiline matching 1 8 Newline linebreak options 1 9 Backslash R options 1 10 Beginning of pattern options 1 11 Backreferences 1 12 Named subpatterns 1 13 Subroutines 1 14 Atomic grouping 1 15 Look ahead and look behind assertions 1 16 Escape sequences for zero width assertions 1 17 Comments 1 18 Recursive patterns 1 19 Generic callouts 2 Differences from Perl 2 1 Until release 10 30 recursive matches were atomic in PCRE and non atomic in Perl 2 2 The value of a capture buffer deriving from the quantifier match 1 or 0 times when nested in another quantified capture buffer is different 2 3 PCRE allows named capture buffers to be given numeric names Perl requires the name to follow the rule of barewords 2 4 PCRE allows alternatives within lookbehind to be different lengths 2 5 PCRE does not support certain experimental Perl constructs 2 6 PCRE and Perl are slightly different in their tolerance of erroneous constructs 2 7 PCRE has a hard limit on recursion depth Perl does not 2 8 Verification 3 Notes and references 3 1 Notes 3 2 References 4 See also 5 External linksFeatures editJust in time compiler support edit This optional feature is available if enabled when the PCRE2 library is built Large performance benefits are possible when for example the calling program utilizes the feature with compatible patterns that are executed repeatedly The just in time compiler support was written by Zoltan Herczeg and is not addressed in the POSIX wrapper Flexible memory management edit The use of the system stack for backtracking can be problematic in PCRE1 which is why this feature of the implementation was changed in PCRE2 The heap is now used for this purpose and the total amount can be limited The problem of stack overflow which came up regularly with PCRE1 is no longer an issue with PCRE2 from release 10 30 2017 Consistent escaping rules edit Like Perl PCRE2 has consistent escaping rules any non alpha numeric character may be escaped to mean its literal value by prefixing a backslash before the character Any alpha numeric character preceded by a backslash typically gives it a special meaning In the case where the sequence has not been defined to be special an error occurs This is different to Perl which gives an error only if it is in warning mode PCRE2 does not have a warning mode In basic POSIX regular expressions sometimes backslashes escaped non alpha numerics e g and sometimes they introduced a special feature e g Extended character classes edit Single letter character classes are supported in addition to the longer POSIX names For example d matches any digit exactly as digit would in POSIX regular expressions Minimal matching a k a ungreedy edit A may be placed after any repetition quantifier to indicate that the shortest match should be used The default is to attempt the longest match first and backtrack through shorter matches e g a b would match first ab in ababab where a b would match the entire string If the U flag is set then quantifiers are ungreedy lazy by default while makes them greedy Unicode character properties edit Unicode defines several properties for each character Patterns in PCRE2 can match these properties e g span class err span span class nv p span span class p span span class x Ps span span class p span span class o span span class err span span class nv p span span class p span span class x Pe span span class p span would match a string beginning with any opening punctuation and ending with any close punctuation such as abc Matching of certain normal metacharacters can be driven by Unicode properties when the compile option PCRE2 UCP is set The option can be set for a pattern by including UCP at the start of pattern The option alters behavior of the following metacharacters B b D d S s W w and some of the POSIX character classes For example the set of characters matched by w word characters is expanded to include letters and accented letters as defined by Unicode properties Such matching is slower than the normal ASCII only non UCP alternative Note that the UCP option requires the library to have been built to include Unicode support this is the default for PCRE2 Very early versions of PCRE1 supported only ASCII code Later UTF 8 support was added Support for UTF 16 was added in version 8 30 and support for UTF 32 in version 8 32 PCRE2 has always supported all three UTF encodings Multiline matching edit and can match at the beginning and end of a string only or at the start and end of each line within the string depending on what options are set Newline linebreak options edit When PCRE is compiled a newline default is selected Which newline linebreak is in effect affects where PCRE detects line beginnings and ends in multiline mode as well as what matches dot regardless of multiline mode unless the dotall option s is set It also affects PCRE matching procedure since version 7 0 when an unanchored pattern fails to match at the start of a newline sequence PCRE advances past the entire newline sequence before retrying the match If the newline option alternative in effect includes CRLF as one of the valid linebreaks it does not skip the n in a CRLF if the pattern contains specific r or n references since version 7 3 Since version 8 10 the metacharacter N always matches any character other than linebreak characters It has the same behavior as when the dotall option aka s is not in effect The newline option can be altered with external options when PCRE is compiled and when it is run Some applications using PCRE provide users with the means to apply this setting through an external option So the newline option can also be stated at the start of the pattern using one of the following LF Newline is a linefeed character Corresponding linebreaks can be matched with n CR Newline is a carriage return Corresponding linebreaks can be matched with r CRLF Newline linebreak is a carriage return followed by a linefeed Corresponding linebreaks can be matched with r n ANYCRLF Any of the above encountered in the data will trigger newline processing Corresponding linebreaks can be matched with span class o span span class err span span class nv r span span class err span span class nv n span span class o span span class err span span class nv n span span class o span or with R See below for configuration and options concerning what matches backslash R ANY Any of the above plus special Unicode linebreaks When not in UTF 8 mode corresponding linebreaks can be matched with span class o span span class err span span class nv r span span class err span span class nv n span span class o span span class err span span class nv n span span class o span span class err span span class nv x0B span span class o span span class err span span class nv f span span class o span span class err span span class nv x85 span span class o span Note 2 or R In UTF 8 mode two additional characters are recognized as line breaks with ANY LS line separator U 2028 PS paragraph separator U 2029 On Windows in non Unicode data some of the ANY linebreak characters have other meanings For example x85 can match a horizontal ellipsis and if encountered while the ANY newline is in effect it would trigger newline processing See below for configuration and options concerning what matches backslash R Backslash R options edit When PCRE is compiled a default is selected for what matches R The default can be either to match the linebreaks corresponding to ANYCRLF or those corresponding to ANY The default can be overridden when necessary by including BSR UNICODE or BSR ANYCRLF at the start of the pattern When providing a BSR option you can also provide a i newline i option e g BSR UNICODE ANY i rest of pattern i The backslash R options also can be changed with external options by the application calling PCRE2 when a pattern is compiled Beginning of pattern options edit Linebreak options such as LF documented above backslash R options such as BSR ANYCRLF documented above Unicode Character Properties option UCP documented above UTF8 option documented as follows if your PCRE2 library has been compiled with UTF support you can specify the UTF option at the beginning of a pattern instead of setting an external option to invoke UTF 8 UTF 16 or UTF 32 mode Backreferences edit A pattern may refer back to the results of a previous match For example a b c 1 would match either aca or bcb and would not match for example acb Named subpatterns edit A sub pattern surrounded by parentheses like may be named by including a leading P lt name gt after the opening parenthesis Named subpatterns are a feature that PCRE adopted from Python regular expressions This feature was subsequently adopted by Perl so now named groups can also be defined using lt name gt or name as well as P lt name gt Named groups can be backreferenced with for example P name Python syntax or k name Perl syntax Subroutines edit While a backreference provides a mechanism to refer to that part of the subject that has previously matched a subpattern a subroutine provides a mechanism to reuse an underlying previously defined subpattern The subpattern s options such as case independence are fixed when the subpattern is defined a c 1 would match aacabc or abcadc whereas using a backreference a c 1 would not though both would match aacaac or abcabc PCRE also supports a non Perl Oniguruma construct for subroutines They are specified using g lt subpat number gt or g lt subpat name gt Atomic grouping edit Atomic grouping is a way of preventing backtracking in a pattern For example a bc will match as many a s as possible and never back up to try one less Look ahead and look behind assertions edit Assertion Lookbehind LookaheadPositive lt pattern pattern Negative lt pattern pattern Look behind and look ahead assertionsin Perl regular expressionsPatterns may assert that previous text or subsequent text contains a pattern without consuming matched text zero width assertion For example w t matches a word followed by a tab without including the tab itself Look behind assertions cannot be of uncertain length though unlike Perl each branch can be a different fixed length K can be used in a pattern to reset the start of the current whole match This provides a flexible alternative approach to look behind assertions because the discarded part of the match the part that precedes K need not be fixed in length Escape sequences for zero width assertions edit E g b for matching zero width word boundaries similar to span class o span span class err lt span span class o span span class err span span class nv W span span class o span span class err span span class nv w span span class o span span class err lt span span class o span span class err span span class nv w span span class o span span class err span span class nv W span span class o span Comments edit A comment begins with and ends at the next closing parenthesis Recursive patterns edit A pattern can refer back to itself recursively or to any subpattern For example the pattern span class err span span class o span span class nv a span span class o span span class nv R span span class o span span class err span span class o span will match any combination of balanced parentheses and a s Generic callouts edit PCRE expressions can embed C i n i where n is some number This will call out to an external user defined function through the PCRE API and can be used to embed arbitrary code in a pattern Differences from Perl editThis section needs to be updated The reason given is the reference given below refers to Perl 5 26 Please help update this article to reflect recent events or newly available information September 2020 Differences between PCRE2 and Perl as of Perl 5 9 4 include but are not limited to 6 Until release 10 30 recursive matches were atomic in PCRE and non atomic in Perl edit This meant that span class s lt lt gt gt gt gt lt gt gt gt gt gt span span class w span span class o span span class sr lt lt gt 3 1 gt gt gt gt span would match in Perl but not in PCRE2 until release 10 30 The value of a capture buffer deriving from the quantifier match 1 or 0 times when nested in another quantified capture buffer is different edit In Perl span class s aba span span class w span span class o span span class sr a b span span class p span will result in 1 containing a and 2 containing undef but in PCRE will result in 2 containing b PCRE allows named capture buffers to be given numeric names Perl requires the name to follow the rule of barewords edit This means that g is unambiguous in Perl but potentially ambiguous in PCRE This is no longer a difference since PCRE 8 34 released on 2013 12 15 which no longer allows group names to start with a digit 7 PCRE allows alternatives within lookbehind to be different lengths edit Within lookbehind assertions both PCRE and Perl require fixed length patterns That is both PCRE and Perl disallow variable length patterns using quantifiers within lookbehind assertions However Perl requires all alternative branches of a lookbehind assertion to be the same length as each other whereas PCRE allows those alternative branches to have different lengths from each other as long as each branch still has a fixed length PCRE does not support certain experimental Perl constructs edit Such as a callback whose return is evaluated as being part of the pattern nor the construct although the latter can be emulated using Cn Recursion control verbs added in the Perl 5 9 x series are also not supported Support for experimental backtracking control verbs added in Perl 5 10 is available in PCRE since version 7 3 They are FAIL F PRUNE SKIP THEN COMMIT and ACCEPT Perl s corresponding use of arguments with backtracking control verbs is not generally supported Note however that since version 8 10 PCRE supports the following verbs with a specified argument MARK markName SKIP markName PRUNE markName and THEN markName Since version 10 32 PCRE2 has supported ACCEPT markName FAIL markName and COMMIT markName PCRE and Perl are slightly different in their tolerance of erroneous constructs edit Perl allows quantifiers on the construct which is meaningless but harmless albeit inefficient PCRE produces an error in versions before 8 13 PCRE has a hard limit on recursion depth Perl does not edit With default build options span class s bbbbXcXaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa span span class w span span class o span span class sr X X span will fail to match due to the limit but Perl will match this correctly Perl uses the heap for recursion and has no hard limit for recursion depth whereas PCRE2 has a compile time default limit that can be adjusted up or down by the calling application Verification edit With the exception of the above points PCRE is capable of passing the tests in the Perl t op re tests file one of the main syntax level regression tests for Perl s regular expression engine Notes and references editNotes edit The core PCRE2 library provides both matching and match and replace functionality Sure the span class err span span class nv x85 span part is not span class err span span class nv xC2 span span class err span span class nv x85 span i e span class o span span class err span span class nv r span span class err span span class nv n span span class o span span class err span span class nv n span span class o span span class err span span class nv x0B span span class o span span class err span span class nv f span span class o span span class err span span class nv xC2 span span class err span span class nv x85 span span class o span as U 0085 0x85 Caveat If the pattern span class err span span class nv xC2 span span class err span span class nv x85 span failed to work experiment with the RegEx implementation s Unicode settings or try substituting with the following span class err span span class nv x span span class p span span class x 0085 span span class p span span class err span span class nv u0085 span References edit 8 Final release of PCRE1 https lists exim org lurker message 20210615 162400 c16ff8a3 en html Releases https github com PCRE2Project pcre2 releases Exim and PCRE How free software hijacked my life 1999 12 by Philip Hazel p 7 https www ukuug org events winter99 proc PH ps What about PCRE Written summer 1997 placed on ftp site People found it and started a mailing list There has been a trickle of enhancements Regular Expression POSIX Standard Google Search https www google com search num 100 amp q 22Regular Expression 22 7C 22Regular Expressions 22 7C 22RegEx 22 7C 22RegExp 22 site 3Apubs opengroup org inurl 3Aonlinepubs 2F9699919799 intitle 3A 22Index of 2Fonlinepubs 22 inurl 3Aidx inurl 3Acontents html inurl 3Atoc html inurl 3A9699919799 orig inurl 3A2008edition inurl 3A2013edition inurl 3A2016edition inurl 3A2018edition Utilities Pattern Matching Notation https pubs opengroup org onlinepubs 9699919799 2018edition utilities V3 chap02 html tag 18 13 Base Definitions Basic Regular Expressions https pubs opengroup org onlinepubs 9699919799 2018edition basedefs V1 chap09 html tag 09 03 Rationale Regular Expressions https pubs opengroup org onlinepubs 9699919799 2018edition xrat V4 xbd chap09 html tag 21 09 PCRE2 Perl compatible regular expressions revised API 2020 by University of Cambridge https pcre org pcre2 txt Differences Between PCRE2 and Perl 2019 07 13 by Philip Hazel https www pcre org current doc html pcre2compat html Quote PCRE changelog https www pcre org original changelog txt Perl no longer allows group names to start with digits so I have made this change also in PCRE ChangeLog for PCRE2 https www pcre org changelog txtSee also edit nbsp Free and open source software portalPcregrep Comparison of regular expression enginesExternal links editOfficial website nbsp PCRE Development mailing list https groups google com g pcre2 dev PCRE Bug Tracker https github com PCRE2Project pcre2 issues Pattern Matching Using Regular Expressions 2010 03 02 by Nick Maclaren Philip Hazel https www uxsup csx cam ac uk courses moved REs paper pdf pcre 8 43 2019 04 Windows Cygwin x86 64 https www uxsup csx cam ac uk pub windows cygwin x86 64 release pcre Retrieved from https en wikipedia org w index php title Perl Compatible Regular Expressions amp oldid 1198056814, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.