Need to understand regexp_replace logic of given code [duplicate] - sql
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
What is this?
This is a collection of common Q&A. This is also a Community Wiki, so everyone is invited to participate in maintaining it.
Why is this?
regex is suffering from give me ze code type of questions and poor answers with no explanation. This reference is meant to provide links to quality Q&A.
What's the scope?
This reference is meant for the following languages: php, perl, javascript, python, ruby, java, .net.
This might be too broad, but these languages share the same syntax. For specific features there's the tag of the language behind it, example:
What are regular expression Balancing Groups? .net
The Stack Overflow Regular Expressions FAQ
See also a lot of general hints and useful links at the regex tag details page.
Online tutorials
RegexOne ↪
Regular Expressions Info ↪
Quantifiers
Zero-or-more: *:greedy, *?:reluctant, *+:possessive
One-or-more: +:greedy, +?:reluctant, ++:possessive
?:optional (zero-or-one)
Min/max ranges (all inclusive): {n,m}:between n & m, {n,}:n-or-more, {n}:exactly n
Differences between greedy, reluctant (a.k.a. "lazy", "ungreedy") and possessive quantifier:
Greedy vs. Reluctant vs. Possessive Quantifiers
In-depth discussion on the differences between greedy versus non-greedy
What's the difference between {n} and {n}?
Can someone explain Possessive Quantifiers to me? php, perl, java, ruby
Emulating possessive quantifiers .net
Non-Stack Overflow references: From Oracle, regular-expressions.info
Character Classes
What is the difference between square brackets and parentheses?
[...]: any one character, [^...]: negated/any character but
[^] matches any one character including newlines javascript
[\w-[\d]] / [a-z-[qz]]: set subtraction .net, xml-schema, xpath, JGSoft
[\w&&[^\d]]: set intersection java, ruby 1.9+
[[:alpha:]]:POSIX character classes
[[:<:]] and [[:>:]] Word boundaries
Why do [^\\D2], [^[^0-9]2], [^2[^0-9]] get different results in Java? java
Shorthand:
Digit: \d:digit, \D:non-digit
Word character (Letter, digit, underscore): \w:word character, \W:non-word character
Whitespace: \s:whitespace, \S:non-whitespace
Unicode categories (\p{L}, \P{L}, etc.)
Escape Sequences
Horizontal whitespace: \h:space-or-tab, \t:tab
Newlines:
\r, \n:carriage return and line feed
\R:generic newline php java-8
Negated whitespace sequences: \H:Non horizontal whitespace character, \V:Non vertical whitespace character, \N:Non line feed character pcre php5 java-8
Other: \v:vertical tab, \e:the escape character
Anchors
anchor
matches
flavors
^
Start of string
Common*
^
Start of line
Commonm
$
End of line
Commonm
$
End of text
Common* except javascript
$
Very end of string
javascript*, phpD
\A
Start of string
Common except javascript
\Z
End of text
Common except javascript python
\Z
Very end of string
python
\z
Very end of string
Common except javascript python
\b
Word boundary
Common
\B
Not a word boundary
Common
\G
End of previous match
Common except javascript, python
Term
Definition
Start of string
At the very start of the string.
Start of line
At the very start of the string, andafter a non-terminal line terminator.
Very end of string
At the very end of the string.
End of text
At the very end of the string, andat a terminal line terminator.
End of line
At the very end of the string, andat a line terminator.
Word boundary
At a word character not preceded by a word character, andat a non-word character not preceded by a non-word character.
End of previous match
At a previously set position, usually where a previous match ended.At the very start of the string if no position was set.
"Common" refers to the following: icu java javascript .net objective-c pcre perl php python swift ruby
* Default |
m Multi-line mode. |
D Dollar end only mode.
Groups
(...):capture group, (?:):non-capture group
Why is my repeating capturing group only capturing the last match?
\1:backreference and capture-group reference, $1:capture group reference
What's the meaning of a number after a backslash in a regular expression?
\g<1>123:How to follow a numbered capture group, such as \1, with a number?: python
What does a subpattern (?i:regex) mean?
What does the 'P' in (?P<group_name>regexp) mean?
(?>):atomic group or independent group, (?|):branch reset
Equivalent of branch reset in .NET/C# .net
Named capture groups:
General named capturing group reference at regular-expressions.info
java: (?<groupname>regex): Overview and naming rules (Non-Stack Overflow links)
Other languages: (?P<groupname>regex) python, (?<groupname>regex) .net, (?<groupname>regex) perl, (?P<groupname>regex) and (?<groupname>regex) php
Lookarounds
Lookaheads: (?=...):positive, (?!...):negative
Lookbehinds: (?<=...):positive, (?<!...):negative
Lookbehind limits in:
Lookbehinds need to be constant-length php, perl, python, ruby
Lookarounds of limited length {0,n} java
Variable length lookbehinds are allowed .net
Lookbehind alternatives:
Using \K php, perl (Flavors that support \K)
Alternative regex module for Python python
The hacky way
JavaScript negative lookbehind equivalents External link
Modifiers
flag
modifier
flavors
a
ASCII
python
c
current position
perl
e
expression
php perl
g
global
most
i
case-insensitive
most
m
multiline
php perl python javascript .net java
m
(non)multiline
ruby
o
once
perl ruby
r
non-destructive
perl
S
study
php
s
single line
ruby
U
ungreedy
php r
u
unicode
most
x
whitespace-extended
most
y
sticky ↪
javascript
How to convert preg_replace e to preg_replace_callback?
What are inline modifiers?
What is '?-mix' in a Ruby Regular Expression
Other:
|:alternation (OR) operator, .:any character, [.]:literal dot character
What special characters must be escaped?
Control verbs (php and perl): (*PRUNE), (*SKIP), (*FAIL) and (*F)
php only: (*BSR_ANYCRLF)
Recursion (php and perl): (?R), (?0) and (?1), (?-1), (?&groupname)
Common Tasks
Get a string between two curly braces: {...}
Match (or replace) a pattern except in situations s1, s2, s3...
How do I find all YouTube video ids in a string using a regex?
Validation:
Internet: email addresses, URLs (host/port: regex and non-regex alternatives), passwords
Numeric: a number, min-max ranges (such as 1-31), phone numbers, date
Parsing HTML with regex: See "General Information > When not to use Regex"
Advanced Regex-Fu
Strings and numbers:
Regular expression to match a line that doesn't contain a word
How does this PCRE pattern detect palindromes?
Match strings whose length is a fourth power
How does this regex find triangular numbers?
How to determine if a number is a prime with regex?
How to match the middle character in a string with regex?
Other:
How can we match a^n b^n?
Match nested brackets
Using a recursive pattern php, perl
Using balancing groups .net
“Vertical” regex matching in an ASCII “image”
List of highly up-voted regex questions on Code Golf
How to make two quantifiers repeat the same number of times?
An impossible-to-match regular expression: (?!a)a
Match/delete/replace this except in contexts A, B and C
Match nested brackets with regex without using recursion or balancing groups?
Flavor-Specific Information
(Except for those marked with *, this section contains non-Stack Overflow links.)
Java
Official documentation: Pattern Javadoc ↪, Oracle's regular expressions tutorial ↪
The differences between functions in java.util.regex.Matcher:
matches()): The match must be anchored to both input-start and -end
find()): A match may be anywhere in the input string (substrings)
lookingAt(): The match must be anchored to input-start only
(For anchors in general, see the section "Anchors")
The only java.lang.String functions that accept regular expressions: matches(s), replaceAll(s,s), replaceFirst(s,s), split(s), split(s,i)
*An (opinionated and) detailed discussion of the disadvantages of and missing features in java.util.regex
.NET
How to read a .NET regex with look-ahead, look-behind, capturing groups and back-references mixed together?
Official documentation:
Boost regex engine: General syntax, Perl syntax (used by TextPad, Sublime Text, UltraEdit, ...???)
JavaScript general info and RegExp object
.NET MySQL Oracle Perl5 version 18.2
PHP: pattern syntax, preg_match
Python: Regular expression operations, search vs match, how-to
Rust: crate regex, struct regex::Regex
Splunk: regex terminology and syntax and regex command
Tcl: regex syntax, manpage, regexp command
Visual Studio Find and Replace
General information
(Links marked with * are non-Stack Overflow links.)
Other general documentation resources: Learning Regular Expressions, *Regular-expressions.info, *Wikipedia entry, *RexEgg, Open-Directory Project
DFA versus NFA
Generating Strings matching regex
Books: Jeffrey Friedl's Mastering Regular Expressions
When to not use regular expressions:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (blog post written by Stack Overflow's founder)*
Do not use regex to parse HTML:
Don't. Please, just don't
Well, maybe...if you're really determined (other answers in this question are also good)
Examples of regex that can cause regex engine to fail
Why does this regular expression kill the Java regex engine?
Tools: Testers and Explainers
(This section contains non-Stack Overflow links.)
Online (* includes replacement tester, + includes split tester):
Debuggex (Also has a repository of useful regexes) javascript, python, pcre
*Regular Expressions 101 php, pcre, python, javascript, java
Regex Pal, regular-expressions.info javascript
Rubular ruby RegExr Regex Hero dotnet
*+ regexstorm.net .net
*RegexPlanet: Java java, Go go, Haskell haskell, JavaScript javascript, .NET dotnet, Perl perl php PCRE php, Python python, Ruby ruby, XRegExp xregexp
freeformatter.com xregexp
*+regex.larsolavtorvik.com php PCRE and POSIX, javascript
Offline:
Microsoft Windows: RegexBuddy (analysis), RegexMagic (creation), Expresso (analysis, creation, free)
MySQL 8.0: Various syntax changes were made. Note especially the doubling of backslashes in some contexts. (This Answer need further editing to reflect the differences.)
Related
How can I remove everything after the last occurence of a character (_) and all digits in the end of a string in Snowflake SQL? [duplicate]
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions. What is this? This is a collection of common Q&A. This is also a Community Wiki, so everyone is invited to participate in maintaining it. Why is this? regex is suffering from give me ze code type of questions and poor answers with no explanation. This reference is meant to provide links to quality Q&A. What's the scope? This reference is meant for the following languages: php, perl, javascript, python, ruby, java, .net. This might be too broad, but these languages share the same syntax. For specific features there's the tag of the language behind it, example: What are regular expression Balancing Groups? .net
The Stack Overflow Regular Expressions FAQ See also a lot of general hints and useful links at the regex tag details page. Online tutorials RegexOne ↪ Regular Expressions Info ↪ Quantifiers Zero-or-more: *:greedy, *?:reluctant, *+:possessive One-or-more: +:greedy, +?:reluctant, ++:possessive ?:optional (zero-or-one) Min/max ranges (all inclusive): {n,m}:between n & m, {n,}:n-or-more, {n}:exactly n Differences between greedy, reluctant (a.k.a. "lazy", "ungreedy") and possessive quantifier: Greedy vs. Reluctant vs. Possessive Quantifiers In-depth discussion on the differences between greedy versus non-greedy What's the difference between {n} and {n}? Can someone explain Possessive Quantifiers to me? php, perl, java, ruby Emulating possessive quantifiers .net Non-Stack Overflow references: From Oracle, regular-expressions.info Character Classes What is the difference between square brackets and parentheses? [...]: any one character, [^...]: negated/any character but [^] matches any one character including newlines javascript [\w-[\d]] / [a-z-[qz]]: set subtraction .net, xml-schema, xpath, JGSoft [\w&&[^\d]]: set intersection java, ruby 1.9+ [[:alpha:]]:POSIX character classes [[:<:]] and [[:>:]] Word boundaries Why do [^\\D2], [^[^0-9]2], [^2[^0-9]] get different results in Java? java Shorthand: Digit: \d:digit, \D:non-digit Word character (Letter, digit, underscore): \w:word character, \W:non-word character Whitespace: \s:whitespace, \S:non-whitespace Unicode categories (\p{L}, \P{L}, etc.) Escape Sequences Horizontal whitespace: \h:space-or-tab, \t:tab Newlines: \r, \n:carriage return and line feed \R:generic newline php java-8 Negated whitespace sequences: \H:Non horizontal whitespace character, \V:Non vertical whitespace character, \N:Non line feed character pcre php5 java-8 Other: \v:vertical tab, \e:the escape character Anchors anchor matches flavors ^ Start of string Common* ^ Start of line Commonm $ End of line Commonm $ End of text Common* except javascript $ Very end of string javascript*, phpD \A Start of string Common except javascript \Z End of text Common except javascript python \Z Very end of string python \z Very end of string Common except javascript python \b Word boundary Common \B Not a word boundary Common \G End of previous match Common except javascript, python Term Definition Start of string At the very start of the string. Start of line At the very start of the string, andafter a non-terminal line terminator. Very end of string At the very end of the string. End of text At the very end of the string, andat a terminal line terminator. End of line At the very end of the string, andat a line terminator. Word boundary At a word character not preceded by a word character, andat a non-word character not preceded by a non-word character. End of previous match At a previously set position, usually where a previous match ended.At the very start of the string if no position was set. "Common" refers to the following: icu java javascript .net objective-c pcre perl php python swift ruby * Default | m Multi-line mode. | D Dollar end only mode. Groups (...):capture group, (?:):non-capture group Why is my repeating capturing group only capturing the last match? \1:backreference and capture-group reference, $1:capture group reference What's the meaning of a number after a backslash in a regular expression? \g<1>123:How to follow a numbered capture group, such as \1, with a number?: python What does a subpattern (?i:regex) mean? What does the 'P' in (?P<group_name>regexp) mean? (?>):atomic group or independent group, (?|):branch reset Equivalent of branch reset in .NET/C# .net Named capture groups: General named capturing group reference at regular-expressions.info java: (?<groupname>regex): Overview and naming rules (Non-Stack Overflow links) Other languages: (?P<groupname>regex) python, (?<groupname>regex) .net, (?<groupname>regex) perl, (?P<groupname>regex) and (?<groupname>regex) php Lookarounds Lookaheads: (?=...):positive, (?!...):negative Lookbehinds: (?<=...):positive, (?<!...):negative Lookbehind limits in: Lookbehinds need to be constant-length php, perl, python, ruby Lookarounds of limited length {0,n} java Variable length lookbehinds are allowed .net Lookbehind alternatives: Using \K php, perl (Flavors that support \K) Alternative regex module for Python python The hacky way JavaScript negative lookbehind equivalents External link Modifiers flag modifier flavors a ASCII python c current position perl e expression php perl g global most i case-insensitive most m multiline php perl python javascript .net java m (non)multiline ruby o once perl ruby r non-destructive perl S study php s single line ruby U ungreedy php r u unicode most x whitespace-extended most y sticky ↪ javascript How to convert preg_replace e to preg_replace_callback? What are inline modifiers? What is '?-mix' in a Ruby Regular Expression Other: |:alternation (OR) operator, .:any character, [.]:literal dot character What special characters must be escaped? Control verbs (php and perl): (*PRUNE), (*SKIP), (*FAIL) and (*F) php only: (*BSR_ANYCRLF) Recursion (php and perl): (?R), (?0) and (?1), (?-1), (?&groupname) Common Tasks Get a string between two curly braces: {...} Match (or replace) a pattern except in situations s1, s2, s3... How do I find all YouTube video ids in a string using a regex? Validation: Internet: email addresses, URLs (host/port: regex and non-regex alternatives), passwords Numeric: a number, min-max ranges (such as 1-31), phone numbers, date Parsing HTML with regex: See "General Information > When not to use Regex" Advanced Regex-Fu Strings and numbers: Regular expression to match a line that doesn't contain a word How does this PCRE pattern detect palindromes? Match strings whose length is a fourth power How does this regex find triangular numbers? How to determine if a number is a prime with regex? How to match the middle character in a string with regex? Other: How can we match a^n b^n? Match nested brackets Using a recursive pattern php, perl Using balancing groups .net “Vertical” regex matching in an ASCII “image” List of highly up-voted regex questions on Code Golf How to make two quantifiers repeat the same number of times? An impossible-to-match regular expression: (?!a)a Match/delete/replace this except in contexts A, B and C Match nested brackets with regex without using recursion or balancing groups? Flavor-Specific Information (Except for those marked with *, this section contains non-Stack Overflow links.) Java Official documentation: Pattern Javadoc ↪, Oracle's regular expressions tutorial ↪ The differences between functions in java.util.regex.Matcher: matches()): The match must be anchored to both input-start and -end find()): A match may be anywhere in the input string (substrings) lookingAt(): The match must be anchored to input-start only (For anchors in general, see the section "Anchors") The only java.lang.String functions that accept regular expressions: matches(s), replaceAll(s,s), replaceFirst(s,s), split(s), split(s,i) *An (opinionated and) detailed discussion of the disadvantages of and missing features in java.util.regex .NET How to read a .NET regex with look-ahead, look-behind, capturing groups and back-references mixed together? Official documentation: Boost regex engine: General syntax, Perl syntax (used by TextPad, Sublime Text, UltraEdit, ...???) JavaScript general info and RegExp object .NET MySQL Oracle Perl5 version 18.2 PHP: pattern syntax, preg_match Python: Regular expression operations, search vs match, how-to Rust: crate regex, struct regex::Regex Splunk: regex terminology and syntax and regex command Tcl: regex syntax, manpage, regexp command Visual Studio Find and Replace General information (Links marked with * are non-Stack Overflow links.) Other general documentation resources: Learning Regular Expressions, *Regular-expressions.info, *Wikipedia entry, *RexEgg, Open-Directory Project DFA versus NFA Generating Strings matching regex Books: Jeffrey Friedl's Mastering Regular Expressions When to not use regular expressions: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (blog post written by Stack Overflow's founder)* Do not use regex to parse HTML: Don't. Please, just don't Well, maybe...if you're really determined (other answers in this question are also good) Examples of regex that can cause regex engine to fail Why does this regular expression kill the Java regex engine? Tools: Testers and Explainers (This section contains non-Stack Overflow links.) Online (* includes replacement tester, + includes split tester): Debuggex (Also has a repository of useful regexes) javascript, python, pcre *Regular Expressions 101 php, pcre, python, javascript, java Regex Pal, regular-expressions.info javascript Rubular ruby RegExr Regex Hero dotnet *+ regexstorm.net .net *RegexPlanet: Java java, Go go, Haskell haskell, JavaScript javascript, .NET dotnet, Perl perl php PCRE php, Python python, Ruby ruby, XRegExp xregexp freeformatter.com xregexp *+regex.larsolavtorvik.com php PCRE and POSIX, javascript Offline: Microsoft Windows: RegexBuddy (analysis), RegexMagic (creation), Expresso (analysis, creation, free) MySQL 8.0: Various syntax changes were made. Note especially the doubling of backslashes in some contexts. (This Answer need further editing to reflect the differences.)
man page of fc builtin uses − rather than - for options
I know that BUILTIN commands don't have separate man pages, however I am curious about the following. Upon executing man fc in the terminal I searched for -l to look for its description. However, there is no result. The reason is that the man page for fc (and maybe other builtins?) uses − (which corresponds to <−> 8722, Hex 2212, Oct 21022, Digr -2) rather than - for option (even if the actual way to use them is the latter, not the former). Is this somehow intended?
fc is part of the POSIX Shell & Utilities, which means it is standardized for better portability. Its POSIX page has a description of the utility with all the portable options, all using the standard ASCII hyphen character (0x2d). Also, the Utility Conventions part of POSIX does mention: Guideline 4: All options should be preceded by the '-' delimiter character. In which - is the "standard" ASCII hyphen character (0x2D). So I'd say that the issue with the − is purely due to aesthetic reasons (probably to make the hyphens more distinguishable/easier to read).
How to split on unicode whitespace in kotlin
In Kotlin if we use: string.split(Regex("\\s+")) Then we can split a string into words separated by whitespace. However the string: val string = "a\u2000b" doesn't split since the regex doesn't match unicode whitespace characters. I there a way to split the string on all whitespace characters?
Since Java 7 Pattern allows to specify the UNICODE_CHARACTER_CLASS-flag which would basically also work for your current issue: Pattern.compile("\\s+", Pattern.UNICODE_CHARACTER_CLASS) Unfortunately this isn't directly supported via RegexOption with Kotlins Regex yet. There is a known issue that also describes a workaround (KT-21094): string.split("""(?U)\s+""".toRegex()) You (most probably) require Java 7+ for that to actually work. Alternatives could be to use other predefined character classes instead. However, you need to lookup the appropriate Pattern-javadoc for your Java version to ensure that it is actually working (or do it in a trial-error-manner ;-)).
I've used the following regex to match Unicode whitespace: Regex("[\\p{javaWhitespace}\u00A0\u2007\u202F]+") This works because while \s matches only Latin-1 whitespace, \p{javaWhitespace} matches everything for which Character.isWhitespace() is true. For some reason, this doesn't include a few particular characters, which I've listed separately. More info in the docs for Pattern. Related fact: although java.lang.String.trim() doesn't remove non-breaking spaces or figure spaces, kotlin.String.trim() does!
awk pattern to match an XML PI at the start of a line
I have an XML document containing a number of XML Processing Instructions which are of the form: <?cpdoc something?> I am trying to match them in awk with the pattern /^\<\?cpdoc/ but it's not returning anything. If I remove the ^ anchor, it works (but I have other similar PIs which don't start a line which I don't want matched). It looks as if it's being confused by the \<\? but why is it ignoring the line-start anchor?
Don't parse XML with regex, use a proper XML/HTML parser. theory : According to the compiling theory, XML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC. realLife©®™ everyday tool in a shell : You can use one of the following : xmllint xmlstarlet saxon-lint (my own project) Check: Using regular expressions with HTML tags Example using xpath : xmllint --xpath '//processing-instruction()' file.xml
Solution by OP and explanation by Ed Morton. It works if the less-than is not escaped, as otherwise it's a word boundary. So instead of: \<\? I should use literal: <\? This is because we can't just go escaping any character and hoping for the best, we have to know which characters are metacharacters and then escape them if we want them treated as literal.
Selenium: Part of text present
Is there a way to verify only part of text present? If there is a text "Warning: A15P09 has not been activated." I need to verify the text is present. However, 'A15P09' is not always the same, so I cannot do something like Selenium.IsTextPresent("Warning: A15P09 has not been activated."); I might do something like: Selenium.IsTextPresent("has not been activated."); But is there another way to verify this in Selenium. Please let me know if there is. Thanks!
You could use getText and then do any normal regex that your language supplies for examining that result. Edit: And for some languages you can do isTextPresent on a pattern modified string. The documentation states: Various Pattern syntaxes are available for matching string values: glob:pattern: Match a string against a "glob" (aka "wildmat") pattern. "Glob" is a kind of limited regular-expression syntax typically used in command-line shells. In a glob pattern, "*" represents any sequence of characters, and "?" represents any single character. Glob patterns match against the entire string. regexp:regexp: Match a string using a regular-expression. The full power of JavaScript regular-expressions is available. exact:string: Match a string exactly, verbatim, without any of that fancy wildcard stuff. If no pattern prefix is specified, Selenium assumes that it's a "glob" pattern.