String to search:
VALUES ('9gfdg', to_date('1876/12/06','YYYY/MM/DD'), null)
Regex search so far:
VALUES\s*\(\s*'?\s*(.+?)\s*'?\s*,\s*'?\s*(.+?)\s*'?\s*,\s*'?\s*(.+?)\s*'?\s*\)
Regex replace to 3 groups: ie \1 \2 \3
I am aiming for a result of:
9gfdg to_date('1876/12/06' ,'YYYY/MM/DD') null
but instead get (because of that extra comma in to_Date and also lazy instead of greedy):
9gfdg to_date('1876/12/06 YYYY/MM/DD , null)
Note:
It is exactly 3 fields (the values within th 3 fields may be different but you get the idea of the format I am grappling with). ie each of the fields could have commas (usually character values, could be a keyword such as null, could be a number or could be a to_Date expression.
Regex engine is VBA/VBscript
Anyone have any pointers on fixing up this regex?
Here is a solution.
Notice the regex for $field: it is yet another application of the normal* (special normal*)* pattern, with normal being anything but a comma ([^,]) and special a comma as long as it is not followed by two single quotes (,(?!'')). The first normal, however, is made non empty using + instead of *.
Demonstration code in perl. The string concatenation operator in perl is a dot:
fge#erwin $ cat t.pl
#!/usr/bin/perl -W
use strict;
# Value separator: a comma optionally surrounded by spaces
my $value_separator = '\s*,\s*';
# Literal "null", and a number
my $null = 'null';
my $number = '\d+';
# Text field
my $normal = '[^,]'; # Anything but a comma
my $special = ",(?!'')"; # A comma, _not_ followed by two single quotes
my $field = "'$normal+(?:$special$normal*)*'"; # a text field
# A to_date() expression
my $to_date = 'to_date\(\s*' . $field . $value_separator . $field . '\s*\)';
# Any field
my $any_field = '(' . $null . '|' . $number . '|' . $field . '|' . $to_date . ')';
# The full regex
my $full_regex = '^\s*VALUES\s*\(\s*' . $any_field . $value_separator . $any_field
. $value_separator . $any_field . '\s*\)\s*$';
# This builds a compiled form of the regex
my $re = qr/$full_regex/;
# Read from stdin, try and match (m//), if match, print the three captured groups
while (<STDIN>) {
m/$re/ and print <<EOF;
Argument 1: -->$1<--
Argument 2: -->$2<--
Argument 3: -->$3<--
EOF
}
Demonstration output:
fge#erwin ~ $ perl t.pl
VALUES ('9gfdg', to_date('1876/12/06','YYYY/MM/DD'), null)
Argument 1: -->'9gfdg'<--
Argument 2: -->to_date('1876/12/06','YYYY/MM/DD')<--
Argument 3: -->null<--
VALUES('prout', 'ma', 'chere')
Argument 1: -->'prout'<--
Argument 2: -->'ma'<--
Argument 3: -->'chere'<--
VALUES(324, 'Aiie, a comma', to_date('whatever', 'is there, even commas'))
Argument 1: -->324<--
Argument 2: -->'Aiie, a comma'<--
Argument 3: -->to_date('whatever', 'is there, even commas')<--
One thing to note: you will notice that I don't ever use any lazy quantifiers, and not even the dot!
edit: special in a field is actually a comma not followed by two single quotes, not one
If only the second parameter can have commas in it, you could do something like:
^VALUES\s*\(\s*'?([^',]*)'?\s*,\s*(.*?)\s*,\s*'?([^',]*)'?\s*\)$
Otherwise I don't know what features that regex flavor supports, so hard to make something more fun. Altho you could always make a limited depth nested parentheses regex if (?R) is not supported.
For the more general case you could try something like:
^\s*
VALUES\s*
\(
\s*
(?: '([^']*)' | ( \w+ (?: \( [^()]* \) )? ) )
\s*,\s*
(?: '([^']*)' | ( \w+ (?: \( [^()]* \) )? ) )
\s*,\s*
(?: '([^']*)' | ( \w+ (?: \( [^()]* \) )? ) )
\s*
\)\s*
$
Spaces removed:
^\s*VALUES\s*\(\s*(?:'([^']*)'|(\w+(?:\([^()]*\))?))\s*,\s*(?:'([^']*)'|(\w+(?:\([^()]*\))?))\s*,\s*(?:'([^']*)'|(\w+(?:\([^()]*\))?))\s*\)\s*$
Replace with:
\1\2 \3\4 \5\6
Should work for one nested level of parentheses without any quoted parenthesis in them.
PS: Not tested. You can usually use the spaced regex if your flavor supports the /x flag.
Related
trying found out relation name in public schema that contain string "main" and "parted".
So far I have tried.
\dt public.(*main* | *parted*)
-- ERROR: invalid regular expression: parentheses () not balanced
I also tried the following queries:
SELECT
regexp_match('parted1 new main', '(main|parted\S+)');
SELECT
regexp_matches('main new parted1', '(main|parted\S+)');
SELECT
regexp_substr ('main new parted1', '(main|parted\S+)');
SELECT
regexp_substr ('main new parted1', '(main|parted\S+)');
I expect a pattern that will match substring that have pattern "main" or "parted\S+".
In psql manual:
All regular expression special characters work as specified in Section
9.7.3, except for . which is taken as a separator as mentioned above, * which is translated to the regular-expression notation .*, ? which is translated to ., and $ which is matched literally.
From the quote so it's doable?
So I am using Snowflake and specifically the REGEXP_REPLACE function. I am looking for a Regex expression that will match any word with an # symbol in it in a text field.
Example:
RAW_DATA
CLEANED_DATA
here is a sample and then an email#gmail.com
here is a sample and then an xxxxx
abc#test.com
xxxxx
What I have tried so far is:
Select regexp_replace('ABC#gmail.com' , '(([a-zA-Z]+)(\W)?([a-zA-Z]+))', 'xxxxxxx') as result;
Result:
xxxxxxx#xxxxxxx.xxxxxxx
You can use
Select regexp_replace('here is a sample and then an email#gmail.com' , '\\S+#\\S+', 'xxxxx') as result;
Here,
\S+ - one or more non-whitespace chars
# - a # char
\S+ - one or more non-whitespace chars
I have the following working function which is used in check constraint (I'll only post the relevant SQL part):
-- a comma should always be followed by a space
-- a period should always be followed by a space, except if it is the last character of the string OR the string contains 'caporal'
-- a question mark should always be followed by a space, except if it is the last character of the string
-- must not contain 2 or more spaces in a row
-- must not contain ((
-- must not contain ))
-- any open parenthesis should be closed: number of '(' should equal to number of ')'
SELECT
($1 !~ ',(?!\s)|\s{2}|[?](?!\s(?!$)|$)|[()]{2,}') AND
((array_length(string_to_array($1, '('), 1) - 1) = (array_length(string_to_array($1, ')'), 1) - 1)) AND
($1 ~ 'caporal' OR $1 !~ '[.](?!\s(?!$)|$)')
Overtime I realized that I need to allow a period without a following space for the cases:
.fr
.com
.net
.co.uk
Also, I realized that I need to allow float numbers to be written with comma/period as separator. The following cases should be valid:
2,5cm
10.4l
I was trying multiple things but it seems I'm just breaking the existing rules instead of adding "exceptions" to them.
My latter attempt was the following:
SELECT
($1 !~ '[[a-zA-Z]àâçéèêëîïôûùüÿæœ],(?!\s)|\s{2}|[?](?!\s(?!$)|$)|[()]{2,}') AND
((array_length(string_to_array($1, '('), 1) - 1) = (array_length(string_to_array($1, ')'), 1) - 1)) AND
($1 ~ 'caporal' OR $1 !~ '[[a-zA-Z]àâçéèêëîïôûùüÿæœ][.](?!\s(?!$)|(?!fr)|(?!com)|$)')
But it clearly isn't what I want. Thank you in advance for hints and advices!
You should change the first regex to
,(?!\d(?<=\d,\d)|\s)|\s{2}|\?(?!\s(?!$)|$)|[()]{2,}
and the last one to
\.(?!\d(?<=\d\.\d)|(?:fr|com|co\.uk|(?<=\yco\.)uk|net)\y|\s(?!$)|$)
The changes are additions to the negative lookaheads that fail the match if their patterns match immediately to the right of the current location.
In the first case, ,(?!\d(?<=\d,\d)|\s) is used to match any comma that is not followed with a whitespace or any digit that is a fractional digit (as it must be preceded with a digit and a comma).
In the second regex, a similar restriction is added, see \d(?<=\d\.\d) that makes the \. match a dot that is not the first fractional digit in a float number with a period as a decimal separator, and the (?:fr|com|co\.uk|(?<=\yco\.)uk|net)\y part is added to avoid matching a . that is followed with fr, com, co.uk, the second period in co.uk ((?<=\yco\.)uk lookbehind makes sure that the comma before uk not preceded with co. is still matched) or net as whole words (see \y, word boundary).
I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:
A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.
EXAMPLE 1:
The following are valid literal strings:
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)
It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.
lexer grammar PdfStringLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ;
// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ;
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
mode STR;
LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ;
TEXT : . -> more ;
parser grammar PdfStringParser;
options { tokenVocab=PdfStringLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
When I process this file:
(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj
I get this error and parse tree:
line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'
So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?
Edit
I left out the instructions regarding escape sequences within the string:
Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.
Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)
An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.
EXAMPLE 2:
(These \
two strings \
are the same.)
(These two strings are the same.)
EXAMPLE 3:
(This string has an end-of-line at the end of it.
)
(So does this one.\n)
Should I use this STRING definition:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?
You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
And with escape sequences, you could try:
STRING
: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | STRING )* ')'
;
fragment ESCAPE_SEQUENCE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;
Can there be multiple patterns in Regexp_Replace.
Pattern 1 : '^#.*'
Pattern 2: '^//.*'
Pattern 3 : '^&&.*'
I want all three patterns in same regexp_replace function like
select REGEXP_REPLACE ('Unit testing last level','Pattern 1,Pattern 2,Pattern 3','',1,0,'m')
from dual;
You can use an alternation group where all alternative branches are |-separated.
^(#|//|&&).*
The (...) form a grouping construct where you may place your various #, &&, and other possible "branches". A | is an alternation operator.
The pattern will match:
^ - start of a line (as you are passing m match_parameter)
(#|//|&&) - either #, // or &&
.* - any 0+ chars other than a newline (since n match_parameter is not used).