Why does this space matter in this regular expression? - sql

Why does the space make all the difference?
select * from beds where id~'.*Extra large.* (Red).*';
and
select * from beds where id~'.*Extra large.*(Red).*';
The first one returned nothing and the second acted as I wanted. An example of what I want matched is:
"Extra large" (Red) {2012 model}
I thought the first would work since there is a space after (Red)?
EDIT:Even if I escape the brackets with '\' I still can't have a space there.

The problem is that you have not escaped your brackets around "Red". Your regex should be:
'.*Extra large.* \(Red\).*'
This makes the brackets literal brackets, but without escaping them they create a regex group (and not characters to be matched).
Your first regex grouped the characters Red and required a space to precede that group Red, so it would match "... Red...", but there is a bracket in your input before Red, so it doesn't match.
Your second regex accepts any character(s) (via .*) before Red, so it matches.

This is because you're not escaping the ().
The brackets around "Red" create a group and are not included in the match. This
is the reason why the regexp without the whitespace works.
The .* in the regexp without the whitespace matches " (, then comes Red and after that ) {2012 model}. The brackets are matched by the .* operators.
The .* in the regexp with the whitespace matches " and the ( is not included in the pattern.
So the right pattern would be this:
.*Extra large.*\(Red\).*

Related

Square bracket in regexp_replace pattern

I want to use REGEXP_REPLACE with this pattern. but I don't know how to add square bracket in square bracket. I try to put escape character but i did not work. in this screenshot i want to also keep the [XXX] these square bracket. I need to add this square bracket somehow in my pattern. thanks.
Right now the output is this:
MSD_40001_ME_SPE__XXXX__Technical__Specification_REV9_(2021_05_27)_xls
but I want to like that:
MSD_40001_ME_SPE_[XXXX]_Technical__Specification_REV9_(2021_05_27)_xls
I tried the escape character \ but it did not work
You could try this regex pattern: [^][a-z_A-Z0-9()]
SELECT REGEXP_REPLACE('MSD_40001_ME_SPE_[XXXX]_Technical_%Specification_REV#9_(2021_05_27)_xls', '[^][a-z_A-Z0-9()]', '_')
FROM DUAL
To specify a right bracket (]) in the bracket expression, place it first in the list (after the initial circumflex (^), if any).
See demo here
From Regexp.Info,
One key syntactic difference is that the backslash is NOT a metacharacter in a POSIX bracket expression. So in POSIX, the regular expression [\d] matches a \ or a d. To match a ], put it as the first character after the opening [ or the negating ^. To match a -, put it right before the closing ]. To match a ^, put it before the final literal - or the closing ]. Put together, []\d^-] matches ], , d, ^ or -.
How about this different take on the problem? Match one or more of the following: space OR a dash OR a percent sign OR a dollar sign OR an at sign OR a period (escaping the characters that have special regex meaning) and replace with an underscore. Note this takes care of the double underscore after "Technical".
with tbl(data) as (
select 'MSD 40001-ME-SPE-[XXXX] Technical%$Specification#REV9 (2021.05.27).xls' from dual
)
select regexp_replace(data, '( |\-|%|\$|#|\.)+', '_') fixed
from tbl;
FIXED
---------------------------------------------------------------------
MSD_40001_ME_SPE_[XXXX]_Technical_Specification_REV9_(2021_05_27)_xls

new lines are not getting eliminated

I'm trying to replace newline etc kind of values using regexp_replace. But when I open the result in query result window, I can still see the new lines in the text. Even when I copy the result, I can see new line characters. See output for example, I just copied from the result.
Below is my query
select regexp_replace('abc123
/n
CHAR(10)
头疼,'||CHR(10)||'allo','[^[:alpha:][:digit:][ \t]]','') from dual;
/ I just kept for testing characters.
Output:
abc123
/n
CHAR(10)
头疼,
allo
How to remove the new lines from the text?
Expected output:
abc123 /nCHAR(10)头疼,allo
There are two mistakes in your code. One of them causes the issue you noticed.
First, in a bracket expression, in Oracle regular expressions (which follow the POSIX standard), there are no escape sequences. You probably meant \t as escape sequence for tab - within the bracket expression. (Note also that in Oracle regular expressions, there are no escape sequences like \t and \n anyway. If you must preserve tabs, it can be done, but not like that.)
Second, regardless of this, you include two character classes, [:alpha:] and [:digit:], and also [ \t] in the (negated) bracket expression. The last one is not a character class, so the [ as well as the space, the backslash and the letter t are interpreted as literal characters - they stand in for themselves. The closing bracket, on the other hand, has special meaning. The first of your two closing brackets is interpreted as the end of the bracket expression; and the second closing bracket is interpreted as being an additional, literal character that must be matched! Since there is no such literal closing bracket anywhere in the string, nothing is replaced.
To fix both mistakes, replace [ \t] with the [:blank:] character class, which consists exactly of space and tab. (And, note that [:alpha:][:digit:] can be written more compactly as [:alnum:].)

REGEXP_REPLACE explanation

Hi may i know what does the below query means?
REGEXP_REPLACE(number,'[^'' ''-/0-9:-#A-Z''[''-`a-z{-~]', 'xy') ext_number
part 1
In terms of explaining what the function function call is doing:
It is a function call to analyse an input string 'number' with a regex (2nd argument) and replace any parts of the string which match a specific string. As for the name after the parenthesis I am not sure, but the documentation for the function is here
part 2
Sorry to be writing a question within an answer here but I cannot respond in comments yet (not enough rep)
Does this regex work? Unless sql uses different syntax this would appear to be a non-functional regex. There are some red flags, e.g:
The entire regex is wrapped in square parenthesis, indicating a set of characters but seems to predominantly hold an expression
There is a range indicator between a single quote and a character (invalid range: if a dash was required in the match it should be escaped with a '\' (backslash))
One set of square brackets is never closed
After some minor tweaks this regex is valid syntax:
^'' ''\-\/0-9:-#A-Z''[''-a-z{-~]`, but does not match anything I can think of, it is important to know what string is being examined/what the context is for the program in order to identify what the regex might be attempting to do
It seems like it is meant to replaces all ASCII control characters in the column or variable number with xy.
[] encloses a class of characters. Any character in that class matches. [^] negates that, hence all characters match, that are not in the class.
- is a range operator, e.g. a-z means all characters from a to z, like abc...xyz.
It seams like characters enclosed in ' should be escaped (The second ' is to escape the ' in the string itself.) At least this would make some sense. (But for none of the DBMS I found having a regexp_replace() function (Postgres, Oracle, DB2, MariaDB, MySQL), I found something in the docs, that would indicate this escape mechanism. They all use \, but maybe I missed something? Unfortunately you didn't tag which DBMS you're actually using!)
Now if you take an ASCII table you'll see, that the ranges in the expression make up all printable characters (counting space as printable) in groups from space to /, 0 to 9, : to #, etc.. Actually it might have been shorter to express it as '' ''-~, space to ~.
Given the negation, all these don't match. The ones left are from NUL to US and DEL. These match and get replaced by xy one by one.

Oracle REGEXP_LIKE doesn't work as expected

I was testing a regular expression in Oracle SQL and found something I could not understand:
-- NO MATCH
SELECT 1 FROM DUAL WHERE REGEXP_LIKE ('Professor Frank', '(^|\s)Prof[^\s]*(\s|$)');
Above doesn't match, while the following matches:
-- MATCH
SELECT 1 FROM DUAL WHERE REGEXP_LIKE ('Professor Frank', '(^|\s)Prof\S*(\s|$)');
In other regex flavors, It will be like \bProf[^\s]*\b versus \bProf\S*\b and have similar results. Note: Oracle SQL regex does not have \b or word boundary.
Question: Why don't [^\s]* and \S* work the same way in Oracle SQL?
I notice if I remove the (\s|$) at the end, the first regex will match.
In Oracle regular expressions, \s is indeed the escape sequence for a space, but NOT in a matching character set (that is, [.....], or [^....] for excluding one character). In a matching character set, only two characters have a special meaning, - for ranges and ] for closing the set enumeration. They can't be escaped; if needed in the matching set, ] must always be the first character right after the opening [ (it is the ONLY position in which a closing ] stands for itself as a character, and does not denote the end of the matching set), and - must be first or last (best to leave it always to the end of the matching set) - anywhere else it is seen as a range marker. To include (or exclude, if using the [^.....] syntax) a space, just type an actual physical space in the matching set.
Edit: What I said above is not entirely right. There is another special character in a matching set, namely ^. If it is used in the first position, it means "match any character OTHER THAN." In any other position it stands for itself. For example, '[^^]' will match any single character OTHER THAN ^ (the first ^ has special meaning, the second stands in for itself). And, a closing bracket ] stands for itself if it is the second character in brackets, if the first character is ^ (with its SPECIAL meaning). That is, to match any single character OTHER THAN ], we can use the matching pattern '[^]]'.

Trim trailing spaces with PostgreSQL

I have a column eventDate which contains trailing spaces. I am trying to remove them with the PostgreSQL function TRIM(). More specifically, I am running:
SELECT TRIM(both ' ' from eventDate)
FROM EventDates;
However, the trailing spaces don't go away. Furthermore, when I try and trim another character from the date (such as a number), it doesn't trim either. If I'm reading the manual correctly this should work. Any thoughts?
There are many different invisible characters. Many of them have the property WSpace=Y ("whitespace") in Unicode. But some special characters are not considered "whitespace" and still have no visible representation. The excellent Wikipedia articles about space (punctuation) and whitespace characters should give you an idea.
<rant>Unicode sucks in this regard: introducing lots of exotic characters that mainly serve to confuse people.</rant>
The standard SQL trim() function by default only trims the basic Latin space character (Unicode: U+0020 / ASCII 32). Same with the rtrim() and ltrim() variants. Your call also only targets that particular character.
Use regular expressions with regexp_replace() instead.
Trailing
To remove all trailing white space (but not white space inside the string):
SELECT regexp_replace(eventdate, '\s+$', '') FROM eventdates;
The regular expression explained:
\s ... regular expression class shorthand for [[:space:]]
    - which is the set of white-space characters - see limitations below
+ ... 1 or more consecutive matches
$ ... end of string
Demo:
SELECT regexp_replace('inner white ', '\s+$', '') || '|'
Returns:
inner white|
Yes, that's a single backslash (\). Details in this related answer:
SQL select where column begins with \
Leading
To remove all leading white space (but not white space inside the string):
regexp_replace(eventdate, '^\s+', '')
^ .. start of string
Both
To remove both, you can chain above function calls:
regexp_replace(regexp_replace(eventdate, '^\s+', ''), '\s+$', '')
Or you can combine both in a single call with two branches.
Add 'g' as 4th parameter to replace all matches, not just the first:
regexp_replace(eventdate, '^\s+|\s+$', '', 'g')
But that should typically be faster with substring():
substring(eventdate, '\S(?:.*\S)*')
\S ... everything but white space
(?:re) ... non-capturing set of parentheses
.* ... any string of 0-n characters
Or one of these:
substring(eventdate, '^\s*(.*\S)')
substring(eventdate, '(\S.*\S)') -- only works for 2+ printing characters
(re) ... Capturing set of parentheses
Effectively takes the first non-whitespace character and everything up to the last non-whitespace character if available.
Whitespace?
There are a few more related characters which are not classified as "whitespace" in Unicode - so not contained in the character class [[:space:]].
These print as invisible glyphs in pgAdmin for me: "mongolian vowel", "zero width space", "zero width non-joiner", "zero width joiner":
SELECT E'\u180e', E'\u200B', E'\u200C', E'\u200D';
'᠎' | '​' | '‌' | '‍'
Two more, printing as visible glyphs in pgAdmin, but invisible in my browser: "word joiner", "zero width non-breaking space":
SELECT E'\u2060', E'\uFEFF';
'⁠' | ''
Ultimately, whether characters are rendered invisible or not also depends on the font used for display.
To remove all of these as well, replace '\s' with '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]' or '[\s᠎​‌‍⁠]' (note trailing invisible characters!).
Example, instead of:
regexp_replace(eventdate, '\s+$', '')
use:
regexp_replace(eventdate, '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]+$', '')
or:
regexp_replace(eventdate, '[\s᠎​‌‍⁠]+$', '') -- note invisible characters
Limitations
There is also the Posix character class [[:graph:]] supposed to represent "visible characters". Example:
substring(eventdate, '([[:graph:]].*[[:graph:]])')
It works reliably for ASCII characters in every setup (where it boils down to [\x21-\x7E]), but beyond that you currently (incl. pg 10) depend on information provided by the underlying OS (to define ctype) and possibly locale settings.
Strictly speaking, that's the case for every reference to a character class, but there seems to be more disagreement with the less commonly used ones like graph. But you may have to add more characters to the character class [[:space:]] (shorthand \s) to catch all whitespace characters. Like: \u2007, \u202f and \u00a0 seem to also be missing for #XiCoN JFS.
The manual:
Within a bracket expression, the name of a character class enclosed in
[: and :] stands for the list of all characters belonging to that
class. Standard character class names are: alnum, alpha, blank, cntrl,
digit, graph, lower, print, punct, space, upper, xdigit.
These stand for the character classes defined in ctype.
A locale can provide others.
Bold emphasis mine.
Also note this limitation that was fixed with Postgres 10:
Fix regular expressions' character class handling for large character
codes, particularly Unicode characters above U+7FF (Tom Lane)
Previously, such characters were never recognized as belonging to
locale-dependent character classes such as [[:alpha:]].
It should work the way you're handling it, but it's hard to say without knowing the specific string.
If you're only trimming leading spaces, you might want to use the more concise form:
SELECT RTRIM(eventDate)
FROM EventDates;
This is a little test to show you that it works.
Tell us if it works out!
If your whitespace is more than just the space meta value than you will need to use regexp_replace:
SELECT '(' || REGEXP_REPLACE(eventDate, E'[[:space:]]', '', 'g') || ')'
FROM EventDates;
In the above example I am bounding the return value in ( and ) just so you can easily see that the regex replace is working in a psql prompt. So you'll want to remove those in your code.
SELECT replace((' devo system ') ,' ','');
It gives: devosystem
A tested one that works like a charm:
UPDATE company SET name = TRIM (BOTH FROM name) where id > 0