Using RegExp in sql to find rows that only contain 'x' - sql

How do i use a regexp to only find rows where the first name only includes one type of character 'x' but it doesnt matter how many characters there are.
So far I came up with:
REGEXP_LIKE(LOWER(fst_name),'^x+$'))
possible rows I am looking for:
'x'
'xx'
'xxx'
'xxxxxxxxx'
So im interpreting this as meaning find the rows where x is at the beginning and the end of the field and there can be only x's inbetween. Am I interpreting this correctly?
or is it possible to have: 'xxxxxxaxxxxx'

Your regex is correct:
^x+$
^ is the "start" anchor
x is the character for which you are searching. I assume it isn't a regex metacharacter
+ is the "one or more" quantifier
$ is the "end" anchor
So I would interpret your regex to match all of the cases you supplied, and would not match something like 'xxxxaxxxx'. http://regex101.com/r/dE8vU6
It's been long enough since I used Oracle that I don't recall whether your REGEX_LIKE syntax is correct there, but it seems right to me.

Related

regex not working correctly when the test is fine

For my database, I have a list of company numbers where some of them start with two letters. I have created a regex which should eliminate these from a query and according to my tests, it should. But when executed, the result still contains the numbers with letters.
Here is my regex, which I've tested on https://www.regexpal.com
([^A-Z+|a-z+].*)
I've tested it against numerous variations such as SC08093, ZC000191 and NI232312 which shouldn't match and don't in the tests, which is fine.
My sql query looks like;
SELECT companyNumber FROM company_data
WHERE companyNumber ~ '([^A-Z+|a-z+].*)' order by companyNumber desc
To summerise, strings like SC08093 should not match as they start with letters.
I've read through the documentation for postgres but I couldn't seem to find anything regarding this. I'm not sure what I'm missing here. Thanks.
The ~ '([^A-Z+|a-z+].*)' does not work because this is a [^A-Z+|a-z+].* regex matching operation that returns true even upon a partial match (regex matching operation does not require full string match, and thus the pattern can match anywhere in the string). [^A-Z+|a-z+].* matches a letter from A to Z, +,|or a letter fromatoz`, and then any amount of any zero or more chars, anywhere inside a string.
You may use
WHERE companyNumber NOT SIMILAR TO '[A-Za-z]{2}%'
See the online demo
Here, NOT SIMILAR TO returns the inverse result of the SIMILAR TO operation. This SIMILAR TO operator accepts patterns that are almost regex patterns, but are also like regular wildcard patterns. NOT SIMILAR TO '[A-Za-z]{2}%' means all records that start with two ASCII letters ([A-Za-z]{2}) and having anything after (%) are NOT returned and all others will be returned. Note that SIMILAR TO requires a full string match, same as LIKE.
Your pattern: [^A-Z+|a-z+].* means "a string where at least some characters are not A-Z" - to extend that to the whole string you would need to use an anchored regex as shown by S-Man (the group defined with (..) isn't really necessary btw)
I would probably use a regex that specifies want the valid pattern is and then use !~ instead.
where company !~ '^[0-9].*$'
^[0-9].*$ means "only consists of numbers" and the !~ means "does not match"
or
where not (company ~ '^[0-9].*$')
Not start with a letter could be done with
WHERE company ~ '^[^A-Za-z].*'
demo: db<>fiddle
The first ^ marks the beginning. The [^A-Za-z] says "no letter" (including small and capital letters).
Edit: Changed [A-z] into the more precise [A-Za-z] (Why is this regex allowing a caret?)

REGEXP_REPLACE explanation

Hi may i know what does the below query means?
REGEXP_REPLACE(number,'[^'' ''-/0-9:-#A-Z''[''-`a-z{-~]', 'xy') ext_number
part 1
In terms of explaining what the function function call is doing:
It is a function call to analyse an input string 'number' with a regex (2nd argument) and replace any parts of the string which match a specific string. As for the name after the parenthesis I am not sure, but the documentation for the function is here
part 2
Sorry to be writing a question within an answer here but I cannot respond in comments yet (not enough rep)
Does this regex work? Unless sql uses different syntax this would appear to be a non-functional regex. There are some red flags, e.g:
The entire regex is wrapped in square parenthesis, indicating a set of characters but seems to predominantly hold an expression
There is a range indicator between a single quote and a character (invalid range: if a dash was required in the match it should be escaped with a '\' (backslash))
One set of square brackets is never closed
After some minor tweaks this regex is valid syntax:
^'' ''\-\/0-9:-#A-Z''[''-a-z{-~]`, but does not match anything I can think of, it is important to know what string is being examined/what the context is for the program in order to identify what the regex might be attempting to do
It seems like it is meant to replaces all ASCII control characters in the column or variable number with xy.
[] encloses a class of characters. Any character in that class matches. [^] negates that, hence all characters match, that are not in the class.
- is a range operator, e.g. a-z means all characters from a to z, like abc...xyz.
It seams like characters enclosed in ' should be escaped (The second ' is to escape the ' in the string itself.) At least this would make some sense. (But for none of the DBMS I found having a regexp_replace() function (Postgres, Oracle, DB2, MariaDB, MySQL), I found something in the docs, that would indicate this escape mechanism. They all use \, but maybe I missed something? Unfortunately you didn't tag which DBMS you're actually using!)
Now if you take an ASCII table you'll see, that the ranges in the expression make up all printable characters (counting space as printable) in groups from space to /, 0 to 9, : to #, etc.. Actually it might have been shorter to express it as '' ''-~, space to ~.
Given the negation, all these don't match. The ones left are from NUL to US and DEL. These match and get replaced by xy one by one.

Massive change polish marks in notepad++ [duplicate]

Consider the following regex:
([a-zA-Z])([a-zA-Z]?)/([a-zA-Z])([a-zA-Z]?)
If the text is: a/b
the capturing groups will be:
/1 'a'
/2 ''
/3 'b'
/4 ''
And if the text is: aa/b
the capturing groups will be:
/1 'a'
/2 'a'
/3 'b'
/4 ''
Suppose, I want to find and replace this string in Notepad++ such that if /2 or /4 are empty (as in the first case above), I prepend c.
So, the text a/b becomes ca/cb.
And the text aa/b becomes aa/cb
I use the following regex for replacing:
(?(2)\1\2|0\1)/(?(4)\3\4|0\3)
But Notepad++ is treating ? literally in this case, and not as a conditional identifier. Any idea what am I doing wrong?
The syntax in the conditional replacement is
(?{GROUP_MATCHED?}REPLACEMENT_IF_YES:REPLACEMENT_IF_NO)
The { and } are necessary to avoid ambiguity when you deal with groups higher than 9 and with named capture groups.
Since Notepad++ uses Boost-Extended Format String Syntax, see this Boost documentation:
The character ? begins a conditional expression, the general form is:
?Ntrue-expression:false-expression
where N is decimal digit.
If sub-expression N was matched, then true-expression is evaluated and sent to output, otherwise false-expression is evaluated and sent to output.
You will normally need to surround a conditional-expression with parenthesis in order to prevent ambiguities.
For example, the format string (?1foo:bar) will replace each match found with foo if the sub-expression $1 was matched, and with bar otherwise.
For sub-expressions with an index greater than 9, or for access to named sub-expressions use:
?{INDEX}true-expression:false-expression
or
?{NAME}true-expression:false-expression
So, use ([a-zA-Z])([a-zA-Z])?/([a-zA-Z])([a-zA-Z])? and replace with (?{2}$1$2:c$1)/(?{4}$3$4:c$3).
The second problem is that you placed the ? quantifier inside the capturing group, making the pattern inside the group optional, but not the whole group. That made the group always "participating in the match", and the condition would be always "true" (always matched). ? should quantify the group.

Oracle REGEXP_LIKE doesn't work as expected

I was testing a regular expression in Oracle SQL and found something I could not understand:
-- NO MATCH
SELECT 1 FROM DUAL WHERE REGEXP_LIKE ('Professor Frank', '(^|\s)Prof[^\s]*(\s|$)');
Above doesn't match, while the following matches:
-- MATCH
SELECT 1 FROM DUAL WHERE REGEXP_LIKE ('Professor Frank', '(^|\s)Prof\S*(\s|$)');
In other regex flavors, It will be like \bProf[^\s]*\b versus \bProf\S*\b and have similar results. Note: Oracle SQL regex does not have \b or word boundary.
Question: Why don't [^\s]* and \S* work the same way in Oracle SQL?
I notice if I remove the (\s|$) at the end, the first regex will match.
In Oracle regular expressions, \s is indeed the escape sequence for a space, but NOT in a matching character set (that is, [.....], or [^....] for excluding one character). In a matching character set, only two characters have a special meaning, - for ranges and ] for closing the set enumeration. They can't be escaped; if needed in the matching set, ] must always be the first character right after the opening [ (it is the ONLY position in which a closing ] stands for itself as a character, and does not denote the end of the matching set), and - must be first or last (best to leave it always to the end of the matching set) - anywhere else it is seen as a range marker. To include (or exclude, if using the [^.....] syntax) a space, just type an actual physical space in the matching set.
Edit: What I said above is not entirely right. There is another special character in a matching set, namely ^. If it is used in the first position, it means "match any character OTHER THAN." In any other position it stands for itself. For example, '[^^]' will match any single character OTHER THAN ^ (the first ^ has special meaning, the second stands in for itself). And, a closing bracket ] stands for itself if it is the second character in brackets, if the first character is ^ (with its SPECIAL meaning). That is, to match any single character OTHER THAN ], we can use the matching pattern '[^]]'.

regexp_matches() returns two matches for $ (end of string)

Can somebody explain this odd behavior of regexp_matches() in PostgreSQL 9.2.4 (same result in 9.1.9):
db=# SELECT regexp_matches('test string', '$') AS end_of_string;
end_of_string
---------------
{""}
(1 row)
db=# SELECT regexp_matches('test string', '$', 'g') AS end_of_string;
end_of_string
---------------
{""}
{""}
(2 rows)
-> SQLfiddle demo.
The second parameter is a regular expression. $ marks the end of the string.
The third parameter is for flags. g is for "globally", meaning the the function doesn't stop at the first match.
The function seems to report the end of the string twice with the g flag, but that can only exist once per definition. It breaks my query. :(
Am I missing something?
I would need my query to return one more row at the end, for any possible string. I expected this query to do the job, but it adds two rows:
SELECT (regexp_matches('test & foo/bar', '(&|/|$)', 'ig'))[1] AS delim
I know how to manually add a row, but I want to let the function take care of it.
It looks like it was a bug in PostgreSQL. I verified for sure it is fixed in 9.3.8. Looking at the release notes, I see possible references in:
9.3.4
Allow regular-expression operators to be terminated early by query
cancel requests (Tom Lane)
This prevents scenarios wherein a pathological regular expression
could lock up a server process uninterruptably for a long time.
9.3.6
Fix incorrect search for shortest-first regular expression matches
(Tom Lane)
Matching would often fail when the number of allowed iterations is
limited by a ? quantifier or a bound expression.
Thanks to Erwin for narrowing it down to 9.3.x.
I am not sure about what I am going to say because I don't use PostgreSQL so this is just me thinking out loud.
Since you are trying to match the end of string/line $, then in the first situation the outcome is expected, but when you turn on global match modifier g and because matching the end of line character doesn't actually consume or read any characters from the input string then the next match attempt will start where the first one left off, that is at the end of string and this will cause an infinite loop if it kept going like that so PostgreSQL engine might be able to detect this and stop it to prevent a crash or an infinite loop.
I tested the same expression in RegexBuddy with POSIX ERE flavor and it caused the program to become unresponsive and crash and this is the reason for my reasoning.
the same occurs for example in C# in which I had the same problem recently so I think this is a normal behaviour for regexps
this is because $ doesn't stand for a specific sign but a specific position instead
so $ doesn't really match anything and the position of parser stays in the same position
you need to change your convention a little;
to test for an empty string you can use ^$
This was a bug that has been fixed in Postgres 9.3. See accepted answer.
For Postgres 9.2 or older: A halfway decent workaround for my situation would be to use the expression .$ instead - matches for any string once at the last character:
WITH x(id, t) AS (
VALUES
(1, 'test & foo/bar')
,(2, 'test')
,(3, '') -- empty string
,(4, 'test & foo/') -- other branch as last character
)
SELECT id, (regexp_matches(t, '(&|/|.$)', 'ig'))[1] AS delim
FROM x;
But it fails for empty strings.
And it fails if the last character happens to match another branch. Like: 'foo/bar/'.
And it isn't perfect to have the actual final character returned. An empty string would be much preferable.
-> SQLfiddle.