Specific patterns in Postgresql - sql

I'm getting familiar with postgres sql, but having some trouble with pattern matching. I read the documentation and looked through other questions, but couldn't solve this on my own.
I have a field with lots of text data, in the middle of it, numbers with this pattern:
"2021-1234567" (four digits + - + seven digits)
Problem is, it can have other number sequences. Like this:
"Project number 12345678912345 with id 2020-2583697 1456"
(in this case, i need to extract 2020-2583697)
In some cases it may be just eleven digits, like this:
"Project 12345678912345 sequence 20202583697 1456"
(in this case i need to extract 20202583697)
At first i tried to extract only the numbers (the text is mostly user input)
with:
SELECT
SUBSTRING("my_field", '^[0-9]+$' )
FROM
my_table
That didn't help at all...
Can anyone help me?

This appears to do what you want:
select substring(str, '[0-9]{4}-?[0-9]{7}')
from (values ('asfasdf 2020-2583697 qererf i0iu0 1234234'),
('asfasdf 20202583697 qererf i0iu0 1234234')
) v(str)
It searches for 4 digits followed by an optional hyphen followed by 7 digits.

Or this, as I could not manage to force checking for blanks around the pattern without returning those blanks otherwise:
WITH
indata(s) AS (
SELECT 'Project number 12345678912345 with id 2020-2583697 1456'
UNION ALL SELECT 'Project 12345678912345 sequence 20202583697 1456'
)
SELECT
REGEXP_REPLACE(s,'^.* (\d{4}-?\d{7}) .*$','\1') AS found_token
, s
FROM indata;
found_token | s
--------------+---------------------------------------------------------
2020-2583697 | Project number 12345678912345 with id 2020-2583697 1456
20202583697 | Project 12345678912345 sequence 20202583697 1456
(2 rows)
The pattern used - REGEXP_REPLACE(s,'^.* (\d{4}-?\d{7}) .*$','\1') - means: replace ^.* the beginning of the string and any number of any characters, followed by a blank; then (\d{4}-?\d{7}) four digits, zero or one dash - -?, and seven digits - and the parentheses around it mean: remember this as the first group; finally: .*$ a blank, then any number of any characters till the end of the string - with group 1: \1 .

Related

Remove space between number and character - PostgreSQL/REGEXP_REPLACE

I have a table with medication_product_amount column where there are spaces between numbers and characteres like below:
medication_product_amount
1 UN DE 50 ML
20 UN
1 UN DE 600 G
What I want is to remove the single space ONLY between numbers and characters, something like this:
new_medication_product_amount
1UN DE 50ML
20UN
1UN DE 600G
To do this, I am looking for a regular expression to use in the function REGEXP_REPLACE. I tried using the pattern below, indicating to replace the single space after the numbers, but the output remained the same as the input:
select REGEXP_REPLACE(medication_product_amount, '(^[0-9])( )', '\1') as new_medication_product_amount
from medications
Can anyone help me come up with the right way to do this? Thanks!
Your regex is a little off. First what yours does. '(^[0-9])( )', '\1')
(^[0-9]) Start Capture (field 1) at the beginning of the string for 1 digit
followed by Start Capture (field 2) for 1 space.
Replace the string by field1.
The problems and correction:
What you want to capture does not necessary the first character of the string. So eliminate the anchor ^.
What you want to capture may be more that 1 digit in length. So replace [0-9] by [0-9]+. I.E any number of digits.
Not actually a problem but a space holds no special meaning in a regexp, it is just a space so no need to capture it unless user later. Replace ( ) with just .
END of Pattern. But there may be other occurrences. Tell Postgres to continue with the above pattern until end of string. (see flag 'g').
Resulting Expression/Query: (demo here)
select regexp_replace(medication_product_posology, '([0-9]+) ', '\1','g') as new_medication_product_posology
from medications;
Match "digit space letter", capturing and the digit and letter using '([0-9]) ([A-Z])', then put them back using back references.
select REGEXP_REPLACE(medication_product_amount, '([0-9]) ([A-Z])', '\1\2') as new_medication_product_amount
from medications

Imapala Regex - find specific sequence of characters, with delimiters between them, some are not letters, digits or underscore

I am new to regex and need to search a string field in Impala for multiple matches to this exact sequence of characters: ~FC* followed by 11 more * that could have letters/digits between (but could not, they are basically delimiters in this string field). After the 12th * (if you count #1 in ~FC*) it should be immediately followed by Y~.
since the asterisks are not letters or digits, I am unsure on how to search for these delimiters properly.
This is my SQL so far:
select
regexp_extract(col_name, '(~FC\\*).*(\\*Y~)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
data returned:
pattern_found
--------------
~FC*
(~FC\\*) in Impala SQL it returns ~FC* which is great (got it from my other question)
Been trying this (~FC\\*).*(\\*Y~) which obviously isnt counting the number of asterisks but its is also not picking the Y up.
This is a test string, it has 2 occurrences:
N4*CITY*STATE*2155446*2120~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~
results should be these 2, which has an overlapping ~ between them. but will settle for at least the first being found if both cannot.
~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~
~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~
figured out a solution but happy to learn of a better way to accomplish this
This is what worked in Impala SQL, needed parentheses and double escape backslashes for allllll the asterisks:
(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)
Full SQL:
select
regexp_extract(col_name, '(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
and here is the RegexDemo without the additional syntax needed for Impala SQL

How to regex_replace in 10th position from CLOB field

I have this code:
SELECT REGEXP_REPLACE(name,'^name\[([[:alpha:][:space:][:digit:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[[:alpha:][:space:][:punct:]]*)\|\|\|([[:digit:][:alpha:]]*)\|\|\|([[:digit:][:punct:]]*)\|\|\|([[:alpha:][:space:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[:digit:]]*)\|\|\|([[:alpha:][:space:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[:digit:][:alpha:]]*)\|\|\|([[:digit:][:alpha:][:space:]]*)\|\|\|([[:digit:][:alpha:]]*)\|\|\|([[:alpha:][:space:]]*)\|\|\|([[:alpha:]]*).*','[p1=\10]') as replaced
FROM Dual
Editor's note: the above is a single unreadable line. Here is the same regex with line breaks for readability:
SELECT REGEXP_REPLACE(name
,'^name\[([[:alpha:][:space:][:digit:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
([[[:alpha:][:space:][:punct:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
([[:digit:][:punct:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
([[:digit:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
([[:digit:][:alpha:][:space:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*).*'
,'[p1=\10]') as replaced
FROM Dual
I want to select tenth position out of it. I am able to select until nine positions but I am not able to make its tenth position on above logic. Any guess or help.
[p1=\9] if I use this expression I am able to select nine positions but I want tenth position string from the above expression.
[p1=\10] if my expression is like this it's selecting first position's value followed by 0.
Any help?
Here's a very basic example of a string that matches your regex:
name[a|||b|||c|||d|||0|||e|||f|||1|||g|||h|||i|||j|||k|||l|||m
So, you want to return 'h', the tenth field, but \10 returns a0.
If you're only interested in the tenth capturing group and none of the previous ones, then you can just remove the brackets on all capturing groups up to that one, and then use \1.
UPDATE: OP wants 2,3,4,8,9,10 and 12th fields, so just add brackets for those fields.
Field | Capture Group number
====================================
2 | \1
3 | \2
4 | \3
8 | \4
9 | \5
10 | \6
12 | \7
The code:
select REGEXP_REPLACE(name
,'^name\[[[:alpha:][:space:][:digit:]]*\|\|\|
([[:alpha:]]*)\|\|\|
([[[:alpha:][:space:][:punct:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
[[:digit:][:punct:]]*\|\|\|
[[:alpha:][:space:]]*\|\|\|
[[:alpha:]]*\|\|\|
([[:digit:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
[[:digit:][:alpha:]]*\|\|\|
([[:digit:][:alpha:][:space:]]*)\|\|\|
[[:digit:][:alpha:]]*\|\|\|
[[:alpha:][:space:]]*\|\|\|
[[:alpha:]]*.*','[p1=\1]') as replaced
FROM Dual
(Linebreaks added to the regex for clarity)
I should add that it looks like the broader question you're asking is how to get the tenth field from a triple-pipe delimited string in Oracle, which may be achievable in other ways that don't involve lengthy regexes like this.

SQL - need help in parsing text of a field

I have a select query and it fetches a field with complex data. I need to parse that data in specified format. please help with your expertise:
selected string = complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I
expected output - PB|I
Please help me in writing a sql regular expression to accomplish this output.
The first step in figuring out the regular expression is to be able to describe it plain language. Based on what we know (and as others have said, more info is really needed) from your post, some assumptions have to be made.
I'd take a stab at it by describing it like this, which is based on the sample data you provided: I want the sets of one or more characters that follow the equal signs but not including the following space or end of the line. The output should be these sets of characters, separated by a pipe, in the order they are encountered in the string when reading from left to right. My assumptions are based on your test data: only 2 equal signs exist in the string and the last data element is not followed by a space but by the end of the line. A regular expression can be built using that info, but you also need to consider other facts which would change the regex.
Could there be more than 2 equal signs?
Could there be an empty data element after the equal sign?
Could the data set after the equal sign contain one or more spaces?
All these affect how the regex needs to be designed. All that said, and based on the data provided and the assumptions as stated, next I would build a regex that describes the string (really translating from the plain language to the regex language), grouping around the data sets we want to preserve, then replace the string with those data sets separated by a pipe.
SQL> with tbl(str) as (
2 select 'complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I' from dual
3 )
4 select regexp_replace(str, '^.*=([^ ]+).*=([^ ]+)$', '\1|\2') result from tbl;
RESU
----
PB|I
The match regex explained:
^ Match the beginning of the line
. followed by any character
* followed by 0 or more 'any characters' (refers to the previous character class)
= followed by an equal sign
( start remembered group 1
[^ ]+ which is a set of one or more characters that are not a space
) end remembered group one
.*= followed by any number of any characters but ending in an equal sign
([^ ]+) followed by the second remembered group of non-space characters
$ followed by the end of the line
The replace string explained:
\1 The first remembered group
| a pipe character
\2 the second remember group
Keep in mind this answer is for your exact sample data as shown, and may not work in all cases. You need to analyse the data you will be working with. At any rate, these steps should get you started on breaking down the problem when faced with a challenging regex. The important thing is to consider all types of data and patterns (or NULLs) that could be present and allow for all cases in the regex so you return accurate data.
Edit: Check this out, it parses all the values right after the equal signs and allows for nulls:
SQL> with tbl(str) as (
2 select 'a=zz|complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I - testing|test1=|test2=test2 - testing' from dual
3 )
4 select regexp_substr(str, '=([^ |]*)( |||$)', 1, level, null, 1) output, level
5 from tbl
6 connect by level <= regexp_count(str, '=')
7 ORDER BY level;
OUTPUT LEVEL
-------------------- ----------
zz 1
PB 2
I 3
4
test2 5
SQL>

using oracle sql substr to get last digits

I have a result of a query and am supposed to get the final digits of one column say 'term'
The value of column term can be like:
'term' 'number' (output)
---------------------------
xyz012 12
xyz112 112
xyz1 1
xyz02 2
xyz002 2
xyz88 88
Note: Not limited to above scenario's but requirement being last 3 or less characters can be digit
Function I used: to_number(substr(term.name,-3))
(Initially I assumed the requirement as last 3 characters are always digit, But I was wrong)
I am using to_number because if last 3 digits are '012' then number should be '12'
But as one can see in some specific cases like 'xyz88', 'xyz1') would give a
ORA-01722: invalid number
How can I achieve this using substr or regexp_substr ?
Did not explore regexp_substr much.
Using REGEXP_SUBSTR,
select column_name, to_number(regexp_substr(column_name,'\d+$'))
from table_name;
\d matches digits. Along with +, it becomes a group with one or more digits.
$ matches end of line.
Putting it together, this regex extracts a group of digits at the end of a string.
More details here.
Demo here.
Oracle has the function regexp_instr() which does what you want:
select term, cast(substr(term, 1-regexp_instr(reverse(term),'[^0-9]')) as int) as number
select SUBSTRING(acc_no,len(acc_no)-1,len(acc_no)) from table_name;