How to regex_replace in 10th position from CLOB field

How to regex_replace in 10th position from CLOB field - sql

I have this code:
SELECT REGEXP_REPLACE(name,'^name\[([[:alpha:][:space:][:digit:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[[:alpha:][:space:][:punct:]]*)\|\|\|([[:digit:][:alpha:]]*)\|\|\|([[:digit:][:punct:]]*)\|\|\|([[:alpha:][:space:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[:digit:]]*)\|\|\|([[:alpha:][:space:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[:digit:][:alpha:]]*)\|\|\|([[:digit:][:alpha:][:space:]]*)\|\|\|([[:digit:][:alpha:]]*)\|\|\|([[:alpha:][:space:]]*)\|\|\|([[:alpha:]]*).*','[p1=\10]') as replaced
FROM Dual
Editor's note: the above is a single unreadable line. Here is the same regex with line breaks for readability:
SELECT REGEXP_REPLACE(name
,'^name\[([[:alpha:][:space:][:digit:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
([[[:alpha:][:space:][:punct:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
([[:digit:][:punct:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
([[:digit:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
([[:digit:][:alpha:][:space:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*).*'
,'[p1=\10]') as replaced
FROM Dual
I want to select tenth position out of it. I am able to select until nine positions but I am not able to make its tenth position on above logic. Any guess or help.
[p1=\9] if I use this expression I am able to select nine positions but I want tenth position string from the above expression.
[p1=\10] if my expression is like this it's selecting first position's value followed by 0.
Any help?

Here's a very basic example of a string that matches your regex:
name[a|||b|||c|||d|||0|||e|||f|||1|||g|||h|||i|||j|||k|||l|||m
So, you want to return 'h', the tenth field, but \10 returns a0.
If you're only interested in the tenth capturing group and none of the previous ones, then you can just remove the brackets on all capturing groups up to that one, and then use \1.
UPDATE: OP wants 2,3,4,8,9,10 and 12th fields, so just add brackets for those fields.
Field | Capture Group number
====================================
2 | \1
3 | \2
4 | \3
8 | \4
9 | \5
10 | \6
12 | \7
The code:
select REGEXP_REPLACE(name
,'^name\[[[:alpha:][:space:][:digit:]]*\|\|\|
([[:alpha:]]*)\|\|\|
([[[:alpha:][:space:][:punct:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
[[:digit:][:punct:]]*\|\|\|
[[:alpha:][:space:]]*\|\|\|
[[:alpha:]]*\|\|\|
([[:digit:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
[[:digit:][:alpha:]]*\|\|\|
([[:digit:][:alpha:][:space:]]*)\|\|\|
[[:digit:][:alpha:]]*\|\|\|
[[:alpha:][:space:]]*\|\|\|
[[:alpha:]]*.*','[p1=\1]') as replaced
FROM Dual
(Linebreaks added to the regex for clarity)
I should add that it looks like the broader question you're asking is how to get the tenth field from a triple-pipe delimited string in Oracle, which may be achievable in other ways that don't involve lengthy regexes like this.

Related

Imapala Regex - find specific sequence of characters, with delimiters between them, some are not letters, digits or underscore

I am new to regex and need to search a string field in Impala for multiple matches to this exact sequence of characters: ~FC* followed by 11 more * that could have letters/digits between (but could not, they are basically delimiters in this string field). After the 12th * (if you count #1 in ~FC*) it should be immediately followed by Y~.
since the asterisks are not letters or digits, I am unsure on how to search for these delimiters properly.
This is my SQL so far:
select
regexp_extract(col_name, '(~FC\\*).*(\\*Y~)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
data returned:
pattern_found
--------------
~FC*
(~FC\\*) in Impala SQL it returns ~FC* which is great (got it from my other question)
Been trying this (~FC\\*).*(\\*Y~) which obviously isnt counting the number of asterisks but its is also not picking the Y up.
This is a test string, it has 2 occurrences:
N4*CITY*STATE*2155446*2120~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~
results should be these 2, which has an overlapping ~ between them. but will settle for at least the first being found if both cannot.
~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~
~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~

figured out a solution but happy to learn of a better way to accomplish this
This is what worked in Impala SQL, needed parentheses and double escape backslashes for allllll the asterisks:
(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)
Full SQL:
select
regexp_extract(col_name, '(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
and here is the RegexDemo without the additional syntax needed for Impala SQL

Replacing a value in a column- Snowflake (SQL)

I'm extracting a number from a string using the following code:
regexp_substr(data, '\\d+\.\\d+') AS Age
Where the value is 0 (within the string), I'm getting a null value. Is there any way to correct this within the wider query, so all the nulls are replaced with 0s?

I could be helpful if you have shared your sample data, but not sure if below helps:
select
regexp_substr(column1, '\\d+(\.?\\d+)?')
from values
('test0me'),
('234.234'),
('test my age 25.')
;
+------------------------------------------+
| REGEXP_SUBSTR(COLUMN1, '\\D+(\.?\\D+)?') |
|------------------------------------------|
| 0 |
| 234.234 |
| 25 |
+------------------------------------------+
Pretty similar to what #kurt suggested, but without the leading "?:".

Note \d+.\d+ means "one ore more digits followed by any character followed by one or more digits. This pattern requires at least 3 characters to match, and the middle one doesn't even have to be a digit.
0 does not match that pattern. To match 0 or 42 or 1000000, i.e. any integer that is merely a string of digits, you only need \d+
10.5 does not match this pattern, however, so if you need to also capture decimal values, you will need something more complex that handles digits followed by an optional decimal point and more digits:
\d+(?:\.\d+)?
This pattern will match 0, 0.0, 42, 98.6, etc.

Specific patterns in Postgresql

I'm getting familiar with postgres sql, but having some trouble with pattern matching. I read the documentation and looked through other questions, but couldn't solve this on my own.
I have a field with lots of text data, in the middle of it, numbers with this pattern:
"2021-1234567" (four digits + - + seven digits)
Problem is, it can have other number sequences. Like this:
"Project number 12345678912345 with id 2020-2583697 1456"
(in this case, i need to extract 2020-2583697)
In some cases it may be just eleven digits, like this:
"Project 12345678912345 sequence 20202583697 1456"
(in this case i need to extract 20202583697)
At first i tried to extract only the numbers (the text is mostly user input)
with:
SELECT
SUBSTRING("my_field", '^[0-9]+$' )
FROM
my_table
That didn't help at all...
Can anyone help me?

This appears to do what you want:
select substring(str, '[0-9]{4}-?[0-9]{7}')
from (values ('asfasdf 2020-2583697 qererf i0iu0 1234234'),
('asfasdf 20202583697 qererf i0iu0 1234234')
) v(str)
It searches for 4 digits followed by an optional hyphen followed by 7 digits.

Or this, as I could not manage to force checking for blanks around the pattern without returning those blanks otherwise:
WITH
indata(s) AS (
SELECT 'Project number 12345678912345 with id 2020-2583697 1456'
UNION ALL SELECT 'Project 12345678912345 sequence 20202583697 1456'
)
SELECT
REGEXP_REPLACE(s,'^.* (\d{4}-?\d{7}) .*$','\1') AS found_token
, s
FROM indata;
found_token | s
--------------+---------------------------------------------------------
2020-2583697 | Project number 12345678912345 with id 2020-2583697 1456
20202583697 | Project 12345678912345 sequence 20202583697 1456
(2 rows)
The pattern used - REGEXP_REPLACE(s,'^.* (\d{4}-?\d{7}) .*$','\1') - means: replace ^.* the beginning of the string and any number of any characters, followed by a blank; then (\d{4}-?\d{7}) four digits, zero or one dash - -?, and seven digits - and the parentheses around it mean: remember this as the first group; finally: .*$ a blank, then any number of any characters till the end of the string - with group 1: \1 .

Oracle regexp to match only digits after certain combination of signs

I have a string which roughly looks like: XXXXXXXXX - 1234567 XXXXXXXX,
where X can be either digit, string or sign (<,>,. or space).
I need to extract only these numbers after ' - '.
I have tried following:
select regexp_substr('17.12.12 <XXXXXXXXXX> - 1234567 <XXXXXXXXXX>','(- )[0-9]{1,7}') from dual
I end up with - 1234567.
How to I get rid of '- '?
Thank you in advance

This should work with Oracle 11g.
Place the capturing group around the pattern part you are interested in first. Since you need the digits, wrap the [0-9]{1,7} with the capturing parentheses.
Then, pass all the 6 arguments to the REGEXP_SUBSTR function where the 6th one indicates the number of capturing group you want to extract:
select regexp_substr('17.12.12 <XXXXXXXXXX> - 1234567 <XXXXXXXXXX>',' - ([0-9]{1,7})', 1,1,NULL,1) from dual
Here, 1,1,NULL,1 means: start looking for a pattern match from Position 1, just for the first match, with no specific regex options, and return the contents of Group 1.

What #Gordon Linoff was trying to say was:
select substr(regexp_substr('17.12.12 <XXXXXXXXXX> - 1234567 <XXXXXXXXXX>','(- )[0-9]{1,7}'), 3)
from dual
Substr the remaining "- " off of your result.

SQL - need help in parsing text of a field

I have a select query and it fetches a field with complex data. I need to parse that data in specified format. please help with your expertise:
selected string = complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I
expected output - PB|I
Please help me in writing a sql regular expression to accomplish this output.

The first step in figuring out the regular expression is to be able to describe it plain language. Based on what we know (and as others have said, more info is really needed) from your post, some assumptions have to be made.
I'd take a stab at it by describing it like this, which is based on the sample data you provided: I want the sets of one or more characters that follow the equal signs but not including the following space or end of the line. The output should be these sets of characters, separated by a pipe, in the order they are encountered in the string when reading from left to right. My assumptions are based on your test data: only 2 equal signs exist in the string and the last data element is not followed by a space but by the end of the line. A regular expression can be built using that info, but you also need to consider other facts which would change the regex.
Could there be more than 2 equal signs?
Could there be an empty data element after the equal sign?
Could the data set after the equal sign contain one or more spaces?
All these affect how the regex needs to be designed. All that said, and based on the data provided and the assumptions as stated, next I would build a regex that describes the string (really translating from the plain language to the regex language), grouping around the data sets we want to preserve, then replace the string with those data sets separated by a pipe.
SQL> with tbl(str) as (
2 select 'complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I' from dual
3 )
4 select regexp_replace(str, '^.*=([^ ]+).*=([^ ]+)$', '\1|\2') result from tbl;
RESU
----
PB|I
The match regex explained:
^ Match the beginning of the line
. followed by any character
* followed by 0 or more 'any characters' (refers to the previous character class)
= followed by an equal sign
( start remembered group 1
[^ ]+ which is a set of one or more characters that are not a space
) end remembered group one
.*= followed by any number of any characters but ending in an equal sign
([^ ]+) followed by the second remembered group of non-space characters
$ followed by the end of the line
The replace string explained:
\1 The first remembered group
| a pipe character
\2 the second remember group
Keep in mind this answer is for your exact sample data as shown, and may not work in all cases. You need to analyse the data you will be working with. At any rate, these steps should get you started on breaking down the problem when faced with a challenging regex. The important thing is to consider all types of data and patterns (or NULLs) that could be present and allow for all cases in the regex so you return accurate data.
Edit: Check this out, it parses all the values right after the equal signs and allows for nulls:
SQL> with tbl(str) as (
2 select 'a=zz|complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I - testing|test1=|test2=test2 - testing' from dual
3 )
4 select regexp_substr(str, '=([^ |]*)( |||$)', 1, level, null, 1) output, level
5 from tbl
6 connect by level <= regexp_count(str, '=')
7 ORDER BY level;
OUTPUT LEVEL
-------------------- ----------
zz 1
PB 2
I 3
4
test2 5
SQL>

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to regex_replace in 10th position from CLOB field - sql

Related

Imapala Regex - find specific sequence of characters, with delimiters between them, some are not letters, digits or underscore

Replacing a value in a column- Snowflake (SQL)

Specific patterns in Postgresql

Oracle regexp to match only digits after certain combination of signs

SQL - need help in parsing text of a field

Categories

Resources