Split specific chain of digits from a string - sql

There is this table (called data) below:
row comments
1 Fortune favors https://something.aaa.org/show_screen.cgi?id=548545 the 23 bold
2 No man 87485 is id# 548522 an island 65654.
3 125 Better id NEWLINE #546654 late than 5875565 never.
4 555 Better id546654 late than 565 never
I used the query below:
select row, substring(substring(comments::text, '((id|ID) [0-9]+)'), '[0-9]+') as id
from data
where comments::text ~* 'id [0-9]+';
This query output ignored rows 1 to 3. It just processed row 4:
row id
4 546654
Does some of you know how to properly split the ID number? Note that the ID contains up to 9 digits.

Use regexp_replace():
SELECT c.rownr
, regexp_replace (c.comments, e'.*[Ii][Dd][^0-9]*([0-9]+).*', '\1' ) AS the_id
, c.comments AS comments
FROM comments c
;
.* matches the initial garbage
`[Ii][Dd] matches the Id string, case insignificant
[^0-9]* consumes al non-numeric characters
([0-9]+) Matches the numeric string that you want
.*matches any trailing characters
'\1' (in the 3rd argument) tells that you want the stuff matched inside the first ()
Results:
rownr | the_id | comments
-------+--------+--------------------------------------------------------------------------------
1 | 548545 | Fortune favors https://something.aaa.org/show_screen.cgi?id=548545 the 23 bold
2 | 548522 | No man 87485 is id# 548522 an island 65654.
3 | 546654 | 125 Better id NEWLINE #546654 late than 5875565 never.
4 | 546654 | 555 Better id546654 late than 565 never
(4 rows)

Related

Padding inside of a string in SQL

I just started learning SQL and there is my problem.
I have a column that contains acronyms like "GP2", "MU1", "FR10", .... and I want to add '0's to the acronyms that don't have enough characters.
For example I want acronyms like "FR10", "GP48",... to stay like this but acronyms like "MU3" must be converted into "MU03" to be as the same size as the others.
I already heard about LPAD and RPAD but it just add the wanted character at the left or the right.
Thanks !
Is the minimum length 3 as in your examples and the padded value should always be in the 3rd position? If so, use a case expression and concat such as this:
with my_data as (
select 'GP2' as col1 union all
select 'MU1' union all
select 'FR10'
)
select col1,
case
when length(col1) = 3 then concat(left(col1, 2), '0', right(col1, 1))
else col1
end padded_col1
from my_data;
col1
padded_col1
GP2
GP02
MU1
MU01
FR10
FR10
A regexp_replace():
with tests(example) as (values
('ab02'),('ab1'),('A'),('1'),('A1'),('123'),('ABC'),('abc0'),('a123'),('abcd0123'),('1a'),('a1a'),('1a1') )
select example,
regexp_replace(
example,
'^(\D{0,4})(\d{0,4})$',
'\1' || repeat('0',4-length(example)) || '\2' )
from tests;
example | regexp_replace
----------+----------------
ab02 | ab02
ab1 | ab01
A | A000
1 | 0001
A1 | A001
123 | 0123
ABC | ABC0
abc0 | abc0
a123 | a123
abcd0123 | abcd0123 --caught, repeat('0',-4) is same as repeat('0',0), so nothing
1a | 1a --doesn't start with non-digits
a1a | a1a --doesn't end with digits
1a1 | 1a1 --doesn't start with non-digits
catches non-digits with a \D at the start of the string ^
catches digits with a \d at the end $
specifies that it's looking for 0 to 4 occurences of each {0,4}
referencing each hit enclosed in consecutive parentheses () with a backreference \1 and \2.
filling the space between them with a repeat() up to the total length of 4.
It's good to consider additional test cases.
Thank you all for your response. I think i did something similar as Isolated.
Here is what I've done ("acronym" is the name of the column and "destination" is the name of the table) :
SELECT CONCAT(LEFT(acronym, 2), LPAD(RIGHT(acronym, LENGTH(acronym) - 2), 2, '0')) AS acronym
FROM destination
ORDER BY acronym;
Thanks !

SQL split column value (if column value has more than 5 characters) into multiple rows

I have a table:
Name | Number
Lisa | P1234P6953
Monica | P0034
Hayley | P0021P5691
I want to achieve the below result. Can someone please help with this? When number column has more than 5 characters, after fifth character it should be splitted into multiple rows.
Name | Number
Lisa | P1234
Lisa | P6953
Monica | P0034
Hayley | P0021
Hayley | P5691
-- SQL Server, but fairly adaptable to other platforms
with split as (select 1 as ofs union all select 6)
select Name, substring(Name, ofs, 5)
from T inner join split on ofs < len(Name)
Other platforms might have slightly better options. I'm assuming that you have a substring() function of some kind and as well as len(). For convenience I assumed that you can use a CTE. Generally I would discourage you from saving data in this format as it is not the preferred way to use a SQL database.

What is the difference to use CARET symbol in REGEXP_LIKE in oracle?

I am new to REGEX. So,I tried:
select * from ot.contacts where REGEXP_like(last_name,'^[A-C]');
Also,I tried:
select * from ot.contacts where REGEXP_like(last_name,'[A-C]');
both of them are giving me output where last_name starts with A,b,c and the no of records fetched is same.Can you tell me when I can see difference using this caret symbol?
In this context, ^ represents the beginning of the string.
'^[A-C]' checks for A, B or C at the beginning of the string.
'[A-C]' checks for A, B or C at the anywhere in the string.
Depending on your dataset, both expressions might, or might not produce the same output. Here is on example where the resultset would be different:
last_name | ^[A-C] | [A-C]
----------------- | ------- | -----
Arthur | match | match
Bill | match | match
Jean-Christophe | no match | match

PostgreSQL search lists of substrings in string column

I have the following table in a postreSQL database (simplified for clarity):
| serverdate | name | value
|-------------------------------------
0 | 2019-12-01 | A LOC 123 DISP | 1
1 | 2019-12-01 | B LOC 456 DISP | 2
2 | 2019-12-01 | C LOC 777 DISP | 0
3 | 2019-12-01 | D LOC 000 DISP | 10
4 | 2019-12-01 | A LOC 700 DISP | 123
5 | 2019-12-01 | F LOC 777 DISP | 8
name columns is of type string. The substrings LOC and DISP can have other values of different lengths but are not of interest in this question.
The problem: I want to SELECT the rows that only contain a certain substring. There are several substrings, passed as an ARRAY, in the following format:
['A_123', 'F_777'] # this is an example only
I would want to select all the rows that contain the first part of the substring (sepparating it by the underscore '_'), as well as the second. In this example, with the mentioned array, I should obtain rows 0 and 5 (as these are the only ones with exact matches in both parts of the):
| serverdate | name | value
|-------------------------------------
0 | 2019-12-01 | A LOC 123 DISP | 1
5 | 2019-12-01 | F LOC 777 DISP | 8
Row 4 has the first part of the substring correct, but not the other one, so it shouldn't be returned. Same thing with row 2 (only second part matches).
How could this query be done? I'm relatively new to SQL.
This query is part of process in Python, so I can adjust the input parameter (the substring array) if needed, but the behaviour must be the same as the one described.
Thanks!
Have you tried with regexp_replace and a subquery?
SELECT * FROM
(SELECT serverdate, substring(name from 1 for 1)||'_'||
regexp_replace(name, '\D*', '', 'g') AS name, value
FROM t) j
WHERE name IN('A_123', 'F_777');
Or using a CTE
WITH j AS (
SELECT serverdate, substring(name from 1 for 1)||'_'||
regexp_replace(name, '\D*', '', 'g') AS name2,
value,name
FROM t
) SELECT serverdate,name,value FROM j
WHERE name2 IN('A_123', 'F_777');
serverdate | name | value
------------+----------------+-------
2019-12-01 | A LOC 123 DISP | 1
2019-12-01 | F LOC 777 DISP | 8
(2 Zeilen)
Just unnest the array and join the table using a like clause
select
*
from
Table1
join
(
select
'%'||replace(unnest, '_', '%')||'%' pat
from
unnest(array['A_123', 'F_777'])
) pat_table on "name" like "pat"
Just replace unnest(array['A_123', 'F_777']) with unnest(string_to_array(str_variable, ','))
Thanks for your answers! Solution by Larry B got me an error, but it was caused by external factors (I run the queries using an internal tool developed by my company and it threw errors when using the % wildcard. Strange behaviour, I already contacted support team), so I could not test it properly.
Solution by Jim Jones seemed an alternative, but I found that, in some cases, the values in the name field would look like these (didn't notice it when writing the question, as it a rare case):
ABC LOC 123 DISP
So I modified the solution a little bit so as to grab the first part of the name when splitting it by the ' ' character.
(TLDR: 1st substring of name could be of arbitrary length, but is always at the start)
My solution is this one:
SELECT * FROM
(SELECT serverdate, split_part(name, ' ', 1)||'_'||
regexp_replace(name, '\D*', '', 'g') AS name, value
FROM t) j
WHERE name IN('A_123', 'F_777');
split_part(name,'_',1) + '_' + split_part(name,'_',3) as name
this is the break down of the query: A + _ + 123 = A_123

regex to convert alphanumeric and special characters in a string to * in oracle

I have a requirement to convert all the characters in my string to *. My string can also contain special characters as well.
For Example:
abc_d$ should be converted to ******.
Can any body help me with regex like this in oracle.
Thanks
Use REGEXP_REPLACE and replace any single character (.) with *.
SELECT
REGEXP_REPLACE (col, '.', '*')
FROM yourTable
Demo
Instead of regex you could also use
select rpad('*', length('abc_d$ s'),'*') from dual
-- use '*' and pad it until length fits with other *
Doku: rpad(string,length,appendWhat)
Repeat with a string of '*' should work as well: repeat(string,count) (not tested)
regex or rpad makes no difference - they are optimized down to the same execution plan:
n-th try of rpad:
Plan Hash Value : 1388734953
-----------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 2 | 00:00:01 |
| 1 | FAST DUAL | | 1 | | 2 | 00:00:01 |
-----------------------------------------------------------------
n-th try of regex_replace
Plan Hash Value : 1388734953
-----------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 2 | 00:00:01 |
| 1 | FAST DUAL | | 1 | | 2 | 00:00:01 |
-----------------------------------------------------------------
So it does not matter wich u use.
THIS IS NOT AN ANSWER
As suggested by Tom Biegeleisen’s brother Tim, I ran a test to compare a solution based on regular expressions to one using just standard string functions. (Specifically, Tim's answer with regular expressions vs. Patrick Artner's solution using just LENGTH and RPAD.)
Details of the test are shown below.
CONCLUSION: On a table with 5 million rows, each consisting of one string of length 30 (in a single column), the regular expression query runs in 21 seconds. The query using LENGTH and RPAD runs in one second. Both solutions read all the data from the table; the only difference is the function used in the SELECT clause. As noted already, both queries have the same execution plan, AND the same estimated cost - because the cost does not take into account differences in function calculation time.
Setup:
create table tbl ( str varchar2(30) );
insert into tbl
select a.str
from ( select dbms_random.string('p', 30) as str
from dual
connect by level <= 100
) a
cross join
( select level
from dual
connect by level <= 50000
) b
;
commit;
Note that there are only 100 distinct values, and each is repeated 50,000 times for a total of 5 million values. We know the values are repeated; Oracle doesn't know that. It will really do "the same thing" 5 million times, it won't just do it 100 times and then simply copy the results; it's not that smart. This is something that would be known only by seeing the actual stored data, it's not known to Oracle beforehand, so it can't "prepare" for such shortcuts.
Queries:
The two queries - note that I didn't want to send 5 million rows to screen, nor did I want to populate another table with the "masked" values (and muddy the waters with the time it takes to INSERT the results into another table); rather, I compute all the new strings and take the MAX. Again, in this test all "new" strings are equal to each other - they are all strings of 30 asterisks - but there is no way for Oracle to know that. It really has to compute all 5 million new strings and take the max over them all.
select max(new_str)
from ( select regexp_replace(str, '.', '*' ) as new_str
from tbl
)
;
select max(new_str)
from ( select rpad('*', length(str), '*') as new_str
from tbl
)
;
Try this:
SELECT
REGEXP_REPLACE('B^%2',
'*([A-Z]|[a-z]|[0-9]|[ ]|([^A-Z]|[^a-z]|[^0-9]|[^ ]))', '*') "REGEXP_REPLACE"
FROM DUAL;
I have included for white spaces too
select name,lpad(regexp_replace(name,name,'*'),length(name),'*')
from customer;