SELECT only Unique values from Multiple Columns in SQL - sql

I have to concatenate around 35 Columns in a table into a single string. The data within a column can be repetitive with different case, as per the below.
COL_1
apple | ORANGE | APPLE | Orange
COL_2
GRAPE | grape | Grape
The data in each column is pipe separated and I am trying to concatenate each column by separating with '|'. I expect the final output to be "apple | orange | grape" (All in lower case is fine)
But currently I am getting
apple | ORANGE | APPLE | Orange | GRAPE | grape | Grape
My current SQL is
SELECT COL_1 || '|' || COL_2 from TABLE_X;
Can some one explain me how to extract unique value from each column? This will reduce my string length drastically. My current SQL is exceeding Oracle's 4000 character limit.

I tried doing this
WITH test AS
( SELECT 'Test | test | Test' str FROM dual
)
SELECT *
FROM
(SELECT DISTINCT(LOWER(regexp_substr (str, '[^ | ]+', 1, rownum))) split
FROM test
CONNECT BY level <= LENGTH (regexp_replace (str, '[^ | ]+')) + 1
)
WHERE SPLIT IS NOT NULL;
This query produces only 'test'
Some how its producing unique values after splitting the string separated by ' | ' in a column. But doing this for 35+ columns in a single SQL query would be cumbersome. Could someone suggest a better approach?

Related

Padding inside of a string in SQL

I just started learning SQL and there is my problem.
I have a column that contains acronyms like "GP2", "MU1", "FR10", .... and I want to add '0's to the acronyms that don't have enough characters.
For example I want acronyms like "FR10", "GP48",... to stay like this but acronyms like "MU3" must be converted into "MU03" to be as the same size as the others.
I already heard about LPAD and RPAD but it just add the wanted character at the left or the right.
Thanks !
Is the minimum length 3 as in your examples and the padded value should always be in the 3rd position? If so, use a case expression and concat such as this:
with my_data as (
select 'GP2' as col1 union all
select 'MU1' union all
select 'FR10'
)
select col1,
case
when length(col1) = 3 then concat(left(col1, 2), '0', right(col1, 1))
else col1
end padded_col1
from my_data;
col1
padded_col1
GP2
GP02
MU1
MU01
FR10
FR10
A regexp_replace():
with tests(example) as (values
('ab02'),('ab1'),('A'),('1'),('A1'),('123'),('ABC'),('abc0'),('a123'),('abcd0123'),('1a'),('a1a'),('1a1') )
select example,
regexp_replace(
example,
'^(\D{0,4})(\d{0,4})$',
'\1' || repeat('0',4-length(example)) || '\2' )
from tests;
example | regexp_replace
----------+----------------
ab02 | ab02
ab1 | ab01
A | A000
1 | 0001
A1 | A001
123 | 0123
ABC | ABC0
abc0 | abc0
a123 | a123
abcd0123 | abcd0123 --caught, repeat('0',-4) is same as repeat('0',0), so nothing
1a | 1a --doesn't start with non-digits
a1a | a1a --doesn't end with digits
1a1 | 1a1 --doesn't start with non-digits
catches non-digits with a \D at the start of the string ^
catches digits with a \d at the end $
specifies that it's looking for 0 to 4 occurences of each {0,4}
referencing each hit enclosed in consecutive parentheses () with a backreference \1 and \2.
filling the space between them with a repeat() up to the total length of 4.
It's good to consider additional test cases.
Thank you all for your response. I think i did something similar as Isolated.
Here is what I've done ("acronym" is the name of the column and "destination" is the name of the table) :
SELECT CONCAT(LEFT(acronym, 2), LPAD(RIGHT(acronym, LENGTH(acronym) - 2), 2, '0')) AS acronym
FROM destination
ORDER BY acronym;
Thanks !

Sort each character in a string from a specific column in Snowflake SQL

I am trying to alphabetically sort each value in a column with Snowflake. For example I have:
| NAME |
| ---- |
| abc |
| bca |
| acb |
and want
| NAME |
| ---- |
| abc |
| abc |
| abc |
how would I go about doing that? I've tried using SPLIT and the ordering the rows, but that doesn't seem to work without a specific delimiter.
Using REGEXP_REPLACE to introduce separator between each character, STRTOK_SPLIT_TO_TABLE to get individual letters as rows and LISTAGG to combine again as sorted string:
SELECT tab.col, LISTAGG(s.value) WITHIN GROUP (ORDER BY s.value) AS result
FROM tab
, TABLE(STRTOK_SPLIT_TO_TABLE(REGEXP_REPLACE(tab.col, '(.)', '\\1~'), '~')) AS s
GROUP BY tab.col;
For sample data:
CREATE OR REPLACE TABLE tab
AS
SELECT 'abc' AS col UNION
SELECT 'bca' UNION
SELECT 'acb';
Output:
Similar implementation as Lukasz's, but using regexp_extract_all to extract individual characters in the form of an array that we later split to rows using flatten . The listagg then stitches it back in the order we specify in within group clause.
with cte (col) as
(select 'abc' union
select 'bca' union
select 'acb')
select col, listagg(b.value) within group (order by b.value) as col2
from cte, lateral flatten(regexp_extract_all(col,'.')) b
group by col;

How to convert arrays from two different table columns to parallel rows?

I'm working with hive and I have a table of the following format (I present only one row, but it has many rows)
_______________________________
segments | rates | sessID
---------|-----------|---------
'1,2,3' | '10,20,30'| 555
Namely, two columns have a string representing arrays of the same length and the third column has some integer. I want to flatten the arrays such that first member of the first array appears in the same row with the first member of the second array, etc:
Something like:
----------------------------
segment | rate | sessId
--------|------|------------
1 | 10 | 555
2 | 20 | 555
3 | 30 | 555
I've tried the following query (for simplicity I've hardcoded the values):
SELECT explode(segments), explode (rates), sessID FROM
(SELECT Split('1,2,3', ',') as segments, Split('10,20,30', ',') as rates, 555 as sessID) data ;
However, this does produce the required result, returning an error:
FAILED: SemanticException 1:26 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'rates'
When I try to flatten just one column it does work:
The query:
SELECT explode(segments) FROM (
SELECT Split('1,2,3', ',') as segments, Split('10,20,30', ',') as rates, 555 as sessID) data ;
the result:
1
2
3
How can I get the result I want?
I don't have access to Hive to test this, but the approach should basically work.
POSEXPLODE() can be used to get two columns, the position within an array and the item itself. Then you can use that position to look up the corresponding item from the other array...
SELECT
yourData.sessID,
segment.item AS segment,
SPLIT(yourData.rates, ',')[segment.pos] AS rate
FROM
yourData
LATERAL VIEW
POSEXPLODE(SPLIT(yourData.segments,',')) segment AS pos, item
I think that POSEXPLODE() returns the positions starting from 1, but array indexes in Hive start from 0? If that's the case then use [segment.pos - 1] instead.
Please give a try on this.
select sessID,tf1.val as segments, tf2.val as rates
from (SELECT Split('1,2,3', ',') as segments, Split('10,20,30', ',') as rates, 555 as sessID) t
lateral view posexplode(segments) tf1
lateral view posexplode(rates) tf2
where tf1.pos = tf2.pos;
+---------+-----------+--------+--+
| sessid | segments | rates |
+---------+-----------+--------+--+
| 555 | 1 | 10 |
| 555 | 2 | 20 |
| 555 | 3 | 30 |
+---------+-----------+--------+--+

regex to convert alphanumeric and special characters in a string to * in oracle

I have a requirement to convert all the characters in my string to *. My string can also contain special characters as well.
For Example:
abc_d$ should be converted to ******.
Can any body help me with regex like this in oracle.
Thanks
Use REGEXP_REPLACE and replace any single character (.) with *.
SELECT
REGEXP_REPLACE (col, '.', '*')
FROM yourTable
Demo
Instead of regex you could also use
select rpad('*', length('abc_d$ s'),'*') from dual
-- use '*' and pad it until length fits with other *
Doku: rpad(string,length,appendWhat)
Repeat with a string of '*' should work as well: repeat(string,count) (not tested)
regex or rpad makes no difference - they are optimized down to the same execution plan:
n-th try of rpad:
Plan Hash Value : 1388734953
-----------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 2 | 00:00:01 |
| 1 | FAST DUAL | | 1 | | 2 | 00:00:01 |
-----------------------------------------------------------------
n-th try of regex_replace
Plan Hash Value : 1388734953
-----------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 2 | 00:00:01 |
| 1 | FAST DUAL | | 1 | | 2 | 00:00:01 |
-----------------------------------------------------------------
So it does not matter wich u use.
THIS IS NOT AN ANSWER
As suggested by Tom Biegeleisen’s brother Tim, I ran a test to compare a solution based on regular expressions to one using just standard string functions. (Specifically, Tim's answer with regular expressions vs. Patrick Artner's solution using just LENGTH and RPAD.)
Details of the test are shown below.
CONCLUSION: On a table with 5 million rows, each consisting of one string of length 30 (in a single column), the regular expression query runs in 21 seconds. The query using LENGTH and RPAD runs in one second. Both solutions read all the data from the table; the only difference is the function used in the SELECT clause. As noted already, both queries have the same execution plan, AND the same estimated cost - because the cost does not take into account differences in function calculation time.
Setup:
create table tbl ( str varchar2(30) );
insert into tbl
select a.str
from ( select dbms_random.string('p', 30) as str
from dual
connect by level <= 100
) a
cross join
( select level
from dual
connect by level <= 50000
) b
;
commit;
Note that there are only 100 distinct values, and each is repeated 50,000 times for a total of 5 million values. We know the values are repeated; Oracle doesn't know that. It will really do "the same thing" 5 million times, it won't just do it 100 times and then simply copy the results; it's not that smart. This is something that would be known only by seeing the actual stored data, it's not known to Oracle beforehand, so it can't "prepare" for such shortcuts.
Queries:
The two queries - note that I didn't want to send 5 million rows to screen, nor did I want to populate another table with the "masked" values (and muddy the waters with the time it takes to INSERT the results into another table); rather, I compute all the new strings and take the MAX. Again, in this test all "new" strings are equal to each other - they are all strings of 30 asterisks - but there is no way for Oracle to know that. It really has to compute all 5 million new strings and take the max over them all.
select max(new_str)
from ( select regexp_replace(str, '.', '*' ) as new_str
from tbl
)
;
select max(new_str)
from ( select rpad('*', length(str), '*') as new_str
from tbl
)
;
Try this:
SELECT
REGEXP_REPLACE('B^%2',
'*([A-Z]|[a-z]|[0-9]|[ ]|([^A-Z]|[^a-z]|[^0-9]|[^ ]))', '*') "REGEXP_REPLACE"
FROM DUAL;
I have included for white spaces too
select name,lpad(regexp_replace(name,name,'*'),length(name),'*')
from customer;

How to use Regex in SQL for extracting values after repetitive numbers

I have the following table (table1):
+---+---------------------------------------------+
+---|--------att1 --------------------------------+
| 1 | 10.2.5.4 4.3.2.1.in-addr.arpa |
| 2 | asd 100.99.98.97 97.3.2.1.a.b.c fsdf |
| 3 | fd 95.94.93.92 92.5.7.1.a.b.c |
| 4 | a 11.4.99.75 75.77.52.41.in-addr.arpa |
+---+---------------------------------------------+
I would like to get the following values (that are located after the repetitive numbers): in-addr.arpa, a.b.c, a.b.c, in-addr.arpa.
I tried to use the following format with no success:
SELECT att1
FROM table1
WHERE REGEXP_LIKE(att1 , '^(\d+?)\1$')
I would like it to run in Impala and Oracle.
Use REGEXP_SUBSTR (assuming you are using an Oracle DB).
select regexp_substr(att1,'[0-9]\.([^0-9]+)',1,1,null,1)
from table1
[0-9]\. a numeric followed by a .
[^0-9]+ any character other than a numeric is matched until the next numeric is found. () around this indicates the group (first in this case) and we only extract that part of the string.
Sample Demo