Oracle Regex - Remove duplicates including chinese - sql

I'm trying to remove the duplicates in the results of a query involving listagg.
I'm using this syntax:
REGEXP_REPLACE (LISTAGG (PR.NAME, ',' ) WITHIN GROUP (ORDER BY 1),
'([^,]+)(,\1)+',
'\1') AS PRODUCERS
However, occurrences including chinese characters are not removed:
Any idea ?

Your regular expression does not work. If the LISTAGG output is A,A,AA then the regular expression ([^,]+)(,\1)+ does not check that it has matched a complete element of your list and will match A,A,A which is 2½ elements of the list and will give the output AA instead of the expected A,AA. Worse, if you have the string BA,BABAB,BABD then the regular expression will replace BA,BA with BA and then replace BAB,BAB with BAB and you end up with the string BABABD which does not match any of the elements of the original list.
An example demonstrating this is:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE names ( id, name ) AS
SELECT 1, 'A' FROM DUAL UNION ALL
SELECT 2, 'A' FROM DUAL UNION ALL
SELECT 3, 'B' FROM DUAL UNION ALL
SELECT 4, 'C' FROM DUAL UNION ALL
SELECT 5, 'A' FROM DUAL UNION ALL
SELECT 6, 'AA' FROM DUAL UNION ALL
SELECT 7, 'A' FROM DUAL UNION ALL
SELECT 8, 'BA' FROM DUAL UNION ALL
SELECT 9, 'A' FROM DUAL
/
Query 1:
SELECT REGEXP_REPLACE (
LISTAGG (NAME, ',' ) WITHIN GROUP (ORDER BY 1),
'([^,]+)(,\1)+',
'\1'
) AS constant_sort
FROM names
Results:
| CONSTANT_SORT |
|---------------|
| AA,BA,C |
If you want to get the distinct elements then you can use DISTINCT (as per Littlefoot's answer) or you can COLLECT the values into a user-defined collection and then use the SET function to remove duplicates. You can then pass this de-duplicated collection to a table collection expression and use LISTAGG to get your output:
Oracle 11g R2 Schema Setup:
CREATE TYPE StringList IS TABLE OF VARCHAR2(4000)
/
Query 2:
SELECT (
SELECT LISTAGG( column_value, ',' )
WITHIN GROUP ( ORDER BY ROWNUM )
FROM TABLE( n.unique_names )
) AS agg_names
FROM (
SELECT SET( CAST( COLLECT( name ORDER BY NAME ) AS StringList ) )
AS unique_names
FROM names
) n
Results:
| AGG_NAMES |
|-------------|
| A,AA,B,BA,C |
Regarding your comment:
in the context of a bigger query involving a lot of join and given my begginers skills I would have no idea how to implement this model
For example, if your query was:
SELECT REGEXP_REPLACE(
LISTAGG (PR.NAME, ',' ) WITHIN GROUP (ORDER BY 1),
'([^,]+)(,\1)+',
'\1'
) AS PRODUCERS,
other_column1,
other_column2
FROM table1 pr
INNER JOIN table2 t2
ON (pr.some_condition = t2.some_condition )
WHERE t2.some_other_condition = 'TRUE'
GROUP BY other_column1, other_column2
Then you can change it to:
SELECT (
SELECT LISTAGG( COLUMN_VALUE, ',' ) WITHIN GROUP ( ORDER BY ROWNUM )
FROM TABLE( t.PRODUCERS )
) AS producers,
other_column1,
other_column2
FROM (
SELECT SET( CAST( COLLECT( PR.name ORDER BY PR.NAME ) AS StringList ) )
AS PRODUCERS,
other_column1,
other_column2
FROM table1 pr
INNER JOIN table2 t2
ON (pr.some_condition = t2.some_condition )
WHERE t2.some_other_condition = 'TRUE'
GROUP BY other_column1, other_column2
) t

(I can't see images; company policy).
Why wouldn't you remove duplicates before applying LISTAGG? Something like
select listagg(x.distinct_name, ',') within group (order by 1) producers
from (select DISTINCT name distinct_name
from some_table
) x

Another way to remove duplicates is to use window functions and case:
select listagg(case when seqnum = 1 then name end, ',') within group (order by 1) as producers
from (select . . .,
row_number() over (partition by name order by name) as seqnum
from . . .
) t
This does require modifications to the rest of the query, but you should still be able to do the rest of the aggregations and computations.

Related

SELECT rows with a new DISTINCT from a VARCHAR with CSV in it

I have an Oracle database table with a field called Classification which is VARCHAR. The VARCHAR is a CSV(using semi colons). Example:
;CHR;
;OTR;CHR;ROW;
;CHA;ROW;
;OTR;ROW;
I want to pull all the rows with ONLY a different value in the CSV from the others. It is ok if a row has a previously found value as long as it has a new different value.
For instance from the above dataset it would be:
;CHR;
;OTR;CHR;ROW;
;CHA;ROW;
If I do just:
Select DISTINCT Classification from Table1
I get rows that overlap distinct values due to the overall VARCHAR being Distinct.
I can get all the distinct values using:
select LISTAGG(val,',') WITHIN GROUP ( ORDER BY val ) as final
FROM
(
select distinct trim(regexp_substr("Classification",'[^;]+', 1, level) ) as val
from Table1
connect by regexp_substr("Classification", '[^,]+', 1, level) is not null
ORDER BY val
)
which give me
FINAL
CHA,CHR,OTR,ROW
but am unable to make the link to pull out one record per unique value
Is this possible with SQL?
EDIT: This is a database created by a large corporation and mine purchased the product. Now I am tasked with data mining the backend database for BI and have absolutely no control of the database structure.
No offence but I see many answers in the questions I have researched stating 'Do better database design/normalization' and while I agree MOST I have read have no control over the database and are asking for SO assistance with a problem because of this, not ridicule on bad database design.
I apologize if I offend anyone
There is no parent/child relationship. I cannot see the object layer but I assume these values are changed in the object layer before propagating to the client as there is no link to them in the actual database
Clarification:
I see 2 ways to solve this:
1: One select statement that pulls out 1 row based on a new unique value within the VARCHAR CSV(Classification)
2: Use my select statement to loop through and pull one row containing that value in the VARCHAR CSV(Classification)
Thanks all for the input. I upvoted the ones that worked for me. In the end I will be using the one I developed just because I can easily manipulate the output(to a csv) for what the analyst wishes.
Here's one way to approach it:
Assign row numbers to the original CSV data
Split the CSV -> rows
Now assign the split CSV values row numbers, sorted by the CSV ordering from the first step
Return any rows where the row number for the previous step = 1
Return the distinct list of CSVs
For example:
with tab as (
select ';CHR;' str from dual union all
select ';OTR;CHR;ROW;' str from dual union all
select ';CHA;ROW;' str from dual union all
select ';OTR;ROW;' str from dual
), ranks as (
select row_number() over ( order by str ) rn, tab.* from tab
), rws as (
select trim ( regexp_substr(str,'[^;]+', 1, level ) ) as val, rn, str
from ranks
connect by regexp_substr ( str, '[^;]+', 1, level ) is not null
and prior rn = rn
and prior sys_guid () is not null
), rns as (
select row_number () over (
partition by val
order by rn
) val_rn, r.*
from rws r
)
select distinct str
from rns
where val_rn = 1;
STR
;CHA;ROW;
;OTR;CHR;ROW;
;CHR;
This is an ad Hoc solution proposal if the generic answer yields a suboptimal performance and some restrictions are fullfiled:
all the keys have a fixed length
the maximal number of the keys is known
Than to parse the CSV string you may use this query (add further UNION ALL for longer strings)
with tab as (
select ';CHR;' str from dual union all
select ';OTR;CHR;ROW;' str from dual union all
select ';CHA;ROW;' str from dual union all
select ';OTR;ROW;' str from dual
), tab2 as (
select str, substr(str,2,3) val from tab union all
select str, substr(str,6,3) val from tab where substr(str,6,3) is not null union all
select str, substr(str,10,3) val from tab where substr(str,10,3) is not null)
select * from tab2;
which results in
STR VAL
------------- ------------
;CHR; CHR
;OTR;CHR;ROW; OTR
;CHA;ROW; CHA
;OTR;ROW; OTR
;OTR;CHR;ROW; CHR
;CHA;ROW; ROW
;OTR;ROW; ROW
;OTR;CHR;ROW; ROW
Now you need only to find the first occurence of each key and get all distinct strings with this first occurence.
I'm reusing the approach from the solution of Chris Saxon
with tab as (
select ';CHR;' str from dual union all
select ';OTR;CHR;ROW;' str from dual union all
select ';CHA;ROW;' str from dual union all
select ';OTR;ROW;' str from dual
), tab2 as (
select str, substr(str,2,3) val from tab union all
select str, substr(str,6,3) val from tab where substr(str,6,3) is not null union all
select str, substr(str,10,3) val from tab where substr(str,10,3) is not null),
tab3 as (
select STR, VAL,
row_number() over (partition by val order by str) rn
from tab2)
select distinct str
from tab3
where rn = 1
You were very close since you had already gotten the list of distinct values. Instead of combining them with LISTAGG, you can use that list to find a row that contains that unique value. Below are two separate queries that will return a Classification for each unique value. You can try them both and see which performs better based on the data you have in the table.
Query Option 1
WITH
table1 (classification)
AS
(SELECT ';CHR;' FROM DUAL
UNION ALL
SELECT ';OTR;CHR;ROW;' FROM DUAL
UNION ALL
SELECT ';CHA;ROW;' FROM DUAL
UNION ALL
SELECT ';OTR;ROW;' FROM DUAL),
dist_vals (val)
AS
( SELECT DISTINCT TRIM (REGEXP_SUBSTR (classification,
'[^;]+',
1,
LEVEL)) AS val
FROM Table1
CONNECT BY LEVEL < REGEXP_COUNT (classification, ';'))
SELECT val, classification
FROM (SELECT dv.val,
t.classification,
ROW_NUMBER () OVER (PARTITION BY dv.val ORDER BY t.classification) AS occurence
FROM dist_vals dv, table1 t
WHERE t.classification LIKE '%;' || dv.val || ';%')
WHERE occurence = 1;
Query Option 2
WITH
table1 (classification)
AS
(SELECT ';CHR;' FROM DUAL
UNION ALL
SELECT ';OTR;CHR;ROW;' FROM DUAL
UNION ALL
SELECT ';CHA;ROW;' FROM DUAL
UNION ALL
SELECT ';OTR;ROW;' FROM DUAL),
dist_vals (val)
AS
( SELECT DISTINCT TRIM (REGEXP_SUBSTR (classification,
'[^;]+',
1,
LEVEL)) AS val
FROM Table1
CONNECT BY LEVEL < REGEXP_COUNT (classification, ';'))
SELECT dv.val,
(SELECT classification
FROM table1
WHERE classification LIKE '%;' || dv.val || ';%' AND ROWNUM = 1)
FROM dist_vals dv;
I figured it out this way and it runs fast(even once all my joins to other tables are added). Will test other answers as I can and decide best one(others look better than mine if they work as I would rather not use dbms_output)
DECLARE
v_search_string varchar2(4000);
v_classification varchar2(4000);
BEGIN
select LISTAGG(val,',') WITHIN GROUP ( ORDER BY val ) as final
INTO v_search_string
FROM
(
select distinct trim(regexp_substr("Classification",'[^;]+', 1, level) ) as val
from mytable
connect by regexp_substr("Classification", '[^,]+', 1, level) is not null
ORDER BY val
);
FOR i IN
(SELECT trim(regexp_substr(v_search_string, '[^,]+', 1, LEVEL)) l
FROM dual
CONNECT BY LEVEL <= regexp_count(v_search_string, ',')+1
)
LOOP
SELECT "Classification"
INTO v_classification
FROM mytable
WHERE "Classification" LIKE '%' || i.l || '%'
FETCH NEXT 1 ROWS ONLY;
dbms_output.put_line(v_classification);
END LOOP;
END;

Oracle Select max where a certain key is matched

i'm working with oracle, plSql, i need to query a table and select the max id where a key is matched, now i have this query
select t.* from (
select distinct (TO_CHAR(I.DATE, 'YYMMDD') || I.AUTH_CODE || I.AMOUNT || I.CARD_NUMBER) as kies, I.SID as ids
from transactions I) t group by kies, ids order by ids desc;
It's displaying this data
If i remove the ID from the query, it displays the distinct keys (in the query i use the alias KIES because keys was in blue, so i thought it might be a reserved word)
How can i display the max id (last one inserted) for every different key without displaying all the data like in the first image??
greetings.
Do you just want aggregation?
select thekey, max(sid)
from (select t.*,
(TO_CHAR(t.DATE, 'YYMMDD') || t.AUTH_CODE || t.AMOUNT || t.CARD_NUMBER) as thekey,
t.SID
from transactions t
) t
group by thekey
order by max(ids) desc;
Since you haven't provided data in text format, its difficult to type such long numbers and recreated the data.
However I think you can simply use the MAX analytical function to achieve your results.
with data as (
select 1111 keys,1 id from dual
union
select 2222, 1 from dual
union
select 1111, 2 from dual
union
select 2222,3 from dual
union
select 9999, 1 from dual
union
select 1111, 5 from dual
)
select distinct keys, max(id) over( partition by (keys)) from data
This query returns -
KEYS MAX(ID)OVER(PARTITIONBY(KEYS))
1111 5
9999 1
2222 3

Oracle: SQL Dynamic cursor statement

I have a dynamic temporary table like below.
Table name for assumption: TB_EMP_TEMP_TABLE
Column1 | column2 | column3
Emp_NM | EMP_ID |TB_EMP_DTLS
Emp_Adr | EMP_ID |TB_EMP_DTLS
Emp_Sal | EMP_ID |TB_EMP_OTHER
The above data is retrieved as a Cursor(Emp_cursor) and i need to construct a dynamic SQL Query as below based on cursor data.
Expected Output:
SELECT TB_EMP_DTLS.EMP_NM,TB_EMP_DTLS.EMP_Adr,TB_EMP_OTHER.EMP_SAL
FROM TB_EMP_DTLS,TB_EMP_OTHER
WHERE TB_EMP_DTLS.EMP_ID=TB_EMP_OTHER.EMP_ID
I havent worked extensively on PLSQL/Cursor concepts. How the cursor can be looped to get expected output.
If i understand it right, you want column1 values selected from column3 tables joined by column2 columns.
It's not elegant but should work:
select listagg(v, ' ') within group (order by n asc) my_cursor from (
with
tb as (select distinct column3 val from tb_emp_temp_table), --tables
sl as (select distinct column3||'.'||column1 val from tb_emp_temp_table), --selected columns
pr as (select distinct column3||'.'||column2 val from tb_emp_temp_table) --predicates
select 1 n, 'SELECT' v from dual
union
select 2 n, listagg(val, ', ') within group (order by val) v from sl
union
select 3 n, 'FROM' v from dual
union
select 4 n, listagg(val, ', ') within group (order by val) v from tb
union
select 5 n, 'WHERE' v from dual
union
select 6 n, listagg(pra.val||'='||prb.val, ' AND ') within group (order by pra.val) v from pr pra, pr prb where pra.val != prb.val
)

BigQuery - Concatenate multiple rows into a single row

I have a BigQuery table with 2 columns:
id|name
1|John
1|Tom
1|Bob
2|Jack
2|Tim
Expected output: Concatenate names grouped by id
id|Text
1|John,Tom,Bob
2|Jack,Tim
For BigQuery Standard SQL:
#standardSQL
--WITH yourTable AS (
-- SELECT 1 AS id, 'John' AS name UNION ALL
-- SELECT 1, 'Tom' UNION ALL
-- SELECT 1, 'Bob' UNION ALL
-- SELECT 2, 'Jack' UNION ALL
-- SELECT 2, 'Tim'
--)
SELECT
id,
STRING_AGG(name ORDER BY name) AS Text
FROM yourTable
GROUP BY id
Optional ORDER BY name within STRING_CONCAT allows you to get out sorted list of names as below
id Text
1 Bob,John,Tom
2 Jack,Tim
For Legacy SQL
#legacySQL
SELECT
id,
GROUP_CONCAT(name) AS Text
FROM yourTable
GROUP BY id
If you would need to output sorted list here, you can use below (formally - it is not guaranteed by BigQuery Legacy SQL to get sorted list - but for most practical cases I had - it worked)
#legacySQL
SELECT
id,
GROUP_CONCAT(name) AS Text
FROM (
SELECT id, name
FROM yourTable
ORDER BY name
)
GROUP BY id
You can use GROUP_CONCAT
SELECT id, GROUP_CONCAT(name) AS Text FROM <dataset>.<table> GROUP BY id

distinct listagg in oracle

I have a query something like this:
select tab1.id,
(
select listagg(tab2.surna||' '||tab2.name||':'||tab2.addr||' '||tab2.numb,', ') within group( order by tab2.name)
from tab2
where tab1.id=tab2.id2id /*join tab1 with tab2 */
)as address
from tab1
and the result is like:
name_surname1:addr 1,name_surname1:addr 2,name_surname2:addr 3
but the exptected result would be something like:
name_surname1:(addr 1,addr 2),name_surname2:(addr 3)
how can i implement this in order to avoid duplicate records in the display names?
Thanks
I think you need 2 levels of listagg for that. As you didn't provide any script to replicate your data structure, I provide a generic example of my own...
with tab as (
select 's' s,'n' n, 'addr1' addr from dual
union all
select 's' s,'n' n, 'addr2' addr from dual
union all
select 'd' s,'k' n, 'addr3' addr from dual
union all
select 'd' s,'k' n, 'addr4' addr from dual
)
select listagg(res,',') within group (order by res) final_res from (
select s || n || ':(' ||listagg( addr,', ') within group (order by s,n) || ')'res
from tab
group by s||n
)
result is
dk:(addr3, addr4),sn:(addr1, addr2)