Abbreviation of Strings that Remains Unique - sql

I have a unique list of strings (the original idea was the column names in a table).
The task is to perform a maximal possible abbreviation of the list, so the list remains distinct.
For example AAA, AB can be abbreviated to AA, AB. (But not to A, AB – as A could be prefix of both AAA and AB).
AAAA, BAAAA can be shorten to A, B.
But A1, A2 can’t be abbreviated at all.
Here are the sample data
create table tab as
select 'AAA' col from dual union all
select 'AABA' col from dual union all
select 'COL1' col from dual union all
select 'COL21' col from dual union all
select 'AAAAAA' col from dual union all
select 'BBAA' col from dual union all
select 'BAAAA' col from dual union all
select 'AB' col from dual;
The expected result is
COL ABR_COL
------ ------------------------
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
I managed a brute force solution consisting of four subqueries, which I do not post on purpose, because I hope there exists a more simple solution from which I do not want to distract.
Btw there is a similar function in r called abbreviate, but I’m looking for SQL solution. Prefered Oracle solutions for other RDBMS are welcommed.

I would do the filtering in the recursive CTE:
with potential_abbreviations(col, abbr, lev) as (
select col, col as abbr, 1 as lev
from tab
union all
select pa.col, substr(pa.abbr, 1, length(pa.abbr) - 1) as abbr, lev + 1
from potential_abbreviations pa
where length(abbr) > 1 and
not exists (select 1
from tab
where tab.col like substr(pa.abbr, 1, length(pa.abbr) - 1) || '%' and
tab.col <> pa.col
)
)
select pa.col, pa.abbr
from (select pa.*, row_number() over (partition by pa.col order by pa.lev desc) as seqnum
from potential_abbreviations pa
) pa
where seqnum = 1
Here is a db<>fiddle.
The lev is strictly not necessary. You can use length(abbr) desc in the order by. But, I usually include a recursion counter when I use recursive CTEs, so this is habit.
Doing the extra comparison in the CTE may look more complicated, but it simplifies the execution -- the recursion stops at the correct value.
This is also tested on unique single letter col values.

This is actually possible using a recursive CTE. I don't really get it shorter than three subqueries (plus one query), but at least it is not constrained by string length. The steps are roughly as follows:
Calculate all potential abbreviations with a recursive CTE. This selects all column
names themselves and then the column names shortened by one letter, recursively:
Table:
col abbr
--- -------
AAA AAA
AAA AA
AAA A
...
For each abbreviation, count how often it occurs
Table
ABBR CONFLICT
---- --------
AA 3
AAA 2
AABA 1
...
Select the abbreviations that are the unique shortest ones, and also
the abbreviations that are just the column name itself, and rank these by length of the abbreviation. In the example, you see that AAA conflicts with some other abbreviation but still must be chosen as it is equal to its unshortened name.
Table
COL ABBR CONFLICT POS
-------------------------------
AAA AAA 2 1
AAAAAA AAAA 1 1
AAAAAA AAAAA 1 2
AAAAAA AAAAAA 1 3
AABA AAB 1 1
...
Choose the first ranked abbreviation (or column name itself) for each column.
Table
COL ABBR POS
-------------------
AAA AAA 1
AAAAAA AAAA 1
AABA AAB 1
...
Complete SQL
This results in the following SQL, with the above steps as CTEs:
with potential_abbreviations(col,abbr) as (
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations
where length(abbr) > 1
)
, abbreviation_counts as (
select abbr
, count(*) as conflict
from potential_abbreviations
group by abbr
)
, all_unique_abbreviations(col,abbr,conflict,pos) as (
select
p.col
, p.abbr
, conflict
, rank() over (partition by col order by p.abbr) as pos
from potential_abbreviations p
join abbreviation_counts c on p.abbr = c.abbr
where conflict = 1 or p.col = p.abbr
)
select col, abbr, pos
from all_unique_abbreviations
where pos = 1
order by col, abbr
Result
COL ABBR
------- ----
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
AC1 AC
AD AD
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
SQL Fiddle

I found a second approach, not added to the first answer as it is shorter and different. The steps are as follows:
Calculate all potential abbreviations for each name, recursively
SQL
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations a
where length(abbr) > 1
Results
col abbr
--- -------
AAA AAA
AAA AA
AAA A
...
Then calculate the conflicts between abbreviations. Also keep track of the column name that led to this abbreviation. We only want to keep abbreviations that cause no conflict, so the min() aggregate is of no concern.
SQL
select
abbr
, count(*) as conflicts
, min(col) as best_candidate
from potential_abbreviations
group by abbr
having count(*) = 1
Result
ABBR CONFLICTS BEST_CANDIDATE
------- --------- ---------------
AAAA 1 AAAAAA
AAAAA 1 AAAAAA
AAAAAA 1 AAAAAA
AAB 1 AABA
AABA 1 AABA
...
Finally, do a left join of the potential abbreviations with the best conflict-free candidates, and just use the column name if there was no conflict free resolution:
SQL
select
p.col as col
, nvl(min(c.abbr), p.col) as abbr
from potential_abbreviations p
left join conflict_free c on p.col = c.best_candidate
where c.conflicts = 1 or p.abbr = p.col
group by p.col
order by col, abbr
Complete SQL
with potential_abbreviations(col,abbr) as (
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations a
where length(abbr) > 1
)
, conflict_free as (
select
abbr
, count(*) as conflicts
, min(col) as best_candidate
from potential_abbreviations
group by abbr
having count(*) = 1
)
select
p.col as col
-- , c.best_candidate
, nvl(min(c.abbr), p.col) as abbr
-- , min(c.abbr) over (partition by c.best_candidate) shortest
from potential_abbreviations p
left join conflict_free c on p.col = c.best_candidate
where c.conflicts = 1 or p.abbr = p.col
group by p.col, c.best_candidate
order by col, abbr
Result
COL ABBR
------- ----
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
AC1 AC
AD AD
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
SQL Fiddle
Note: For Postgresql, the recursive CTE must be with recursive while Oracle does not like the word recursive at all there.

Related

Hive query optimization

My requirement is to get the id and name of the students having more than 1 email id's and type=1.
I am using a query like
select distinct b.id, b.name, b.email, b.type,a.cnt
from (
select id, count(email) as cnt
from (
select distinct id, email
from table1
) c
group by id
) a
join table1 b on a.id = b.id
where b.type=1
order by b.id
Please let me know is this fine or any simpler version available.
Sample data is like:
id name email type
123 AAA abc#xyz.com 1
123 AAA acd#xyz.com 1
123 AAA ayx#xyz.com 3
345 BBB nch#xyz.com 1
345 BBB nch#xyz.com 1
678 CCC iuy#xyz.com 1
Expected Output:
123 AAA abc#xyz.com 1 2
123 AAA acd#xyz.com 1 2
345 BBB nch#xyz.com 1 1
678 CCC iuy#xyz.com 1 1
you can use group by -> having count() for this requirement.
select distinct b.id
, b.name,
, b.email
, b.type
from table1 b
where id in
(select distinct id from table1 group by email, id having count(email) > 1)
and b.type=1
order by b.id
You can try to use the analytical way of count() function:
SELECT sub.ID, sub.NAME
FROM (SELECT ID, NAME, COUNT (*) OVER (PARTITION BY ID, EMAIL) cnt
FROM raw.crddacia_raw) sub
WHERE sub.cnt > 1 AND sub.TYPE = 1
I strongly recommend using window functions. However, Hive does not support count(distinct) as a window function. There are different methods to solve this. One is the sum of dense_rank()s:
select id, name, email, type, cnt
from (select t1.*,
(dense_rank() over (partition by id order by email) +
dense_rank() over (partition by id order by email desc)
) as cnt
from table1 t1
) t
where type = 1;
I would expect this to have better performance than your version. However, it is worth testing different versions to see which has the better performance (and feel free to come back to let others know which is better).
One more method using collect_set and taking the size of returned array for calculating distinct emails.
Demo:
--your data example
with table1 as ( --use your table instead of this
select stack(6,
123, 'AAA', 'abc#xyz.com', 1,
123, 'AAA', 'acd#xyz.com', 1,
123, 'AAA', 'ayx#xyz.com', 3,
345, 'BBB', 'nch#xyz.com', 1,
345, 'BBB', 'nch#xyz.com', 1,
678, 'CCC', 'iuy#xyz.com', 1
) as (id, name, email, type )
)
--query
select distinct id, name, email, type,
size(collect_set(email) over(partition by id)) cnt
from table1
where type=1
Result:
id name email type cnt
123 AAA abc#xyz.com 1 2
123 AAA acd#xyz.com 1 2
345 BBB nch#xyz.com 1 1
678 CCC iuy#xyz.com 1 1
We still need DISTINCT here because analytic function does not remove duplicates like in case 345 BBB nch#xyz.com.
This is very similar to your query but here i am filtering data at initial step(in inner query)so that the join should not happen on less data
select distinct b.id,b.name,b.email,b.type,intr_table.cnt from table1 orig_table join
(
select a.id,a.type,count(a.email) as cnt from table1 as a where a.type=1 group by a
) intr_table on inter_table.id=orig_table.id,inter_table.type=orig_table.type

How to get this type of result in Teradata sql without using inbuilt functions

This is source table
Id. A B
---------------
1 aa bb
2 cc dd
The output table need is
Id. Col1 Col2
------------------------
1 A aa
1 B bb
2 A cc
2 B dd
You can use union all:
select id, 'A' as col1, a as col2 from t
union all
select id, 'B', b from t;

Sum analytical function or any other easy way

I have below Data and need to select all columns with sum of one column
id size desc1, desc2
1 13 xxx yyy
1 13 xxx yyy
1 10 mmm kkk
1 10 mmm kkk
I need below output
id **total_size** desc1 des2
1 23 xxx yyy
1 23 xxx yyy
1 23 mmm kkk
1 23 mmm kkk
total_size should be sum (distinct size)
select a.id
,a.size
,sum(b.size) as 'total_size'
,a.desc1
,a.desc2
from (
select *, row_number() over (order by id, size, desc1, desc2) as 'RowNumber'
from #tmp
) a
left join (
select *, row_number() over(partition by id, size order by id) as 'dupe'
from #tmp
) b
on a.id = b.id
and b.dupe=1
group by a.RowNumber
,a.id
,a.size
,a.desc1
,a.desc2
Not here to argue, but you should really consider reviewing the data structure you're working with.
Select your data, adding a column to number the rows
Join a copy of your data (with distinct records only)
Sum the size column from the list of distinct records
You just need to add sum(distinct "size") over (partition by id) for computing total_size column for each row in your SQL :
with tab(id,"size","desc1","desc2") as
(
select 1 ,13,'xxx','yyy' from dual union all
select 1 ,13,'xxx','yyy' from dual union all
select 1 ,10,'mmm','kkk' from dual union all
select 1 ,10,'mmm','kkk' from dual
)
select t.id,
sum(distinct t."size") over (partition by id) as "total_size",
t."desc1",t."desc2"
from tab t;
P.S. size is a reserved keyword, so, cannot be used as a column name, unless quoted. as "size"

How to add several unpivot() functions in the same select statement ORACLE

Here is my example,
WITH TABLE1 (ID,COL1,COL2,SUBCOL1,SUBCOL2)
as (select 1, 'm1',null,'s1',null from dual
union all
select 2, null ,'m2', null,'s2' from dual
)
select * from TABLE1;
From above table1 I want to create a view as follows,
ID | COLTYPE | col | SUBCOLTYPE | subcol
-------------------------------------------------------------
1 COL1 m1 SUBCOL1 s1
2 COL2 m2 SUBCOL2 s2
what I did was I merged COL1 , COL2 in to COL and SUBCOL1, SUBCOL2 in to SUBCOL. Can I achieve this by using UNPIVOT() function.
my imaginary query is as follows,
select * from table1
unpivot(COL for COLTYPE in (COL1,COL2)) --- FIRST MERGE
unpivot(SUBCOL FOR SUBCOLTYPE IN (SUBCOL1,SUBCOL2)) ---SECOND MERGE
;
Each first and second merge lines are working individually when comment other one. But They are not working same time. How to add several unpivot() function in the same select statement. Is it possible to do ?
Answer to the original question
Why do you want to use unpivot? I ask because you are not transposing columns into rows. Is it guaranteed that all the columns you want to merge are NULL except one of them? If yes you could use the coalesce function.
Source data:
ID COL1 COL2 SUBCOL1 SUBCOL2
---------- -------- -------- -------- --------
1 m1 s1
2 m2 s2
Example:
WITH TABLE1 (ID,COL1,COL2,SUBCOL1,SUBCOL2)
as (select 1, 'm1',null,'s1',null from dual
union all
select 2, null ,'m2', null,'s2' from dual
)
select id,
coalesce(col1,col2) as col,
coalesce(subcol1, subcol2) as subcol
from TABLE1;
Result:
ID COL SUBCOL
---------- -------- --------
1 m1 s1
2 m2 s2
Answer to the updated question
After you edited your question, this syntax is probably what you are looking for:
WITH TABLE1 (ID,COL1,COL2,SUBCOL1,SUBCOL2)
as (select 1, 'm1',null,'s1',null from dual
union all
select 2, null ,'m2', null,'s2' from dual
)
select id, coltype, col, subcoltype, subcol from TABLE1
UNPIVOT ((col, subcol) FOR (coltype, subcoltype) IN ((col1, subcol1) AS ('col1', 'subcol1'), (col2, subcol2) AS ('col2', 'subcol2')));
Result:
ID COLTYPE COL SUBCOLTYPE SUBCOL
---------- ---------------- -------- ---------------------------- --------
1 col1 m1 subcol1 s1
2 col2 m2 subcol2 s2

SQL: return only those columns with different data

I have two rows from a table that has many columns. How do I return only those columns where the value for row1 does not equal the value for row2?
I'm using Oracle 11.1.0.07
~~ Edit: clarification ~~
Example:
So I've got a table with rows:
1 a b c d e f g h i j k l
2 a x c d e x g h y j k l
3 a b x d e x g h x y k z
I want to return rows where id (first column) is 1 or 3, only those columns that are different. So:
1 c f i j l
3 x x x y z
with column names.
In reality, the table I'm pulling from has 223007 rows, and 40 columns. The above is a simplified example. There are two rows (one each for primary key values) that I'm wanting to compare.
First, the SQL language was not designed for dynamic column generation. For that, you need to write dynamic SQL which should be done in a middle-tier or reporting component.
Second, if what you seek is to compare two specific rows, then the simplest solution would probably be to return those rows and analyze them in a middle-tier component. However, if you accept that we must return all columns and you insist on doing this in SQL, this is one solution:
With Inputs As
(
Select 1 As Col1,'a' As Col2,'b' As Col3,'c' As Col4,'d' As Col5,'e' As Col6,'f' As Col7,'g' As Col8,'h' As Col9,'i' As Col10,'j' As Col11,'k' As Col12,'l' As Col13
Union All Select 2,'a','x','c','d','e','x','g','h','y','j','k','l'
Union All Select 3,'a','b','x','d','e','x','g','h','x','y','k','z'
)
, TransposedInputs As
(
Select Col1, 2 As ColNum, Col2 As Value From Inputs
Union All Select Col1, 3, Col3 From Inputs
Union All Select Col1, 4, Col4 From Inputs
Union All Select Col1, 5, Col5 From Inputs
Union All Select Col1, 6, Col6 From Inputs
Union All Select Col1, 7, Col7 From Inputs
Union All Select Col1, 8, Col8 From Inputs
Union All Select Col1, 9, Col9 From Inputs
Union All Select Col1, 10, Col10 From Inputs
Union All Select Col1, 11, Col11 From Inputs
Union All Select Col1, 12, Col12 From Inputs
Union All Select Col1, 13, Col13 From Inputs
)
, UniqueValues As
(
Select Min(Col1) As Col1, ColNum, Value
From TransposedInputs
Where Col1 In(1,3)
Group By ColNum, Value
Having Count(*) = 1
)
Select Col1
, Min( Case When ColNum = 2 Then Value End ) As Col2
, Min( Case When ColNum = 3 Then Value End ) As Col3
, Min( Case When ColNum = 4 Then Value End ) As Col4
, Min( Case When ColNum = 5 Then Value End ) As Col5
, Min( Case When ColNum = 6 Then Value End ) As Col6
, Min( Case When ColNum = 7 Then Value End ) As Col7
, Min( Case When ColNum = 8 Then Value End ) As Col8
, Min( Case When ColNum = 9 Then Value End ) As Col9
, Min( Case When ColNum = 10 Then Value End ) As Col10
, Min( Case When ColNum = 11 Then Value End ) As Col11
, Min( Case When ColNum = 12 Then Value End ) As Col12
, Min( Case When ColNum = 13 Then Value End ) As Col13
From UniqueValues
Group By Col1
Results:
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Col11 | Col12 | Col13
1 | NULL | NULL | c | NULL | NULL | f | NULL | NULL | i | j | NULL | l
3 | NULL | NULL | x | NULL | NULL | x | NULL | NULL | x | y | NULL | z
If you're trying to transpose or pivot your row1 and row2 into columns, then these questions might help you:
Oracle PIVOT, twice?
Oracle SQL pivot query
etc (Google for Oracle PIVOT Query)
After pivoting, you can select only those tuples that have row1_pivoted <> row2_pivoted
Hmm. first stab at an answer was wrong when I re-read the question. So... for clarification, you've got some rows/values
a b c
d e f
a b c
and you'd like only the 'd e f' row returned, because it doesn't have a duplicate row elsewhere?
The number of columns in the result set can't be dynamic (without resorting to dynamic SQL).
You might be interested in the Unpivot operator. That would let you return the columns as rows.
I haven't experiemented with it myself yet, so unfortunately I'm unable to help you with it :/
Edit
I wanted to give manual pivoting a shot :)
select *
from inputs;
ID C1 C2 C3 C4 C5 C6
--- -- -- -- -- -- --
1 a b c d e f
2 a x c d e x
3 a b x d e x
with unpivoted as(
select id, 'c1' as cn, c1 as cv from inputs union all
select id, 'c2' as cn, c2 as cv from inputs union all
select id, 'c3' as cn, c3 as cv from inputs union all
select id, 'c4' as cn, c4 as cv from inputs union all
select id, 'c5' as cn, c5 as cv from inputs union all
select id, 'c6' as cn, c6 as cv from inputs
)
select cn
,max(case when id = 1 then cv end) as id1
,max(case when id = 3 then cv end) as id3
from unpivoted
where id in(1,3)
group
by cn
having count(distinct cv) = 2;
CN ID1 ID3
-- --- ---
c3 c x
c6 f x
The above works by creating one row for each column and ID (2 * 6 = 12 rows).
Then I group by the column name (assigned as a literal).
I will always get 6 groups (one for each column). In each group I will have exactly two rows (one for each selected ID).
In the having clause, I count the number of unique values for the column. If the rows have the same value, then the numner of unique values = 1. Else we have a mismatch.
Note 1. id in(x,y) is pushed into the view, so we are not selecting the entire table.
Note 2. This cannot be extended into comparing more than 2 rows.
Note 3. This does not deal with NULLS in either column