Consolidate records - sql

I want to consolidate a set of records
(id) / (referencedid)
1 10
1 11
2 11
2 10
3 10
3 11
3 12
The result of query should be
1 10
1 11
3 10
3 11
3 12
So, since id=1 and id=2 has same set of corresponding referenceids {10,11} they would be consolidated. But id=3 s corresponding referenceids are not the same, hence wouldnt be consolidated.
What would be good way to get this done?

Select id, referenceid
From MyTable
Where Id In (
Select Min( Z.Id ) As Id
From (
Select Z1.id, Group_Concat( Z1.referenceid ) As signature
From (
Select id, referenceid
From MyTable
Order By id, referenceid
) As Z1
Group By Z1.id
) As Z
Group By Z.Signature
)

-- generate count of elements for each distinct id
with Counts as (
select
id,
count(1) as ReferenceCount
from
tblReferences R
group by
R.id
)
-- generate every pairing of two different id's, along with
-- their counts, and how many are equivalent between the two
,Pairings as (
select
R1.id as id1
,R2.id as id2
,C1.ReferenceCount as count1
,C2.ReferenceCount as count2
,sum(case when R1.referenceid = R2.referenceid then 1 else 0 end) as samecount
from
tblReferences R1 join Counts C1 on R1.id = C1.id
cross join
tblReferences R2 join Counts C2 on R2.id = C2.id
where
R1.id < R2.id
group by
R1.id, C1.ReferenceCount, R2.id, C2.ReferenceCount
)
-- generate the list of ids that are safe to remove by picking
-- out any id's that have the same number of matches, and same
-- size of list, which means their reference lists are identical.
-- since id2 > id, we can safely remove id2 as a copy of id, and
-- the smallest id of which all id2 > id are copies will be left
,RemovableIds as (
select
distinct id2 as id
from
Pairings P
where
P.count1 = P.count2 and P.count1 = P.samecount
)
-- validate the results by just selecting to see which id's
-- will be removed. can also include id in the query above
-- to see which id was identified as the copy
select id from RemovableIds R
-- comment out `select` above and uncomment `delete` below to
-- remove the records after verifying they are correct!
--delete from tblReferences where id in (select id from RemovableIds) R

Related

SQL like: How to calculate intersection and union of <item,user> data

Need help in SQL:
I have a data with the following columns:
ItemId
UserId
Each row indicates that some item was bought by some user.
Example:
ItemId UserId
200 user1
200 user3
200 user4
300 user5
300 user3
for each I would like to calculate the following output table:
users(i) : number of users bought i
users(j) : number of users bought j
users(i, j) : number of users bought both i and j
users(i, ~j) : number of users bought i but not j
users(~i, j) : number of users bought j but not i
Example of output (from the example above):
i_itemId j_itemId users(i) users(j) users(i,j) users(i,~j) users(~i, j)
200 200 3 3 3 0 0
200 300 3 2 1 2 1
300 300 2 2 2 0 0
300 200 3 2 1 1 2
Note:
The data table is huge (11 GB) located on the cloud. I have a framework of SQL to work with. So I cannot download the file and run python (for example)
So the solution has to be written in SQL in efficient way
The solution does not have to be one single SQL statement.
I am looking for an efficient solution
We can assume that is a key
If someone have a better alternative for the Question header here i will be glad to update it :)
I am not sure if there is an "easy" way to accomplish this. One method is rather brute force: use a cross join to generate all the rows. Then use subqueries for each of the individual counts:
select i1.itemid, i2.itemid, i1.num as cnt1, i2.num as cnt2,
(select count(*)
from t u1 join
t u2
on u1.userid = u2.userid
where u1.itemid = i1.itemid and u2.itemid = i2.itemid
) as cnt_1_2,
(select count(*)
from t u1 left join
t u2
on u1.userid = u2.userid and u2.itemid = i2.itemid
where u1.itemid = i1.itemid and u2.itemid is null
) as cnt_1_not2,
(select count(*)
from t u1 left join
t u2
on u1.userid = u2.userid and u1.itemid = i1.itemid
where u2.itemid = i2.itemid and u1.itemid is null
) as cnt_not1_2
from (select itemid, count(*) as num from t group by itemid) i1 cross join
(select itemid, count(*) as num from t group by itemid) i2;
Here's a recipe
1) Create a temporary table to gather the I and J totals.
Disclaimer :
This example uses a MS SQL server datatype: INT.
So change it to the numeric type that your RDBMS supports.
Btw, in MS SQL Server, temporary tables start with #
create table TempTotals (iItemId int, jItemId int, TotalUsers int);
2) Fill it up with totals
delete from TempTotals;
insert into TempTotals (iItemId, jItemId, TotalUsers)
select
t1.ItemId as iItemId,
t2.ItemId as jItemId,
count(distinct t1.UserId) as TotalUsers
from YourTable t1
full join YourTable t2 on (t1.UserId = t2.UserId)
group by t1.ItemId, t2.ItemId;
3) Self-join the temporary table to get all the totals
select
ij.iItemId,
ij.jItemId,
i.TotalUsers as Users_I,
j.TotalUsers as Users_J,
ij.TotalUsers as Users_I_and_J,
(i.TotalUsers - ij.TotalUsers) as Users_I_no_J,
(j.TotalUsers - ij.TotalUsers) as Users_J_no_I
from TempTotals ij
left join TempTotals i on (i.iItemId = ij.iItemId and i.iItemId = i.jItemId)
left join TempTotals j on (j.jItemId = ij.jItemId and j.iItemId = j.jItemId)
If you're using Oracle Database, you can compare nested tables (collections) with the multiset operators. And get the number of elements in a collection with cardinality.
So what you can do is:
Group by itemid, collecting all the users into a nested table
Cross join the output of this with itself
Use the multiset intersect/except operators to get number of elements in the sets as needed
Which looks a little like:
create table t (
ItemId int, UserId varchar2(10)
);
insert into t values ( 200 , 'user1');
insert into t values ( 200 , 'user3');
insert into t values ( 200 , 'user4');
insert into t values ( 300 , 'user5');
insert into t values ( 300 , 'user3');
commit;
create or replace type users_t as table of varchar2(10);
/
with grps as (
select itemid, cast ( collect ( userid ) as users_t ) users
from t
group by itemid
)
select g1.itemid i, g2.itemid j,
cardinality ( g1.users ) num_i,
cardinality ( g2.users ) num_j,
cardinality ( g1.users multiset intersect g2.users ) i_and_j,
cardinality ( g1.users multiset except g2.users ) i_not_j,
cardinality ( g2.users multiset except g1.users ) j_not_i
from grps g1
cross join grps g2;
I J NUM_I NUM_J I_AND_J I_NOT_J J_NOT_I
200 200 3 3 3 0 0
200 300 3 2 1 2 1
300 200 2 3 1 1 2
300 300 2 2 2 0 0
If necessary, you may be able to get more performance by skipping the except operators when i = j, e.g.:
case
when g1.itemid = g2.itemid then 0
else cardinality ( g1.users multiset intersect g2.users )
end

Group parents with same children

EDIT: This is way harder to explain that I though, constatly editing based on comments. Thank you all for taking interest.
I have a table like this
ID Type ParentID
1 ChildTypeA 1
2 ChildTypeB 1
3 ChildTypeC 1
4 ChildTypeD 1
5 ChildTypeA 2
6 ChildTypeB 2
7 ChildTypeC 2
8 ChildTypeA 3
9 ChildTypeB 3
10 ChildTypeC 3
11 ChildTypeD 3
12 ChildTypeA 4
13 ChildTypeB 4
14 ChildTypeC 4
and I want to group parents that have same children - meaning same number of children of same type.
From parent point of view, there is a finite set of possible configurations (max 10).
If any parent has same set of children (by ChildType), I want to group them together (in what I call a configuration).
ChildTypeA-D = ConfigA
ChildTypeA-C = ConfigB
ChildTypeA, B, E, F = ConfigX
etc.
The output I need is parents grouped by Configurations.
Config Group ParentID
ConfigA 1
ConfigA 3
ConfigB 2
ConfigB 4
I have no idea where to even begin.
I named your table t. Please try if this is what you are looking for.
It's show matched and unmatched.
It's looking for parentids with the same number of rows (t1.cnt = t2.cnt) and that all the rows are matched (having COUNT(*) = t1.cnt).
You can try it here
;with t1 as (select parentid, type, id, count(*) over (partition by parentid order by parentid) cnt from t),
t3 as
(
select t1.parentid parentid1, t2.parentid parentid2, count(*) cn, t1.cnt cnt1, t2.cnt cnt2, ROW_NUMBER () over (order by t1.parentid) rn
from t1 join t1 as t2 on t1.type = t2.type and t1.parentid <> t2.parentid and t1.cnt = t2.cnt
group by t1.parentid, t2.parentid, t1.cnt, t2.cnt
having COUNT(*) = t1.cnt
),
notFound as (
select t1.parentid, ROW_NUMBER() over(order by t1.parentid) rn
from t1
where not exists (select 1 from t3 where t1.parentid = t3.parentid1)
group by t1.parentid
)
select 'Config'+char((select min(rn)+64 from t3 as t4 where t3.parentid1 in (t4.parentid1 , t4.parentid2))) config, t3.parentid1
from t3
union all
select 'Config'+char((select max(rn)+64+notFound.rn from t3)) config, notFound.parentid
from notFound
OUTPUT
config parentid1
ConfigA 1
ConfigA 3
ConfigB 2
ConfigB 4
If id 14 was ChildTypeZ then parentid 2 and 4 wouldn't match. This would be the output:
config parentid1
ConfigA 1
ConfigA 3
ConfigC 2
ConfigD 4
I have happen to have similar task. The data I'm working with is a bit bigger scale so I had to find an effective approach to this. Basically I've found 2 working approaches.
One is pure SQL - here's a core query. Basically it gives you smallest ParentID with same collection of children, which you can then use as a group id (you can also enumerate it with row_number). As a small note - I'm using cte here, but in real world I'd suggest to put grouped parents into temporary table and add indexes on the table as well.
;with cte_parents as (
-- You can also use different statistics to narrow the search
select
[ParentID],
count(*) as cnt,
min([Type]) as min_Type,
max([Type]) as max_Type
from Table1
group by
[ParentID]
)
select
h1.ParentID,
k.ParentID as GroupID
from cte_parents as h1
outer apply (
select top 1
h2.[ParentID]
from cte_parents as h2
where
h2.cnt = h1.cnt and
h2.min_Type = h1.min_Type and
h2.max_Type = h1.max_Type and
not exists (
select *
from (select tt.[Type] from Table1 as tt where tt.[ParentID] = h2.[ParentID]) as tt1
full join (select tt.[Type] from Table1 as tt where tt.[ParentID] = h1.[ParentID]) as tt2 on
tt2.[Type] = tt1.[Type]
where
tt1.[Type] is null or tt2.[Type] is null
)
order by
h2.[ParentID]
) as k
ParentID GroupID
----------- --------------
1 1
2 2
3 1
4 2
Another one is a bit trickier and you have to be careful when using it. But surprisingly, it works not so bad. The idea is to concatenate children into big string and then group by these strings. You can use any available concatenation method (xml trick or clr if you have SQL Server 2017). The important part is that you have to use ordered concatenation so every string will represent your group precisely. I have created a special CLR function (dbo.f_ConcatAsc) for this.
;with cte1 as (
select
ParentID,
dbo.f_ConcatAsc([Type], ',') as group_data
from Table1
group by
ParentID
), cte2 as (
select
dbo.f_ConcatAsc(ParentID, ',') as parent_data,
group_data,
row_number() over(order by group_data) as rn
from cte1
group by
group_data
)
select
cast(p.value as int) as ParentID,
c.rn as GroupID,
c.group_data
from cte2 as c
cross apply string_split(c.parent_data, ',') as p
ParentID GroupID group_data
----------- -------------------- --------------------------------------------------
2 1 ChildTypeA,ChildTypeB,ChildTypeC
4 1 ChildTypeA,ChildTypeB,ChildTypeC
1 2 ChildTypeA,ChildTypeB,ChildTypeC,ChildTypeD
3 2 ChildTypeA,ChildTypeB,ChildTypeC,ChildTypeD

Aggregate data from multiple rows into single row

In my table each row has some data columns Priority column (for example, timestamp or just an integer). I want to group my data by ID and then in each group take latest not-null column. For example I have following table:
id A B C Priority
1 NULL 3 4 1
1 5 6 NULL 2
1 8 NULL NULL 3
2 634 346 359 1
2 34 NULL 734 2
Desired result is :
id A B C
1 8 6 4
2 34 346 734
In this example table is small and has only 5 columns, but in real table it will be much larger. I really want this script to work fast. I tried do it myself, but my script works for SQLSERVER2012+ so I deleted it as not applicable.
Numbers: table could have 150k of rows, 20 columns, 20-80k of unique ids and average SELECT COUNT(id) FROM T GROUP BY ID is 2..5
Now I have a working code (thanks to #ypercubeᵀᴹ), but it runs very slowly on big tables, in my case script can take one minute or even more (with indices and so on).
How can it be speeded up?
SELECT
d.id,
d1.A,
d2.B,
d3.C
FROM
( SELECT id
FROM T
GROUP BY id
) AS d
OUTER APPLY
( SELECT TOP (1) A
FROM T
WHERE id = d.id
AND A IS NOT NULL
ORDER BY priority DESC
) AS d1
OUTER APPLY
( SELECT TOP (1) B
FROM T
WHERE id = d.id
AND B IS NOT NULL
ORDER BY priority DESC
) AS d2
OUTER APPLY
( SELECT TOP (1) C
FROM T
WHERE id = d.id
AND C IS NOT NULL
ORDER BY priority DESC
) AS d3 ;
In my test database with real amount of data I get following execution plan:
This should do the trick, everything raised to the power 0 will return 1 except null:
DECLARE #t table(id int,A int,B int,C int,Priority int)
INSERT #t
VALUES (1,NULL,3 ,4 ,1),
(1,5 ,6 ,NULL,2),(1,8 ,NULL,NULL,3),
(2,634 ,346 ,359 ,1),(2,34 ,NULL,734 ,2)
;WITH CTE as
(
SELECT id,
CASE WHEN row_number() over
(partition by id order by Priority*power(A,0) desc) = 1 THEN A END A,
CASE WHEN row_number() over
(partition by id order by Priority*power(B,0) desc) = 1 THEN B END B,
CASE WHEN row_number() over
(partition by id order by Priority*power(C,0) desc) = 1 THEN C END C
FROM #t
)
SELECT id, max(a) a, max(b) b, max(c) c
FROM CTE
GROUP BY id
Result:
id a b c
1 8 6 4
2 34 346 734
One alternative that might be faster is a multiple join approach. Get the priority for each column and then join back to the original table. For the first part:
select id,
max(case when a is not null then priority end) as pa,
max(case when b is not null then priority end) as pb,
max(case when c is not null then priority end) as pc
from t
group by id;
Then join back to this table:
with pabc as (
select id,
max(case when a is not null then priority end) as pa,
max(case when b is not null then priority end) as pb,
max(case when c is not null then priority end) as pc
from t
group by id
)
select pabc.id, ta.a, tb.b, tc.c
from pabc left join
t ta
on pabc.id = ta.id and pabc.pa = ta.priority left join
t tb
on pabc.id = tb.id and pabc.pb = tb.priority left join
t tc
on pabc.id = tc.id and pabc.pc = tc.priority ;
This can also take advantage of an index on t(id, priority).
previous code will work with following syntax:
with pabc as (
select id,
max(case when a is not null then priority end) as pa,
max(case when b is not null then priority end) as pb,
max(case when c is not null then priority end) as pc
from t
group by id
)
select pabc.Id,ta.a, tb.b, tc.c
from pabc
left join t ta on pabc.id = ta.id and pabc.pa = ta.priority
left join t tb on pabc.id = tb.id and pabc.pb = tb.priority
left join t tc on pabc.id = tc.id and pabc.pc = tc.priority ;
This looks rather strange. You have a log table for all column changes, but no associated table with current data. Now you are looking for a query to collect your current values from the log table, which is a laborious task naturally.
The solution is simple: have an additional table with the current data. You can even link the tables with a trigger (so either every time a record gets inserted in your log table you update the current table or everytime a change is written to the current table you write a log entry).
Then just query your current table:
select id, a, b, c from currenttable order by id;

Oracle get rows that exactly matches the list of values

I have a details table and I want to get the record that exactly matches the list of values in another table.
Here is a scenario:
OrderDetailTable
OrderID ItemID
1 1
1 2
1 3
1 4
2 1
2 2
2 4
3 1
3 2
3 3
4 1
4 2
OrderedTable
ItemID
1
2
Now I want to get the OrderID that has the exact ItemID matches with OrderedTable ItemID. In the above scenario OrderID 1 is valid since ItemID 1,2,3 is exactly matched with OrderedTable ItemID.
I used the join but it did not work. It gave me both OrderID 1,2 . How do I do it any ideas??
Try this:
SELECT OrderID
FROM OrderDetailTable JOIN OrderedTable USING (ItemID)
GROUP BY OrderID
HAVING COUNT(DISTINCT ItemID) = (SELECT COUNT(DISTINCT ItemID) FROM OrderedTable)
The idea, in a nutshell, is as follows:
Count how many OrderDetailTable rows match OrderedTable by ItemID,
and then compare that to the total number of ItemIDs from OrderedTable.
If these two numbers are equal, the given OrderID "contains" all ItemIDs. If the one is smaller than the other, there is at least one ItemID not contained in the given OrderID.
Depending on your primary keys, the DISTINCT may not be necessary (though it doesn't hurt).
try
SELECT * FROM OrderDetailTable WHERE OrderID NOT IN
(
SELECT A.OrderID FROM
(
SELECT
Y.OrderID
, OT.ItemID
, (SELECT Z.ItemID
FROM OrderDetailTable Z
WHERE Z.ItemID = OT.ItemID AND Z.OrderID = Y.OrderID
) I
FROM OrderDetailTable Y, OrderedTable OT
) A
WHERE A.I IS NULL);
EDIT - as per request the better syntax:
SELECT * FROM
OrderDetailTable Z WHERE Z.ORDERID NOT IN
(
SELECT O1 FROM
(SELECT Y.ORDERID O1, YY.ORDERID O2 FROM
OrderDetailTable Y CROSS JOIN OrderedTable OT
LEFT OUTER JOIN OrderDetailTable YY ON
YY.ORDERID = Y.ORDERID AND YY.ITEMID = OT.ITEMID ) ZZ WHERE ZZ.O2 IS NULL);

SQL Group By Question

I have a table that has the below columns.
I need to find out those people that has More than 2 ApplicantRowid with same jobcategoryrowid and AssessmentTest should have atleast one row NULL with Different Appstatusrowid's.
The result should look exeactly like the below table.
Rowid ApplicantRowid JobCategoryRowid AssessmentTestRowid AppstatusRowid
10770598 6952346 157 3 5
11619676 6952346 157 NULL 6
select t.*
from
(
select ApplicantRowid, JobCategoryRowid
from tbl
group by ApplicantRowid, JobCategoryRowid
having count(AssessmentTestRowid) < count(*)
and count(distinct AppstatusRowid) > 1
) x
inner join t on t.ApplicantRowid = x.ApplicantRowid
and t.JobCategoryRowid = x.JobCategoryRowid
COUNT does not include NULLs, so count(AssessmentTestRowid) < count(*) ensures there is at least a NULL
count(distinct AppstatusRowid) > 1 ensure there are different AppstatusRowids