SQL like: How to calculate intersection and union of <item,user> data - sql

Need help in SQL:
I have a data with the following columns:
ItemId
UserId
Each row indicates that some item was bought by some user.
Example:
ItemId UserId
200 user1
200 user3
200 user4
300 user5
300 user3
for each I would like to calculate the following output table:
users(i) : number of users bought i
users(j) : number of users bought j
users(i, j) : number of users bought both i and j
users(i, ~j) : number of users bought i but not j
users(~i, j) : number of users bought j but not i
Example of output (from the example above):
i_itemId j_itemId users(i) users(j) users(i,j) users(i,~j) users(~i, j)
200 200 3 3 3 0 0
200 300 3 2 1 2 1
300 300 2 2 2 0 0
300 200 3 2 1 1 2
Note:
The data table is huge (11 GB) located on the cloud. I have a framework of SQL to work with. So I cannot download the file and run python (for example)
So the solution has to be written in SQL in efficient way
The solution does not have to be one single SQL statement.
I am looking for an efficient solution
We can assume that is a key
If someone have a better alternative for the Question header here i will be glad to update it :)

I am not sure if there is an "easy" way to accomplish this. One method is rather brute force: use a cross join to generate all the rows. Then use subqueries for each of the individual counts:
select i1.itemid, i2.itemid, i1.num as cnt1, i2.num as cnt2,
(select count(*)
from t u1 join
t u2
on u1.userid = u2.userid
where u1.itemid = i1.itemid and u2.itemid = i2.itemid
) as cnt_1_2,
(select count(*)
from t u1 left join
t u2
on u1.userid = u2.userid and u2.itemid = i2.itemid
where u1.itemid = i1.itemid and u2.itemid is null
) as cnt_1_not2,
(select count(*)
from t u1 left join
t u2
on u1.userid = u2.userid and u1.itemid = i1.itemid
where u2.itemid = i2.itemid and u1.itemid is null
) as cnt_not1_2
from (select itemid, count(*) as num from t group by itemid) i1 cross join
(select itemid, count(*) as num from t group by itemid) i2;

Here's a recipe
1) Create a temporary table to gather the I and J totals.
Disclaimer :
This example uses a MS SQL server datatype: INT.
So change it to the numeric type that your RDBMS supports.
Btw, in MS SQL Server, temporary tables start with #
create table TempTotals (iItemId int, jItemId int, TotalUsers int);
2) Fill it up with totals
delete from TempTotals;
insert into TempTotals (iItemId, jItemId, TotalUsers)
select
t1.ItemId as iItemId,
t2.ItemId as jItemId,
count(distinct t1.UserId) as TotalUsers
from YourTable t1
full join YourTable t2 on (t1.UserId = t2.UserId)
group by t1.ItemId, t2.ItemId;
3) Self-join the temporary table to get all the totals
select
ij.iItemId,
ij.jItemId,
i.TotalUsers as Users_I,
j.TotalUsers as Users_J,
ij.TotalUsers as Users_I_and_J,
(i.TotalUsers - ij.TotalUsers) as Users_I_no_J,
(j.TotalUsers - ij.TotalUsers) as Users_J_no_I
from TempTotals ij
left join TempTotals i on (i.iItemId = ij.iItemId and i.iItemId = i.jItemId)
left join TempTotals j on (j.jItemId = ij.jItemId and j.iItemId = j.jItemId)

If you're using Oracle Database, you can compare nested tables (collections) with the multiset operators. And get the number of elements in a collection with cardinality.
So what you can do is:
Group by itemid, collecting all the users into a nested table
Cross join the output of this with itself
Use the multiset intersect/except operators to get number of elements in the sets as needed
Which looks a little like:
create table t (
ItemId int, UserId varchar2(10)
);
insert into t values ( 200 , 'user1');
insert into t values ( 200 , 'user3');
insert into t values ( 200 , 'user4');
insert into t values ( 300 , 'user5');
insert into t values ( 300 , 'user3');
commit;
create or replace type users_t as table of varchar2(10);
/
with grps as (
select itemid, cast ( collect ( userid ) as users_t ) users
from t
group by itemid
)
select g1.itemid i, g2.itemid j,
cardinality ( g1.users ) num_i,
cardinality ( g2.users ) num_j,
cardinality ( g1.users multiset intersect g2.users ) i_and_j,
cardinality ( g1.users multiset except g2.users ) i_not_j,
cardinality ( g2.users multiset except g1.users ) j_not_i
from grps g1
cross join grps g2;
I J NUM_I NUM_J I_AND_J I_NOT_J J_NOT_I
200 200 3 3 3 0 0
200 300 3 2 1 2 1
300 200 2 3 1 1 2
300 300 2 2 2 0 0
If necessary, you may be able to get more performance by skipping the except operators when i = j, e.g.:
case
when g1.itemid = g2.itemid then 0
else cardinality ( g1.users multiset intersect g2.users )
end

Related

In SQL, how can i segment users by number of items they have? (redshift)

I'm not a SQL expert so apologies if this is actually really simple.
I have a table that lists users and the different questionnaires they have taken. Users can take questionnaires in any order and take as many as they like. There are a total of 7 available and I want to get a view of how many have taken 1 out of 7, 2 of 7, 3 of 7 etc etc
So a really rough example is the table might look like this:
And I want a query that will show me:
count Users with 1 Q: 1
count Users with 2 Q: 2
count Users with 3 Q: 0
count Users with 4 Q: 0
count Users with 5 Q: 1
count Users with 6 Q: 0
count Users with 7 Q: 0
You can do this with two levels of aggregation:
select cnt_questionnaires, count(*) cnt_users
from (
select count(*) cnt_questionnaires from mytable group by userID
) t
IF OBJECT_ID('tempdb..#t') IS NOT NULL DROP TABLE #t ;
create table #t (userid INT, q nvarchar(32));
insert into #t
values
(1,'Q1'),
(1,'Q3'),
(2,'Q2'),
(3,'Q1'),
(3,'Q2'),
(3,'Q3'),
(3,'Q4'),
(3,'Q5'),
(4,'Q2'),
(4,'Q3')
-- select * from #t
SELECT
v.qCount,
Count(c.userid) uCount
FROM
(VALUES (1),(2),(3),(4),(5),(6),(7)) v(qCount)
LEFT JOIN (
select
userid, count(q) qCount
from
#t
group by userid
) c ON c.qCount = v.qCount
GROUP BY
v.qCount
Assuming you have user_id on each row, the challenge is getting the zero values. Redshift is not very flexible when it comes to creating tables. Assuming your source data has enough rows, you can use:
select n.n, coalesce(u.cnt, 0)
from (select row_number() over () as n
from t
limit 7
) n left join
(select user_id, count(*) as cnt
from t
group by user_id
) u
on n.n = u.cnt;

How to update every rows which has a number greater than or equal to the joining table?

Here is an example of the tables I am joining together (note: the tables have the exact same schema but are in different databases, I am trying to combine them):
Database 1 Table
UniqID UniqID2 Number
100 150 1
100 151 2
Database 2 Table
UniqID UniqID2 Number
100 152 2
100 153 3
I am trying to merge Table2 into Table1 and I'm joining on Table1.UniqID = Table2.UniqID. I don't want any overlapping values in the Number column, this is what I want the result to look like:
Table 1
UniqID UniqID2 Number
100 150 1
100 151 2
100 152 3
100 153 4
This is the query I have so far, but it only updates the row in Table 2 where the Number = 2 and doesn't increment the Number = 3 row. How can I adjust my query to do so?
UPDATE db2
Set db2.Number = db2.Number +
(SELECT MAX(Number) FROM [Database 1]..db1 WHERE UniqID = db2.UniqID)
FROM [Database 2]..table db2
INNER JOIN [Database 1]..Table db1
ON db1.UniqID = db2.UniqID
AND db1.Number = db2.Number
And this is what my Database 2 Table results look like right now:
Database 2 Table
UniqID UniqID2 Number
100 152 3
100 153 3
Basically, the only difference is that I want the Number = 3 to be Number = 4 in the second column.
I think you want a union all query and insert:
insert into table1(UniqID, UniqID2, Number)
select t2.UniqID, t2.UniqID2,
(x.maxn + row_number() over (order by (select null) ))
from table2 t2 cross join
(select max(number) as maxn from table1) x;
A different appraoch could be
UPDATE t2
SET t2.Number = t1.T1Number + 1
FROM table2 t2
INNER JOIN (SELECT uniqid, uniqid2, number as T1Number from Table1
union
SELECT uniqid, uniqid2, number as T1Number from Table2
) t1
ON t1.uniqid = t2.uniqid and t1.UniqID2 = t2.UniqID2-1
one more approach which works in SQL 2012..
Demo here
;With cte
as
(select *
from
#t
union all
select *
from #t1
)
select uniqid,uniqid2,
case when lag(number) over (order by uniqid,uniqid2) is null then number
when lead(number) over (order by uniqid,uniqid2) is null
then number+1 else lead(number) over (order by uniqid,uniqid2) end as nextnumber
from cte

Consolidate records

I want to consolidate a set of records
(id) / (referencedid)
1 10
1 11
2 11
2 10
3 10
3 11
3 12
The result of query should be
1 10
1 11
3 10
3 11
3 12
So, since id=1 and id=2 has same set of corresponding referenceids {10,11} they would be consolidated. But id=3 s corresponding referenceids are not the same, hence wouldnt be consolidated.
What would be good way to get this done?
Select id, referenceid
From MyTable
Where Id In (
Select Min( Z.Id ) As Id
From (
Select Z1.id, Group_Concat( Z1.referenceid ) As signature
From (
Select id, referenceid
From MyTable
Order By id, referenceid
) As Z1
Group By Z1.id
) As Z
Group By Z.Signature
)
-- generate count of elements for each distinct id
with Counts as (
select
id,
count(1) as ReferenceCount
from
tblReferences R
group by
R.id
)
-- generate every pairing of two different id's, along with
-- their counts, and how many are equivalent between the two
,Pairings as (
select
R1.id as id1
,R2.id as id2
,C1.ReferenceCount as count1
,C2.ReferenceCount as count2
,sum(case when R1.referenceid = R2.referenceid then 1 else 0 end) as samecount
from
tblReferences R1 join Counts C1 on R1.id = C1.id
cross join
tblReferences R2 join Counts C2 on R2.id = C2.id
where
R1.id < R2.id
group by
R1.id, C1.ReferenceCount, R2.id, C2.ReferenceCount
)
-- generate the list of ids that are safe to remove by picking
-- out any id's that have the same number of matches, and same
-- size of list, which means their reference lists are identical.
-- since id2 > id, we can safely remove id2 as a copy of id, and
-- the smallest id of which all id2 > id are copies will be left
,RemovableIds as (
select
distinct id2 as id
from
Pairings P
where
P.count1 = P.count2 and P.count1 = P.samecount
)
-- validate the results by just selecting to see which id's
-- will be removed. can also include id in the query above
-- to see which id was identified as the copy
select id from RemovableIds R
-- comment out `select` above and uncomment `delete` below to
-- remove the records after verifying they are correct!
--delete from tblReferences where id in (select id from RemovableIds) R

SQL Group By Question

I have a table that has the below columns.
I need to find out those people that has More than 2 ApplicantRowid with same jobcategoryrowid and AssessmentTest should have atleast one row NULL with Different Appstatusrowid's.
The result should look exeactly like the below table.
Rowid ApplicantRowid JobCategoryRowid AssessmentTestRowid AppstatusRowid
10770598 6952346 157 3 5
11619676 6952346 157 NULL 6
select t.*
from
(
select ApplicantRowid, JobCategoryRowid
from tbl
group by ApplicantRowid, JobCategoryRowid
having count(AssessmentTestRowid) < count(*)
and count(distinct AppstatusRowid) > 1
) x
inner join t on t.ApplicantRowid = x.ApplicantRowid
and t.JobCategoryRowid = x.JobCategoryRowid
COUNT does not include NULLs, so count(AssessmentTestRowid) < count(*) ensures there is at least a NULL
count(distinct AppstatusRowid) > 1 ensure there are different AppstatusRowids

How to find "holes" in a table

I recently inherited a database on which one of the tables has the primary key composed of encoded values (Part1*1000 + Part2).
I normalized that column, but I cannot change the old values.
So now I have
select ID from table order by ID
ID
100001
100002
101001
...
I want to find the "holes" in the table (more precisely, the first "hole" after 100000) for new rows.
I'm using the following select, but is there a better way to do that?
select /* top 1 */ ID+1 as newID from table
where ID > 100000 and
ID + 1 not in (select ID from table)
order by ID
newID
100003
101029
...
The database is Microsoft SQL Server 2000. I'm ok with using SQL extensions.
select ID +1 From Table t1
where not exists (select * from Table t2 where t1.id +1 = t2.id);
not sure if this version would be faster than the one you mentioned originally.
SELECT (ID+1) FROM table AS t1
LEFT JOIN table as t2
ON t1.ID+1 = t2.ID
WHERE t2.ID IS NULL
This solution should give you the first and last ID values of the "holes" you are seeking. I use this in Firebird 1.5 on a table of 500K records, and although it does take a little while, it gives me what I want.
SELECT l.id + 1 start_id, MIN(fr.id) - 1 stop_id
FROM (table l
LEFT JOIN table r
ON l.id = r.id - 1)
LEFT JOIN table fr
ON l.id < fr.id
WHERE r.id IS NULL AND fr.id IS NOT NULL
GROUP BY l.id, r.id
For example, if your data looks like this:
ID
1001
1002
1005
1006
1007
1009
1011
You would receive this:
start_id stop_id
1003 1004
1008 1008
1010 1010
I wish I could take full credit for this solution, but I found it at Xaprb.
from How do I find a "gap" in running counter with SQL?
select
MIN(ID)
from (
select
100001 ID
union all
select
[YourIdColumn]+1
from
[YourTable]
where
--Filter the rest of your key--
) foo
left join
[YourTable]
on [YourIdColumn]=ID
and --Filter the rest of your key--
where
[YourIdColumn] is null
The best way is building a temp table with all IDs
Than make a left join.
declare #maxId int
select #maxId = max(YOUR_COLUMN_ID) from YOUR_TABLE_HERE
declare #t table (id int)
declare #i int
set #i = 1
while #i <= #maxId
begin
insert into #t values (#i)
set #i = #i +1
end
select t.id
from #t t
left join YOUR_TABLE_HERE x on x.YOUR_COLUMN_ID = t.id
where x.YOUR_COLUMN_ID is null
Have thought about this question recently, and looks like this is the most elegant way to do that:
SELECT TOP(#MaxNumber) ROW_NUMBER() OVER (ORDER BY t1.number)
FROM master..spt_values t1 CROSS JOIN master..spt_values t2
EXCEPT
SELECT Id FROM <your_table>
This solution doesn't give all holes in table, only next free ones + first available max number on table - works if you want to fill in gaps in id-es, + get free id number if you don't have a gap..
select numb + 1 from temp
minus
select numb from temp;
This will give you the complete picture, where 'Bottom' stands for gap start and 'Top' stands for gap end:
select *
from
(
(select <COL>+1 as id, 'Bottom' AS 'Pos' from <TABLENAME> /*where <CONDITION*/>
except
select <COL>, 'Bottom' AS 'Pos' from <TABLENAME> /*where <CONDITION>*/)
union
(select <COL>-1 as id, 'Top' AS 'Pos' from <TABLENAME> /*where <CONDITION>*/
except
select <COL>, 'Top' AS 'Pos' from <TABLENAME> /*where <CONDITION>*/)
) t
order by t.id, t.Pos
Note: First and Last results are WRONG and should not be regarded, but taking them out would make this query a lot more complicated, so this will do for now.
Many of the previous answer are quite good. However they all miss to return the first value of the sequence and/or miss to consider the lower limit 100000. They all returns intermediate holes but not the very first one (100001 if missing).
A full solution to the question is the following one:
select id + 1 as newid from
(select 100000 as id union select id from tbl) t
where (id + 1 not in (select id from tbl)) and
(id >= 100000)
order by id
limit 1;
The number 100000 is to be used if the first number of the sequence is 100001 (as in the original question); otherwise it is to be modified accordingly
"limit 1" is used in order to have just the first available number instead of the full sequence
For people using Oracle, the following can be used:
select a, b from (
select ID + 1 a, max(ID) over (order by ID rows between current row and 1 following) - 1 b from MY_TABLE
) where a <= b order by a desc;
The following SQL code works well with SqLite, but should be used without issues also on MySQL, MS SQL and so on.
On SqLite this takes only 2 seconds on a table with 1 million rows (and about 100 spared missing rows)
WITH holes AS (
SELECT
IIF(c2.id IS NULL,c1.id+1,null) as start,
IIF(c3.id IS NULL,c1.id-1,null) AS stop,
ROW_NUMBER () OVER (
ORDER BY c1.id ASC
) AS rowNum
FROM |mytable| AS c1
LEFT JOIN |mytable| AS c2 ON c1.id+1 = c2.id
LEFT JOIN |mytable| AS c3 ON c1.id-1 = c3.id
WHERE c2.id IS NULL OR c3.id IS NULL
)
SELECT h1.start AS start, h2.stop AS stop FROM holes AS h1
LEFT JOIN holes AS h2 ON h1.rowNum+1 = h2.rowNum
WHERE h1.start IS NOT NULL AND h2.stop IS NOT NULL
UNION ALL
SELECT 1 AS start, h1.stop AS stop FROM holes AS h1
WHERE h1.rowNum = 1 AND h1.stop > 0
ORDER BY h1.start ASC