In SQL, how can i segment users by number of items they have? (redshift) - sql

I'm not a SQL expert so apologies if this is actually really simple.
I have a table that lists users and the different questionnaires they have taken. Users can take questionnaires in any order and take as many as they like. There are a total of 7 available and I want to get a view of how many have taken 1 out of 7, 2 of 7, 3 of 7 etc etc
So a really rough example is the table might look like this:
And I want a query that will show me:
count Users with 1 Q: 1
count Users with 2 Q: 2
count Users with 3 Q: 0
count Users with 4 Q: 0
count Users with 5 Q: 1
count Users with 6 Q: 0
count Users with 7 Q: 0

You can do this with two levels of aggregation:
select cnt_questionnaires, count(*) cnt_users
from (
select count(*) cnt_questionnaires from mytable group by userID
) t

IF OBJECT_ID('tempdb..#t') IS NOT NULL DROP TABLE #t ;
create table #t (userid INT, q nvarchar(32));
insert into #t
values
(1,'Q1'),
(1,'Q3'),
(2,'Q2'),
(3,'Q1'),
(3,'Q2'),
(3,'Q3'),
(3,'Q4'),
(3,'Q5'),
(4,'Q2'),
(4,'Q3')
-- select * from #t
SELECT
v.qCount,
Count(c.userid) uCount
FROM
(VALUES (1),(2),(3),(4),(5),(6),(7)) v(qCount)
LEFT JOIN (
select
userid, count(q) qCount
from
#t
group by userid
) c ON c.qCount = v.qCount
GROUP BY
v.qCount

Assuming you have user_id on each row, the challenge is getting the zero values. Redshift is not very flexible when it comes to creating tables. Assuming your source data has enough rows, you can use:
select n.n, coalesce(u.cnt, 0)
from (select row_number() over () as n
from t
limit 7
) n left join
(select user_id, count(*) as cnt
from t
group by user_id
) u
on n.n = u.cnt;

Related

how can I count some values for data in a table based on same key in another table in Bigquery?

I have one table like bellow. Each id is unique.
id
times_of_going_out
fef666
2
S335gg
1
9a2c50
1
and another table like this one ↓. In this second table the "id" is not unique, there are different "category_name" for a single id.
id
category_name
city
S335gg
Games & Game Supplies
tk
9a2c50
Telephone Companies
os
9a2c50
Recreation Centers
ky
fef666
Recreation Centers
ky
I want to find the difference between destinations(category_name) of people who go out often(times_of_going_out<5) and people who don't go out often(times_of_going_out<=5).
** Both tables are a small sample of large tables.
 ・ Where do people who go out twice often go?
 ・ Where do people who go out 6times often go?
Thank you
The expected result could be something like
less than 5
more than 5
top ten “category_name” for uid’s with "times_of_going_out" less than 5 times
top ten “category_name” for uid’s with "times_of_going_out" more than 5 times
Steps:
combining data and aggregating total time_going_out
creating the categories that you need : less than equal to 5 and more than 5. if you don't need equal to 5, you can adjust the code
ranking both categories with top 10, using dense_rank(). this will produce the rank from 1 - 10 based on the total time_going out
filtering the cases so it takes top 10 values for both categories
with main as (
select
category_name,
sum(coalesce(times_of_going_out,0)) as total_time_per_category
from table1 as t1
left join table2 as t2
on t1.id = t2.id
group by 1
),
category as (
select
*,
if(total_time_per_category >= 5, 'more than 5', 'less than equal to 5') as is_more_than_5_times
from main
),
ranking_ as (
select *,
case when
is_more_than_5_times = 'more than 5' then
dense_rank() over (partition by is_more_than_5_times order by total_time_per_category desc)
else NULL
end AS rank_more_than_5,
case when
is_more_than_5_times = 'less than equal to 5' then
dense_rank() over (partition by is_more_than_5_times order by total_time_per_category)
else NULL
end AS rank_less_than_equal_5
from category
)
select
is_more_than_5_times,
string_agg(category_name,',') as list
from ranking_
where rank_less_than_equal_5 <=10 or rank_more_than_5 <= 10
group by 1

SQL like: How to calculate intersection and union of <item,user> data

Need help in SQL:
I have a data with the following columns:
ItemId
UserId
Each row indicates that some item was bought by some user.
Example:
ItemId UserId
200 user1
200 user3
200 user4
300 user5
300 user3
for each I would like to calculate the following output table:
users(i) : number of users bought i
users(j) : number of users bought j
users(i, j) : number of users bought both i and j
users(i, ~j) : number of users bought i but not j
users(~i, j) : number of users bought j but not i
Example of output (from the example above):
i_itemId j_itemId users(i) users(j) users(i,j) users(i,~j) users(~i, j)
200 200 3 3 3 0 0
200 300 3 2 1 2 1
300 300 2 2 2 0 0
300 200 3 2 1 1 2
Note:
The data table is huge (11 GB) located on the cloud. I have a framework of SQL to work with. So I cannot download the file and run python (for example)
So the solution has to be written in SQL in efficient way
The solution does not have to be one single SQL statement.
I am looking for an efficient solution
We can assume that is a key
If someone have a better alternative for the Question header here i will be glad to update it :)
I am not sure if there is an "easy" way to accomplish this. One method is rather brute force: use a cross join to generate all the rows. Then use subqueries for each of the individual counts:
select i1.itemid, i2.itemid, i1.num as cnt1, i2.num as cnt2,
(select count(*)
from t u1 join
t u2
on u1.userid = u2.userid
where u1.itemid = i1.itemid and u2.itemid = i2.itemid
) as cnt_1_2,
(select count(*)
from t u1 left join
t u2
on u1.userid = u2.userid and u2.itemid = i2.itemid
where u1.itemid = i1.itemid and u2.itemid is null
) as cnt_1_not2,
(select count(*)
from t u1 left join
t u2
on u1.userid = u2.userid and u1.itemid = i1.itemid
where u2.itemid = i2.itemid and u1.itemid is null
) as cnt_not1_2
from (select itemid, count(*) as num from t group by itemid) i1 cross join
(select itemid, count(*) as num from t group by itemid) i2;
Here's a recipe
1) Create a temporary table to gather the I and J totals.
Disclaimer :
This example uses a MS SQL server datatype: INT.
So change it to the numeric type that your RDBMS supports.
Btw, in MS SQL Server, temporary tables start with #
create table TempTotals (iItemId int, jItemId int, TotalUsers int);
2) Fill it up with totals
delete from TempTotals;
insert into TempTotals (iItemId, jItemId, TotalUsers)
select
t1.ItemId as iItemId,
t2.ItemId as jItemId,
count(distinct t1.UserId) as TotalUsers
from YourTable t1
full join YourTable t2 on (t1.UserId = t2.UserId)
group by t1.ItemId, t2.ItemId;
3) Self-join the temporary table to get all the totals
select
ij.iItemId,
ij.jItemId,
i.TotalUsers as Users_I,
j.TotalUsers as Users_J,
ij.TotalUsers as Users_I_and_J,
(i.TotalUsers - ij.TotalUsers) as Users_I_no_J,
(j.TotalUsers - ij.TotalUsers) as Users_J_no_I
from TempTotals ij
left join TempTotals i on (i.iItemId = ij.iItemId and i.iItemId = i.jItemId)
left join TempTotals j on (j.jItemId = ij.jItemId and j.iItemId = j.jItemId)
If you're using Oracle Database, you can compare nested tables (collections) with the multiset operators. And get the number of elements in a collection with cardinality.
So what you can do is:
Group by itemid, collecting all the users into a nested table
Cross join the output of this with itself
Use the multiset intersect/except operators to get number of elements in the sets as needed
Which looks a little like:
create table t (
ItemId int, UserId varchar2(10)
);
insert into t values ( 200 , 'user1');
insert into t values ( 200 , 'user3');
insert into t values ( 200 , 'user4');
insert into t values ( 300 , 'user5');
insert into t values ( 300 , 'user3');
commit;
create or replace type users_t as table of varchar2(10);
/
with grps as (
select itemid, cast ( collect ( userid ) as users_t ) users
from t
group by itemid
)
select g1.itemid i, g2.itemid j,
cardinality ( g1.users ) num_i,
cardinality ( g2.users ) num_j,
cardinality ( g1.users multiset intersect g2.users ) i_and_j,
cardinality ( g1.users multiset except g2.users ) i_not_j,
cardinality ( g2.users multiset except g1.users ) j_not_i
from grps g1
cross join grps g2;
I J NUM_I NUM_J I_AND_J I_NOT_J J_NOT_I
200 200 3 3 3 0 0
200 300 3 2 1 2 1
300 200 2 3 1 1 2
300 300 2 2 2 0 0
If necessary, you may be able to get more performance by skipping the except operators when i = j, e.g.:
case
when g1.itemid = g2.itemid then 0
else cardinality ( g1.users multiset intersect g2.users )
end

SQL. how to compare values and from two table, and report per-row results

I have two Tables.
table A
id name Size
===================
1 Apple 7
2 Orange 15
3 Banana 22
4 Kiwi 2
5 Melon 28
6 Peach 9
And Table B
id size
==============
1 14
2 5
3 31
4 9
5 1
6 16
7 7
8 25
My desired result will be (add one column to Table A, which is the number of rows in Table B that have size smaller than Size in Table A)
id name Size Num.smaller.in.B
==============================
1 Apple 7 2
2 Orange 15 5
3 Banana 22 6
4 Kiwi 2 1
5 Melon 28 7
6 Peach 9 3
Both Table A and B are pretty huge. Is there a clever way of doing this. Thanks
Use this query it's helpful
SELECT id,
name,
Size,
(Select count(*) From TableB Where TableB.size<Size)
FROM TableA
The standard way to get your result involves a non-equi-join, which will be a product join in Explain. First duplicating 20,000 rows, followed by 7,000,000 * 20,000 comparisons and a huge intermediate spool before the count.
There's a solution based on OLAP-functions which is usually quite efficient:
SELECT dt.*,
-- Do a cumulative count of the rows of table #2
-- sorted by size, i.e. count number of rows with a size #2 less size #1
Sum(CASE WHEN NAME = '' THEN 1 ELSE 0 end)
Over (ORDER BY SIZE, NAME DESC ROWS Unbounded Preceding)
FROM
( -- mix the rows of both tables, an empty name indicates rows from table #2
SELECT id, name, size
FROM a
UNION ALL
SELECT id, '', size
FROM b
) AS dt
-- only return the rows of table #1
QUALIFY name <> ''
If there are multiple rows with the same size in table #2 you better count before the Union to reduce the size:
SELECT dt.*,
-- Do a cumulative sum of the counts of table #2
-- sorted by size, i.e. count number of rows with a size #2 less size #1
Sum(CASE WHEN NAME ='' THEN id ELSE 0 end)
Over (ORDER BY SIZE, NAME DESC ROWS Unbounded Preceding)
FROM
( -- mix the rows of both tables, an empty name indicates rows from table #2
SELECT id, name, size
FROM a
UNION ALL
SELECT Count(*), '', SIZE
FROM b
GROUP BY SIZE
) AS dt
-- only return the rows of table #1
QUALIFY NAME <> ''
There is no clever way of doing that, you just need to join the tables like this:
select a.*, b.size
from TableA a join TableB b on a.id = b.id
To improve performance you'll need to have indexes on the id columns.
maybe
select
id,
name,
a.Size,
sum(cnt) as sum_cnt
from
a inner join
(select size, count(*) as cnt from b group by size) s on
s.size < a.size
group by id,name,a.size
if you're working with large tables. Indexing table b's size field could help. I'm also assuming the values in table B converge, that there's many duplicates you don't care about, other than you want to count them.
sqlfiddle
#Ritesh solution is perfectly correct, another similar solution is using CROSS JOIN as shown below
use tempdb
create table dbo.A (id int identity, name varchar(30), size int );
create table dbo.B (id int identity, size int);
go
insert into dbo.A (name, size)
values ('Apple', 7)
,('Orange', 15)
,('Banana', 22)
,('Kiwi', 2 )
,('Melon', 28)
,('Peach', 6 )
insert into dbo.B (size)
values (14), (5),(31),(9),(1),(16), (7),(25)
go
-- using cross join
select a.*, t.cnt
from dbo.A
cross apply (select cnt=count(*) from dbo.B where B.size < A.size) T(cnt)
try this query
SELECT
A.id,A.name,A.size,Count(B.size)
from A,B
where A.size>B.size
group by A.size
order by A.id;

How do I aggregate numbers from a string column in SQL

I am dealing with a poorly designed database column which has values like this
ID cid Score
1 1 3 out of 3
2 1 1 out of 5
3 2 3 out of 6
4 3 7 out of 10
I want the aggregate sum and percentage of Score column grouped on cid like this
cid sum percentage
1 4 out of 8 50
2 3 out of 6 50
3 7 out of 10 70
How do I do this?
You can try this way :
select
t.cid
, cast(sum(s.a) as varchar(5)) +
' out of ' +
cast(sum(s.b) as varchar(5)) as sum
, ((cast(sum(s.a) as decimal))/sum(s.b))*100 as percentage
from MyTable t
inner join
(select
id
, cast(substring(score,0,2) as Int) a
, cast(substring(score,charindex('out of', score)+7,len(score)) as int) b
from MyTable
) s on s.id = t.id
group by t.cid
[SQLFiddle Demo]
Redesign the table, but on-the-fly as a CTE. Here's a solution that's not as short as you could make it, but that takes advantage of the handy SQL Server function PARSENAME. You may need to tweak the percentage calculation if you want to truncate rather than round, or if you want it to be a decimal value, not an int.
In this or most any solution, you have to count on the column values for Score to be in the very specific format you show. If you have the slightest doubt, you should run some other checks so you don't miss or misinterpret anything.
with
P(ID, cid, Score2Parse) as (
select
ID,
cid,
replace(Score,space(1),'.')
from scores
),
S(ID,cid,pts,tot) as (
select
ID,
cid,
cast(parsename(Score2Parse,4) as int),
cast(parsename(Score2Parse,1) as int)
from P
)
select
cid, cast(round(100e0*sum(pts)/sum(tot),0) as int) as percentage
from S
group by cid;

How to fetch the rows from SQL Server based on GROUP BY

How to fetch the rows from the top 2 pack id's not at a all of the rows in SQL Server?
Ex: Sample_table
tranid packid referencenum
1 1 123456
2 1 654982
3 2 894652
4 3 684521
5 3 684651
6 4 987566
Based on above sample table, how do I get the rows of pack 2 (for 1 and 2) for next instance I need again 3 and 4 rows
Can anyone help me out to sort the issue?
If I didn't miss something, this:
SELECT *
FROM PacksTable p
WHERE p.Id IN (1, 2)
Will give you only the data for the two pack_id's in your table.
It is unclear what you are looking for here. You can group by pack_id then get the top two pack_id, but what do you want to do with the grouped referencenum values for grouped pack_id, i.e What aggregate function you will use for this column, Min, Max, etc ??!.
In other words: If you are looking for the Top minimum pack_id, i.e.: 1, 2 in the first time, you will have to answer the question: What aggregate function to use with the corresponding referencenum values??,
For example, you can use MIN like this:
SELECT TOP(2) p.packid, MIN(p.referencenum)
FROM PacksTable p
GROUP BY(p.packid)
ORDER BY p.packid
please go through the following query.
select * from sample_table group by packid;
You could use variables combined with the DENSE_RANK function to window through two packid's at a time:
create table #packing (tranid int,packid int,referencenum int)
insert into #packing values
(1,1,123456)
, (2,1,654982)
, (3,2,894652)
, (4,3,684521)
, (5,3,684651)
, (6,4,987566)
go
declare #i int=-1;
declare #j int=0;
while ##ROWCOUNT>0 begin
set #i+=2;
set #j+=2;
; with a as (
select *, dr=dense_rank()over(order by packid) from #packing
)
select tranid, packid, referencenum
from a
where dr between #i and #j;
end
go
drop table #packing
go
Result: