What's the best way to group SQL results by items batch? - sql

For example, I have a simple table books:
author
book
Author-A
Book-A1
Author-A
Book-A2
Author-B
Book-B1
Author-C
Book-C1
Author-C
Book-C2
And I need to count books by each author, so I'll write:
select author, count(*) from books
group by author
# Author-A = 2
# Author-B = 1
# Author-C = 2
But now I need to count books by groups of authors:
groupA = ['Author-A', 'Author-C'],
groupB = ['Author-B']
select authorGroup, count(*) from books
group by {
case author in groupA -> 'groupA'
case author in groupB -> 'groupB'
} as authorGroup
# ['Author-A', 'Author-C'] = 4
# ['Author-B'] = 1
These groups can be different and come from another module.
What's the best way to write this requests? Maybe without union such as:
select author as 'groupA', count(*) from books
where author in { groupA }
union
select author as 'groupB', count(*) from books
where author in { groupB }
because there could be a lot of groups in request (~20-30)
The problem is that these groups can be absolutely dynamic: I can request ['Author-A', 'Author-B'] in one request as one group and ['Author-B', 'Author-C'] in another.
For example, the group is not something like author's country or genre. It can be totally dynamic.

The usual way is to JOIN on to a mapping table, which can be an in-line-view if need be (though I recommend an actual table, which can be indexed).
WITH
author_group AS
(
SELECT 'Author-A' AS author, 'Group-A' AS group_label
UNION ALL
SELECT 'Author-B' AS author, 'Group-B' AS group_label
UNION ALL
SELECT 'Author-C' AS author, 'Group-A' AS group_label
)
SELECT
author_group.group_label,
COUNT(*)
FROM
books
INNER JOIN
author_group
ON author_group.author = books.author
GROUP BY
author_group.group_label
Similar results can be achieved with CASE expressions, but it doesn't scale very well...
WITH
mapped_author AS
(
SELECT
*,
CASE author
WHEN 'Author-A' THEN 'Group-A'
WHEN 'Author-B' THEN 'Group-B'
WHEN 'Author-C' THEN 'Group-A'
END
AS author_group
FROM
books
)
SELECT
author_group,
COUNT(*)
FROM
mapped_author
GROUP BY
author_group

First you need to create a new table that show in what group is the author.
Later you just count
Like this:
select distinct a.group_auth, count(a.book) over (partition by a.group_auth)
from
(select
case when b.Author in [groupA] then 'groupA',
when b.Author in [groupB] then 'groupB'
end case as group_auth,
b.book as book
from books b
) as a
;

Related

How to find combination of intersection from many tables?

I have a list of different channels that could potentially bring users to a website (organic, SEO, online marketing, etc.). I would like to find an efficient way to count daily active user that comes from the combination of these channels. Each channel has its own table and track its respective users.
The tables looks like the following,
channel A
date user_id
2020-08-01 A
2020-08-01 B
2020-08-01 C
channel B
date user_id
2020-08-01 C
2020-08-01 D
2020-08-01 G
channel C
date user_id
2020-08-01 A
2020-08-01 C
2020-08-01 F
I want to know the following combinations
Only visit channel A
Only visit channel A & B
Only visit channel B & C
Only visit channel B
etc.
However, when there are a lot of channels (I have around 8 channels) the combination is a lot. What I've done roughly is as simple as this (this one includes channel A)
SELECT
a.date,
COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1
but extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this? I was thinking to use FULL OUTER JOIN but can't seem to get the grasp out of it. Answers really appreciated.
I would approach this with union all and two levels of aggregation:
select date, channels, count(*) as num_users
from (select date, user_id, string_agg(channel order by channel) as channels
from ((select distinct date, user_id, 'a' as channel from a) union all
(select distinct date, user_id, 'b' as channel from b) union all
(select distinct date, user_id, 'c' as channel from c)
) abc
group by date, user_id
) c
group by date, channels;
However, when there are a lot of channels (I have around 8 channels) the combination is a lot
extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this?
Below is for BigQuery Standard SQL and addresses exactly above aspect of the OP's concerns
#standardSQL
CREATE TEMP FUNCTION generate_combinations(a ARRAY<INT64>)
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
var combine = function(a) {
var fn = function(n, src, got, all) {
if (n == 0) {
if (got.length > 0) {
all[all.length] = got;
} return;
}
for (var j = 0; j < src.length; j++) {
fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
} return;
}
var all = []; for (var i = 1; i < a.length; i++) {
fn(i, a, [], all);
}
all.push(a);
return all;
}
return combine(a)
''';
with users as (
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
), visits as (
select date, user_id,
string_agg(channel, ' & ' order by channel) combination
from users
group by date, user_id
), channels AS (
select channel, cast(row_number() over(order by channel) as string) channel_num
from (select distinct channel from users)
), combinations as (
select string_agg(channel, ' & ' order by channel_num) combination
from unnest(generate_combinations(generate_array(1,(select count(1) from channels)))) AS items,
unnest(split(items)) AS channel_num
join channels using(channel_num)
group by items
)
select date,
combination as channels_visited_only,
count(distinct user_id) dau
from visits
join combinations using (combination)
group by date, combination
order by combination
If to apply to sample data from your question - output is
Some explanations to help with using above
CTE users just simply union all tables and adds channel column to be able to distinguish from which table respective row came
CTE visits extracts list of all visited channels for each user-date combination
CTE channels just simply prepares list of channels and assigns number for later use
CTE combinations uses JS UDF to generate all combinations of channels' numbers and then joins them back to channels to generate channels combinations
and final SELECT statement is simply looks for those users whose list of visited channels match channels combination generated in previous step
Some recommendations for further streamlining above code
assuming your channel tables names follow channel_* pattern
you can use wildcard tables feature in users CTE and instead of
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
you can use something like below - so just one line instead of as many lines as cannles you have
select distinct date, user_id, _TABLE_SUFFIX as channel from channel_*
I think you could use set operators to answer your questions: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators
E.g.
is (A except B) except C
is A intersect B
etc.
I am thinking full join and aggregation:
select date, a.channel_a, b.channel_b, c.channel_c, count(*) cnt
from (select 'a' channel_a, a.* from channel_a) a
full join (select 'b' channel_b, b.* from channel_b b) b using (date, user_id)
full join (select 'c' channel_c, c.* from channel_c c) c using (date, user_id)
group by date, a.channel_a, b.channel_b, c.channel_c

Merging two query results in a materialized view

Im trying to merge two SELECT results into one view.
The first query returns the id's of all registered users.
The second query goes through an entire table and counts how many victories a player has and returns the id of the player and number of wins.
What I'm trying to do now is to merge these two results, so that if the user has wins it states how many but if he doesn't then it says 0.
I tried doing it like this:
SELECT profile.user_id
FROM profile
FULL JOIN ( SELECT player_game_data.user_id,
count(player_game_data.user_id) AS wins
FROM player_game_data
WHERE player_game_data.is_winner = 1
GROUP BY player_game_data.user_id) t2 ON profile.user_id::text = t2.user_id::text;
But in the end it only returns id's of the players and there isn't a count column:
What am I doing wrong?
Is this what you want?
select p.*,
(select count(*)
from player_game_data pg
where pg.user_id = p.user_id and pg.is_winner = 1
) as num_wins
from profile p;
Or, if all users have played at least one game, you can use conditional aggregation:
select pg.user_id,
count(*) filter (where pg.is_winner = 1)
from player_game_data pg
group by pg.user_id;
Or, if is_winner only takes on the values of 0 and 1:
select pg.user_id, sum(ps.is_winner)
from player_game_data pg
group by pg.user_id;
Thanks for the help Gordon. I've got it to work now.
The final query looks like this :
SELECT p.user_id,
( SELECT count(*) AS count
FROM player_game_data pg
WHERE pg.user_id::text = p.user_id::text AND pg.is_winner = 1) AS wins,
( SELECT count(*) AS count
FROM player_game_data pg
WHERE pg.user_id::text = p.user_id::text AND pg.is_winner = 0) AS losses,
( SELECT count(*) AS count
FROM player_game_data pg
WHERE pg.user_id::text = p.user_id::text) AS games_played
FROM profile p;
And when I run it I get the result that i wanted:

finding value in a list created via subquery

Thank you Stack-Community,
This is probably obvious for most of you but I just don't understand why it doesn't work.
I am using the Northwind database and lets say I am trying to find the countries that or not occurring twice but are listed either more than twice or less often.
I already figured out other ways of doing it with a having statement, so I am not looking for alternatives but trying to understand why my initial attempt is not working.
I look at it and look at it and it makes perfect sense to me. Can someone explain what's the problem?
SELECT country, count(country)
FROM Customers
WHERE 2 not in (SELECT count(country) FROM Customers GROUP BY country)
GROUP BY country
;
You need correlated subquery:
SELECT country, count(country)
FROM Customers c
WHERE 2 not in (SELECT count(country) FROM Customers c2
WHERE c2.country = c.country )
GROUP BY country;
Otherwise you get something like:
SELECT country, count(country)
FROM Customers c
WHERE 2 not in (1,2,3) -- false in every case and empty resultset
GROUP BY country;
Imagine that you have:
1, 'UK' -- 1
2, 'DE' -- 2
3, 'DE'
4, 'RU' -- 1
Now you will get equivalent of
SELECT country, count(country)
FROM Customers c
WHERE 2 not in (1,2,1) -- false in every case and empty resultset
GROUP BY country;
-- 0 rows selected

SQL: multiple counts from same table

I am having a real problem trying to get a query with the data I need. I have tried a few methods without success. I can get the data with 4 separate queries, just can't get hem into 1 query. All data comes from 1 table. I will list as much info as I can.
My data looks like this. I have a customerID and 3 columns that record who has worked on the record for that customer as well as the assigned acct manager
RecID_Customer___CreatedBy____LastUser____AcctMan
1-------1374----------Bob Jones--------Mary Willis------Bob Jones
2-------1375----------Mary Willis------Bob Jones--------Bob Jones
3-------1376----------Jay Scott--------Mary Willis-------Mary Willis
4-------1377----------Jay Scott--------Mary Willis------Jay Scott
5-------1378----------Bob Jones--------Jay Scott--------Jay Scott
I want the query to return the following data. See below for a description of how each is obtained.
Employee___Created__Modified__Mod Own__Created Own
Bob Jones--------2-----------1---------------1----------------1
Mary Willis------1-----------2---------------1----------------0
Jay Scott--------2-----------1---------------1----------------1
Created = Counts the number of records created by each Employee
Modified = Number of records where the Employee is listed as Last User
(except where they created the record)
Mod Own = Number of records for each where the LastUser = Acctman
(account manager)
Created Own = Number of Records created by the employee where they are
the account manager for that customer
I can get each of these from a query, just need to somehow combine them:
Select CreatedBy, COUNT(CreatedBy) as Created
FROM [dbo].[Cust_REc] GROUP By CreatedBy
Select LastUser, COUNT(LastUser) as Modified
FROM [dbo].[Cust_REc] Where LastUser != CreatedBy GROUP By LastUser
Select AcctMan, COUNT(AcctMan) as CreatePort
FROM [dbo].[Cust_REc] Where AcctMan = CreatedBy GROUP By AcctMan
Select AcctMan, COUNT(AcctMan) as ModPort
FROM [dbo].[Cust_REc] Where AcctMan = LastUser AND NOT AcctMan = CreatedBy GROUP By AcctMan
Can someone see a way to do this? I may have to join the table to itself, but my attempts have not given me the correct data.
The following will give you the results you're looking for.
select
e.employee,
create_count=(select count(*) from customers c where c.createdby=e.employee),
mod_count=(select count(*) from customers c where c.lastmodifiedby=e.employee),
create_own_count=(select count(*) from customers c where c.createdby=e.employee and c.acctman=e.employee),
mod_own_count=(select count(*) from customers c where c.lastmodifiedby=e.employee and c.acctman=e.employee)
from (
select employee=createdby from customers
union
select employee=lastmodifiedby from customers
union
select employee=acctman from customers
) e
Note: there are other approaches that are more efficient than this but potentially far more complex as well. Specifically, I would bet there is a master Employee table somewhere that would prevent you from having to do the inline view just to get the list of names.
this seems pretty straight forward. Try this:
select a.employee,b.created,c.modified ....
from (select distinct created_by from data) as a
inner join
(select created_by,count(*) as created from data group by created_by) as b
on a.employee = b.created_by)
inner join ....
This highly inefficient query may be a rough start to what you are looking for. Once you validate the data then there are things you can do to tidy it up and make it more efficient.
Also, I don't think you need the DISTINCT on the UNION part because the UNION will return DISTINCT values unless UNION ALL is specified.
SELECT
Employees.EmployeeID,
Created =(SELECT COUNT(*) FROM Cust_REc WHERE Cust_REc.CreatedBy=Employees.EmployeeID),
Mopdified =(SELECT COUNT(*) FROM Cust_REc WHERE Cust_REc.LastUser=Employees.EmployeeID AND Cust_REc.CreateBy<>Employees.EmployeeID),
ModOwn =
CASE WHEN NOT Empoyees.IsManager THEN NULL ELSE
(SELECT COUNT(*) FROM Cust_REc WHERE AcctMan=Employees.EmployeeID)
END,
CreatedOwn=(SELECT COUNT(*) FROM Cust_REc WHERE AcctMan=Employees.EmployeeID AND CReatedBy=Employees.EMployeeID)
FROM
(
SELECT
EmployeeID,
IsManager=CASE WHEN EXISTS(SELECT AcctMan FROM CustRec WHERE AcctMan=EmployeeID)
FROM
(
SELECT DISTINCT
EmployeeID
FROM
(
SELECT EmployeeID=CreatedBy FROM Cust_Rec
UNION
SELECT EmployeeID=LastUser FROM Cust_Rec
UNION
SELECT EmployeeID=AcctMan FROM Cust_Rec
)AS Z
)AS Y
)
AS Employees
I had the same issue with the Modified column. All the other columns worked okay. DCR example would work well with the join on an employees table if you have it.
SELECT CreatedBy AS [Employee],
COUNT(CreatedBy) AS [Created],
--Couldn't get modified to pull the right results
SUM(CASE WHEN LastUser = AcctMan THEN 1 ELSE 0 END) [Mod Own],
SUM(CASE WHEN CreatedBy = AcctMan THEN 1 ELSE 0 END) [Created Own]
FROM Cust_Rec
GROUP BY CreatedBy

Having problems identifying my mistake

The tables which are already created and unmodifiable are Book and Author.
Book (Title, Price, Yeareleased)
Author(AName,btitle,position)
Italized are the keys
and Btitle in Author is a foreign key that references Book(Title).
My SQL query:
select distinct AName
from Author
where position in (2,3) AND position<>1
group by AName
When I run this I get all the authors that have a book in position 2 or 3. Which is what I want but I'm only trying to get those authors which have a position 2 or 3 for all there books.
Essentially returning every author who was in the 2nd or 3rd position in all the books.
Maybe something like this would work:
select distinct AName
from #Author
where position in (2,3)
except
select distinct AName
from #Author
where position not in (2,3)
It makes a set of those authors who are in position 2 and 3 and then removes the ones who has another position.
It is not entirely clear whether someone who co-wrote 2 books and was listed second on one and third on the other should be selected or not. It is simpler to allow it; you can refine the query if you need the more stringent condition.
One way to answer this query makes the key observation that you're interested in authors for whom the count of the books they have written is equal to the count of the books where they are listed as second or third author.
Go for some TDQD — Test-Driven Query Design
Number of books each author wrote
SELECT Aname, COUNT(*) AS BookCount
FROM Author
GROUP BY AName
Number of books each author wrote as second or third author
SELECT Aname, COUNT(*) AS NonLeadAuthorCount
FROM Author
WHERE Position IN (2, 3)
GROUP BY Aname
Join those two where the counts are identical
SELECT X.Aname
FROM (SELECT Aname, COUNT(*) AS BookCount
FROM Author
GROUP BY AName
) AS X
JOIN (SELECT Aname, COUNT(*) AS NonLeadAuthorCount
FROM Author
WHERE Position IN (2, 3)
GROUP BY Aname
) AS Y
ON X.BookCount = Y.NonLeadAuthorCount
An alternative way of looking at is 'the set authors who have written a book in position 2 or 3 minus the set of authors who have written a book where the position is neither 2 nor 3'. For this, see the answer by jpw.
Trying to write standard SQL:
SELECT AName FROM (
SELECT
AName,
COUNT(*) AS count_all,
(SELECT COUNT(*) FROM Author AS aa WHERE aa.AName = a.AName AND position=2) AS count_2,
(SELECT COUNT(*) FROM Author AS aa WHERE aa.AName = a.AName AND position=3) AS count_3,
FROM Author AS a
GROUP BY AName
) AS t
WHERE count_all = count_2
OR count_all = count_3
I hope this work for you.
Try this:
select AName from Author where position=2 OR position=3 group by AName;
Try adding
and AName not in (select AName from Author where position != 2 and position != 3
Or something like that...