Doing a cross pivot in Google BigQuery - sql

I have asked a previous question about doing a multi-level aggregation query on the X-axis here: Get the top patent countries, codes in a BQ public dataset.
Here is how the query (copied from the accepted answer works) to get:
Top 2 Countries by Count, and within those countries, top 2 Codes by Count
WITH A AS (
SELECT country_code
FROM `patents-public-data.patents.publications`
GROUP BY country_code
ORDER BY COUNT(1) DESC
LIMIT 2
), B AS (
SELECT
country_code,
application_kind,
COUNT(1) application_kind_count
FROM `patents-public-data.patents.publications`
WHERE country_code IN (SELECT country_code FROM A)
GROUP BY country_code, application_kind
), C AS (
SELECT
country_code,
application_kind,
application_kind_count,
DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
FROM B
)
SELECT
country_code,
application_kind,
application_kind_count
FROM C
WHERE application_kind_rank <= 2
And I get something like:
country_code application_kind count
JP A 125
JP U 124
CN A 118
CN U 101
Now I would like to add the following pivot on the y-axis: to get the following:
X: Top 2 Countries by Count, and within those countries, top 2 Codes by Count
Y: Top 2 family_id by Count, Top 2 priority_date by Count
The final results would then look like:
I am able to build the Y-query in a second query --
WITH A AS (
SELECT family_id
FROM `patents-public-data.patents.publications`
GROUP BY family_id
ORDER BY COUNT(1) DESC
LIMIT 2
), B AS (
SELECT
family_id,
priority_date,
COUNT(1) priority_date_count
FROM `patents-public-data.patents.publications`
WHERE family_id IN (SELECT family_id FROM A)
GROUP BY family_id, priority_date
), C AS (
SELECT
family_id,
priority_date,
priority_date_count,
DENSE_RANK() OVER(PARTITION BY family_id ORDER BY priority_date_count DESC) AS priority_date_rank
FROM B
)
SELECT
family_id,
priority_date,
priority_date_count
FROM C
WHERE priority_date_rank <= 2
However, I am not quite sure how to merge them together, in a single query or in two.

Below is for BigQuery Standard SQL and is just demo of the approach and not pretending to be 100% representing requested logic
WITH A_X AS (
SELECT country_code FROM `patents-public-data.patents.publications`
GROUP BY country_code ORDER BY COUNT(1) DESC LIMIT 2
), B_X AS (
SELECT country_code, application_kind, COUNT(1) application_kind_count
FROM `patents-public-data.patents.publications` WHERE country_code IN (SELECT country_code FROM A_X)
GROUP BY country_code, application_kind
), C_X AS (
SELECT country_code, application_kind, application_kind_count,
DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
FROM B_X
), X AS (
SELECT country_code, application_kind, application_kind_count
FROM C_X WHERE application_kind_rank <= 2
), A_Y AS (
SELECT family_id FROM `patents-public-data.patents.publications`
JOIN X USING(country_code, application_kind)
GROUP BY family_id
ORDER BY COUNT(1) DESC LIMIT 2
), B_Y AS (
SELECT family_id, priority_date, COUNT(1) priority_date_count
FROM `patents-public-data.patents.publications` WHERE family_id IN (SELECT family_id FROM A_Y)
GROUP BY family_id, priority_date
), C_Y AS (
SELECT family_id, priority_date, priority_date_count,
DENSE_RANK() OVER(PARTITION BY family_id ORDER BY priority_date_count DESC) AS pos_date
FROM B_Y
), Y AS (
SELECT family_id, priority_date, pos_date, DENSE_RANK() OVER(ORDER BY family_id) pos_family
FROM C_Y WHERE pos_date <= 2
)
SELECT country_code, application_kind,
COUNTIF(pos_family = 1 AND pos_date = 1) `family1_date1`,
COUNTIF(pos_family = 1 AND pos_date = 2) `family1_date2`,
COUNTIF(pos_family = 2 AND pos_date = 1) `family2_date1`,
COUNTIF(pos_family = 2 AND pos_date = 2) `family2_date2`
FROM `patents-public-data.patents.publications`
JOIN Y USING(family_id, priority_date)
WHERE country_code IN (SELECT country_code FROM X)
AND application_kind IN (SELECT application_kind FROM x)
GROUP BY country_code, application_kind
the result is
Obviously, there are number of zeroes above because of intersection logic

Related

How to get rid of multiple branch_id?

I have a SQL Query of this:
SELECT
COUNT(PERMISSION_ID) AS USER_TOTAL_PERMISSION_PER_BRANCH,
USER_ID,
BRANCH_ID
FROM BRANCH_PERMISSION_USER
GROUP BY USER_ID, BRANCH_ID
ORDER BY USER_ID, USER_TOTAL_PERMISSION_PER_BRANCH DESC
But I have a problem because I only want the first row per user_id. The main goal is to get the list of user together it's branch and top 1 or the distinct on the USER_TOTAL_PERMISSION_PER_BRANCH
Here is the sample output:
Expected output should be:
[USER_TOTAL_PERMISSION_PER_BRANCH][USER_ID][BRANCH_ID]
135 1 1
135 2 1
134 3 1
1 4 1
1 5 1
1 6 1
You can use window functions:
SELECT USER_TOTAL_PERMISSION_PER_BRANCH, USER_ID, BRANCH_ID
FROM (SELECT COUNT(*) AS USER_TOTAL_PERMISSION_PER_BRANCH,
USER_ID, BRANCH_ID,
ROW_NUMBER() OVER (PARTITION BY USER_ID ORDER BY COUNT(*) DESC) as seqnum
FROM BRANCH_PERMISSION_USER
GROUP BY USER_ID, BRANCH_ID
) ub
WHERE seqnum = 1
You can turn your query to a CTE a do filtering using correlation:
with cte as (
select
count(permission_id) as user_total_permission_per_branch,
user_id,
branch_id
from branch_permission_user
group by user_id, branch_id
)
select c.*
from cte c
where c.user_total_permission_per_branch = (
select max(c1.user_total_permission_per_branch)
from cte c1
where c1.user_id = c.user_id and c1.branch_id = c.branch_id
)
Thanks to Sir #Gordon
I use his logic. Here is it:
SELECT USER_TOTAL_PERMISSION_PER_BRANCH, USER_ID, BRANCH_ID
FROM (SELECT COUNT(*) AS USER_TOTAL_PERMISSION_PER_BRANCH,
USER_ID, BRANCH_ID,
ROW_NUMBER() OVER (PARTITION BY USER_ID ORDER BY COUNT(*) DESC) as seqnum
FROM BRANCH_PERMISSION_USER
GROUP BY USER_ID, BRANCH_ID
) ub
WHERE seqnum = 1

find total unique number of hackers who made at least one submission every day and find the hacker_id who made maximum number of submissions each day

Find total number of unique hackers who made at least submission each day (starting on the first day of the contest), and find the hacker_id and name of the hacker who made maximum number of submissions each day. If more than one such hacker has a maximum number of submissions, print the lowest hacker_id. The query should print this information for each day of the contest, sorted by the date.
Here is the sample data:
Hackers table:
15758 Rose
20703 Angela
36396 Frank
38289 Patrick
44065 Lisa
53473 Kimberly
62529 Bonnie
79722 Michael
Submissions table:
Submission_date submission_id hacker_id score
3/1/2016 8494 20703 0
3/1/2016 22403 53473 15
3/1/2016 23965 79722 60
3/1/2016 30173 36396 70
3/2/2016 34928 20703 0
3/2/2016 38740 15758 60
3/2/2016 42769 79722 25
3/2/2016 44364 79722 60
3/3/2016 45440 20703 0
3/3/2016 49050 36396 70
3/3/2016 50273 79722 5
3/4/2016 50344 20703 0
3/4/2016 51360 44065 90
3/4/2016 54404 53473 65
3/4/2016 61533 79722 45
3/5/2016 72852 20703 0
3/5/2016 74546 38289 0
3/5/2016 76487 62529 0
3/5/2016 82439 36396 10
3/5/2016 90006 36396 40
3/6/2016 90404 20703 0
for the above data, expected results is:
2016-03-01 4 20703 Angela
2016-03-02 2 79722 Michael
2016-03-03 2 20703 Angela
2016-03-04 2 20703 Angela
2016-03-05 1 36396 Frank
2016-03-06 1 20703 Angela
My below query doesnt give me unique hacker_ids
select submission_date, cnt, hacker_id, name from
(select s.submission_date
, count(s.hacker_id) over(partition by s.submission_date) cnt
, row_number() over(partition by s.submission_date order by s.hacker_id asc) rn
, s.hacker_id, h.name from submissions s
inner join hackers h on h.hacker_id = s.hacker_id) as tble
where tble.rn = 1;
How do I get the unique hacker_ids in the above results ?
For MS SQL
with MaxSubEachDay as (
select submission_date,
hacker_id,
RANK() OVER(partition by submission_date order by SubCount desc, hacker_id) as Rn
FROM
(select submission_date, hacker_id, count(1) as SubCount
from submissions
group by submission_date, hacker_id
) subQuery
), DayWiseRank as (
select submission_date,
hacker_id,
DENSE_RANK() OVER(order by submission_date) as dayRn
from submissions
), HackerCntTillDate as (
select outtr.submission_date,
outtr.hacker_id,
case when outtr.submission_date='2016-03-01' then 1
else 1+(select count(distinct a.submission_date) from submissions a where a.hacker_id = outtr.hacker_id and a.submission_date<outtr.submission_date)
end as PrevCnt,
outtr.dayRn
from DayWiseRank outtr
), HackerSubEachDay as (
select submission_date,
count(distinct hacker_id) HackerCnt
from HackerCntTillDate
where PrevCnt = dayRn
group by submission_date
)
select HackerSubEachDay.submission_date,
HackerSubEachDay.HackerCnt,
MaxSubEachDay.hacker_id,
Hackers.name
from HackerSubEachDay
inner join MaxSubEachDay
on HackerSubEachDay.submission_date = MaxSubEachDay.submission_date
inner join Hackers
on Hackers.hacker_id = MaxSubEachDay.hacker_id
where MaxSubEachDay.Rn=1
You can use two levels of aggregation:
select s.submission_date, count(*) as num_hackers, sum(cnt) as num_hacks,
max(case when seqnum = 1 then h.hacker_id end) as hacker_id,
max(case when seqnum = 1 then h.name end) as name,
from (select s.submission_date, s.hacker_id, count(*) as cnt
row_number() over(partition by s.submission_date order by count(*) desc) as seqnum
from submissions s
group by s.submission_date, s.hacker_id
) s join
hackers h
on h.hacker_id = s.hacker_id
group by s.submission_date;
Note that the subquery is aggregating by the date and hacker_id, so there is one row per hacker_id on each date. The count(*) in the outer query is counting these rows, which is the number of hackers. I included the count for the number of hacks.
EDIT:
I realize that you can do an additional analytic function in the subquery and that will simplify the logic a bit:
select s.submission_date, s.num_hackers, num_hacks,
h.hacker_id, h.name
from (select s.submission_date, s.hacker_id, count(*) as cnt,
sum(count(*)) over (partition by s.submission_date) as num_hacks,
count(*) over (partition by s.submission_date) as num_hackers,
row_number() over(partition by s.submission_date order by count(*) desc) as seqnum
from submissions s
group by s.submission_date, s.hacker_id
) s join
hackers h
on h.hacker_id = s.hacker_id
where seqnum = 1;
select big_1.submission_date, big_1.hkr_cnt, big_2.hacker_id, h.name
from
(select submission_date, count(distinct hacker_id) as hkr_cnt
from
(select s.*
, dense_rank() over(order by submission_date) as date_rank
--, row_number() over(order by submission_date) as rn_date_rank
,dense_rank() over(partition by hacker_id order by submission_date) as hacker_rank
--,row_number() over(partition by hacker_id order by submission_date) as rn_hacker_rank
from submissions s ) a
where a.date_rank = a.hacker_rank
group by submission_date) big_1
join
(select submission_date,hacker_id,
rank() over(partition by submission_date order by sub_cnt desc, hacker_id) as max_rank
from (select submission_date, hacker_id, count(*) as sub_cnt
from submissions
group by submission_date, hacker_id) b ) big_2
on big_1.submission_date = big_2.submission_date and big_2.max_rank = 1
join hackers h on h.hacker_id = big_2.hacker_id
order by 1 ;
select tt.submission_date,tt.hacker_count,ts.hacker_id,ts.name
from
(select t2.submission_date,count(t2.hacker_rank) as hacker_count from
(
select submission_date,count(distinct(hacker_id)) as hacker_count,
dense_rank() over(order by submission_date) as date_rank,
dense_rank() over(partition by hacker_id order by submission_date) as
hacker_rank
from submissions
group by submission_date,hacker_id
) as t2
where t2.hacker_rank = t2.date_rank
group by submission_date
) as tt
join (
select t1.submission_date,t1.hacker_id,t1.name
from (
select s.submission_date,count(s.hacker_id) as
count_hacker_id,s.hacker_id,h.name,
ROW_NUMBER() over(PARTITION BY s.submission_date order by count(*) desc)
as seqnum
from submissions s
left join hackers h
on h.hacker_id = s.hacker_id
group by s.submission_date,s.hacker_id,h.name
) as t1
where t1.seqnum = 1 ) as ts on ts.submission_date = tt.submission_date;

Find top N most frequent categories with top N most frequent sub-categories for each category

I'm trying to make a single query that will retrieve:
The top e.g. 3 most popular brands from a list of cars. For each of the top 3 brands I want to retrieve the top 5 most popular models.
I tried with both a ranking/partitioning strategy and a distinct ON strategy but I cannot seem to figure out how I can get the limits to works within two queries.
Here is some sample data: http://sqlfiddle.com/#!15/1e81d5/1
From the ranking query I would expect an output like this, given the sample data (order not important):
brand car_mode count
'Audi' 'A4' 3
'Audi' 'A1' 3
'Audi' 'Q7' 2
'Audi' 'Q5' 2
'Audi' 'A3' 2
'VW' 'Passat' 3
'VW' 'Beetle' 3
'VW' 'Caravelle' 2
'VW' 'Golf' 2
'VW' 'Fox' 2
'Volvo' 'V70' 3
'Volvo' 'V40' 3
'Volvo' 'S60' 2
'Volvo' 'XC70' 2
'Volvo' 'V50' 2
Turns out I could use LATERAL join as suggested in comments. Thanks.
SELECT brand, car_model, the_count
FROM
(
SELECT brand FROM cars GROUP BY brand ORDER BY COUNT(*) DESC LIMIT 3
) o1
INNER JOIN LATERAL
(
SELECT car_model, count(*) as the_count
FROM cars
WHERE brand = o1.brand
GROUP BY brand, car_model
ORDER BY count(*) DESC LIMIT 5
) o2 ON true;
http://sqlfiddle.com/#!15/1e81d5/9
you can try by using cte and window function row_number()
with cte as
(
select brand,car_model,count(*) as cnt from cars group by brand,car_model
) , cte2 as
(
select * ,row_number() over(partition by brand order by cnt desc) rn from cte
)
select brand,car_model,cnt from cte2 where rn<=5
demo link
You can use window functions for this:
select brand, car_model, cnt_car
from (select c.*, dense_rank() over (order by cnt_brand, brand) as seqnum_b
from (select brand, car_model, count(*) as cnt_car,
row_number() over (partition by brand order by count(*) desc) as seqnum_bc,
sum(count(*)) over (partition by brand) as cnt_brand
from cars c
group by brand, car_model
) c
) c
where seqnum_bc <= 5 and seqnum_b <= 3
order by cnt_brand desc, brand, cnt desc;
If you know that each brand (or at least each top brand) has at least five cars, then you can simplify the query to:
select brand, car_model, cnt_car
from (select brand, car_model, count(*) as cnt_car,
row_number() over (partition by brand order by count(*) desc) as seqnum_bc,
sum(count(*)) over (partition by brand) as cnt_brand
from cars c
group by brand, car_model
) c
where seqnum_bc <= 5
order by cnt_brand desc, brand, cnt desc
limit 15

Avoid Unions to get TOP count

Here are two tables:
LocationId Address City State Zip
1 2100, 1st St Austin TX 76819
2 2200, 2nd St Austin TX 76829
3 2300, 3rd St Austin TX 76839
4 2400, 4th St Austin TX 76849
5 2500, 5th St Austin TX 76859
6 2600, 6th St Austin TX 76869
TripId PassengerId FromLocationId ToLocationId
1 746896 1 2
2 746896 2 1
3 234456 1 3
4 234456 3 1
5 234456 1 4
6 234456 4 1
7 234456 1 6
8 234456 6 1
9 746896 1 2
10 746896 2 1
11 746896 1 2
12 746896 2 1
I want TOP 5 locations which each passenger has traveled to (does not matter if its from or to location). I can get it using a UNION, but was wondering if there was a better way to do this.
My Solution:
select top 5 *
from
(select count(l.LocationId) as cnt, l.LocationId, l.Address1, l.Address2, l.City, St.State , l.Zip
from
Trip t
join LOCATION l on t.FromLocationId = l.LocationId
where t.PassengerId = 746896
group by count(l.LocationId) as cnt, l.LocationId, l.Address1, l.Address2, l.City, St.State , l.Zip
UNION
select count(l.LocationId) as cnt, l.LocationId, l.Address1, l.Address2, l.City, St.State , l.Zip
from
Trip t
join LOCATION l on t.ToLocationId = l.LocationId
where t.PassengerId = 746896
group by count(l.LocationId) as cnt, l.LocationId, l.Address1, l.Address2, l.City, St.State , l.Zip
) as tbl
order by cnt desc
This will give you top 5 location.
SELECT TOP 5 tmp.fromlocationid AS locationid,
Count(tmp.fromlocationid) AS Times
FROM (SELECT fromlocationid
FROM trip
UNION ALL
SELECT tolocationid
FROM trip) tmp
GROUP BY tmp.fromlocationid
Method 1: This will give you top 5 location of each passenger.
WITH cte AS
( SELECT passengerid,
locationid,
Count(locationid) AS Times,
Row_number() OVER(partition BY passengerid ORDER BY passengerid ASC) AS RowNum
FROM (SELECT tripid, passengerid, fromlocationid AS locationid
FROM trip
UNION ALL
SELECT tripid, passengerid, tolocationid AS locationid
FROM trip) tmp
GROUP BY passengerid, locationid )
SELECT *
FROM cte
WHERE rownum <= 5
ORDER BY passengerid, Times DESC
Method 2: Same result without Union Operator (Top 5 location of each passenger)
WITH cte AS
( SELECT passengerid,
locationid,
Count(locationid) AS Times,
Row_number() OVER(partition BY passengerid ORDER BY passengerid ASC) AS RowNum
FROM trip
UNPIVOT ( locationid
FOR subject IN (fromlocationid, tolocationid) ) u
GROUP BY passengerid, locationid )
SELECT *
FROM cte
WHERE rownum <= 5
ORDER BY passengerid, times DESC
If you also want to get the location details, you can simply join the location table.
SELECT cte.* , location.*
FROM cte
INNER JOIN location ON location.locationid = cte.locationid
WHERE rownum <= 5
ORDER BY passengerid, times DESC
Reference
- https://stackoverflow.com/a/19056083/6327676
YOou'll need to replace the SELECT *'s with the columns you need, however, something like this should work:
WITH Visits AS (
SELECT *,
COUNT(*) OVER (PARTITION BY t.PassengerID, L.LocationID) AS Visits
FROM Trip T
JOIN [Location] L ON T.FromLocationId = L.LocationId),
Rankings AS (
SELECT *,
DENSE_RANK() OVER (PARTITION BY V.PassengerID ORDER BY Visits DESC) AS Ranking
FROM Visits V)
SELECT *
FROM Rankings
WHERE Ranking <= 5;
Further simplified solution
select top 3 * from
(
Select distinct count(locationId) as cnt, locationId from trip
unpivot
(
locationId
for direction in (fromLocationId, toLocationId)
)u
where passengerId IN (746896, 234456)
group by direction, locationId
)as tbl2
order by cnt desc;
Solution combining columns
The main issue for me is avoiding union to combine the two columns.
The UNPIVOT command can do this.
select top 3 * from (
select count(locationId) cnt, locationId
from
(
Select valu as locationId, passengerId from trip
unpivot
(
valu
for loc in (fromLocationId, toLocationId)
)u
)united
where passengerId IN (746896, 234456)
group by locationId
) as tbl
order by cnt desc;
http://sqlfiddle.com/#!18/cec8b/136
If you want to get the counts by direction:
select top 3 * from (
select count(locationId) cnt, locationId, direction
from
(
Select valu as locationId, direction, passengerId from trip
unpivot
(
valu
for direction in (fromLocationId, toLocationId)
)u
)united
where passengerId IN (746896, 234456)
group by locationId, direction
) as tbl
order by cnt desc;
http://sqlfiddle.com/#!18/cec8b/139
Same Results as you ( minus some minor descriptions )
select top 3 * from
(
select distinct * from (
select count(locationId) cnt, locationId
from
(
Select valu as locationId, direction, passengerId from trip
unpivot
(
valu
for direction in (fromLocationId, toLocationId)
)u
)united
where passengerId IN (746896, 234456)
group by locationId, direction
) as tbl
)as tbl2
order by cnt desc;
You can do this without union all:
select top (5) t.passengerid, v.locationid, count(*)
from trip t cross apply
(values (fromlocationid), (tolocationid)) v(locationid) join
location l
on v.locationid = l.locationid
where t.PassengerId = 746896
group by t.passengerid, v.locationid
order by count(*) desc;
If you want an answer for all passengers, it would be a similar idea, using row_number(), but your query suggests you want the answer only for one customer at a time.
You can include additional fields from location as well.
Here is a SQL Fiddle.

Select Max two rows of each account SQL Server

I have this table
ID AGE ACCNUM NAME
--------------------------------
1 10 55409 Intro
2 6 55409 Chapter1
3 4 55409 Chapter2
4 3 69591 Intro
5 6 69591 Outro
6 0 40322 Intro
And I need a query that returns the two max age from each ACCNUM
in this case, records:
1, 2, 4, 5, 6
I have tried too many queries but nothing works for me.
I tried this query
Select
T1.accnum, T1.age
from
table1 as T1
inner join
(select
accnum, max(age) as max
from table1
group by accnum) as T2 on T1.accnum = T2.accnum
and (T1.age = T2.max or T1.age = T2.max -1)
TSQL Ranking Functions: Row_Number() https://msdn.microsoft.com/en-us/library/ms186734.aspx
select id, age, accnum, name
from
(
select id, age, accnum, name, ROW_NUMBER() Over (Partition By accnum order by age desc) as rn
from yourtable
) a
where a.rn <= 2
You can use row_number():
select accnum
, age
from ( select accnum
, age
, row_number() over(partition by accnum order by age desc) as r
from table1 as T1) t where r < 3
CODE:
WITH CTE AS (SELECT ID, AGE, ACCNUM, NAME,
ROW_NUMBER() OVER(PARTITION BY ACCNUM ORDER BY AGE DESC) AS ROW_NUM
FROM T1)
SELECT ID, AGE, ACCNUM, NAME
FROM CTE
WHERE ROW_NUM <= 2
Uses a common table expression to achieve the desired result.
SQL Fiddle