SQL - aggregation with column value as column name - sql

For a table like below need to do an aggregation such that for each unique field in one column, need to find the count of occurrences of a discrete value in another column
input table is:
id model datetime driver distance
---|-----|------------|--------|---------
1 | S | 04/03/2009 | john | 399
2 | X | 04/03/2009 | juliet | 244
3 | 3 | 04/03/2009 | borat | 555
4 | 3 | 03/03/2009 | john | 300
5 | X | 03/03/2009 | juliet | 200
6 | X | 03/03/2009 | borat | 500
7 | S | 24/12/2008 | borat | 600
8 | X | 01/01/2009 | borat | 700
Output required
model john juliet | borat
-----|--------|-------|------
S | 1 | 0 | 1
X | 0 | 2 | 2
3 | 1 | 0 | 1
one potential way to do is to group by model with an aggregation like
SUM (CASE WHEN driver = 'value' THEN 1 ELSE 0 END) AS value for each discrete value of driver column. But the challenge is sometimes the number of discrete values is too many ( around 50 in my case) or in some cases do not even know all possible discrete values - I was wondering if there is an alternate way to do this.

The aggregation part need a litle more work.
Here the details:
Need calculate first what are all the combinations
Then use LEFT JOIN to get which combination doesnt have data.
DEMO
WITH "allDrivers" as (
SELECT DISTINCT "driver"
FROM Table1
),
"allModels" as (
SELECT DISTINCT "model"
FROM Table1
),
"source" as (
SELECT d."driver", m."model"
FROM "allDrivers" d
CROSS JOIN "allModels" m
)
SELECT s."model", s."driver", COUNT(t."datetime")
FROM "source" s
LEFT JOIN table1 t
ON s."model" = t."model"
AND s."driver" = t."driver"
GROUP BY s."model", s."driver"
OUTPUT
| model | driver | count |
|-------|--------|-------|
| 3 | borat | 1 |
| 3 | john | 1 |
| 3 | juliet | 0 |
| S | borat | 1 |
| S | john | 1 |
| S | juliet | 0 |
| X | borat | 2 |
| X | john | 0 |
| X | juliet | 2 |
Then you can do the dynamic pivot

Related

SQL Count depending on certain conditions

I have two tables.
One have userid and email (users table). The other have payments information (payments table) from the userid in users.
users
+--------+------------+
| Userid | Name |
+--------+------------+
| 1 | Alex T |
| 2 | Jeremy T |
| 3 | Frederic A |
+--------+------------+
payments
+--------+-----------+------------+----------+
| Userid | ValuePaid | PaidMonths | Refunded |
+--------+-----------+------------+----------+
| 1 | 1 | 12 | null |
| 1 | 20 | 12 | null |
| 1 | 20 | 12 | null |
| 1 | 20 | 1 | null |
| 2 | 1 | 1 | null |
| 2 | 20 | 12 | 1 |
| 2 | 20 | 12 | null |
| 2 | 20 | 1 | null |
| 3 | 1 | 12 | null |
| 3 | 20 | 1 | 1 |
| 3 | 20 | 1 | null |
+--------+-----------+------------+----------+
I want to count the PaidMonths taking in consideration the following rules:
If ValuePaid < 10 PaidMonths should be = 0.23 (even if in the column the value seen is any other mumber).
If Refund=1 the PaidMonths should be = 0.
Based on this when i join both tables by userid, and sum the PaidMonths based in the previousrules, i expect to see as result:
+--------+------------+------------+
| userid | Name | paidMonths |
+--------+------------+------------+
| 1 | Alex T | 25.23 |
| 2 | Jeremy T | 13.23 |
| 3 | Frederic A | 1.23 |
+--------+------------+------------+
Can you help me to achieve this in the most elegant way? Should a temporary table be used?
The following gives your desired results, using apply with case expression to map your values:
select u.UserID, u.Name, Sum(pm) PaidMonths
from users u join payments p on p.userid=u.userid
cross apply (values(
case
when valuepaid <10 then 0.23
when Refunded=1 then 0
else PaidMonths end
))x(pm)
group by u.UserID, u.Name
See Working Fiddle

SQL - Create number of categories based on pre-defined number of splits

I am using BigQuery, and trying to assign categorical values to each of my records, based on the number of 'splits' assigned to it.
The table has a cumulative count of records, grouped at the STR level - i.e., if there are 4 SKUs at 2 STR, the SKUs will be labeled 1,2,3,4. Each STR is assigned a SPLIT value, so if the STR has a SPLIT value of 2, I want it to split its SKUs into 2 categories. I want to create another column that would assign SKUs labeled 1-2 as '1', and SKUs labeled 3-4 as '2'. (The actual data is on a much larger scale, but thought this would be easier.)
+-----+------+---------------+--------+
| STR | SKU | SKU_ROW_COUNT | SPLITS |
+-----+------+---------------+--------+
| 1 | 1230 | 1 | 3 |
| 1 | 1231 | 2 | 3 |
| 1 | 1232 | 3 | 3 |
| 1 | 1233 | 4 | 3 |
| 1 | 1234 | 5 | 3 |
| 1 | 1235 | 6 | 3 |
| 2 | 1310 | 1 | 2 |
| 2 | 1311 | 2 | 2 |
| 2 | 1312 | 3 | 2 |
| 2 | 1313 | 4 | 2 |
| 3 | 2345 | 1 | 1 |
| 3 | 2346 | 2 | 1 |
| 3 | 2347 | 3 | 1 |
+-----+------+---------------+--------+
The SPLITS column is dynamic, ranging from 1 to 3. The number of SKUs in each category should be relatively equal, but that's not a priority as much as just the number of groups that are created. Ideally, the final table with the new column (HOST_NUMBER) would look something like this:
+-----+------+---------------+--------+-------------+
| STR | SKU | SKU_ROW_COUNT | SPLITS | HOST_NUMBER |
+-----+------+---------------+--------+-------------+
| 1 | 1230 | 1 | 3 | 1 |
| 1 | 1231 | 2 | 3 | 1 |
| 1 | 1232 | 3 | 3 | 2 |
| 1 | 1233 | 4 | 3 | 2 |
| 1 | 1234 | 5 | 3 | 3 |
| 1 | 1235 | 6 | 3 | 3 |
| 2 | 1310 | 1 | 2 | 1 |
| 2 | 1311 | 2 | 2 | 1 |
| 2 | 1312 | 3 | 2 | 2 |
| 2 | 1313 | 4 | 2 | 2 |
| 3 | 2345 | 1 | 1 | 1 |
| 3 | 2346 | 2 | 1 | 1 |
| 3 | 2347 | 3 | 1 | 1 |
+-----+------+---------------+--------+-------------+
You can use window functions and arithmetics:
select
t.*,
1 + floor((sku_row_count - 1) * splits / count(*) over(partition by str)) host_number
from mytable t
order by sku
Actually, ntile() seems to do exactly what you want - and you don't even need the sku_row_count column (which basically mimics row_number() anyway):
select
t.*,
ntile(splits) over(partition by str order by sku) host_number
from mytable t
order by sku
If the ordering of the values in the groups doesn't matter, just use modulo arithmetic:
select t.*, (SKU_ROW_COUNT % SPLITS) as split_group
from t
Below is for BigQuery Standard SQL
#standardSQL
SELECT *, 1 + MOD(SKU_ROW_COUNT, SPLITS) AS HOST_NUMBER
FROM `project.dataset.table`

SQL aggregation over one column giving a result from another

I am trying (and failing) to join some tables in a SQLite database. The data itself is complicated but I think I have boiled it down to an illustrative example.
Here are the three tables I want to join.
Table: Events
+----+---------+-------+-----------+
| id | user_id | class | timestamp |
+----+---------+-------+-----------+
| 1 | 'user1' | 6 | 100 |
| 2 | 'user1' | 12 | 400 |
| 3 | 'user1' | 4 | 900 |
| 4 | 'user2' | 6 | 400 |
| 5 | 'user2' | 3 | 800 |
| 6 | 'user2' | 8 | 900 |
+----+---------+-------+-----------+
Table: Games
+---------+---------+------------+-----------+
| user_id | game_id | game_class | timestamp |
+---------+---------+------------+-----------+
| 'user1' | 1 | 'A' | 200 |
| 'user2' | 2 | 'A' | 300 |
| 'user1' | 3 | 'B' | 500 |
| 'user1' | 4 | 'A' | 600 |
| 'user1' | 5 | 'A' | 700 |
+---------+---------+------------+-----------+
Table: AScores
+---------+-------+
| game_id | score |
+---------+-------+
| 1 | 8 |
| 2 | 2 |
| 4 | 9 |
| 5 | 6 |
+---------+-------+
I would like to join these to provide an additional column on the first table containing the users current score in game class A at the time of the event. I.e. I would like theresult of the join to look like this:
Desired Result
+----+----------+-------+-----------+-----------------+
| id | user_id | class | timestamp | current_a_score |
+----+----------+-------+-----------+-----------------+
| 1 | 'user1' | 6 | 100 | (null) |
| 2 | 'user1' | 12 | 400 | 8 |
| 3 | 'user1' | 4 | 900 | 6 |
| 4 | 'user2' | 6 | 400 | 2 |
| 5 | 'user2' | 3 | 800 | 2 |
| 6 | 'user2' | 8 | 900 | 2 |
+----+----------+-------+-----------+-----------------+
The following simple join pulls together the two tables AScores and Games.
SELECT * FROM AScores
INNER JOIN Games
ON AScores.game_id = Games.game_id
And so I was hoping to join this to the Events table as a sub-query. Something like this:
SELECT Events.*, AScoredGames.time_stamp AS game_time_stamp, AScoredGames.score
FROM Events
LEFT OUTER JOIN (
SELECT AScores.score, Games.* FROM AScores
INNER JOIN Games
ON AScores.game_id = Games.game_id
) AS AScoredGames
ON Events.user_id = AScoredGames.user_id
AND Events.time_stamp >= AScoredGames.time_stamp
ORDER BY Events.time_stamp ASC
That results in the following:
+----+---------+-------+------------+-----------------+-------+
| id | user_id | class | time_stamp | game_time_stamp | score |
+----+---------+-------+------------+-----------------+-------+
| 1 | user1 | 6 | 100 | NULL | NULL |
| 2 | user1 | 12 | 400 | 200 | 8 |
| 4 | user2 | 6 | 400 | 300 | 2 |
| 5 | user2 | 3 | 800 | 300 | 2 |
| 6 | user2 | 8 | 900 | 300 | 2 |
| 3 | user1 | 4 | 900 | 200 | 8 |
| 3 | user1 | 4 | 900 | 600 | 9 |
| 3 | user1 | 4 | 900 | 700 | 6 |
+----+---------+-------+------------+-----------------+-------+
So I need to group by Events.id to get rid of the triplicated row with Events.id 3. But what I want to do is to choose the row with the maximum game_time_stamp but then use the row's score. If I do MAX(game_time_stamp) as my aggregation I still have to independently aggregate the score. Is there a way to tie the row choice in the score column's aggregation function to the result of the game_time_stamp column's aggregation function?
(N.B. Existing answers to questions like Select first record in a One-to-Many relation using left join and SQL Server: How to Join to first row seem to suggest I cannot and say one must use a WHERE clause over a sub-query. But I am struggling with that (I'll post another question about that) and I can think of at least one solution and I am hoping there are better ones.)
The following query should do it. It uses a NOT EXISTS condition with a correlated subquery to locate the relevant game record for each event.
SELECT e.*, s.score current_a_score
FROM
events e
LEFT JOIN games g
ON g.user_id = e .user_id
AND g.timestamp < e.timestamp
AND NOT EXISTS (
SELECT 1
FROM games g1
WHERE
g1.user_id = e .user_id
AND g1.timestamp < e.timestamp
AND g1.timestamp > g.timestamp
)
LEFT JOIN ascores s
ON s.game_id = g.game_id
ORDER BY e.id
This DB Fiddle demo with your test data returns :
| id | user_id | class | timestamp | current_a_score |
| --- | ------- | ----- | --------- | --------------- |
| 1 | user1 | 6 | 100 | |
| 2 | user1 | 12 | 400 | 8 |
| 3 | user1 | 4 | 900 | 6 |
| 4 | user2 | 6 | 400 | 2 |
| 5 | user2 | 3 | 800 | 2 |
| 6 | user2 | 8 | 900 | 2 |
I have one work-around, but it feels hacky and relies on the specifics of my data. First note that the time_stamps are all multiples of 100 while the scores are all below 10. I can acombine these in a way that will not interfere with my comparison but will mean they are both encoded in one numeric column. This query gives the desired result:
SELECT Events.id, MIN(Events.user_id) AS user_id, MIN(Events.class) AS class, MIN(Events.time_stamp) AS time_stamp, MAX(AScoredGames.combination) % 10 AS current_a_score
FROM Events
LEFT OUTER JOIN (
SELECT AScores.score, AScores.score + (Games.time_stamp - 10) AS combination, Games.* FROM AScores
INNER JOIN Games
ON AScores.game_id = Games.game_id) AS AScoredGames
ON Events.user_id = AScoredGames.user_id AND Events.time_stamp >= AScoredGames.time_stamp
GROUP BY Events.id
ORDER BY id ASC
(The combining is done in AScores.score + (Games.time_stamp - 10) and so the aggregate function becomes MAX(AScoredGames.combination) % 10.)
Actual Result
+----+---------+-------+------------+-----------------+
| id | user_id | class | time_stamp | current_a_score |
+----+---------+-------+------------+-----------------+
| 1 | user1 | 6 | 100 | NULL |
| 2 | user1 | 12 | 400 | 8 |
| 3 | user1 | 4 | 900 | 6 |
| 4 | user2 | 6 | 400 | 2 |
| 5 | user2 | 3 | 800 | 2 |
| 6 | user2 | 8 | 900 | 2 |
+----+---------+-------+------------+-----------------+

Percentage to total in BigQuery Legacy SQL (Subqueries?)

I can't understand how to calulate percentage to total in BigQuery Legacy SQL.
So, I have a table:
ID | Name | Group | Mark
1 | John | A | 10
2 | Lucy | A | 5
3 | Jane | A | 7
4 | Lily | B | 9
5 | Steve | B | 14
6 | Rita | B | 11
I want to calculate percentage like this:
ID | Name | Group | Mark | Percent
1 | John | A | 10 | 10/(10+5+7)=45%
2 | Lucy | A | 5 | 5/(10+5+7)=22%
3 | Jane | A | 7 | 7/(10+5+7)=33%
4 | Lily | B | 9 | 9/(9+14+11)=26%
5 | Steve | B | 14 | 14/(9+14+11)=42%
6 | Rita | B | 11 | 11/(9+14+11)=32%
My table is quite long for me (3 million rows).
I thought that I could do it with subqueries, but in SELECT I can't use subqueries.
Does anyone know a way to do it?
SELECT
ID, Name, [Group], Mark,
RATIO_TO_REPORT(Mark) OVER(PARTITION BY [Group]) AS percent
FROM YourTable
Check more about RATIO_TO_REPORT

Select distinct combinations of values

I have a table with X values and Y values, both INT. What I want to do is group on the X value with the condition that it contains a distinct combination of Y values. I also want to see the total number of each combination.
I tried using SUM ( POWER (2, Y)), but that generates numbers that are too big as Y can get up to about 300 in some cases.
+--------------+--------------+
| X | Y |
+--------------+--------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 4 |
| 1 | 6 |
| 2 | 1 |
| 2 | 2 |
| 2 | 4 |
| 2 | 6 |
| 3 | 2 |
| 3 | 3 |
| 3 | 5 |
| 4 | 2 |
| 4 | 3 |
| 4 | 5 |
| 5 | 2 |
| 5 | 3 |
| 5 | 6 |
+--------------+--------------+
I want the result to look something like:
+--------------+--------------+
| X | COUNT |
+--------------+--------------+
| 1 | 2 |
| 3 | 2 |
| 5 | 1 |
+--------------+--------------+
Based on your description (but not on your sample data) next query should do:
select X, count(distinct Y)
from TBL
group by X
Thanks for trying to help. I realize that it might have been hard to understand what I was trying to do.
Anyway, I ended up solving it with the checksum_agg aggregate function.