Calculating group percentage in SQL - sql

I have a sample db table with columns 'team' and 'result' that stores match results 'W' (win) or 'L' (loose) for teams 'A' and 'B':
team|result
A |W
A |W
A |W
A |L
B |W
B |W
B |L
B |L
I can get the number of wins/losses per team by grouping by team and result:
sqlite> select team,result,count(*) from results group by team,result;
A|L|1
A|W|3
B|L|2
B|W|2
However, I would also like to get a percentage of win/loss per team:
A|L|1|25
A|W|3|75
B|L|2|50
B|W|2|50
I have not succeeded in figuring out how to do this in SQL. I have managed to do this programmatically with a python programme that queries the db via the sqlite api, then loops over the result set and creates a variable to store the total count per group and then calculate percentage, etc
Can this be achieved directly in SQL?
Thanks

We can use SUM() as an analytic function:
SELCET team, result, COUNT(*) AS cnt,
100.0 * COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY team) AS pct
FROM results
GROUP BY team, result;

Related

Trying to display Using Sum() with JOIN doesn't provide correct output

I'm trying to create a query that displays a user's Id, the sum of total steps, and sum of total calories burnt.
The data for steps and calories are within two datasets, so I used JOIN. However, when I write out the query, the joined data does not look correct. However when I do them separately, it appears to show the correct data
Below are my queries...I am fairly new to SQL, so I am somewhat confused on what I did wrong. How do I correct this? Thank you in advanced for the help!
For the Steps table, "Id" and "StepTotal" are Integers. For the Calories table, "Id" and "Calories" are also Integers.
SELECT steps.Id,Sum(StepTotal) AS Total_steps,Sum(cal.Calories) as Total_calories
FROM fitbit.Daily_steps AS steps
JOIN fitbit.Daily_calories AS cal ON steps.Id=cal.Id
GROUP BY Id
Given Output(Picture)
Expected Output(Picture)
For Steps
SELECT Id,Sum(StepTotal) AS Total_steps
FROM fitbit.Daily_steps
group by Id
Id
Total_steps
1503960366
375619
1624580081
178061
1644430081
218489
For Calories
SELECT Id,Sum(Calories) AS Total_calories
FROM fitbit.Daily_calories
group by Id
Id
Total_calories
1503960366
56309
1624580081
45984
1644430081
84339
I believe your current solution is returning additional rows as the result of the JOIN.
Let's look at an example data set
Steps
id | total
a | 5
a | 7
b | 3
Calories
id | total
a | 100
a | 300
b | 400
Now, if we SELECT * FROM Calories, we'd get 3 rows. If we SELECT * FROM Calories GROUP BY id, we'd get two rows.
But if we use a JOIN:
SELECT Steps.id, Steps.total AS steps, Calories.total AS cals FROM Steps
JOIN Calories
ON Steps.id = Calories.id
WHERE id = 'a'
This would return the following:
Steps_Calories
id | steps | cals
a | 5 | 100
a | 5 | 300
a | 7 | 100
a | 7 | 300
So now if we GROUP BY & SUM(steps), we get 24, instead of the expected 12, because the JOIN returns each pairing of steps & calories.
To mitigate this, we can use sub-queries & group & sum within the sub-queries
SELECT Steps.id, Steps.total AS steps, Calories.total AS cals
FROM (SELECT id, SUM(total) FROM Steps GROUP BY id) as step_totals
JOIN (Select id, SUM(total) FROM Cals GROUP BY id) as cal_totals
JOIN Calories
ON cal_totals.id = step_totals.id
Now each subquery only returns a single row for each id, so the join only returns a single row as well.
Of course, you'll have to adapt this for your schema.

SQL - The percentage of responses out of the total responses, grouped per country

So country has questions which has answers. I want the percentage of how many selected an answer from a specific question out of all the answers in that question, grouped by country.
Note there are multiple instances of the same question per country, each with different number of individual answers. There is also a field containing the total_nr_responses per answer/entry.
Sample Data
question_id country answer_key total_nr_responses
A1 Austria A1_B1 3
A1 Austria A1_B1 0
A1 Austria A1_B2 4
A1 Belgium A1_B1 4
A1 Belgium A1_B1 10
A2 Austria A2_B1 2
...
Expected Result for question A1, answer A1_B1 as percentage of the total_nr_responses for a specific answer out of the total responses, per country (100x3/7):
Country Result
Austria percentage
Belgium percentage
I tried something like this but I am not sure how to get the percentage per country/ how to group in the sub query per country so that the whole query works:
Select Country, count(total_nr_responses)* 100 / (Select count(total_nr_responses) From my_table WHERE question_key = 'A1') as percentage
From my_table
WHERE question_id = 'A1' AND answer_key = 'A1_B1'
GROUP BY Country
Any help much appreciated.
How about using CROSS APPLY to get the total?
Query
SELECT mt.question_id, mt.country, mt.answer_key, (SUM(mt.total_nr_responses) * 100 / ca.total_nr_responses) AS result
FROM my_table mt
CROSS APPLY (SELECT SUM(total_nr_responses) AS total_nr_responses
FROM my_table
WHERE question_id = mt.question_id AND country = mt.country) ca
WHERE mt.question_id = 'A1' AND mt.answer_key = 'A1_B1'
GROUP BY mt.question_id, mt.country, mt.answer_key, ca.total_nr_responses
Output
+-------------+---------+------------+--------+
| question_id | country | answer_key | result |
+-------------+---------+------------+--------+
| A1 | Austria | A1_B1 | 42 |
| A1 | Belgium | A1_B1 | 100 |
+-------------+---------+------------+--------+
You can use SUM function with a window specification.
select distinct country,
question_id,
answer_key,
100.0*sum(total_nr_responses) over(partition by country,question_id,answer_key)/
sum(total_nr_responses) over(partition by country,question_id) as pct
from my_table
Add a where clause to restrict the result to specific questions/answers/countries, if needed.
Perhaps something like this is what you're looking for?
SELECT
mt.country,
SUM(mt.total_nr_responses) * 100 / p.total_sum_responses
FROM
my_table AS mt,
( SELECT country, SUM(total_nr_responses) AS total_sum_responses FROM my_table WHERE question_id = 'A1' GROUP BY country ) AS p
WHERE
question_id = 'A1' AND
answer_key = 'A1_B1' AND
p.country = mt.country
GROUP BY
mt.country,
p.total_sum_responses
I wasn't able to make it work with OVER(PARTITION BY) because of the percent calculation. It'd be great to see what Cade Roux had in mind fully spelled out in code.
Execution plans between the nested SELECT and the CROSS APPLY are pretty similar and all three (window function, cross-apply and nested-select) yield similar results. If dealing with large amounts of data, make sure you have a composite index on the country, and question_id. Great to see such diverse solutions to the same problem!
Normally, you would do this with a simple window function along with aggregation:
Select Country,
count(total_nr_responses) * 100 / sum(count(total_nr_responses)) over () as percentage
From my_table
where question_id = 'A1' AND answer_key = 'A1_B1'
group by Country;
Note: SQL Server does integer division. I would change the 100 to 100.0 and format the result after the division. Otherwise, the values won't come close to adding up to 100.

Cumulative Sum and Percentage

Can someone help me in getting the cumulative sum and percentage. I have three fields in the table "capability"
vertical|Defects(F)|Defects(NF)|
Billing | 193 |678
Provi |200 |906
Billing |232 |111
Analyt |67 |0
Provi |121 |690
I would want the final output to be
Vertical|Total Defects|Cumulative Defects|Cumulative%
Billing |1214 | 1214 |37.96%
Provi |1917 | 3131 |97.90%
Analyt |67 | 3198 |100.00%
Please note that I have around 3mn rows and data keeps increasing day on day.
I find the easiest way to get the cumulative sum is by using a correlated subquery -- unless you have the full power of window functions as in SQL Server 2012
However this poses several challenges. First, you need to summarize by the verticals. Then you need to order the verticals in the right order. Finally, you need to output the results the way you want them:
This should come close to what you want:
with d as
(select vertical, sum(Defects_N + Defects_NF) as Defects,
(case when vertical = 'Billing' then 1
when vertical = 'Provi' then 2
else 3
end) as num
from t
group by vertical
)
select vertical, defects as TotalDefects, cumDefects,
100 * cast(cumDefects as float) / fulldefects as DefectPercent
from (select d.*,
(select sum(defects) from d d2 where d2.num <= d.num
) as cumdefects,
SUM(defects) over () as fulldefects
from d
) d
Ok may be this will help someone trying to get the similar output. Here is what I did to get the desired output. Thanks a lot gordon, your suggestion actually helped me to make this possible
with d as
(SELECT ROW_NUMBER () OVER
(ORDER BY SUM([DEFECTS(F)]+[DEFECTS(NF)]) asc) NUM,
VERTICAL,
SUM([DEFECTS(F)]+[DEFECTS(NF)]) AS [DEFECTS]
FROM Capability
GROUP BY VERTICAL
)
select left(vertical,5) as Vertical, defects as TotalDefects, cumDefects,
cast(cumDefects as float) / fulldefects as DefectPercent
from (select d.*,
(select sum(defects) from d d2 where d2.num <= d.num
) as cumdefects,
SUM(defects) over () as fulldefects
from d
) d

How can you get a histogram of counts from a join table without using a subquery?

I have a lot of tables that look like this: (id, user_id, object_id). I am often interested in the question "how many users have one object? how many have two? etc." and would like to see the distribution.
The obvious answer to this looks like:
select x.ucount, count(*)
from (select count(*) as ucount from objects_users group by user_id) as x
group by x.ucount
order by x.ucount;
This produces results like:
ucount | count
-------|-------
1 | 15
2 | 17
3 | 23
4 | 104
5 | 76
7 | 12
Using a subquery here feels inelegant to me and I'd like to figure out how to produce the same result without. Further, if the question you're trying to ask is slightly more complicated it gets messy passing more information out of the subquery. For example, if you want the data further grouped by the user's creation date:
select
x.ucount,
(select cdate from users where id = x.user_id) as cdate,
count(*)
from (
select user_id, count(*) as ucount
from objects_users group by user_id
) as x
group by cdate, x.ucount,
order by cdate, x.ucount;
Is there some way to avoid the explosion of subqueries? I suppose in the end my objection is aesthetic, but it makes the queries hard to read and hard to write.
I think a subquery is exactly the appropriate way to do this, regardless of your RDBMS. Why would it be inelegant?
For the second query, just join the users table like this:
SELECT
x.ucount,
u.cdate,
COUNT(*)
FROM (
SELECT
user_id,
COUNT(*) AS ucount
FROM objects_users
GROUP BY user_id
) AS x
LEFT JOIN users AS u
ON x.user_id = u.id
GROUP BY u.cdate, x.ucount
ORDER BY u.cdate, x.ucount

How to Select and Order By columns not in Groupy By SQL statement - Oracle

I have the following statement:
SELECT
IMPORTID,Region,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
GROUP BY
IMPORTID, Region,RefObligor
Order BY
IMPORTID, Region,RefObligor
There exists some extra columns in table Positions that I want as output for "display data" but I don't want in the group by statement.
These are Site, Desk
Final output would have the following columns:
IMPORTID,Region,Site,Desk,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
Ideally I'd want the data sorted like:
Order BY
IMPORTID,Region,Site,Desk,RefObligor
How to achieve this?
It does not make sense to include columns that are not part of the GROUP BY clause. Consider if you have a MIN(X), MAX(Y) in the SELECT clause, which row should other columns (not grouped) come from?
If your Oracle version is recent enough, you can use SUM - OVER() to show the SUM (grouped) against every data row.
SELECT
IMPORTID,Site,Desk,Region,RefObligor,
SUM(NOTIONAL) OVER(PARTITION BY IMPORTID, Region,RefObligor) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
Order BY
IMPORTID,Region,Site,Desk,RefObligor
Alternatively, you need to make an aggregate out of the Site, Desk columns
SELECT
IMPORTID,Region,Min(Site) Site, Min(Desk) Desk,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
GROUP BY
IMPORTID, Region,RefObligor
Order BY
IMPORTID, Region,Min(Site),Min(Desk),RefObligor
I believe this is
select
IMPORTID,
Region,
Site,
Desk,
RefObligor,
Sum(Sum(Notional)) over (partition by IMPORTID, Region, RefObligor)
from
Positions
group by
IMPORTID, Region, Site, Desk, RefObligor
order by
IMPORTID, Region, RefObligor, Site, Desk;
... but it's hard to tell without further information and/or test data.
A great blog post that covers this dilemma in detail is here:
http://bernardoamc.github.io/sql/2015/05/04/group-by-non-aggregate-columns/
Here are some snippets of it:
Given:
CREATE TABLE games (
game_id serial PRIMARY KEY,
name VARCHAR,
price BIGINT,
released_at DATE,
publisher TEXT
);
INSERT INTO games (name, price, released_at, publisher) VALUES
('Metal Slug Defense', 30, '2015-05-01', 'SNK Playmore'),
('Project Druid', 20, '2015-05-01', 'shortcircuit'),
('Chroma Squad', 40, '2015-04-30', 'Behold Studios'),
('Soul Locus', 30, '2015-04-30', 'Fat Loot Games'),
('Subterrain', 40, '2015-04-30', 'Pixellore');
SELECT * FROM games;
game_id | name | price | released_at | publisher
---------+--------------------+-------+-------------+----------------
1 | Metal Slug Defense | 30 | 2015-05-01 | SNK Playmore
2 | Project Druid | 20 | 2015-05-01 | shortcircuit
3 | Chroma Squad | 40 | 2015-04-30 | Behold Studios
4 | Soul Locus | 30 | 2015-04-30 | Fat Loot Games
5 | Subterrain | 40 | 2015-04-30 | Pixellore
(5 rows)
Trying to get something like this:
SELECT released_at, name, publisher, MAX(price) as most_expensive
FROM games
GROUP BY released_at;
But name and publisher are not added due to being ambiguous when aggregating...
Let’s make this clear:
Selecting the MAX(price) does not select the entire row.
The database can’t know and when it can’t give the right answer every
time for a given query it should give us an error, and that’s what it
does!
Ok… Ok… It’s not so simple, what can we do?
Use an inner join to get the additional columns
SELECT g1.name, g1.publisher, g1.price, g1.released_at
FROM games AS g1
INNER JOIN (
SELECT released_at, MAX(price) as price
FROM games
GROUP BY released_at
) AS g2
ON g2.released_at = g1.released_at AND g2.price = g1.price;
Or Use a left outer join to get the additional columns, and then filter by the NULL of a duplicate column...
SELECT g1.name, g1.publisher, g1.price, g2.price, g1.released_at
FROM games AS g1
LEFT OUTER JOIN games AS g2
ON g1.released_at = g2.released_at AND g1.price < g2.price
WHERE g2.price IS NULL;
Hope that helps.