Group by after a partition by in MS SQL Server - sql

I am working on some car accident data and am stuck on how to get the data in the form I want.
select
sex_of_driver,
accident_severity,
count(accident_severity) over (partition by sex_of_driver, accident_severity)
from
SQL.dbo.accident as accident
inner join SQL.dbo.vehicle as vehicle on
accident.accident_index = vehicle.accident_index
This is my code, which counts the accidents had per each sex for each severity. I know I can do this with group by but I wanted to use a partition by in order to work out % too.
However I get a very large table (I assume for each row that is each sex/severity. When I do the following:
select
sex_of_driver,
accident_severity,
count(accident_severity) over (partition by sex_of_driver, accident_severity)
from
SQL.dbo.accident as accident
inner join SQL.dbo.vehicle as vehicle on
accident.accident_index = vehicle.accident_index
group by
sex_of_driver,
accident_severity
I get this:
sex_of_driver
accident_severity
(No column name)
1
1
1
1
2
1
-1
2
1
-1
1
1
1
3
1
I won't give you the whole table, but basically, the group by has caused the count to just be 1.
I can't figure out why group by isn't working. Is this an MS SQL-Server thing?
I want to get the same result as below (obv without the CASE etc)
select
accident.accident_severity,
count(accident.accident_severity) as num_accidents,
vehicle.sex_of_driver,
CASE vehicle.sex_of_driver WHEN '1' THEN 'Male' WHEN '2' THEN 'Female' end as sex_col,
CASE accident.accident_severity WHEN '1' THEN 'Fatal' WHEN '2' THEN 'Serious' WHEN '3' THEN 'Slight' end as serious_col
from
SQL.dbo.accident as accident
inner join SQL.dbo.vehicle as vehicle on
accident.accident_index = vehicle.accident_index
where
sex_of_driver != 3
and
sex_of_driver != -1
group by
accident.accident_severity,
vehicle.sex_of_driver
order by
accident.accident_severity

You seem to have a misunderstanding here.
GROUP BY will reduce your rows to a single row per grouping (ie per pair of sex_of_driver, accident_severity values. Any normal aggregates you use with this, such as COUNT(*), will return the aggregate value within that group.
Whereas OVER gives you a windowed aggregated, and means you are calculating it after reducing your rows. Therefore when you write count(accident_severity) over (partition by sex_of_driver, accident_severity) the aggregate only receives a single row in each partition, because the rows have already been reduced.
You say "I know I can do this with group by but I wanted to use a partition by in order to work out % too." but you are misunderstanding how to do that. You don't need PARTITION BY to work out percentage. All you need to calculate a percentage over the whole resultset is COUNT(*) * 1.0 / SUM(COUNT(*)) OVER (), in other words a windowed aggregate over a normal aggregate.
Note also that count(accident_severity) does not give you the number of distinct accident_severity values, it gives you the number of non-null values, which is probably not what you intend. You also have a very strange join predicate, you probably want something like a.vehicle_id = v.vehicle_id
So you want something like this:
select
sex_of_driver,
accident_severity,
count(*) as Count,
count(*) * 1.0 /
sum(count(*)) over (partition by sex_of_driver) as PercentOfSex
count(*) * 1.0 /
sum(count(*)) over () as PercentOfTotal
from
dbo.accident as accident a
inner join dbo.vehicle as v on
a.vehicle_id = v.vehicle_id
group by
sex_of_driver,
accident_severity;

Related

TSQL "where ... group by ..." issue that needs solution like "having ..."

I have 3 sub-tables of different formats joined together with unions if this affects anything into full-table. There I have columns "location", "amount" and "time". Then to keep generality for my later needs I union full-table with location-table that has all possible "location" values and other fields are null into master-table.
I query master-table,
select location, sum(amount)
from master-table
where (time...)
group by location
However some "location" values are dropped because sum(amount) is 0 for those "location"s but I really want to have full list of those "location"s for my further steps.
Alternative would be to use HAVING clause but from what I understand HAVING is impossible here because i filter on "time" while grouping on "location" and I would need to add "time" in grouping which destroys the purpose. Keep in mind that the goal here is to get sum(amount) in each "location"
select location, sum(amount)
from master-table
group by location, time
having (time...)
To view the output:
with the first code I get
loc1, 5
loc3, 10
loc6, 1
but I want to get
loc1, 5
loc2, 0
loc3, 10
loc4, 0
loc5, 0
loc6, 1
Any suggestions on what can be done with this structure of master-table? Alternative solution to which I have no idea how to code would be to add numbers from the first query result to location-table (as a query, not actual table) with the final result query that I've posted above.
What you want will require a complete list of locations, then a left-outer join using that table and your calculated values, and IsNull (for tsql) to ensure you see the 0s you expect. You can do this with some CTEs, which I find valuable for clarity during development, or you can work on "putting it all together" in a more traditional SELECT...FROM... statement. The CTE approach might look like this:
WITH loc AS (
SELECT DISTINCT LocationID
FROM location_table
), summary_data as (
SELECT LocationID, SUM(amount) AS location_sum
FROM master-table
GROUP BY LocationID
)
SELECT loc.LocationID, IsNull(location_sum,0) AS location_sum
FROM loc
LEFT OUTER JOIN summary_data ON loc.LocationID = summary_data.LocationID
See if that gets you a step or two closer to the results you're looking for.
I can think of 2 options:
You could move the WHERE to a CASE WHEN construction:
-- Option 1
select
location,
sum(CASE WHEN time <'16:00' THEN amount ELSE 0 END)
from master_table
group by location
Or you could JOIN with the possible values of location (which is my first ever RIGHT JOIN in a very long time 😉):
-- Option 2
select
x.location,
sum(CASE WHEN m.time <'16:00' THEN m.amount ELSE 0 END)
from master_table m
right join (select distinct location from master_table) x ON x.location = m.location
group by x.location
see: DBFIDDLE
The version using T-SQL without CTEs would be:
SELECT l.location ,
ISNULL(m.location_sum, 0) as location_sum
FROM master-table l
LEFT JOIN (
SELECT location,
SUM(amount) as location_sum
FROM master-table
WHERE (time ... )
GROUP BY location
) m ON l.location = m.location
This assumes that you still have your initial UNION in place that ensures that master-table has all possible locations included.
It is the where clause that excludes some locations. To ensure you retain every location you could introduce "conditional aggregation" instead of using the where clause: e.g.
select location, sum(case when (time...) then amount else 0 end) as location_sum
from master-table
group by location
i.e. instead of excluding some rows from the result, place the conditions inside the sum function that equate to the conditions you would have used in the where clause. If those conditions are true, then it will aggregate the amount, but if the conditions evaluate to false then 0 is summed, but the location is retained in the result.

Using Count case

So I've been just re-familiarizing myself with SQL after some time away from it, and I am using Mode Analytics sample Data warehouse, where they have a dataset for SF police calls in 2014.
For reference, it's set up as this:
incident_num, category, descript, day_of_week, date, time, pd_district, Resolution, address, ID
What I am trying to do is figure out the total number of incidents for a category, and a new column of all the people who have been arrested. Ideally looking something like this
Category, Total_Incidents, Arrested
-------------------------------------
Battery 10 4
Murder 200 5
Something like that..
So far I've been trying this out:
SELECT category, COUNT (Resolution) AS Total_Incidents, (
Select COUNT (resolution)
from tutorial.sf_crime_incidents_2014_01
where Resolution like '%ARREST%') AS Arrested
from tutorial.sf_crime_incidents_2014_01
group by 1
order by 2 desc
That returns the total amount of incidents correctly, but for the Arrested, it keeps printing out 9014 Arrest
Any idea what I am doing wrong?
The subquery is not correlated. It just selects the count of all rows. Add a condition, that checks for the category to be equal to that of the outer query.
SELECT o.category,
count(o.resolution) total_incidents,
(SELECT count(i.resolution)
FROM tutorial.sf_crime_incidents_2014_01 i
WHERE i.resolution LIKE '%ARREST%'
AND i.category = o.category) arrested
FROM tutorial.sf_crime_incidents_2014_01 o
GROUP BY 1
You could use this:
SELECT category,
COUNT(Resolution) AS Total_Incidents,
SUM(CASE WHEN Resolution LIKE '%ARREST%' THEN 1 END) AS Arrested
FROM tutorial.sf_crime_incidents_2014_01
GROUP BY category
ORDER BY 2 DESC;

SQLite3 database conditional summing

I am looking to order a list of keys based on the number of orders placed from a database containing order requests. Basically, on table, call it orders(o_partkey, o_returnflag) I am trying to get the total number of returns for each order. I have tried many variations of the following snippet with the goal schema returnlist(partkey, numreturns):
select O.o_partkey as partkey,
count(case when O.o_returnflag = 'R' then 1 else 0 end) as numreturns
from orders O
orderby quantity_returned desc;
I am very new to SQLite and am just jumping into the basics. This is an adjustment of a homework question (the actual question is more complex) but I have simplified down the issue I am having.
Consider using a derived table subquery with SUM() as the aggregate function:
SELECT dT.partkey, dT.numreturns
FROM
(SELECT O.o_partkey as partkey,
SUM(CASE WHEN O.o_returnflag = 'R' THEN 1 ELSE 0 END) as numreturns
FROM [ORDER] O
GROUP BY O.o_partkey) AS dT
ORDER BY dT.numreturns DESC;
Be sure to bracket name of table as [ORDER] is an SQLite key word.
Your problem is that COUNT counts rows, so it counts both 0 and 1 values.
You are not interested in any other rows, so you can just filter out the returns with WHERE:
SELECT o_partkey AS partkey,
COUNT(*) AS numreturns
FROM orders
WHERE o_returnflag = 'R'
ORDER BY 2 DESC;

How to group the outcomes of a query by the values in another column?

I have a table called vegetation with 2 columns:
type, count
I want to sum the count for all the rows where count value is smaller than the average count and all those for which it is higher.
I don't know how to reflect this in the group by clause... (or somewhere else?).
I guess that another way of doing it should be by assigning a value to all less-than-average data and another value to all higher-than-average data and then group by this value. But I just started and can not figure out how to do that either.
SELECT sum(CASE WHEN ct <= x.avg_ct THEN ct ELSE 0 END) AS sum_ct_low
,sum(CASE WHEN ct > x.avg_ct THEN ct ELSE 0 END) AS sum_ct_hi
FROM vegetation v
,(SELECT avg(ct) AS avg_ct FROM vegetation) x
The average is a single value, you can just CROSS JOIN the subquery to the base table (comma separated list of tables means cross-joining).
Then filter with a simple CASE statement.
I use ct instead of count, since that's a reserved word in SQL.

SQL query count divided by a distinct count of same query

Having some trouble with some SQL.
Take the following result for instance:
LOC_CODE CHANNEL
------------ --------------------
3ATEST-01 CHAN2
3ATEST-01 CHAN3
3ATEST-02 CHAN4
What I need to do is get a count of the above query, grouped by channel, but i want that count to be divided by the count that the "LOC_CODE" appears.
Example of the result I am after is:
CHANNEL COUNT
---------------- ----------
CHAN2 0.5
CHAN3 0.5
CHAN4 1
Above explaination is that the CHAN2 appears next to "3ATEST-01", but that LOC_CODE of "3ATEST-01" appears twice, so the count should be divided by 2.
I know I can do this by basically duplicating the query with a distinct count, but the underlying query is quite complex and don't really want to harm performance.
Please let me know if you would like more information!
Try:
select channel,
count(*) over (partition by channel, loc_code)
/ count(*) over (partition by loc_code) as count_ratio
from my_table
SELECT t.CHANNEL, COUNT(*) / gr.TotalCount
FROM my_table t JOIN (
SELECT LOC_CODE, COUNT(*) TotalCount
FROM my_table
GROUP BY LOC_CODE
) gr USING(LOC_CODE)
GROUP BY t.LOC_CODE, t.CHANNEL
Create a index on (LOC_CODE, CHANNEL)
If are no duplicate channels, replace COUNT(*) / gr.TotalCount with 1 / gr.TotalCount and remove the GROUP BY clause
First, find a query that gets you the correct results. Then, see if it can be optimised. My guess is that it's hard to optimise as you require two different groupings, one per Channel and one pre Loc_Code.
I'm not even sure that this fits your description:
SELECT t.CHANNEL
, COUNT(*) / SUM(grp.TotalCount)
FROM my_table t
JOIN
( SELECT LOC_CODE
, COUNT(*) TotalCount --- or is it perhaps?:
--- COUNT(DISTINCT CHANNEL)
FROM my_table
GROUP BY LOC_CODE
) grp
ON grp.LOC_CODE = t.LOC_CODE
GROUP BY t.CHANNEL
Your requirements are still a bit unclear to me when it comes to duplicate CHANNELs, but this should work if you want grouping on both CHANNEL and LOC_CODE to sum up later;
SELECT L1.CHANNEL, 1/COUNT(L2.LOC_CODE)
FROM Locations L1
LEFT JOIN Locations L2 ON L1.LOC_CODE = L2.LOC_CODE
GROUP BY L1.CHANNEL, L1.LOC_CODE
Demo here.