Keeping rows from double-counting in a GROUP BY - sql

Here's the basic guts of my schema and problem: http://sqlfiddle.com/#!1/72ec9/4/2
Note that the periods table can refer to a variable range of time - it could be an entire season, it could be a few games or one game. For a given team and year all period rows represent exclusive ranges of time.
I've got a query written which joins up tables and uses a GROUP BY periods.year to aggregate scores for a season (see sqlfiddle). However, if a coach had two positions in the same year the GROUP BY will count the same period row twice. How can I ditch the duplicates when a coach held two positions but still sum up periods when a year is comprised of multiple periods? If there's a better way to do the schema I'd also appreciate it if you pointed it out to me.

The underlying problem (joining to multiple tables with multiple matches) is explained in this related answer:
Two SQL LEFT JOINS produce incorrect result
To fix, I first simplified & formatted your query:
select pe.year
, sum(pe.wins) AS wins
, sum(pe.losses) AS losses
, sum(pe.ties) AS ties
, array_agg(po.id) AS position_id
, array_agg(po.name) AS position_names
from periods_positions_coaches_linking pp
join positions po ON po.id = pp.position
join periods pe ON pe.id = pp.period
where pp.coach = 1
group by pe.year
order by pe.year;
Yields the same, incorrect result as your original, but simpler / faster / easier to read.
No point in joining the table coach as long as you don't use columns in the SELECT list. I removed it completely and replaced the WHERE condition with where pp.coach = 1.
You don't need COALESCE. NULL values are ignored in the aggregate function sum(). No need to substitute 0.
Use table aliases to make it easier to read.
Next, I solved your problem like this:
SELECT *
FROM (
SELECT pe.year
, array_agg(DISTINCT po.id) AS position_id
, array_agg(DISTINCT po.name) AS position_names
FROM periods_positions_coaches_linking pp
JOIN positions po ON po.id = pp.position
JOIN periods pe ON pe.id = pp.period
WHERE pp.coach = 1
GROUP BY pe.year
) po
LEFT JOIN (
SELECT pe.year
, sum(pe.wins) AS wins
, sum(pe.losses) AS losses
, sum(pe.ties) AS ties
FROM (
SELECT period
FROM periods_positions_coaches_linking
WHERE coach = 1
GROUP BY period
) pp
JOIN periods pe ON pe.id = pp.period
GROUP BY pe.year
) pe USING (year)
ORDER BY year;
Aggregate positions and periods separately before joining them.
In the first sub-query po list positions only once with array_agg(DISTINCT ...).
In the second sub-query pe ...
GROUP BY period, because a coach can have multiple positions per period.
JOIN to periods-data after that, and then aggregate to get sums.
db<>fiddle here
Old sqlfiddle

use distinct as shown here
code:
select periods.year as year,
sum(coalesce(periods.wins, 0)) as wins,
sum(coalesce(periods.losses, 0)) as losses,
sum(coalesce(periods.ties, 0)) as ties,
array_agg( distinct positions.id) as position_id,
array_agg( distinct positions.name) as position_names
from periods_positions_coaches_linking
join coaches on coaches.id = periods_positions_coaches_linking.coach
join positions on positions.id = periods_positions_coaches_linking.position
join periods on periods.id = periods_positions_coaches_linking.period
where coaches.id = 1
group by periods.year, positions.id
order by periods.year;

In your case, the easiest way might be to divide out the positions:
select periods.year as year,
sum(coalesce(periods.wins, 0))/COUNT(distinct positions.id) as wins,
sum(coalesce(periods.losses, 0))/COUNT(distinct positions.id) as losses,
sum(coalesce(periods.ties, 0))/COUNT(distinct positions.id) as ties,
array_agg(distinct positions.id) as position_id,
array_agg(distinct positions.name) as position_names
from periods_positions_coaches_linking join
coaches
on coaches.id = periods_positions_coaches_linking.coach join
positions
on positions.id = periods_positions_coaches_linking.position join
periods
on periods.id = periods_positions_coaches_linking.period
where coaches.id = 1
group by periods.year
order by periods.year;
The number of positions scales the wins, losses, and ties, so dividing it out adjusts the counts.

Related

Not getting 0 value in SQL count aggregate by inner join

I am using the basic chinook database and I am trying to get a query that will display the worst selling genres. I am mostly getting the answer, however there is one genre 'Opera' that has 0 sales, but the query result is ignoring that and moving on to the next lowest non-zero value.
I tried using left join instead of inner join but that returns different values.
This is my query currently:
create view max
as
select distinct
t1.name as genre,
count(*) as Sales
from
tracks t2
inner join
invoice_items t3 on t2.trackid == t3.trackid
left join
genres as t1 on t1.genreid == t2.genreid
group by
t1.genreid
order by
2
limit 10;
The result however skips past the opera value which is 0 sales. How can I include that? I tried using left join but it yields different results.
Any help is appreciated.
If you want to include genres with no sales then you should start the joins from genres and then do LEFT joins to the other tables.
Also, you should not use count(*) which counts any row in the resultset.
SELECT g.name Genre,
COUNT(i.trackid) Sales
FROM genres g
LEFT JOIN tracks t ON t.genreid = g.genreid
LEFT JOIN invoice_items i ON i.trackid = t.trackid
GROUP BY g.genreid
ORDER BY Sales LIMIT 10;
There is no need for the keyword DISTINCT, since the query returns 1 row for each genre.
When asking for the top n one must always state how to deal with ties. If I am looking for the top 1, but there are three rows in the table, all with the same value, shall I select 3 rows? Zero rows? One row arbitrarily chosen? Most often we don't want arbitrary results, which excludes the last option. This excludes LIMIT, too, because LIMIT has no clause for ties in SQLite.
Here is an example with DENSE_RANK instead. You are looking for the worst selling genres, so we must probably look at the revenue per genre, which is the sum of price x quantity sold. In order to include genres without invoices (and maybe even without tracks?) we outer join this data to the genre table.
select total, genre_name
from
(
select
g.name as genre_name,
coalesce(sum(ii.unit_price * ii.quantity), 0) as total
dense_rank() over (order by coalesce(sum(ii.unit_price * ii.quantity), 0)) as rnk
from genres g
left join tracks t on t.genreid = g.genreid
left join invoice_items ii on ii.trackid = t.trackid
group by g.name
) aggregated
where rnk <= 10
order by total, genre_name;

SQL Query with row_number() not returning expected output

my goal is to write a query that should return the cities which produced the highest avg. sales for each item-category.
This is the expected output:
item_category|city
books |los_angeles
toys |austin
electronics |san_fransisco
My 3 table schemas look like this:
users
user_id|city
sales
user_id|item_id|sales_amt
items
item_id|item_category
These are further notes to consider:
1. sales_amt is the only column that may have Null values. if no users have placed a sale for a particular item-category (no rows in sales with a non-Null sales_amt), then the city name should be Null.
2. only 1 row per each distinct item. It more than 1 city qualify, then pick the first one alphabetically.
The attempt I took looks like this but it does not produce the right output:
select a.item_category,a.city from (
select
i.item_category,
u.city,
row_number() over (partition by i.item_category,u.city order by avg(s.sales_amt) desc)rk
from sales s
join users u on s.user_id=u.user_id
join items i on i.item_id=s.item_id
group by i.item_category,u.city)a
where a.rk=1
My output does not return the Null cased for sales_amt. Also, I get non-unique rows. Therefore, I am very nervous I am not properly incorporating the 2 notes.
I hope someone can help.
my goal is to write a query that should return the cities which produced the highest avg. sales for each item-category.
This can be calculated using aggregation and window functions:
select ic.*
from (select i.item_category, u.city,
row_number() over(partition by u.item_category order by avg(s.sales_amt) desc, u.city) as seqnum
from users u join
sales s
on s.user_id = u.user_id join
items i
on i.item_id = s.item_id
group by i.item_category, u.city
) ic
where seqnum = 1;
Your question explicitly says "average" which is why this uses avg(). However, I suspect that you really want the sum in each city, which would be sum().
Notes:
You want one row so row_number() instead of rank().
You need sales to calculate the average, so join, instead of left join.
You want one row per item_category, so that is used for partitioning.
Aaaand my take on it is a mix of GMB and Gordon's advices; GMB points out that left joins are needed but I think his starting table, partition and choice of rank() is wrong (his query cannot generate null city names as requested, and could generate duplicates tied on same avg), and Gordon picked up on things like ordering by city on a tied avg which GMB did not but missed the "if no sales of any items in category X put null for the city" requirement. Both guys left cancelled orders floating round the system which introduces errors:
select *
from (
select
i.item_category,
u.city,
row_number() over(partition by i.item_category order by avg(s.sales_amt) desc, u.city asc) rn
from items i
left join (select * from sales where sale_amt is not null) s on i.item_id = s.item_id
left join users u on s.user_id = u.user_id
group by i.item_category, u.city
) t
where rn = 1
We start from itemcategory so that categories having no sales get nulls for their sale amount and city.
We also need to consider that any sales that didn't fulfil will have null in their amount and we exclude these with a subquery otherwise they will link through to users giving a false positive - even though the avg will calculate as null for a category that only has cancelled orders, the city will still show when it should not). I could also have done this with a and sales_amt is not null predicate in the join but I think this way is clearer. This should not be done with a predicate in the where clause because that will eliminate the sale-less categories we are trying to preserve
Row number is used on avg but with city name to break any ties. It's a simpler function than rank and cannot generate duplicate values
Finally we pull the rn 1s to get the top averaging cities
I think you want left joins starting from users in the inner query to preserve cities without sales.
As for the ranking: if you want one record per city, then do not put other columns that city in the partition (your current partition gives you one record per city and per category, which is not what you want).
Consider:
select *
from (
select
i.item_category,
u.city,
rank() over(partition by u.city order by avg(s.sales_amt) desc) rk
from users u
left join sales s on s.user_id = u.user_id
left join items i on i.item_id = s.item_id
group by i.item_category, u.city
) t
where rk = 1

Calculate percentage of group using Group By SQL

I have a set of data that contains multiple groups of data(Vehicle_Code), each item(PK: Cusip_Sedol) in the group has a certain code(GIC_Code) that is not unique. I am trying to find the percentage of each code(GIC_Code) within each group(Vehicle_Name) of data.
Here is my SQL Select statement thus far:
SELECT H.vehicle_code,
G.group_name,
Count(D.cusip_sedol) AS Total
FROM tbltrading_holdings AS H
INNER JOIN tbltrading_stocks_data_stocks AS D
ON H.cusip_sedol = D.cusip_sedol
LEFT JOIN tbltrading_gic AS G
ON D.gic_code = G.gic_code
WHERE vehicle_code IN (SELECT vehicle_code
FROM tbltrading_vehicles
WHERE vehicle_name LIKE 'J%')
AND D.gic_code IS NOT NULL
GROUP BY H.vehicle_code,
G.group_name
ORDER BY vehicle_code
SELECT
H.vehicle_code,
G.group_name,
VehicleTotal = Count(D.cusip_sedol) OVER (PARTITION BY H.vehicle_code, G.group_name),
d.gic_code,
gic_codePercentPerVehicleName =
Count(d.gic_code) OVER () * 1.0 / Count(*) OVER (PARTITION BY V.vehicle_name),
gic_codePercentPerVehicleName2 =
Count(d.gic_code) * 1.0 / Count(*) OVER (PARTITION BY V.vehicle_name)
FROM
dbo.tbltrading_holdings H
INNER JOIN tbltrading_stocks_data_stocks D
ON H.cusip_sedol = D.cusip_sedol
LEFT JOIN dbo.tbltrading_gic G
ON D.gic_code = G.gic_code
INNER JOIN dbo.tbltrading_vehicles V
ON H.vehicle_code = V.vehicle_code
AND v.vehicle_name LIKE 'J%'
WHERE
D.gic_code IS NOT NULL
GROUP BY
H.vehicle_code,
D.gic_code,
G.group_name,
V.vehicle_name
ORDER BY
H.vehicle_code
;
There are some unknowns here that have forced me to make certain assumptions. You can see that I've come up with two different interpretations about what "gic code per vehicle name" could mean.
For starters, to provide the vehicle_name each gic_code is associated with, we have to do a real join, not an IN (which is effectively an EXISTS). However, is it possible for the same gic_code to join up to different vehicle_name values? (Since there is an intermediate vehicle_code that joins them?) I'm assuming that it's not possible for this to happen, and if it actually is, the query will give unuseful results, and you'll have to formulate better what exactly you're looking for before we can help you more.
Next, the results are all muddied by the fact that you're selecting so many columns, which forces them to be part of the GROUP BY. But once you do that, then all the windowing functions have to include partitions to "break" them out of the grouping. This query may run slowly, as it's being made to do a lot at once, which could result in many scans of the table. The way things are now, for each particular gic_code, you'll get many rows with the same value, because the query has to expose the (multiple) vehicle_code and group_name combinations for each one. Is that really what you want?
You might get better results if you removed some of the displayed columns, as this would let you remove at least some of the PARTITION BY expressions.
Last, I'm not sure I even got the partitions correct. Only you know the cardinality of each column in relation to the joins to other tables.
What you need is the total over all the rows . . . and you can get this using window functions. So, change the select to:
SELECT H.vehicle_code,
G.group_name,
Count(D.cusip_sedol) AS Total,
Count(D.Cusip_sedol)*1.0 / Sum(Count(D.Cusip_sedol)) Over () as p_total
. . .
Note that the *1.0 is there just to prevent integer division.
I think you are pretty close. Is counting the Sedol working for you? If so then just divide that by the count of the group name for your percentage:
SELECT H.vehicle_code,
G.group_name,
cast(Count(DISTINCT D.cusip_sedol) as DECIMAL)/cast(count(DISTINCT G.group_name) as DECIMAL) AS Total --add this second part
FROM tbltrading_holdings AS H
INNER JOIN tbltrading_stocks_data_stocks AS D
ON H.cusip_sedol = D.cusip_sedol
LEFT JOIN tbltrading_gic AS G
ON D.gic_code = G.gic_code
WHERE vehicle_code IN (SELECT vehicle_code
FROM tbltrading_vehicles
WHERE vehicle_name LIKE 'J%')
AND D.gic_code IS NOT NULL
GROUP BY H.vehicle_code,
G.group_name
ORDER BY vehicle_code

How do I select the Max in this query? Help for exam

So, I'm going thru a lot of exercises for a final SQL exam I have on thursday and I came across another query I'm having doubts about.
The tables in the exercise are supposed to be from a hotel DB. You have three tables involved:
STAY ROOM ROOM_TYPE
=========== ============ ============
PK ID_STAY PK ID_ROOM PK ID_ROOM_TYPE
DAYS_QUANT ID_ROOM_TYPE FK DESCRIPTION
DATE PRICE
ID_ROOM FK
The query they are asking me to do is "Show all data for the Room that has been rented for the highest amount of days (in total) in 2011, by room type (you have to show ID Room Type and Description)"
This is the way I solved it, I don't know if it's ok:
SELECT RT.ID_ROOM_TYPE, RT.DESCRIPTON, R.*, SUM(S.DAYS_QUANT)
FROM STAY S, ROOM R, ROOM_TYPE RT
WHERE YEAR(S.DATE) = '2011'
GROUP BY RT.ID_ROOM_TYPE, RT.DESCRIPTON, R.*
ORDER BY SUM(S.DAYS_QUANT) DESC
LIMIT 1
So, the first thing I'm not sure about, is that R.* I included. Can I put it like that in a SELECT? Can it also be included like that in a GROUP BY?
The other thing I'm not sure about if I will be allowed to use LIMIT or SELECT TOP 1 statements in the exam. Can anyone think of a way to solve this without using those? like with a MAX() statement or something?
I believe that you are not allowed to use CTEs so I expanded last part of Steve Kass's answer. You may get desired results without TOP or Limit clauses by comparing total days a room was occupied by max total number of days any room of the same type was occupied. To do so, you would first sum days by room and then, using identical derived table, get maximum of days per room type. Joining the two by room type and days you would isolate most used rooms. Then you join starting tables to show all the data. Unlike TOP or Limit this will produce more records in case of a tie.
P.S. this is NOT tested. I believe it will work, but there might be a typo.
select r.*, rt.*, roomDays.TotalDays
from Room r inner join Room_type rt
on r.id_room_type = rt.id_room_type
inner join
(select id_room, id_room_type, sum(days_quant) TotalDays
from Stay
inner join Room
on Stay.id_room = Room.id_room
where year(Date) = 2011
group by id_room, id_room_type) roomDays
on r.id_room = roomDays.id_room
inner join
(select id_room_type, max(TotalDays) TotalDays
from
(select id_room, id_room_type, sum(days_quant) TotalDays
from Stay
inner join Room
on Stay.id_room = Room.id_room
where year(Date) = 2011
group by id_room, id_room_type) roomDaysHelper
group by id_room_type) roomTypeDays
on r.id_room_type = roomTypeDays.id_room_type
and roomDays.TotalDays = roomTypeDays.TotalDays
select r.*, t.*
from room r
join room_type t on t.id_room_type = r.id_room_type
where r.id in
(select
(select r.id_room
from room r
join stay on stay.id_room = r.id_room
where year(s.date) = '2011'
and r.id_room_type = t.id_room_type
group by r.id_room
order by sum(s.days_quant) desc
limit 1) room_id
from room_type t)
It's always possible to avoid LIMIT 1 or SELECT TOP. One way is to express the top row as the row for which there is no higher row. WHERE NOT EXISTS expresses the idea of "for which there is no."
One way to think of this is as follows: Select those rooms (along with their total days and type information) for which there is no room of the same type with a greater number of total days. That gives you this query (not carefully proofread):
with StayTotals as (
select
STAY.ID_ROOM,
ROOM_TYPE.ID_ROOM_TYPE,
ROOM_TYPE.DESCRIPTION,
SUM(STAY.DAYS_QUANT) AS TotalDays2011
from STAY join ROOM on STAY.ID_ROOM = ROOM.ID_ROOM
join ROOM_TYPE on ROOM.ID_ROOM_TYPE = ROOM_TYPE.ID_ROOM_TYPE
where YEAR(STAY.DATE) = 2011
group by STAY.ID_ROOM, ROOM_TYPE.ID_ROOM_TYPE, ROOM_TYPE.DESCRIPTION
)
select *
from StayTotals as T1
where not exists (
select *
from StayTotals as T2
where T2.ID_ROOM_TYPE = T1.ID_ROOM_TYPE
and T2.TotalDays2011 > T1.TotalDays2011
);
If you can't use CTEs (the WITH clause), you can rewrite it using subqueries, but it's awkward.
Ranking functions have been part of the SQL standard for quite a while. If you can use them, this may also work:
with StayTotals as (
select
STAY.ID_ROOM,
ROOM_TYPE.ID_ROOM_TYPE,
ROOM_TYPE.DESCRIPTION,
SUM(STAY.DAYS_QUANT) AS TotalDays2011
from STAY join ROOM on STAY.ID_ROOM = ROOM.ID_ROOM
join ROOM_TYPE on ROOM.ID_ROOM_TYPE = ROOM_TYPE.ID_ROOM_TYPE
where YEAR(STAY.DATE) = 2011
group by STAY.ID_ROOM, ROOM_TYPE.ID_ROOM_TYPE, ROOM_TYPE.DESCRIPTION
), StayTotalsRankedByType as (
select
ID_ROOM,
ID_ROOM_TYPE,
DESCRIPTION,
TotalDays2011,
RANK() OVER (
PARTITION BY ID_ROOM_TYPE
ORDER BY TotalDays2011 DESC
) as RankInRoomType
from StayTotals
)
select
ID_ROOM,
ID_ROOM_TYPE,
DESCRIPTION,
TotalDays2011
from StayTotalsRankedByType
where RankInRoomType = 1;
Finally, one other way to pull in additional columns to describe the grouped MAX results is to use a "carryalong" sort, which was a handy technique before ranking functions were available. Adam Machanic gives an example here, and there are useful threads on the topic from Usenet, such as this one.
How about this?
select room.id_room, room_type.description, room.price
from room inner join room_type
on room.id_room.type = room_type.id_room_type
where room.room_id = (select room_id from stay
where year (date) = 2011
group by id_room
order by sum (days_quant) desc);
Unfortunately, this query (as it is now) doesn't show how for many days the most popular room had been rented. But there's no 'limit 1'!
Thank you all! with all the ideas you gave me I came up with this, let me know if you think it's ok please!
SELECT R.ID_ROOM, R.ID_ROOM_TYPE, T.DESCRIPTION, SUM(S.DAYS_CUANT)
FROM ROOM R, ROOM_TYPE T, STAY S
(SELECT ID_STAY, SUM(S.DAYS_QUANT) TOTALDAYS
FROM STAY S
WHERE YEAR(S.DATE) = 2011
GROUP BY S.ID_STAY) STAYHELPER
WHERE YEAR(S.DATE) = 2011
GROUP BY R.ID_ROOM, R.ID_ROOM_TYPE, T.DESCRIPTION
HAVING SUM(S.DAYS_QUANT) >= ALL STAYHELPER.TOTALDAYS

Help in a Join query

SELECT game_ratingstblx245v.game_id,avg( game_ratingstblx245v.rating )
as avg_rating,
count(DISTINCT game_ratingstblx245v.userid)
as count,
game_data.name,
game_data.id ,
avg(game_ratings.critic_rating),count(DISTINCT game_ratings.critic)
as cr_count
FROM game_data
LEFT JOIN game_ratingstblx245v ON game_ratingstblx245v.game_id = game_data.id
LEFT JOIN game_ratings ON game_ratings.game_id = game_data.id
WHERE game_data.release_date < NOW()
GROUP BY game_ratingstblx245v.game_id
ORDER BY game_data.release_date DESC,
game_data.name
I am currenty using this query to extract values from 3 tables
game_data - id(foreign key), name, release_date \games info
game_ratings - game_id(foreign key),critic , rating \critic rating
game_ratingstblx245v - game_id(foreign key), rating, userid \user rating
What I want to do with this query is select all id's from table game_data order by release_date descending, then check the avg rating from table game_ratings and game_ratingsblx245v corresponding to individual id's(if games have not been rated the result should return null from fields of the latter two tables)..Now the problem I am facing here is the result is not coming out as expected(some games which have not been rated are showing up while others are not), can you guys check my query and tell me where am i wrong if so...Thanks
You shouldn't use the game_ratingstblx245v.game_id column in your GROUP BY, since it could be NULL when there are no ratings for a given game id. Use game_data.id instead.
Here's how I would write the query:
SELECT g.id, g.name,
AVG( x.rating ) AS avg_user_rating,
COUNT( DISTINCT x.userid ) AS user_count,
AVG( r.critic_rating ) AS avg_critic_rating,
COUNT( DISTINCT r.critic ) AS critic_count
FROM game_data g
LEFT JOIN game_ratingstblx245v x ON (x.game_id = g.id)
LEFT JOIN game_ratings r ON (r.game_id = g.id)
WHERE g.release_date < NOW()
GROUP BY g.id
ORDER BY g.release_date DESC, g.name;
Note that although this query produces a Cartesian product between x and r, it doesn't affect the calculation of the average ratings. Just be aware in the future that if you were doing SUM() or COUNT(), the calculations could be exaggerated by an unintended Cartesian product.