Using multiple left joins to calculate averages and counts - sql

I am trying to figure out how to use multiple left outer joins to calculate average scores and number of cards. I have the following schema and test data. Each deck has 0 or more scores and 0 or more cards. I need to calculate an average score and card count for each deck. I'm using mysql for convenience, I eventually want this to run on sqlite on an Android phone.
mysql> select * from deck;
+----+-------+
| id | name |
+----+-------+
| 1 | one |
| 2 | two |
| 3 | three |
+----+-------+
mysql> select * from score;
+---------+-------+---------------------+--------+
| scoreId | value | date | deckId |
+---------+-------+---------------------+--------+
| 1 | 6.58 | 2009-10-05 20:54:52 | 1 |
| 2 | 7 | 2009-10-05 20:54:58 | 1 |
| 3 | 4.67 | 2009-10-05 20:55:04 | 1 |
| 4 | 7 | 2009-10-05 20:57:38 | 2 |
| 5 | 7 | 2009-10-05 20:57:41 | 2 |
+---------+-------+---------------------+--------+
mysql> select * from card;
+--------+-------+------+--------+
| cardId | front | back | deckId |
+--------+-------+------+--------+
| 1 | fron | back | 2 |
| 2 | fron | back | 1 |
| 3 | f1 | b2 | 1 |
+--------+-------+------+--------+
I run the following query...
mysql> select deck.name, sum(score.value)/count(score.value) "Ave",
-> count(card.front) "Count"
-> from deck
-> left outer join score on deck.id=score.deckId
-> left outer join card on deck.id=card.deckId
-> group by deck.id;
+-------+-----------------+-------+
| name | Ave | Count |
+-------+-----------------+-------+
| one | 6.0833333333333 | 6 |
| two | 7 | 2 |
| three | NULL | 0 |
+-------+-----------------+-------+
... and I get the right answer for the average, but the wrong answer for the number of cards. Can someone tell me what I am doing wrong before I pull my hair out?
Thanks!
John

It's running what you're asking--it's joining card 2 and 3 to scores 1, 2, and 3--creating a count of 6 (2 * 3). In card 1's case, it joins to scores 4 and 5, creating a count of 2 (1 * 2).
If you just want a count of cards, like you're currently doing, COUNT(Distinct Card.CardId).

select deck.name, coalesce(x.ave,0) as ave, count(card.*) as count -- card.* makes the intent more clear, i.e. to counting card itself, not the field. but do not do count(*), will make the result wrong
from deck
left join -- flatten the average result rows first
(
select deckId,sum(value)/count(*) as ave -- count the number of rows, not count the column name value. intent is more clear
from score
group by deckId
) as x on x.deckId = deck.id
left outer join card on card.deckId = deck.id -- then join the flattened results to cards
group by deck.id, x.ave, deck.name
order by deck.id
[EDIT]
sql has built-in average function, just use this:
select deckId, avg(value) as ave
from score
group by deckId

What's going wrong is that you're creating a Cartesian product between score and card.
Here's how it works: when you join deck to score, you may have multiple rows match. Then each of these multiple rows is joined to all of the matching rows in card. There's no condition preventing that from happening, and the default join behavior when no condition restricts it is to join all rows in one table to all rows in another table.
To see it in action, try this query, without the group by:
select *
from deck
left outer join score on deck.id=score.deckId
left outer join card on deck.id=card.deckId;
You'll see a lot of repeated data in the columns that come from score and card. When you calculate the AVG() over data that has repeats in it, the redundant values magically disappear (as long as the values are repeated uniformly). But when you COUNT() or SUM() them, the totals are way off.
There may be remedies for inadvertent Cartesian products. In your case, you can use COUNT(DISTINCT) to compensate:
select deck.name, avg(score.value) "Ave", count(DISTINCT card.front) "Count"
from deck
left outer join score on deck.id=score.deckId
left outer join card on deck.id=card.deckId
group by deck.id;
This solution doesn't solve all cases of inadvertent Cartesian products. The more general-purpose solution is to break it up into two separate queries:
select deck.name, avg(score.value) "Ave"
from deck
left outer join score on deck.id=score.deckId
group by deck.id;
select deck.name, count(card.front) "Count"
from deck
left outer join card on deck.id=card.deckId
group by deck.id;
Not every task in database programming must be done in a single query. It can even be more efficient (as well as simpler, easier to modify, and less error-prone) to use individual queries when you need multiple statistics.

Using left joins isn't a good approach, in my opinion. Here's a standard SQL query for the result you want.
select
name,
(select avg(value) from score where score.deckId = deck.id) as Ave,
(select count(*) from card where card.deckId = deck.id) as "Count"
from deck;

Related

Access SQL - divide value in each row by total (using self join)

I have a table (tblGoals) which shows how many goals each player has scored, e.g:
| Player | Goals |
--------------------
| John | 6 |
| Chris | 10 |
| Ben | 4 |
I am trying to write a query that will output each player along with the percentage of the teams total goals that they have scored:
| Player | PercentageGoals |
------------------------------
| John | 0.3 |
| Chris | 0.5 |
| Ben | 0.2 |
I have already figured out how to do this with a sub query as shown
SELECT
Player,
Goals/ (SELECT SUM(Goals)FROM tblGoals) AS PercentageGoals
FROM tblGoals
The example table I have shown is just to demonstrate what I am trying to do. The actual table I am using is a much larger dataset and trying to use a subquery to get the percentage in this way is running quite slowly.
I have noticed before in Access that self-joins are usually optimised more efficiently than sub queries, and so I am trying to figure out if the above query can be rewritten using a self join.
I have tried something along the lines of below but obviously this is incorrect as t2 is grouped by Player which means I am not getting the true total, but if I leave the Player name out then I can't join on it?
SELECT
t1.Player,
t1.Goals/t2.sumGoals AS PercentageGoals
FROM
tbl1 t1
INNER JOIN
(SELECT Player, SUM(Goals) as sumGoals
FROM tbl1
GROUP BY Player) t2
ON t1.Player= t2.Player
Is there any way to do this?
I'll try this as an answer - although I'm not sure if it will be faster as only using a three records.
Your second table in the join should just have a total of all the goals to join to each record in the first table - a cartesian product.
You can then divide one by the other:
SELECT Player
, Goals / TotalGoals AS Total
FROM tblGoals, (SELECT SUM(Goals) AS TotalGoals FROM tblGoals)
Is it any faster on a big table? The idea being that in your SQL it was calculating the total for each record, while this creates the total as a table and joins to it so should only be calculated once.

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

How to cross join in Big Query using intervals?

How can I join two tables using intervals in Google Big Query?
I have two table:
Table CarsGPS:
ID | Car | Latitude | Longitude
1 | 1 | -22.123 | -43.123
2 | 1 | -22.234 | -43.234
3 | 2 | -22.567 | -43.567
4 | 2 | -22.678 | -43.678
...
Table Areas:
ID | LatitudeMin | LatitudeMax | LongitudeMin | LongitudeMax
1 | -22.124 | -22.120 | -43.124 | -43.120
2 | -22.128 | -22.124 | -43.128 | -43.124
...
I'd like to cross join these tables to check in which areas each car has passed by using Google Big Query.
In a regular SQL server I would make:
SELECT A.ID, C.Car
FROM Cars C, Areas A
WHERE C.Latitude BETWEEN A.LatitudeMin AND A.LatitudeMax AND
C.Longitude BETWEEN A.LongitudeMin AND A.LongitudeMax
But Google Big Query only allows me to do joins (even JOIN EACH) using exact matches among joined tables. And the "FROM X, Y" means UNION, not JOINS.
So, this is not an option:
SELECT A.ID, C.Car
FROM Cars C
JOIN EACH
Areas A
ON C.Latitude BETWEEN A.LatitudeMin AND A.LatitudeMax AND
C.Longitude BETWEEN A.LongitudeMin AND A.LongitudeMax
Then, how can I run something similar to it to identify which cars passed inside each area?
BigQuery now supports CROSS JOIN. Your query would look like:
SELECT A.ID, C.Car
FROM Cars C
CROSS JOIN Areas A
WHERE C.Latitude BETWEEN A.LatitudeMin AND A.LatitudeMax AND
C.Longitude BETWEEN A.LongitudeMin AND A.LongitudeMax

prevent from double/triple SUMing when JOINing

i am joining two tables: accn_demographics and accn_payments. The relationship between the two tables is one to many between accn_demographics.accn_id and accn_payments.accn_id
My question is when I am summing the PAID_AMT and COPAY_AMT, I am getting double/triple/quadrouple the number that I should be getting.
Is there an obvious problem with my join condition?
select sum(p.paid_amt) as SumPaidAmount
, sum(p.copay_amt) as SumCoPay
, p.pmt_date
, d.load_Date
, p.ACCN_ID
from accn_payments p
join
(
select distinct load_date, accn_id
from accn_demographics
) d
on p.ACCN_ID=d.ACCN_ID
where p.POSTED='Y'
and p.pmt_date between '20120701' and '20120731'
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
thanks so much for your guidance.
You need to do the summation in a subquery:
select sum(p.SumPaidAmount) as SumPaidAmount, sum(p.SumCoPay) as SumCoPay,
p.pmt_date, d.load_Date, p.ACCN_ID
from (select accn_id, p.pmt_date, sum(paid_amt) as SumPaidAmt,
sum(copay_amt) as SumCoPay
from accn_payments p
where p.POSTED='Y' and
p.pmt_date between '20120701' and '20120731'
group by accn_id, pmt_date
) p join
(select distinct load_date, accn_id from accn_demographics) d
on p.ACCN_ID=d.ACCN_ID
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
Question: do you really intend for pmt_date to be in the final results? It looks like you want to remove it from both the outer SELECT and the subquery.
The only thing I can see if that (select distinct load_date, accn_id from accn_demographics) might return several matches. Look at your data and run a separate query
select distinct load_date, accn_id from accn_demographics WHERE accn_id=SomeID
where SomeID is one of the result accounts that is returning double/triple values. That should pinpoint your problem.
Yes, but it's not so obvious for beginners. What happens is that for every accn_payments record, you're matching on ONLY the accn_id, which means if there are multiple records in accn_demographics for that particular accn_id, then you will get duplicate accn_payment records due to the join. Is there another limiting field on accn_demographics to join back to the payments?
Ultimately, think of it this way:
accn_payments (p):
accn_id | paid_amt | copay_amt | ...
----------------------------------------------------
1 | 100.00 | 20.00 | ...
accn_demographics (d):
accn_id | load_date | ...
------------------------------------
1 | 2012/01/01 | ...
1 | 2012/03/05 | ...
1 | 2012/06/23 | ...
After joining, your results will look like this:
p.accn_id | p.paid_amt | p.copay_amt | p... | d.accn_id | d.load_date | d...
----------------------------------------------------------------------------
1 | 100.00 | 20.00 | .... | 1 | 2012/01/01 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/03/05 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/06/21 | ....
As you can see, the same row from accn_payments gets replicated for every matching accn_demographics record, since you specified only the accn_id column to be the join criteria. It can't limit the results any further, so it the DB engine says "Hey, look, this p record matches for all these d records, this must be what he was asking for!" Obviously not what was intended, as when you sum on the p.paid_amt and p.copay_amt, it performs a sum for ALL ROWS (even though they are duplicated).
Ultimately, see if you can limit the join criteria for accn_demographics even further (by some date, perhaps), that way you limit the number of duplicate payment records during the join.

SQL: how to make a GROUP BY query considerin' also empty fields

I have to group some data into categories, based on the column "qualifier" that could be 1,2,3,4, or empty.
The problem is that "empty" is not considered into the "group by" categories.
Here's my query:
SELECT m.id, m.name, COUNT (*)
FROM _gialli_g2bff_distinct AS g
INNER JOIN flag.qualifier_flags AS m ON g.qualifier = m.id
GROUP BY m.name, m.id
ORDER BY m.id;
Here's the answer:
1 | "NOT_CONTRIBUTES_TO" | 2
2 | "CONTRIBUTES_TO" | 411
3 | "COLOCALIZES_WITH" | 200
4 | "NOT" | 983
The problem of this answer is that it does not take into account all the elements that have qualifier field EMPTY.
Here's what I would like to have as answer:
1 | "NOT_CONTRIBUTES_TO" | 2
2 | "CONTRIBUTES_TO" | 411
3 | "COLOCALIZES_WITH" | 200
4 | "NOT" | 983
5 | | 1854
How could I modify my query?
Thanks
Your problem is not occuring at the GROUP BY level, but rather in the JOIN. The rows with a NULL qualifier cannot be JOINed and, because you're using INNER JOIN, they fall out of the result set.
Use LEFT OUTER JOIN to see all the rows.