Summing cost by id that appears on multiple rows - sql

SOLUTION
I solved it by simple doing the following.
SELECT table_size, sum(cost) as total_cost, sum(num_players) as num_players
FROM
(
SELECT table_size, cost, sum(tp.uid) as num_players
FROM tournament as t
LEFT JOIN takes_part AS tp ON tp.tid = t.tid
LEFT JOIN users as u on u.uid = tp.tid
JOIN attributes as a on a.aid = t.attrId
GROUP BY t.tid
) as res
GROUP BY table_size
I wasn't sure it would work, what with the other aggregate functions that I had to use in my real sql, but it seems to be working ok. There may be problems in the future if I want to do other kind of calculations, for instance do a COUNT(DISTINCT tp.uid) over all tournaments. Still, in this case that is not all that important so I am satisfied for now. Thank you all for your help.
UPDATE!!!
Here is a Fiddle that explains the problem:
http://www.sqlfiddle.com/#!2/e03ff/7
I want to get:
table_size | cost
-------------------------------
5 | 110
8 | 80
OLD POST
I'm sure that there is an easy solution to this that I'm just not seeing, but I can't seem to find a solution to it anywhere. What I'm trying to do is the following:
I need to sum 'costs' per tournament in a system. For other reasons, I've had to join with lots of other tables, making the same cost appear on multiple rows, like so:
id | name | cost | (hidden_id)
-----------------------------
0 | Abc | 100 | 1
1 | ASD | 100 | 1
2 | Das | 100 | 1
3 | Ads | 50 | 2
4 | Ads | 50 | 2
5 | Fsd | 0 | 3
6 | Ads | 0 | 3
7 | Dsa | 0 | 3
The costs in the table above are linked to an id value that is not necessary selected in by the SQL (this depends on what the user decides at runtime). What I want to get, is the sum 100+50+0 = 150. Of course, if I just use SUM(cost) I will get a different answer. I tried using SUM(cost)/COUNT(*)*COUNT(tourney_ids) but this only gives correct result under certain circumstances. A (very) simple form of query looks like this:
SELECT SUM(cost) as tot_cost -- This will not work as it sums all rows where the sum appears.
FROM t
JOIN ta ON t.attr_id = ta.toaid
JOIN tr ON tr.toid = t.toid -- This row will cause multiple rows with same cost
GROUP BY *selected by user* -- This row enables the user to group by several attributes, such as weekday, hour or ids of different kinds.
UPDATE. A more correct SQL-query, perhaps:
SELECT
*some way to sum cost*
FROM tournament AS t
JOIN attribute AS ta ON t.attr_id = ta.toaid
JOIN registration AS tr ON tr.tourneyId = t.tourneyId
INNER JOIN pokerstuff as ga ON ta.game_attr_id = ga.gameId
LEFT JOIN people AS p ON p.userId = tr.userId
LEFT JOIN parttaking AS jlt ON (jlt.tourneyId = t.tourneyId AND tr.userId = jlt.userId)
LEFT JOIN (
SELECT t.tourneyId,
ta.a - (ta.b) - sum(c)*ta.cost AS cost
FROM tournament as t
JOIN attribute as ta ON (t.attr_id = ta.toaid)
JOIN registration tr ON (tr.tourneyId = t.tourneyId)
GROUP BY t.tourneyId, ta.b, ta.a
) as o on t.tourneyId = o.tourneyId
AND whereConditions
GROUP BY groupBySql
Description of the tables
tournament (tourneyId, name, attributeId)
attributes (attributeId, ..., gameid)
registration (userId, tourneyId, ...)
pokerstuff(gameid,...)
people(userId,...)
parttaking(userId, tourneyId,...)
Let's assume that we have the following (cost is actually calculated in a subquery, but since it's tied to tournament, I will treat it as an attribute here):
tournament:
tourneyId | name | cost
1 | MyTournament | 50
2 | MyTournament | 80
and
userId | tourneyId
1 | 1
2 | 1
3 | 1
4 | 1
1 | 2
4 | 2
The problem is rather simple. I need to be able to get the sum of the costs of the tournaments without counting a tournament more than once. The sum (and all other aggregates) will be dynamically grouped by the user.
A big problem is that many solutions that I've tried (such as SUM OVER...) would require that I group by certain attributes, and that I cannot do. The group by-clause must be completely decided by the user. The sum of the cost should sum over any group-by attributes, the only problem is of course the multiple rows in which the sum appears.
Do anyone of you have any good hints on what can be done?

Try the following:
select *selected by user*, sum(case rownum when 1 then a.cost end)
from
(
select
*selected by user*, cost,
row_number() over (partition by t.tid) as rownum
FROM t
JOIN ta ON t.attr_id = ta.toaid
JOIN tr ON tr.toid = t.toid
) a
group by *selected by user*
The row_number is used to number each row with the same tournament row. When suming the costs we only consider those rows with a rownum of 1. All other rows are duplicates of this one with regards to the costs.
In terms of the fiddle:
select table_size, sum(case rownum when 1 then a.cost end)
from
(
SELECT
table_size, cost,
row_number() over (partition by t.tid) as rownum
FROM tournament as t
LEFT JOIN takes_part AS tp ON tp.tid = t.tid
LEFT JOIN users as u on u.uid = tp.tid
JOIN attributes as a on a.aid = t.attrId
) a
group by table_size

As the repeated costs are the same each time you can average them by their hidden id and do something like this:
WITH MrTable AS (
SELECT DISTINCT hidden_id, AVG(cost) OVER (PARTITION BY hidden_id) AS cost
FROM stuff
)
SELECT SUM(cost) FROM MrTable;

(Updated) Given that the cost currently returned is the total cost per tournament, you could include a fractional value of cost on each line of an inner select, such that the total of all those values adds up to the total cost (allowing for the fact that each given tournament's values may be appearing multiple times), then sum that fractional cost in your outer select, like so:
select table_size, sum(frac_cost) as agg_cost from
(SELECT a.table_size , cost / count(*) over (partition by t.tid) as frac_cost
FROM tournament as t
LEFT JOIN takes_part AS tp ON tp.tid = t.tid
LEFT JOIN users as u on u.uid = tp.uid
JOIN attributes as a on a.aid = t.attrId) sq
GROUP BY table_size
SQLFiddle here.

Related

How to fill in empty date rows multiple times?

I am trying to fill in dates with empty data, so that my query returned has every date and does not skip any.
My application needs to count bookings for activities by date in a report, and I cannot have skipped dates in what is returned by my SQL
I am trying to use a date table (I have a table with every date from 1/1/2000 to 12/31/2030) to accomplish this by doing a RIGHT OUTER JOIN on this date table, which works when dealing with one set of activities. But I have multiple sets of activities, each needing their own full range of dates regardless if there were bookings on that date.
I also have a function (DateRange) I found that allows for this:
SELECT IndividualDate FROM DateRange('d', '11/01/2017', '11/10/2018')
Let me give an example of what I am getting and what I want to get:
BAD: Without empty date rows:
date | activity_id | bookings
-----------------------------
1/2 | 1 | 5
1/4 | 1 | 4
1/3 | 2 | 6
1/4 | 2 | 2
GOOD: With empty date rows:
date | activity_id | bookings
-----------------------------
1/2 | 1 | 5
1/3 | 1 | NULL
1/4 | 1 | 4
1/2 | 2 | NULL
1/3 | 2 | 6
1/4 | 2 | 2
I hope this makes sense. I get the whole point of joining to a table of just a list of dates OR using the DateRange table function. But neither get me the "GOOD" result above.
Use a cross join to generate the rows and then left join to fill in the values:
select d.date, a.activity_id, t.bookings
from DateRange('d', ''2017-11-01',''2018-11-10') d cross join
(select distinct activity_id from t) a left join
t
on t.date = d.date and t.activity_id = a.activity_id;
It is a bit hard to follow what your data is and what comes from the function. But the idea is the same, wherever the data comes from.
I figured it out:
SELECT TOP 100 PERCENT masterlist.dt, masterlist.activity_id, count(r_activity_sales_bymonth.bookings) AS totalbookings
FROM (SELECT c.activity_id, dateadd(d, b.incr, '2016-12-31') AS dt
FROM (SELECT TOP 365 incr = row_number() OVER (ORDER BY object_id, column_id), *
FROM (SELECT a.object_id, a.column_id
FROM sys.all_columns a CROSS JOIN
sys.all_columns b) AS a) AS b CROSS JOIN
(SELECT DISTINCT activity_id
FROM r_activity_sales_bymonth) AS c) AS masterlist LEFT OUTER JOIN
r_activity_sales_bymonth ON masterlist.dt = r_activity_sales_bymonth.purchase_date AND masterlist.activity_id = r_activity_sales_bymonth.activity_id
GROUP BY masterlist.dt, masterlist.activity_id
ORDER BY masterlist.dt, masterlist.activity_id

SQL Join or SUM is returning too many values when working with Redshift database

I'm working with a Redshift database and I can't understand why my join or SUM is bringing too many values. My query is below:
SELECT
date(u.created_at) AS date,
count(distinct c.user_id) AS active_users,
sum(distinct insights.spend) AS fbcosts,
count(c.transaction_amount) AS share_shake_costs,
round(((sum(distinct insights.spend) + count(c.transaction_amount)) /
count(distinct c.user_id)),2) AS cac
FROM
dbname.users AS u
LEFT JOIN
dbname.card_transaction AS c ON c.user_id = u.id
LEFT JOIN
facebookads.insights ON date(insights.date_start) = date(u.created_at)
LEFT JOIN
dbname.card_transaction AS c2 ON date(c2.timestamp) = date(u.created_at)
WHERE
c2.vendor_transaction_description ilike '%share%'
OR c2.vendor_transaction_description ilike '%shake to win%'
GROUP BY
date
ORDER BY
1 DESC;
This query returns the following data:
If we look at 2017-02-08, we can see a total of 1298 for "share_shake_costs". However, if I run the same query just on the card_transaction table I get the following results which are correct.
The query for this second table looks like this:
SELECT
date(timestamp),
sum(transaction_amount)
FROM
dbname.card_transaction AS c2
WHERE
c2.vendor_transaction_description ilike '%share%'
OR c2.vendor_transaction_description ilike '%shake to win%'
GROUP BY
1
ORDER BY
1 DESC;
I have a feeling that I have a similar issue for my "fbcosts" column. I think it has to do with my join since the SUM should be working fine.
I'm new to Redshift and SQL so perhaps there's a better way of doing this entire query. Is there anything obvious that I'm missing?
It seems you have a table that contains 1:n mapping and when you join over a common clause, that number is being counted n times.
Let us say one of your tables, orders contains user_id and the total bill_amount and the other table, order_details contains the detail of the sub-items placed by that user_id.
If you do a left join, by definition, orders.user_id will join n times to order_details.user_id, where
n = total number of rows in order_details table
and would perform the aggregation (sum, count etc) n times.
+------------------+ +----------------------+
| orders | | order_details |
+------------------+ +----------------------+
|amount user_id | | user_id items |
+------------------+ +----------------------+
| 1000 123 ---------> | 123 apple |
+ +----------------------+
+-------------> | 123 guava |
| +----------------------+
v-------------> | 123 mango |
+----------------------+
select sum(amount) from orders o left join order_details od
on o.user_id = od.user_id; // result: 3000
select count(amount) from orders o left join order_details od
on o.user_id = od.user_id; // result: 3
I hope the reason for large count is clear to you now.
PS: Also, always prefer to enclose OR conditions in ().
WHERE
(c2.vendor_transaction_description ilike '%share%'
OR c2.vendor_transaction_description ilike '%shake to win%')

SQL 2 Left outer joins with Sum and Group By

Looking for some guidance on this. I am attempting to run a report in my complaint management system.. Complaints by Year, Location, Subcategory, Showing Totals for TotalCredits (child table) and TotalsCwts (childtable) as well as total ExternalRootCause (on master table).
This is my SQL, but the TotalCwts and TotalCredits are not being calculated correctly. It calculates 1 time for each child record rather than the total for each master record.
SELECT
dbo.Complaints.Location,
YEAR(dbo.Complaints.ComDate) AS Year,
dbo.Complaints.ComplaintSubcategory,
COUNT(Distinct(dbo.Complaints.ComId)) AS CustomerComplaints,
SUM(DISTINCT CASE WHEN (dbo.Complaints.RootCauseSource = 'External' ) THEN 1 ELSE 0 END) as ExternalRootCause,
SUM(dbo.ComplaintProducts.Cwts) AS TotalCwts,
Coalesce(SUM(dbo.CreditDeductions.CreditAmount),0) AS TotalCredits
FROM dbo.Complaints
JOIN dbo.CustomerComplaints
ON dbo.Complaints.ComId = dbo.CustomerComplaints.ComId
LEFT OUTER JOIN dbo.CreditDeductions
ON dbo.Complaints.ComId = dbo.CreditDeductions.ComId
LEFT OUTER JOIN dbo.ComplaintProducts
ON dbo.Complaints.ComId = dbo.ComplaintProducts.ComId
WHERE
dbo.Complaints.Location = Coalesce(#Location,Location)
GROUP BY
YEAR(dbo.Complaints.ComDate),
dbo.Complaints.Location,
dbo.Complaints.ComplaintSubcategory
ORDER BY
[YEAR] desc,
dbo.Complaints.Location,
dbo.Complaints.ComplaintSubcategory
Data Results
Location | Year | Subcategory | Complaints | External RC | Total Cwts | Total Credits
---------------------------------------------------------------------------------------
Boston | 2016 | Documentation | 1 | 0 | 8 | 8.00
Data Should Read
Location | Year | Subcategory | Complaints | External RC | Total Cwts | Total Credits
---------------------------------------------------------------------------------------
Boston | 2016 | Documentation | 1 | 0 | 4 | 2.00
Above data reflects 1 complaint having 4 Product Records with 1cwt each and 2 credit records with 1.00 each.
What do I need to change in my query or should I approach this query a different way?
The problem is that the 1 complaint has 2 Deductions and 4 products. When you join in this manner then it will return every combination of Deduction/Product for the complaint which gives 8 rows as you're seeing.
One solution, which should work here, is to not query the Dedustion and Product tables directly; query a query which returns one row per table per complaint. In other words, replace:
LEFT OUTER JOIN dbo.CreditDeductions ON dbo.Complaints.ComId = dbo.CreditDeductions.ComId
LEFT OUTER JOIN dbo.ComplaintProducts ON dbo.Complaints.ComId = dbo.ComplaintProducts.ComId
...with this - showing the Deductions table only, you can work out the Products:
LEFT OUTER JOIN (
select ComId, count(*) CountDeductions, sum(CreditAmount) CreditAmount
from dbo.CreditDeductions
group by ComId
) d on d.ComId = Complaints.ComId
You'll have to change the references to dbo.CreditDedustions to just d (or whatever you want to call it).
Once you've done them both then you'll one each per complaint, which will result with 1 row per complaint contaoining the counts and totals from the two sub-tables.

PostgreSQL Using COUNT to form statistical results

I have a few tables that make up a media catalog of live/studio music, where each media item has zero-many show dates, CDs and Vinyl associated to it. The query I have at the moment pulls out statistics that results in a tabular set of data for the all the media items available. I'm having trouble now extending the query to include finer grained statistics on each associated table.
Schema:
media(id , title)
cd(media_fk, type)
vinyl(media_fk)
gig(id, date)
media_gigs(gig_fk, media_fk)
Query I have thus far:
SELECT m.id, m.title, COUNT(DISTINCT c.id) as cds, COUNT(DISTINCT v.id) as vinyl, gig.id as gid, gig.date as gdate
FROM media m
LEFT JOIN cd c on m.id = c.media
LEFT JOIN vinyl v on m.id = v.media
LEFT JOIN media_gigs g on m.id = g.media
LEFT JOIN gig gig on g.gig = gig.id
GROUP BY m.id, gig.id;
Which produces:
id | title | cds | vinyl | gid | gdate
---+---------+-----+-------+--------------------------+------------
1 | title 1 | 5 | 1 | may-11-1989-kawasaki | 1989-05-11
1 | title 1 | 5 | 1 | may-13-1989-tokyo | 1989-05-13
2 | title 2 | 6 | 0 | apr-29-1998-nagoya | 1998-04-29
2 | title 2 | 6 | 0 | may-6-1998-tokyo | 1998-05-06
2 | title 2 | 6 | 0 | may-7-1998-tokyo | 1998-05-07
3 | title 3 | 6 | 2 | dec-1-1986-new-york-city | 1986-12-01
3 | title 3 | 6 | 2 | dec-5-1986-quebec-city | 1986-12-05
3 | title 3 | 6 | 2 | nov-19-1986-tokyo | 1986-11-19
3 | title 3 | 6 | 2 | nov-20-1986-tokyo | 1986-11-20
cd.type is an enum type of [silver,cdr,pro-cdr] that I'm wanting to add to the results. So, the the end goal is to have 3 additional columns that are a count of the type of cd associated to each media item. I've not found the correct syntax using COUNT or otherwise to aggregate the cd based on its type, so looking for a push in the right direction. I'm fairly new to SQL so what I have so far may be a bit naive.
Using PG 9.3.
You can use the CASE function to determine the cd type and do a SUM based on the result, as below:
SELECT
m.id,
m.title,
COUNT(DISTINCT c.id) as cds,
COUNT(DISTINCT v.id) as vinyl,
gig.id as gid, gig.date as gdate,
SUM(case cd.type
when 'silver' then 1
else 0
end) silver,
SUM(case cd.type
when 'cdr' then 1
else 0
end) cdr,
SUM(case cd.type
when 'pro-cdr' then 1
else 0
end) pro_cdr
FROM media m
LEFT JOIN cd c on m.id = c.media
LEFT JOIN vinyl v on m.id = v.media
LEFT JOIN media_gigs g on m.id = g.media
LEFT JOIN gig gig on g.gig = gig.id
GROUP BY m.id, gig.id;
References:
Conditional Expressions on PostgreSQL 9.3 Manual
Enumerated Types on PostgreSQL 9.3 Manual
As other poster has mentioned, you can do this with a SUM(CASE WHEN <cond1> THEN 1 ELSE 0) construction on the c.type column.
There are some other problems with your SQL I would like to mention:
Incorrect use of LEFT JOIN
You group on a value that might be NULL: gig.id. This is probably because of incorrect use of the LEFT JOIN. Only use left join if you want to keep rows in the result set that have no match in the joining table.
So on the CD table a left join is correct, because you also want to be able to show that there are 0 cd's. On the media_gigs and the gigs table you probably want an INNER JOIN, because there always has to be a match.
Edit: It's possible that I mistakenly thought this was incorrect. I assumed from the sample data that you don't want to display media for which there is no gig.
Non-grouping, non-aggregate columns
In your query you select columns that you don't group on, which are not aggregate functions (like SUM, COUNT). While some Db dialects may accept this, it is bad practice. For instance, take the following query:
SELECT x, y, SUM(z) FROM t
GROUP BY x;
If y is not functionally dependant on x, that is, if there can be different values of y for one value of x, it is not clear which of these values should be displayed. Therefore your should always write it like this:
SELECT x, y, SUM(z) FROM t
GROUP BY x, y;

A query to return a mix of SUM and COUNT in 5 joined tables

I have a table named Ads, containing one row for each ad.
| ID | AdTitle | AdDescription | ... |
I also have 3 tables named Applications, Referrals and Subscribers, with rows for each application, referral and subscriber associated with an Ad.
| ID | AdID | ApplicantName | ApplicantEmail | ... |
| ID | AdID | ReferrerEmail | ReferralEmail | ... |
| ID | AdID | SubscriberEmail | ... |
Finally I have a table named Views with one row for each ad, containing the total number of views for that ad.
| ID | AdID | Views |
I'm trying to write a query with a summary for each ad in 6 columns: Ad ID, Ad title, number of applications/referrals/subscribers and total views.
A simple query of all tables that I have been working with:
SELECT *
FROM Ads
LEFT JOIN Applications ON Ads.ID = Applications.AdID
LEFT JOIN Referrals ON Ads.ID = Referrals.AdID
LEFT JOIN Subscribers ON Ads.ID = Subscribers.AdID
LEFT JOIN Views ON Ads.ID = Views.AdID
I have tried a lot of combinations of LEFT and INNER joins, GROUP BY, COUNT(...), COUNT(DISTINCT ...) and SUM(CASE ...) but nothing so far have worked. I end up counting NULL values from previous columns, counting entries twice or not at all, counting the number of rows in the Views-table instead of adding them together and so on.
Is it better to split this up in multiple querys, or is there a good way to archive what I want with a single one?
Try this
SELECT *,
(select count (*) from Applications as A1 where A.ID = A1.AdID ) as Applicants,
(select count(*) from Referrals as R where A.ID = R.AdID ) as Referrals,
(select count(*) from Subscribers as S where A.ID = S.AdID ) as Subscribers,
(select count(*) from Views as V where A.ID = V.AdID ) as Views
FROM Ads as A