SQL Join or SUM is returning too many values when working with Redshift database

SQL Join or SUM is returning too many values when working with Redshift database - sql

I'm working with a Redshift database and I can't understand why my join or SUM is bringing too many values. My query is below:
SELECT
date(u.created_at) AS date,
count(distinct c.user_id) AS active_users,
sum(distinct insights.spend) AS fbcosts,
count(c.transaction_amount) AS share_shake_costs,
round(((sum(distinct insights.spend) + count(c.transaction_amount)) /
count(distinct c.user_id)),2) AS cac
FROM
dbname.users AS u
LEFT JOIN
dbname.card_transaction AS c ON c.user_id = u.id
LEFT JOIN
facebookads.insights ON date(insights.date_start) = date(u.created_at)
LEFT JOIN
dbname.card_transaction AS c2 ON date(c2.timestamp) = date(u.created_at)
WHERE
c2.vendor_transaction_description ilike '%share%'
OR c2.vendor_transaction_description ilike '%shake to win%'
GROUP BY
date
ORDER BY
1 DESC;
This query returns the following data:
If we look at 2017-02-08, we can see a total of 1298 for "share_shake_costs". However, if I run the same query just on the card_transaction table I get the following results which are correct.
The query for this second table looks like this:
SELECT
date(timestamp),
sum(transaction_amount)
FROM
dbname.card_transaction AS c2
WHERE
c2.vendor_transaction_description ilike '%share%'
OR c2.vendor_transaction_description ilike '%shake to win%'
GROUP BY
1
ORDER BY
1 DESC;
I have a feeling that I have a similar issue for my "fbcosts" column. I think it has to do with my join since the SUM should be working fine.
I'm new to Redshift and SQL so perhaps there's a better way of doing this entire query. Is there anything obvious that I'm missing?

It seems you have a table that contains 1:n mapping and when you join over a common clause, that number is being counted n times.
Let us say one of your tables, orders contains user_id and the total bill_amount and the other table, order_details contains the detail of the sub-items placed by that user_id.
If you do a left join, by definition, orders.user_id will join n times to order_details.user_id, where
n = total number of rows in order_details table
and would perform the aggregation (sum, count etc) n times.
+------------------+ +----------------------+
| orders | | order_details |
+------------------+ +----------------------+
|amount user_id | | user_id items |
+------------------+ +----------------------+
| 1000 123 ---------> | 123 apple |
+ +----------------------+
+-------------> | 123 guava |
| +----------------------+
v-------------> | 123 mango |
+----------------------+
select sum(amount) from orders o left join order_details od
on o.user_id = od.user_id; // result: 3000
select count(amount) from orders o left join order_details od
on o.user_id = od.user_id; // result: 3
I hope the reason for large count is clear to you now.
PS: Also, always prefer to enclose OR conditions in ().
WHERE
(c2.vendor_transaction_description ilike '%share%'
OR c2.vendor_transaction_description ilike '%shake to win%')

Related

Oracle SQL query partially including the desired results

My requirement is to display country name, total number of invoices and their average amount. Moreover, I need to return only those countries where the average invoice amount is greater than the average invoice amount of all invoices.
Query for Oracle Database
SELECT cntry.NAME,
COUNT(inv.NUMBER),
AVG(inv.TOTAL_PRICE)
FROM COUNTRY cntry JOIN
CITY ct ON ct.COUNTRY_ID = cntry.ID JOIN
CUSTOMER cst ON cst.CITY_ID = ct.ID JOIN
INVOICE inv ON inv.CUSTOMER_ID = cst.ID
GROUP BY cntry.NAME,
inv.NUMBER,
inv.TOTAL_PRICE
HAVING AVG(inv.TOTAL_PRICE) > (SELECT AVG(TOTAL_PRICE)
FROM INVOICE);
Result: Austria 1 9500
Expected: Austria 2 4825
Schema
Country
ID(INT)(PK) | NAME(VARCHAR)
City
ID(INT)(PK) | NAME(VARCHAR) | POSTAL_CODE(VARCHAR) | COUNTRY_ID(INT)(FK)
Customer
ID(INT)(PK) | NAME(VARCHAR) | CITY_ID(INT)(FK) | ADDRS(VARCHAR) | POC(VARCHAR) | EMAIL(VARCHAR) | IS_ACTV(INT)(0/1)
Invoice
ID(INT)(PK) | NUMBER(VARCHAR) | CUSTOMER_ID(INT)(FK) | USER_ACC_ID(INT) | TOTAL_PRICE(INT)

With no sample data, we can't really tell whether this:
Expected: Austria 2 4825
is true or not.
Anyway: would changing the GROUP BY clause to
GROUP BY cntry.NAME
(i.e. removing additional two columns from it) do any good?

`SELECT C.COUNTRY_NAME,COUNT(I.INVOICE_NUMBER),AVG(I.TOTAL_PRICE) AS AVERAGE
FROM COUNTRY AS C JOIN CITY AS CS ON C.ID=CS.COUNTRY_ID
JOIN CUSTOMER AS CUS ON CUS.CITY_ID=CS.ID
JOIN INVOICE AS I ON I.CUSTOMER_ID=CUS.ID
GROUP BY C.COUNTRY_NAME,C.ID
HAVING AVERAGE>(SELECT AVG(TOTAL_PRICE) FROM INVOICE`

would changing the GROUP BY clause to
GROUP BY cntry.NAME , cntry.ID

Fix your group by columns.
Keep only cntry.name.
It will work.
This is a hackerrank question.

Using COALESCE in Postgres and grouping by the resulting value

I have two tables in a Postgres database:
table a
transaction_id | city | store_name | amount
-------------------------------
123 | London | McDonalds | 6.20
999 | NULL | KFC | 8.40
etc...
table b
transaction_id | location | store_name | amount
-----------------------------------
123 | NULL | McDonalds | 6.20
999 | Sydney | KFC | 7.60
etc...
As you can see, the location might be missing in one table but present in another table. For example with transaction 123, the location is present in table a but missing in table b. Apart from that, the rest of the data (amount, store_name etc.) is the same, row by row, assumed that we join on the transaction_id.
For a given merchant, I need to retrieve a list of locations and the total amount for that location.
An example of the desired result:
KFC sales Report:
suburb | suburb_total
---------------
London | 2500
Sydney | 3500
What I tried:
select
coalesce(a.city, b.location) as suburb,
sum(a.amount) as suburbTotal
from tablea a
join tableb b on a.transaction_id = b.transaction_id
where a.store_name ilike 'KFC'
group by(suburb);
But I get the error column "a.city" must appear in the GROUP BY clause or be used in an aggregate function
So I tried:
select
coalesce(a.city, b.location) as suburb,
sum(a.amount) as suburbTotal,
max(a.city) as city_max,
max(b.location) as location_max
from tablea a
join tableb b on a.transaction_id = b.transaction_id
where a.store_name ilike 'McDonalds'
group by(suburb);
But, surprisingly, I'm getting the same error, even thought I'm now using that column in an aggregate function.
How could I achieve the desired result?
NB there are reasons why we have de-normalised data across two tables, that are currently outside of my control. I have to deal with it.
EDIT: added FROM and JOIN, sorry I forgot to type those...

I can only imagine getting that error with your query if suburb were a column in one of the tables. One way around this is to define the value in the from clause:
select v.suburb,
sum(a.amount) as suburbTotal,
max(a.city) as city_max,
max(b.location) as location_max
from tablea a join
tableb b
on a.transaction_id = b.transaction_id cross join lateral
(values (coalesce(a.city, b.location))) as v(suburb)
where a.store_name ilike 'McDonalds'
group by v.suburb;
This is one of the downsides of allowing column aliases in the group by. Sometimes, you might have conflicts with table columns.

Your querires are missing a from clause, which makes it unclear which logic you are trying to implement.
Based on your sample data and expected results, I think that's a full join on the transaction_id, and then aggregation. Using a positional parameter in the group by clause avoids repeating the expression:
select
store_name,
coalesce(a.city, b.location) as suburb,
sum(amount) suburb_total
from tablea a
full join tableb b using(transaction_id)
group by 1, 2

Create multiple filtered result sets of a joined table for use in aggregate functions

I have a (heavily simplified) orders table, total being the dollar amount, containing:
| id | client_id | type | total |
|----|-----------|--------|-------|
| 1 | 1 | sale | 100 |
| 2 | 1 | refund | 100 |
| 3 | 1 | refund | 100 |
And clients table containing:
| id | name |
|----|------|
| 1 | test |
I am attempting to create a breakdown, by client, metrics about the total number of sales, refunds, sum of sales, sum of refunds etc.
To do this, I am querying the clients table and joining the orders table. The orders table contains both sales and refunds, specified by the type column.
My idea was to join the orders twice using subqueries and create aliases for those filtered tables. The aliases would then be used in aggregate functions to find the sum, average etc. I have tried many variations of joining the orders table twice to achieve this but it produces the same incorrect results. This query demonstrates this idea:
SELECT
clients.*,
SUM(sales.total) as total_sales,
SUM(refunds.total) as total_refunds,
AVG(sales.total) as avg_ticket,
COUNT(sales.*) as num_of_sales
FROM clients
LEFT JOIN (SELECT * FROM orders WHERE type = 'sale') as sales
ON sales.client_id = clients.id
LEFT JOIN (SELECT * FROM orders WHERE type = 'refund') as refunds
ON refunds.client_id = clients.id
GROUP BY clients.id
Result:
| id | name | total_sales | total_refunds | avg_ticket | num_of_sales |
|----|------|-------------|---------------|------------|--------------|
| 1 | test | 200 | 200 | 100 | 2 |
Expected result:
| id | name | total_sales | total_refunds | avg_ticket | num_of_sales |
|----|------|-------------|---------------|------------|--------------|
| 1 | test | 100 | 200 | 100 | 1 |
When the second join is included in the query, the rows returned from the first join are returned again with the second join. They are multiplied by the number of rows in the second join. It's clear my understanding of joining and/or subqueries is incomplete.
I understand that I can filter the orders table with each aggregate function. This produces correct results but seems inefficient:
SELECT
clients.*,
SUM(orders.total) FILTER (WHERE type = 'sale') as total_sales,
SUM(orders.total) FILTER (WHERE type = 'refund') as total_refunds,
AVG(orders.total) FILTER (WHERE type = 'sale') as avg_ticket,
COUNT(orders.*) FILTER (WHERE type = 'sale') as num_of_sales
FROM clients
LEFT JOIN orders
on orders.client_id = clients.id
GROUP BY clients.id
What is the appropriate way to created filtered and aliased versions of this joined table?
Also, what exactly is happening with my initial query where the two subqueries are joined. I would expect them to be treated as separate subsets even though they are operating on the same (orders) table.

You should do the (filtered) aggregation once for all aggregates you want, and then join to the result of that. As your aggregation doesn't need any columns from the clients table, this can be done in a derived table. This is also typically faster than grouping the result of the join.
SELECT clients.*,
o.total_sales,
o.total_refunds,
o.avg_ticket,
o.num_of_sales
FROM clients
LEFT JOIN (
select client_id,
SUM(total) FILTER (WHERE type = 'sale') as total_sales,
SUM(total) FILTER (WHERE type = 'refund') as total_refunds,
AVG(total) FILTER (WHERE type = 'sale') as avg_ticket,
COUNT(*) FILTER (WHERE type = 'sale') as num_of_sales
from orders
group by client_id
) o on o.client_id = clients.id

what is wrong with this LEFT JOIN/SELECT? (the SELECT alone works, and the LEFT JOIN wthout SELECT also work)

MS Access
tables:
prodcuts:
+----+---------+
| id | p_name |
+----+---------+
warehouses:
+----+--------+
| id | w_name |
+----+--------+
how_many_should_be:
+--------------+------------+----------+
| warehouse_id | product_id | how_many |
+--------------+------------+----------+
intake
+--------------+------------+---------------+
| warehouse_id | product_id | intake_amount |
+--------------+------------+---------------+
the wanted query result:
+--------+--------+--------------------+------------+
| w_name | p_name | how_many_should_be | intake_sum |
+--------+--------+--------------------+------------+
this is the query:
it works fine if i leaveout the "LEFT JOIN", and the SLELET in the LEFT JOIN works fine if put in a query by itself. os i d'ont understand what's wrong.
SELECT
warehouses.w_name AS [w_name],
products.p_name AS [p_name],
how_many_should_be.how_many AS [how_many_should_be]
FROM (((how_many_should_be INNER JOIN warehouses ON how_many_should_be.warehouse_id = warehouses.id)
INNER JOIN products ON how_many_should_be.product_id = products.id)
LEFT JOIN
(SELECT intake.warehouse_id, intake.product_id, Sum(intake.units_amout) AS [intake_sum]
FROM intake
GROUP BY intake.warehouse_id, intake.product_id)
ON how_many_should_be.warehouse_id = intake.warehouse_id)
ORDER BY warehouses.w_name, products.p_name
the error: "syntax error on JOIN operation"

As a poster previously mentioned you need to alias the subquery correctly and make sure to update any JOIN and SELECT statements. Here's what it would look like (I rewrote the INNER JOINS for my own purposes, feel free to keep yours as is, it shouldn't impact the final result):
SELECT
warehouses.w_name AS [w_name],
products.p_name AS [p_name],
how_many_should_be.how_many AS [how_many_should_be],
Agg_Intake.[intake_sum] -- pull in the SUM from the sub query
FROM warehouses
INNER JOIN how_many_should_be
ON how_many_should_be.warehouse_id = warehouses.id
INNER JOIN products
ON how_many_should_be.product_id = products.id
LEFT JOIN (
SELECT intake.warehouse_id, intake.product_id, Sum(intake.units_amout) AS [intake_sum]
FROM intake
GROUP BY intake.warehouse_id, intake.product_id
) Agg_Intake -- the needed alias for the subquery
ON how_many_should_be.warehouse_id = Agg_Intake.warehouse_id -- join ON the aliased subquery.column name
ORDER BY warehouses.w_name, products.p_name

Summing cost by id that appears on multiple rows

SOLUTION
I solved it by simple doing the following.
SELECT table_size, sum(cost) as total_cost, sum(num_players) as num_players
FROM
(
SELECT table_size, cost, sum(tp.uid) as num_players
FROM tournament as t
LEFT JOIN takes_part AS tp ON tp.tid = t.tid
LEFT JOIN users as u on u.uid = tp.tid
JOIN attributes as a on a.aid = t.attrId
GROUP BY t.tid
) as res
GROUP BY table_size
I wasn't sure it would work, what with the other aggregate functions that I had to use in my real sql, but it seems to be working ok. There may be problems in the future if I want to do other kind of calculations, for instance do a COUNT(DISTINCT tp.uid) over all tournaments. Still, in this case that is not all that important so I am satisfied for now. Thank you all for your help.
UPDATE!!!
Here is a Fiddle that explains the problem:
http://www.sqlfiddle.com/#!2/e03ff/7
I want to get:
table_size | cost
-------------------------------
5 | 110
8 | 80
OLD POST
I'm sure that there is an easy solution to this that I'm just not seeing, but I can't seem to find a solution to it anywhere. What I'm trying to do is the following:
I need to sum 'costs' per tournament in a system. For other reasons, I've had to join with lots of other tables, making the same cost appear on multiple rows, like so:
id | name | cost | (hidden_id)
-----------------------------
0 | Abc | 100 | 1
1 | ASD | 100 | 1
2 | Das | 100 | 1
3 | Ads | 50 | 2
4 | Ads | 50 | 2
5 | Fsd | 0 | 3
6 | Ads | 0 | 3
7 | Dsa | 0 | 3
The costs in the table above are linked to an id value that is not necessary selected in by the SQL (this depends on what the user decides at runtime). What I want to get, is the sum 100+50+0 = 150. Of course, if I just use SUM(cost) I will get a different answer. I tried using SUM(cost)/COUNT(*)*COUNT(tourney_ids) but this only gives correct result under certain circumstances. A (very) simple form of query looks like this:
SELECT SUM(cost) as tot_cost -- This will not work as it sums all rows where the sum appears.
FROM t
JOIN ta ON t.attr_id = ta.toaid
JOIN tr ON tr.toid = t.toid -- This row will cause multiple rows with same cost
GROUP BY *selected by user* -- This row enables the user to group by several attributes, such as weekday, hour or ids of different kinds.
UPDATE. A more correct SQL-query, perhaps:
SELECT
*some way to sum cost*
FROM tournament AS t
JOIN attribute AS ta ON t.attr_id = ta.toaid
JOIN registration AS tr ON tr.tourneyId = t.tourneyId
INNER JOIN pokerstuff as ga ON ta.game_attr_id = ga.gameId
LEFT JOIN people AS p ON p.userId = tr.userId
LEFT JOIN parttaking AS jlt ON (jlt.tourneyId = t.tourneyId AND tr.userId = jlt.userId)
LEFT JOIN (
SELECT t.tourneyId,
ta.a - (ta.b) - sum(c)*ta.cost AS cost
FROM tournament as t
JOIN attribute as ta ON (t.attr_id = ta.toaid)
JOIN registration tr ON (tr.tourneyId = t.tourneyId)
GROUP BY t.tourneyId, ta.b, ta.a
) as o on t.tourneyId = o.tourneyId
AND whereConditions
GROUP BY groupBySql
Description of the tables
tournament (tourneyId, name, attributeId)
attributes (attributeId, ..., gameid)
registration (userId, tourneyId, ...)
pokerstuff(gameid,...)
people(userId,...)
parttaking(userId, tourneyId,...)
Let's assume that we have the following (cost is actually calculated in a subquery, but since it's tied to tournament, I will treat it as an attribute here):
tournament:
tourneyId | name | cost
1 | MyTournament | 50
2 | MyTournament | 80
and
userId | tourneyId
1 | 1
2 | 1
3 | 1
4 | 1
1 | 2
4 | 2
The problem is rather simple. I need to be able to get the sum of the costs of the tournaments without counting a tournament more than once. The sum (and all other aggregates) will be dynamically grouped by the user.
A big problem is that many solutions that I've tried (such as SUM OVER...) would require that I group by certain attributes, and that I cannot do. The group by-clause must be completely decided by the user. The sum of the cost should sum over any group-by attributes, the only problem is of course the multiple rows in which the sum appears.
Do anyone of you have any good hints on what can be done?

Try the following:
select *selected by user*, sum(case rownum when 1 then a.cost end)
from
(
select
*selected by user*, cost,
row_number() over (partition by t.tid) as rownum
FROM t
JOIN ta ON t.attr_id = ta.toaid
JOIN tr ON tr.toid = t.toid
) a
group by *selected by user*
The row_number is used to number each row with the same tournament row. When suming the costs we only consider those rows with a rownum of 1. All other rows are duplicates of this one with regards to the costs.
In terms of the fiddle:
select table_size, sum(case rownum when 1 then a.cost end)
from
(
SELECT
table_size, cost,
row_number() over (partition by t.tid) as rownum
FROM tournament as t
LEFT JOIN takes_part AS tp ON tp.tid = t.tid
LEFT JOIN users as u on u.uid = tp.tid
JOIN attributes as a on a.aid = t.attrId
) a
group by table_size

As the repeated costs are the same each time you can average them by their hidden id and do something like this:
WITH MrTable AS (
SELECT DISTINCT hidden_id, AVG(cost) OVER (PARTITION BY hidden_id) AS cost
FROM stuff
)
SELECT SUM(cost) FROM MrTable;

(Updated) Given that the cost currently returned is the total cost per tournament, you could include a fractional value of cost on each line of an inner select, such that the total of all those values adds up to the total cost (allowing for the fact that each given tournament's values may be appearing multiple times), then sum that fractional cost in your outer select, like so:
select table_size, sum(frac_cost) as agg_cost from
(SELECT a.table_size , cost / count(*) over (partition by t.tid) as frac_cost
FROM tournament as t
LEFT JOIN takes_part AS tp ON tp.tid = t.tid
LEFT JOIN users as u on u.uid = tp.uid
JOIN attributes as a on a.aid = t.attrId) sq
GROUP BY table_size
SQLFiddle here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Join or SUM is returning too many values when working with Redshift database - sql

Related

Oracle SQL query partially including the desired results

Using COALESCE in Postgres and grouping by the resulting value

Create multiple filtered result sets of a joined table for use in aggregate functions

what is wrong with this LEFT JOIN/SELECT? (the SELECT alone works, and the LEFT JOIN wthout SELECT also work)

Summing cost by id that appears on multiple rows

Categories

Resources