Create multiple filtered result sets of a joined table for use in aggregate functions - sql

I have a (heavily simplified) orders table, total being the dollar amount, containing:
| id | client_id | type | total |
|----|-----------|--------|-------|
| 1 | 1 | sale | 100 |
| 2 | 1 | refund | 100 |
| 3 | 1 | refund | 100 |
And clients table containing:
| id | name |
|----|------|
| 1 | test |
I am attempting to create a breakdown, by client, metrics about the total number of sales, refunds, sum of sales, sum of refunds etc.
To do this, I am querying the clients table and joining the orders table. The orders table contains both sales and refunds, specified by the type column.
My idea was to join the orders twice using subqueries and create aliases for those filtered tables. The aliases would then be used in aggregate functions to find the sum, average etc. I have tried many variations of joining the orders table twice to achieve this but it produces the same incorrect results. This query demonstrates this idea:
SELECT
clients.*,
SUM(sales.total) as total_sales,
SUM(refunds.total) as total_refunds,
AVG(sales.total) as avg_ticket,
COUNT(sales.*) as num_of_sales
FROM clients
LEFT JOIN (SELECT * FROM orders WHERE type = 'sale') as sales
ON sales.client_id = clients.id
LEFT JOIN (SELECT * FROM orders WHERE type = 'refund') as refunds
ON refunds.client_id = clients.id
GROUP BY clients.id
Result:
| id | name | total_sales | total_refunds | avg_ticket | num_of_sales |
|----|------|-------------|---------------|------------|--------------|
| 1 | test | 200 | 200 | 100 | 2 |
Expected result:
| id | name | total_sales | total_refunds | avg_ticket | num_of_sales |
|----|------|-------------|---------------|------------|--------------|
| 1 | test | 100 | 200 | 100 | 1 |
When the second join is included in the query, the rows returned from the first join are returned again with the second join. They are multiplied by the number of rows in the second join. It's clear my understanding of joining and/or subqueries is incomplete.
I understand that I can filter the orders table with each aggregate function. This produces correct results but seems inefficient:
SELECT
clients.*,
SUM(orders.total) FILTER (WHERE type = 'sale') as total_sales,
SUM(orders.total) FILTER (WHERE type = 'refund') as total_refunds,
AVG(orders.total) FILTER (WHERE type = 'sale') as avg_ticket,
COUNT(orders.*) FILTER (WHERE type = 'sale') as num_of_sales
FROM clients
LEFT JOIN orders
on orders.client_id = clients.id
GROUP BY clients.id
What is the appropriate way to created filtered and aliased versions of this joined table?
Also, what exactly is happening with my initial query where the two subqueries are joined. I would expect them to be treated as separate subsets even though they are operating on the same (orders) table.

You should do the (filtered) aggregation once for all aggregates you want, and then join to the result of that. As your aggregation doesn't need any columns from the clients table, this can be done in a derived table. This is also typically faster than grouping the result of the join.
SELECT clients.*,
o.total_sales,
o.total_refunds,
o.avg_ticket,
o.num_of_sales
FROM clients
LEFT JOIN (
select client_id,
SUM(total) FILTER (WHERE type = 'sale') as total_sales,
SUM(total) FILTER (WHERE type = 'refund') as total_refunds,
AVG(total) FILTER (WHERE type = 'sale') as avg_ticket,
COUNT(*) FILTER (WHERE type = 'sale') as num_of_sales
from orders
group by client_id
) o on o.client_id = clients.id

Related

Multi-Table Invoice SUM Comparison

Say I have 3 tables in a rails app:
invoices
id | customer_id | employee_id | notes
---------------------------------------------------------------
1 | 1 | 5 | An order with 2 items.
2 | 12 | 5 | An order with 1 item.
3 | 17 | 12 | An empty order.
4 | 17 | 12 | A brand new order.
invoice_items
id | invoice_id | price | name
---------------------------------------------------------
1 | 1 | 5.35 | widget
2 | 1 | 7.25 | thingy
3 | 2 | 1.25 | smaller thingy
4 | 2 | 1.25 | another smaller thingy
invoice_payments
id | invoice_id | amount | method | notes
---------------------------------------------------------
1 | 1 | 4.85 | credit card | Not enough
2 | 1 | 1.25 | credit card | Still not enough
3 | 2 | 1.25 | check | Paid in full
This represents 4 orders:
The first has 2 items, for a total of 12.60. It has two payments, for a total paid amount of 6.10. This order is partially paid.
The second has only one item, and one payment, both totaling 1.25. This order is paid in full.
The third order has no items or payments. This is important to us, sometimes we use this case. It is considered paid in full as well.
The final order has one item again, for a total of 1.25, but no payments as of yet.
Now I need a query:
Show me all orders that are not paid in full yet; that is, all orders such that the total of the items is greater than the total of the payments.
I can do it in pure sql:
SELECT invoices.*,
invoice_payment_amounts.amount_paid AS amount_paid,
invoice_item_amounts.total_amount AS total_amount
FROM invoices
LEFT JOIN (
SELECT invoices.id AS invoice_id,
COALESCE(SUM(invoice_payments.amount), 0) AS amount_paid
FROM invoices
LEFT JOIN invoice_payments
ON invoices.id = invoice_payments.invoice_id
GROUP BY invoices.id
) AS invoice_payment_amounts
ON invoices.id = invoice_payment_amounts.invoice_id
LEFT JOIN (
SELECT invoices.id AS invoice_id,
COALESCE(SUM(invoice_items.item_price), 0) AS total_amount
FROM invoices
LEFT JOIN invoice_items
ON invoices.id = invoice_items.invoice_id
GROUP BY invoices.id
) AS invoice_item_amounts
ON invoices.id = invoice_item_amounts.invoice_id
WHERE amount_paid < total_amount
But...now I need to get that into rails (probably as a scope). I can use find_by_sql, but that then returns an array, rather than an ActiveRecord::Relation, which is not what I need, since I want to chain it with other scopes (there is, for example, an overdue scope, which uses this), etc.
So raw SQL probably isn't the right way to go here.....but what is? I've not been able to do this in activerecord's query language.
The closest I've gotten so far was this:
Invoice.select('invoices.*, SUM(invoice_items.price) AS total, SUM(invoice_payments.amount) AS amount_paid').
joins(:invoice_payments, :invoice_items).
group('invoices.id').
where('amount_paid < total')
But that fails, since on orders like #1, with multiple payments, it incorrectly doubles the price of the order (due to multiple joins), showing it as still unpaid. I had the same problem in SQL, which is why I structured it in the way I did.
Any thoughts here?
You can get your results using group by and having clause of MySQL as:
Pure MySQL Query:
SELECT `invoices`.* FROM `invoices`
INNER JOIN `invoice_items` ON
`invoice_items`.`invoice_id` = `invoices`.`id`
INNER JOIN `invoice_payments` ON
`invoice_payments`.`invoice_id` = `invoices`.`id`
GROUP BY invoices.id
HAVING sum(invoice_items.price) < sum(invoice_payments.amount)
ActiveRecord Query:
Invoice.joins(:invoice_items, :invoice_payments).group("invoices.id").having("sum(invoice_items.price) < sum(:invoice_payments.amount)")
When building more complex queries in Rails usually Arel Really Exasperates Logicians comes in handy
Arel is a SQL AST manager for Ruby. It
simplifies the generation of complex SQL queries, and
adapts to various RDBMSes.
Here is a sample how the Arel implementation would look like based on the requirements
invoice_table = Invoice.arel_table
# Define invoice_payment_amounts
payment_arel_table = InvoicePayment.arel_table
invoice_payment_amounts = Arel::Table.new(:invoice_payment_amounts)
payment_cte = Arel::Nodes::As.new(
invoice_payment_amounts,
payment_arel_table
.project(payment_arel_table[:invoice_id],
payment_arel_table[:amount].sum.as("amount_paid"))
.group(payment_arel_table[:invoice_id])
)
# Define invoice_item_amounts
item_arel_table = InvoiceItem.arel_table
invoice_item_amounts = Arel::Table.new(:invoice_item_amounts)
item_cte = Arel::Nodes::As.new(
invoice_item_amounts,
item_arel_table
.project(item_arel_table[:invoice_id],
item_arel_table[:price].sum.as("total"))
.group(item_arel_table[:invoice_id])
)
# Define main query
query = invoice_table
.project(
invoice_table[Arel.sql('*')],
invoice_payment_amounts[:amount_paid],
invoice_item_amounts[:total]
)
.join(invoice_payment_amounts).on(
invoice_table[:id].eq(invoice_payment_amounts[:invoice_id])
)
.join(invoice_item_amounts).on(
invoice_table[:id].eq(invoice_item_amounts[:invoice_id])
)
.where(invoice_item_amounts[:total].gt(invoice_payment_amounts[:amount_paid]))
.with(payment_cte, item_cte)
res = Invoice.find_by_sql(query.to_sql)
for r in res do
puts "---- Invoice #{r.id} -----"
p r
puts "total: #{r[:total]}"
puts "amount_paid: #{r[:amount_paid]}"
puts "----"
end
This will return the same output as your SQL query using the sample data you have provided to the question.
Output:
<Invoice id: 2, notes: "An order with 1 items.", created_at: "2017-12-18 21:15:47", updated_at: "2017-12-18 21:15:47">
total: 2.5
amount_paid: 1.25
----
---- Invoice 1 -----
<Invoice id: 1, notes: "An order with 2 items.", created_at: "2017-12-18 21:15:47", updated_at: "2017-12-18 21:15:47">
total: 12.6
amount_paid: 6.1
----
Arel is quite flexible so you can use this as a base and refine the query conditions based on more specific requirements you might have.
I would strongly recommend for you to consider creating a cache columns (total, amount_paid) in the Invoice table and maintain them so you can avoid this complex query. At least the total additional column would be quite simple to create and fill the data.

SQL Join or SUM is returning too many values when working with Redshift database

I'm working with a Redshift database and I can't understand why my join or SUM is bringing too many values. My query is below:
SELECT
date(u.created_at) AS date,
count(distinct c.user_id) AS active_users,
sum(distinct insights.spend) AS fbcosts,
count(c.transaction_amount) AS share_shake_costs,
round(((sum(distinct insights.spend) + count(c.transaction_amount)) /
count(distinct c.user_id)),2) AS cac
FROM
dbname.users AS u
LEFT JOIN
dbname.card_transaction AS c ON c.user_id = u.id
LEFT JOIN
facebookads.insights ON date(insights.date_start) = date(u.created_at)
LEFT JOIN
dbname.card_transaction AS c2 ON date(c2.timestamp) = date(u.created_at)
WHERE
c2.vendor_transaction_description ilike '%share%'
OR c2.vendor_transaction_description ilike '%shake to win%'
GROUP BY
date
ORDER BY
1 DESC;
This query returns the following data:
If we look at 2017-02-08, we can see a total of 1298 for "share_shake_costs". However, if I run the same query just on the card_transaction table I get the following results which are correct.
The query for this second table looks like this:
SELECT
date(timestamp),
sum(transaction_amount)
FROM
dbname.card_transaction AS c2
WHERE
c2.vendor_transaction_description ilike '%share%'
OR c2.vendor_transaction_description ilike '%shake to win%'
GROUP BY
1
ORDER BY
1 DESC;
I have a feeling that I have a similar issue for my "fbcosts" column. I think it has to do with my join since the SUM should be working fine.
I'm new to Redshift and SQL so perhaps there's a better way of doing this entire query. Is there anything obvious that I'm missing?
It seems you have a table that contains 1:n mapping and when you join over a common clause, that number is being counted n times.
Let us say one of your tables, orders contains user_id and the total bill_amount and the other table, order_details contains the detail of the sub-items placed by that user_id.
If you do a left join, by definition, orders.user_id will join n times to order_details.user_id, where
n = total number of rows in order_details table
and would perform the aggregation (sum, count etc) n times.
+------------------+ +----------------------+
| orders | | order_details |
+------------------+ +----------------------+
|amount user_id | | user_id items |
+------------------+ +----------------------+
| 1000 123 ---------> | 123 apple |
+ +----------------------+
+-------------> | 123 guava |
| +----------------------+
v-------------> | 123 mango |
+----------------------+
select sum(amount) from orders o left join order_details od
on o.user_id = od.user_id; // result: 3000
select count(amount) from orders o left join order_details od
on o.user_id = od.user_id; // result: 3
I hope the reason for large count is clear to you now.
PS: Also, always prefer to enclose OR conditions in ().
WHERE
(c2.vendor_transaction_description ilike '%share%'
OR c2.vendor_transaction_description ilike '%shake to win%')

SQL 2 Left outer joins with Sum and Group By

Looking for some guidance on this. I am attempting to run a report in my complaint management system.. Complaints by Year, Location, Subcategory, Showing Totals for TotalCredits (child table) and TotalsCwts (childtable) as well as total ExternalRootCause (on master table).
This is my SQL, but the TotalCwts and TotalCredits are not being calculated correctly. It calculates 1 time for each child record rather than the total for each master record.
SELECT
dbo.Complaints.Location,
YEAR(dbo.Complaints.ComDate) AS Year,
dbo.Complaints.ComplaintSubcategory,
COUNT(Distinct(dbo.Complaints.ComId)) AS CustomerComplaints,
SUM(DISTINCT CASE WHEN (dbo.Complaints.RootCauseSource = 'External' ) THEN 1 ELSE 0 END) as ExternalRootCause,
SUM(dbo.ComplaintProducts.Cwts) AS TotalCwts,
Coalesce(SUM(dbo.CreditDeductions.CreditAmount),0) AS TotalCredits
FROM dbo.Complaints
JOIN dbo.CustomerComplaints
ON dbo.Complaints.ComId = dbo.CustomerComplaints.ComId
LEFT OUTER JOIN dbo.CreditDeductions
ON dbo.Complaints.ComId = dbo.CreditDeductions.ComId
LEFT OUTER JOIN dbo.ComplaintProducts
ON dbo.Complaints.ComId = dbo.ComplaintProducts.ComId
WHERE
dbo.Complaints.Location = Coalesce(#Location,Location)
GROUP BY
YEAR(dbo.Complaints.ComDate),
dbo.Complaints.Location,
dbo.Complaints.ComplaintSubcategory
ORDER BY
[YEAR] desc,
dbo.Complaints.Location,
dbo.Complaints.ComplaintSubcategory
Data Results
Location | Year | Subcategory | Complaints | External RC | Total Cwts | Total Credits
---------------------------------------------------------------------------------------
Boston | 2016 | Documentation | 1 | 0 | 8 | 8.00
Data Should Read
Location | Year | Subcategory | Complaints | External RC | Total Cwts | Total Credits
---------------------------------------------------------------------------------------
Boston | 2016 | Documentation | 1 | 0 | 4 | 2.00
Above data reflects 1 complaint having 4 Product Records with 1cwt each and 2 credit records with 1.00 each.
What do I need to change in my query or should I approach this query a different way?
The problem is that the 1 complaint has 2 Deductions and 4 products. When you join in this manner then it will return every combination of Deduction/Product for the complaint which gives 8 rows as you're seeing.
One solution, which should work here, is to not query the Dedustion and Product tables directly; query a query which returns one row per table per complaint. In other words, replace:
LEFT OUTER JOIN dbo.CreditDeductions ON dbo.Complaints.ComId = dbo.CreditDeductions.ComId
LEFT OUTER JOIN dbo.ComplaintProducts ON dbo.Complaints.ComId = dbo.ComplaintProducts.ComId
...with this - showing the Deductions table only, you can work out the Products:
LEFT OUTER JOIN (
select ComId, count(*) CountDeductions, sum(CreditAmount) CreditAmount
from dbo.CreditDeductions
group by ComId
) d on d.ComId = Complaints.ComId
You'll have to change the references to dbo.CreditDedustions to just d (or whatever you want to call it).
Once you've done them both then you'll one each per complaint, which will result with 1 row per complaint contaoining the counts and totals from the two sub-tables.

INNER JOIN Need to use column value twice in results

I've put in the requisite 2+ hours of digging and not getting an answer.
I'd like to merge 3 SQL tables, where Table A and B share a column in common, and Table B and C share a column in common--Tables A and C do not.
For example:
Table A - entity_list
entity_id | entity_name | Other, irrelevant columns
Example:
1 | Microsoft |
2 | Google |
Table B - transaction_history
transaction_id | purchasing_entity | supplying_entity | other, irrelevant columns
Example:
1 | 2 | 1
Table C - transaction_details
transactional_id | amount_of_purchase | Other, irrelevant columns
1 | 5000000 |
Using INNER JOIN, I've been able to get a result where I can link entity_name to either purchasing_entity or supplying_entity. And then, in the results, rather than seeing the entity_id, I get the entity name. But I want to substitute the entity name for both purchasing and supplying entity.
My ideal results would look like this:
1 [transaction ID] | Microsoft | Google | 5000000
The closes I've come is:
1 [transaction ID] | Microsoft | 2 [Supplying Entity] | 5000000
To get there, I've done:
SELECT transaction_history.transaction_id,
entity_list.entity_name,
transaction_history.supplying_entity,
transaction_details.amount_of_purchase
FROM transaction.history
INNER JOIN entity_list
ON transaction_history.purchasing_entity=entity_list.entity.id
INNER JOIN
ON transaction_history.transaction_id=transaction_details.transaction_id
I can't get entity_name to feed to both purchasing_entity and supplying_entity.
Here is the query:
SELECT h.transaction_id, h.purchasing_entity, purchaser.entity_name, h.supplying_entity, supplier.entity_name, d.amount_of_purchase
FROM transaction_history h
INNER JOIN transaction_details d
ON h.transaction_id = d.transaction_id
INNER JOIN entity_list purchaser
ON h.purchasing_entity = purchaser.entity_id
INNER JOIN entity_list supplier
ON h.supplying_entity = supplier.entity_id

A query to return a mix of SUM and COUNT in 5 joined tables

I have a table named Ads, containing one row for each ad.
| ID | AdTitle | AdDescription | ... |
I also have 3 tables named Applications, Referrals and Subscribers, with rows for each application, referral and subscriber associated with an Ad.
| ID | AdID | ApplicantName | ApplicantEmail | ... |
| ID | AdID | ReferrerEmail | ReferralEmail | ... |
| ID | AdID | SubscriberEmail | ... |
Finally I have a table named Views with one row for each ad, containing the total number of views for that ad.
| ID | AdID | Views |
I'm trying to write a query with a summary for each ad in 6 columns: Ad ID, Ad title, number of applications/referrals/subscribers and total views.
A simple query of all tables that I have been working with:
SELECT *
FROM Ads
LEFT JOIN Applications ON Ads.ID = Applications.AdID
LEFT JOIN Referrals ON Ads.ID = Referrals.AdID
LEFT JOIN Subscribers ON Ads.ID = Subscribers.AdID
LEFT JOIN Views ON Ads.ID = Views.AdID
I have tried a lot of combinations of LEFT and INNER joins, GROUP BY, COUNT(...), COUNT(DISTINCT ...) and SUM(CASE ...) but nothing so far have worked. I end up counting NULL values from previous columns, counting entries twice or not at all, counting the number of rows in the Views-table instead of adding them together and so on.
Is it better to split this up in multiple querys, or is there a good way to archive what I want with a single one?
Try this
SELECT *,
(select count (*) from Applications as A1 where A.ID = A1.AdID ) as Applicants,
(select count(*) from Referrals as R where A.ID = R.AdID ) as Referrals,
(select count(*) from Subscribers as S where A.ID = S.AdID ) as Subscribers,
(select count(*) from Views as V where A.ID = V.AdID ) as Views
FROM Ads as A