How to capture rows that match an aggregate

How to capture rows that match an aggregate - sql

Say I have a table with (pseudocode):
TABLE Order
(
orderid int,
type int,
price NUMERIC(18,2),
)
Now I want to list those orders whose price matches the maximum price for a particular order type.
I start with the following, giving me the max price per order type:
SELECT type, MAX(price)
FROM Order
GROUP BY type
Now I know the maximum price by type. However, I want to, as efficiently as possible, get a result set of the actual orders whose price is that maximum price, instead of just the type/MAX(price).
The table is very large with potentially tens of millions of rows, so efficiency is key here (assuming proper indexes are in place, of course, such as on the type column in this case).
I start with something like:
SELECT orderid, price
FROM Order AS O
WHERE O.price=(SELECT MAX(O2.price)
FROM Order AS O2
WHERE O2.type=O.type)
It's not particularly fast, but it does the job.
Then I realize that orders appear multiple times in this table, because it's actually a denormalized order history table and it really looks more like:
TABLE Order
(
id int, -- This is just an identity column - the surrogate key
orderid int, -- multiple records exist for the same
-- orderid with different update times
type int,
price NUMERIC(18,2),
updatetime DATETIME2(3)
)
So, what I want is actually the latest version of those orders based on updatetime whose price matches the maximum price for their particular type. This is my question.
Extending:
SELECT *
FROM Order AS O
WHERE O.price=(SELECT MAX(O2.price)
FROM Order AS O2
WHERE O2.type=O.type)
..., to handle the new requirement seems like a mess waiting to happen. So I was wondering a good efficient (and hopefully readable) solution to the new requirements would be.
Based on Gordon's suggestion of:
select o.*
from (select o.*,
row_number() over (partition by type, price order by updatetime desc) as seqnum
from (select o.*, max(o.price) over (partition by type) as maxprice,
from Orders o
) o
where price = maxprice
) o
where seqnum = 1;
I have come up with the following query, with comments added to describe my thought process. The comments should of course be read from the innermost
query outward:
SELECT *
FROM
(
-- We want the max price for each order type, but we only want to
-- use the latest version of each order (i.e., seqnum=1). So, we
-- partition by type/seqnum, calculate the max price for each
-- partition and the only use the max prices from the seqnum=1
-- partitions for each type via the WHERE clause in the outer query
SELECT *,
MAX(price) OVER (PARTITION BY type, seqnum) AS maxprice
FROM
(
-- We only want to examine the latest version of each order.
-- BTW, the order price can change between versions.
-- So, let's start by marking the latest version of each order
-- with seqnum=1 which we will use as a "filter in" clause later
SELECT *,
row_number() OVER (PARTITION BY orderid
ORDER BY updatetime DESC) AS seqnum
) AS O
WHERE seqnum=1; -- Discard all but the latest versions of orders
) AS O
WHERE price=maxprice
I am not sure if this is correct though, because it is quite complicated...

Use window functions. Your original query can be written as:
select o.*
from (select o.*, max(o.price) over (partition by type) as maxprice
from Orders o
) o
where price = maxprice;
If you want the most recent order for the price:
select o.*
from (select o.*, max(o.price) over (partition by type) as maxprice,
row_number() over (partition by type, price order by updatetime desc) as seqnum
from Orders o
) o
where price = maxprice and seqnum = 1;
EDIT:
This would be a bit more efficient with an index on Orders(type, price, updatetime). You can also try writing this as:
select o.*
from (select o.*,
row_number() over (partition by type, price order by updatetime desc) as seqnum
from (select o.*, max(o.price) over (partition by type) as maxprice,
from Orders o
) o
where price = maxprice
) o
where seqnum = 1;
This may greatly reduce the data being used for the second analytic function.

Related

Second minimum value for every customer

I am using MySQL database. So, there are two columns I am working on, CustomerId, and OrderDate. I want to find a second-order date (2nd minimum order date) for each customer.

If you are using MySQL 8+, then ROW_NUMBER can be used here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY OrderDate) rn
FROM yourTable
)
SELECT CustomerId, OrderDate
FROM cte
WHERE rn = 2;

I would recommend using dense_rank as it can give you correct result even if there is duplicate order_date as follows:
SELECT * FROM
(SELECT t.*, DENSE_RANK() OVER (PARTITION BY CustomerId ORDER BY OrderDate) dr
FROM yourTable t
) t where dr = 2;
You can use corelated sub-query as follows if your MySQL version do not support analytical functions as follows:
SELECT T.*
FROM YOURTABLE T
WHERE 1 = (
SELECT COUNT(DISTINCT ORDER_DATE)
FROM YOURTABLE TT
WHERE TT.ORDER_DATE > T.ORDER_DATE
)

I would use a subquery like this:
select o.*
from orders o
where o.order_date = (select o2.order_date
from orders o2
where o2.customer_id = o.customer_id
order by o2.order_date
limit 1 offset 1
);
The subquery is a correlated subquery that returns the second date. If you want the second date with other columns, it can be moved to the select.
With an index on (customer_id, order_date), this is likely to be the fastest solution.
This assumes that there is one row per date (or that if there are multiple rows, "second" can be the earliest date). If you want the second distinct date then use select distinct int he subquery -- however select distinct and group by would incur additional overhead.

SQL Server : select only last record per customer from a join query

Assume I have these 3 tables :
The first 2 tables define customers of different types ,i.e second table has other columns which are not included in table 1 i just left them the same to save complexity.
The third table defines orders for both types of customers . Each customer has more than one orders
I want to select the last order for every customer, i.e the order with order_id 4 for customer 1 which was created on 23/12/2016 and the order with order_id 5 for customer 2 which was created on 26/12/2016
I tried something like this :
select *
from customertype1
left join order on order.customer_id = customertype1.customer_id
order by order_id desc;
But this gives me multiple records for every customer, as I have stated above I want only the last order for every customertype1.

If you want the last order for each customer, then you only need the orders table:
select o.*
from (select o.*,
row_number() over (partition by customer_id order by datecreated desc) as seqnum
from orders o
) o
where seqnum = 1;
If you want to include all customers, then you need to combine the two tables. Assuming they are mutually exclusive:
with c as (
select customer_id from customers1 union all
select customer_id from customers2
)
select o.*
from c left join
(select o.*,
row_number() over (partition by customer_id order by datecreated desc) as seqnum
from orders o
) o
on c.customer_id = o.customer_id and seqnum = 1;
A note about your data structure: You should have one table for all customers. You can then define a foreign key constraint between orders and customers. For the additional columns, you can have additional tables for the different types of customers.

Use ROW_NUMBER() and PARTITION BY.
ROW_NUMBER(): it will give sequence no to your each row
PARTITION BY: it will group your data by given column
When you use ROW_NUMBER() and PARTITION BY both together then first partition by group your records and then row_number give then sequence no by each group, so for each group you have start sequence from 1
Help Link: Example of ROW_NUMBER() and PARTITION BY

This is the general idea. You can work out the details.
with customers as
(select customer_id, customer_name
from table1
union
select customer_id, customer_name
from table2)
, lastOrder as
(select customer_id, max(order_id) maxOrderId
from orders
group by customer_id)
select *
from lastOrder join customers on lastOrder.Customer_id = customers.customer_id
join orders on order_id = maxOrderId

SQL Insert Statement that pulls top n from each set of categories that could have duplicates

I am trying to write an Insert statement that will go through sales numbers for a group of people with each sale being marked as an R or C type of sale. I want to find the TOP 100 salespersons in ALL (both R and C), R, and C. Not only do I have sales data though, I have Sales, Margin, Count, Sales/Count data I want to do the same thing for. so far I have to do 12 SQL statements to accomplish this (4 categories X 3 sales types) each one is a slight variation of this to get one of my 4 categories.
INSERT INTO ztbl_AllTopSalesPerson (SalesPerson)
SELECT TOP 100 tbl_Master.SalesPerson
FROM tbl_Master
WHERE tbl_Master.SaleType="C"
GROUP BY tbl_Master.SalesPerson
ORDER BY Sum(tbl_Master.Margin) DESC;
INSERT INTO ztbl_AllTopSalesPerson (SalesPerson)
SELECT TOP 100 tbl_Master.SalesPerson
FROM tbl_Master
WHERE tbl_Master.SaleType="R"
GROUP BY tbl_Master.SalesPerson
ORDER BY Sum(tbl_Master.Margin) DESC;
INSERT INTO ztbl_AllTopSalesPerson (SalesPerson)
SELECT TOP 100 tbl_Master.SalesPerson
FROM tbl_Master
GROUP BY tbl_Master.SalesPerson
ORDER BY Sum(tbl_Master.Margin) DESC;
Ideally I would like a way to make this all one statement. And(if it is not impossible) I would like to filter each one by date so I can do it by monthly data too, not just overall.
Just a few notes: I cant have duplicate names, so if a salesperson is top in all three sales types, they still only appear once. Im using Access with a SQL Server back-end for only the main data table. I cant take the top 300 results, because there is so much overlap between the sales types, and I need the top from each ( I do a separate query after this list is made that lines up the SalesPersons' Alphabetically with their 4 categories as fields). And lastly, I generally up with a final list that has around 260-290 records.
THANKS!
p.s. thanks for your replies, stack exchange has saved my bacon 100s of times. I would post my attempts at this, but I think it would hurt more than it would help.

You might have to tweak it a little depending on what sort of output you want. You also might have to do a subquery for the COUNT(*) part of it, as this is untested. But I think this is the general idea of what you are looking for.
To get aggregated information, you can break it up into two CTE's:
WITH CTE1 AS (
SELECT SalesPerson,
SaleType,
SUM(Margin) OVER (PARTITION BY SalesPerson,SaleType) as Margin,
SUM(Sales) OVER (PARTITION BY SalesPerson,SaleType) as Sales,
SUM(Sales)/COUNT(*) OVER (PARTITION BY SalesPerson,SaleType) as Sales_pct,
COUNT(*) OVER (PARTITION BY SalesPerson,SaleType) as Total
SUM(Margin) OVER (PARTITION BY SalesPerson) as Margin_all,
SUM(Sales) OVER (PARTITION BY SalesPerson) as Sales_all,
SUM(Sales)/COUNT(*) OVER (PARTITION BY SalesPerson) as Sales_pct_all,
COUNT(*) OVER (PARTITION BY SalesPerson) as Total_all
FROM tbl_Master
)
,CTE2 AS (
SELECT SalesPerson
,RANK() OVER (PARTITION BY SaleType ORDER BY Margin desc) as Margin
,RANK() OVER (PARTITION BY SaleType ORDER BY Sales desc) as Sales
,RANK() OVER (PARTITION BY SaleType ORDER BY Sales_pct desc) as Sales_pct
,RANK() OVER (PARTITION BY Master.SaleType ORDER BY Total desc) as Total
,RANK() OVER (ORDER BY Margin_all desc) as Margin_all
,RANK() OVER (ORDER BY Sales_all desc) as Sales_all
,RANK() OVER (ORDER BY Sales_pct_all desc) as Sales_pct_all
,RANK() OVER (ORDER BY Total_all desc) as Total_all
FROM CTE1 )
Select distinct SalesPerson from CTE2
Where Margin <= 100 Or Sales <= 100 Or Total <= 100 or Sales_pct <= 100
Or Margin_all <= 100 Or Sales_all <= 100 Or Total_all <= 100 or Sales_pct_all <= 100
I understand this is not perfect, but it should get you started. To filter by date, add DATEPART(month,[your date field]) to your PARTITION BY clauses (and the first CTE)

Using max function without grouping

I have three sets of information:
productID
date
seller
I need to get information of last product sold for each productID and sold by. I tried using max value of date but that forces me to use grouping for seller as well but I don't want to group by seller. I want to group by productID, get the date it was sold last and by who. How can I avoid grouping on seller?

Use Window function which will help you to find the Latest date in each group(productId)
SELECT ProductID,
[date],
seller
FROM (SELECT Row_number()
OVER(
partition BY ProductID
ORDER BY [date] desc) Rn,
*
FROM tablename) a
WHERE rn = 1
or use can also use Max aggregate with group by to get the result
SELECT ProductID,
[date],
seller
FROM tablename a
JOIN (SELECT Max([date]) [date],
productid
FROM tablename
group by productid) b
ON a.productid = b.productid
AND a.[date] = b.[date]

SQL question about GROUP BY

I've been using SQL for a few years, and this type of problem comes up here and there, and I haven't found an answer. But perhaps I've been looking in the wrong places - I'm not really sure what to call it.
For the sake of brevity, let's say I have a table with 3 columns: Customer, Order_Amount, Order_Date. Each customer may have multiple orders, with one row for each order with the amount and date.
My Question: Is there a simple way in SQL to get the DATE of the maximum order per customer?
I can get the amount of the maximum order for each customer (and which customer made it) by doing something like:
SELECT Customer, MAX(Order_Amount) FROM orders GROUP BY Customer;
But I also want to get the date of the max order, which I haven't figured out a way to easily get. I would have thought that this would be a common type of question for a database, and would therefore be easy to do in SQL, but I haven't found an easy way to do it yet. Once I add Order_Date to the list of columns to select, I need to add it to the Group By clause, which I don't think will give me what I want.

Apart from self-join you can do:
SELECT o1.*
FROM orders o1 JOIN orders o2 ON o1.Customer = o2.Customer
GROUP BY o1.Customer, o1.Order_Amount
HAVING o1.Order_Amount = MAX(o2.Order_Amount);
There's a good article reviewing various approaches.
And in Oracle, db2, Sybase, SQL Server 2005+ you would use RANK() OVER.
SELECT * FROM (
SELECT *
RANK() OVER (PARTITION BY Customer ORDER BY Order_Amount DESC) r
FROM orders) o
WHERE r = 1;
Note: If Customer has more than one order with maximum Order_Amount (i.e. ties), using RANK() function would get you all such orders; to get only first one, replace RANK() with ROW_NUMBER().

There's no short-cut... the easiest way is probably to join to a sub-query:
SELECT
*
FROM
orders JOIN
(
SELECT Customer, MAX(Order_Amount) AS Max_Order_Amount
FROM orders
GROUP BY Customer
) maxOrder
ON maxOrder.Customer = orders.Customer
AND maxOrder.Max_Order_Amount = orders.Order_Amount

you will want to join on the same table...
SELECT Customer, order_date, amt
FROM orders o,
( SELECT Customer, MAX(Order_Amount) amt FROM orders GROUP BY Customer ) o2
WHERE o.customer = o2.customer
AND o.order_amount = o2.amt
;

Another approach for the collection:
WITH tempquery AS
(
SELECT
Customer
,Order_Amount
,Order_Date
,row_number() OVER (PARTITION BY Customer ORDER BY Order_Amount DESC) AS rn
FROM
orders
)
SELECT
Customer
,Order_Amount
,Order_Date
FROM
tempquery
WHERE
rn = 1

If your DB Supports CROSS APPLY you can do this as well, but it doesn't handle ties correctly
SELECT [....]
FROM Customer c
CROSS APPLY
(SELECT TOP 1 [...]
FROM Orders o
WHERE c.customerID = o.CustomerID
ORDER BY o.Order_Amount DESC) o
See this data.SE query

You could try something like this:
SELECT Customer, MAX(Order_Amount), Order_Date
FROM orders O
WHERE ORDER_AMOUNT = (SELECT MAX(ORDER_AMOUNT) FROM orders WHERE CUSTOMER = O.CUSTOMER)
GROUP BY CUSTOMER, Order_Date

with t as
(
select CUSTOMER,Order_Date ,Order_Amount,max(Order_Amount) over (partition
by Customer) as
max_amount from orders
)
select * from t where t.Order_Amount=max_amount

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to capture rows that match an aggregate - sql

Related

Second minimum value for every customer

SQL Server : select only last record per customer from a join query

SQL Insert Statement that pulls top n from each set of categories that could have duplicates

Using max function without grouping

SQL question about GROUP BY

Categories

Resources