Easiest Approach to Selecting Top Result from Windowed Function? - sql

So, let's say I have a list of customers and I want to select details for all customers, as well as their most purchased product from a specific class of products. Even if they have not purchased one of these products I want to select the customer detail, while simply displaying null for their most purchased product from that class.
I would start with the following either as a CTE or temp table:
SELECT
CUST_NUMBER
,PRODUCT
,ROW_NUMBER() OVER (PARTITION BY CUST_NUMBER ORDER BY COUNT(ORDER_NUM) DESC) [ProdRank]
FROM ORDERS
WHERE PROD_CLASS = 'x'
GROUP BY
CUST_NUMBER
,PRODUCT
The thing is this - There can be many different products within this product class, and I am only interested in selecting where ProdRank = 1. As you might know though, I cannot specify either in the WHERE or in HAVING clause for ProdRank to = 1.
I get the error message "Windowed functions can only appear in the SELECT or ORDER BY clauses."
The situation is further complicated by the fact that many customers may have not ordered any products within the product class. Because of this I cannot simply left join the customer list to the above and specify WHERE ProdRank = 1, or else it mimics an inner join and I drop any customers where ProdRank is Null.
The method I've come up with in order to deal with this is to first create a temp table with the code above as #Products which includes the customer and every product with the respective ranking. I then create a second temp table called #TopProducts where I simply :
SELECT * FROM
#Products WHERE
ProdRank = 1
After that I just left join against #TopProducts from my Customers table.
It seems like there should be a simpler way of dealing with this though. Is there any way I can select the top partitioned result of ROW_NUMBER() or RANK() in one step, rather than creating two temp tables?

Use a Common Table Expression
WITH topProducts AS (
SELECT
CUST_NUMBER
,PRODUCT
,ROW_NUMBER() OVER (PARTITION BY CUST_NUMBER ORDER BY COUNT(ORDER_NUM) DESC) [ProdRank]
FROM ORDERS
WHERE PROD_CLASS = 'x'
GROUP BY
CUST_NUMBER
,PRODUCT
)
SELECT *
FROM CustomerDetails c
LEFT JOIN TopProducts p ON (ProdRank = 1 AND c.CUST_NUMBER = p.CUST_NUMBER)
Use a subquery:
SELECT *
FROM CustomerDetails c
LEFT JOIN (
SELECT
CUST_NUMBER
,PRODUCT
,ROW_NUMBER() OVER (PARTITION BY CUST_NUMBER ORDER BY COUNT(ORDER_NUM) DESC) [ProdRank]
FROM ORDERS
WHERE PROD_CLASS = 'x'
GROUP BY
CUST_NUMBER
,PRODUCT
) p ON (ProdRank = 1 AND c.CUST_NUMBER = p.CUST_NUMBER)

I would use outer apply and top in your scenario. Does that make sense?
Few examples here Real life example, when to use OUTER / CROSS APPLY in SQL
I would write a piece of code, but I'm on mobile and that's really not comfortable...

Related

SQL Server query for related products

I am trying to get related products but the issue which I'm facing is that there is product photos table which has one-to-many relationship with products table, so when I get products by matching category Id it also returns multiple product photos with that product which i do not want. I want only one product photo from product photos table of specific product. Is there any way to use distinct in joins or any other way? what I have done so far....
SELECT [Product].[ID],
,[Thumbnail]
,[ProductName]
,[Model]
,[SKU]
,[Price]
,[IsExclusive]
,[DiscountPercentage]
,[DiscountFixed]
,[NetPrice]
,[Url]
FROM [dbo].[Product]
INNER JOIN [ProductPhotos] ON [ProductPhotos].[ProductID]=[Product].[ID]
INNER JOIN [ProductCategories] ON [ProductCategories].[ProductID]=
[Product].[ID]
WHERE [ProductCategories].[CategoryID]=4
And the result I am getting is...
Product Photos table has
Is there any way to use distinct or group by on product Id column in product photos table to return only one row from photos table.
Instead of using inner join, use cross apply:
SELECT . . .
FROM dbo.Product p CROSS APPLY
(SELECT TOP (1) pp.*
FROM ProductPhotos pp
WHERE pp.ProductID = p.id
ORDER BY NEW_ID()
) pp INNER JOIN
ProductCategories pc
ON pc.ProductID = p.id
WHERE pc.CategoryID = 4;
Notes:
The ORDER BY NEWID() chooses a random photo. You can order by specific columns to get the earliest, latest, biggest, or whatever.
Note that I added table aliases. These make the query easier to write and to read.
You should qualify all column names in your query, so it is clear which tables they come from.
I removed the superfluous square braces. They just make the query harder to write and to read.
You can use ROW_NUMBER() to return one row for ProductID, like this:
JOIN (SELECT *,
ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY PhotoID) rn
FROM [ProductPhotos]) [ProductPhotos]
ON [ProductPhotos].[ProductID]=[Product].[ID] AND [ProductPhotos].rn = 1
Instead of this:
JOIN [ProductPhotos] ON [ProductPhotos].[ProductID]=[Product].[ID]
you can use sub query in join with distinct instead of joining table directly.
you can create alias and use that column as distinct in select statement, but it will create performance issues when having loads of data inside.
if you have 3 different photos for same product Id (like 2). you can use sub-query with top 1 order by PK desc to get latest picture.

stacking data from two tables with different dates

I have two tables, Sales and Returns. They have CustomerID, ProductCode, Name, SalesDate, SalesWeek, SalesAmount and ReturnsDate, ReturnsWeek, ReturnsAmount. What I really want to do is just join these tables and stack them on top of each other so the client has data in a single report for both sales and returns.
Sales and Returns dates are different but the product code, customer ID and Name can be same for a record in output table. For instance Customer A bought a product last month and returned it next month, so his record can appear in returns table.
To achieve this I have tried using union by selecting all columns between both tables but I am getting a mix of records for sales and returns with no consistency. All I want to do is see Nulls for Customers who have no business with Returns and vice versa. I was thinking Left Join in this case should work but it isn't working. So I am seeing mixed up data in all columns for Sales and Returns amounts. Attached is the picture that has two tables, the output I am seeing and the out put I want to see. Also I am performing Weekly aggregations for Sales and Returns amounts. What is the best and easiest way to achieve this ? I am sorry I may have not structured my question properly but the image might help #NewToSQL
Presumably, a customer can purchase a given product more than once. If so, a full join is the right solution, but you need a bit more logic to ensure non-duplication in the results:
select . . .
from (select s.*,
row_number() over (partition by product_code, customer_id order by (select NULL)) as seqnum
from sales s
) s full join
(select r.*,
row_number() over (partition by product_code, customer_id order by (select NULL)) as seqnum
from returns r
) r
on s.product_code = r.product_code and s.customer_id = r.customer_id;
I am leaving the name out, because I assume it is defined by the customer id.
You need to do a FULL JOIN sales table with returns table on those columns that you mentioned.
Something like :
SELECT *
FROM sales s
FULL JOIN returns r
ON s.product_code = r.product_code
AND s.customerid = r.customer_id

Distinct on multi-columns in sql

I have this query in sql
select cartlines.id,cartlines.pageId,cartlines.quantity,cartlines.price
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
I want to get rows distinct by pageid ,so in the end I will not have rows with same pageid more then once(duplicate)
any Ideas
Thanks
Baaroz
Going by what you're expecting in the output and your comment that says "...if there rows in output that contain same pageid only one will be shown...," it sounds like you're trying to get the top record for each page ID. This can be achieved with ROW_NUMBER() and PARTITION BY:
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER(PARTITION BY c.pageId ORDER BY c.pageID) rowNumber,
c.id,
c.pageId,
c.quantity,
c.price
FROM orders o
INNER JOIN cartlines c ON c.orderId = o.id
WHERE userId = 5
) a
WHERE a.rowNumber = 1
You can also use ROW_NUMBER() OVER(PARTITION BY ... along with TOP 1 WITH TIES, but it runs a little slower (despite being WAY cleaner):
SELECT TOP 1 WITH TIES c.id, c.pageId, c.quantity, c.price
FROM orders o
INNER JOIN cartlines c ON c.orderId = o.id
WHERE userId = 5
ORDER BY ROW_NUMBER() OVER(PARTITION BY c.pageId ORDER BY c.pageID)
If you wish to remove rows with all columns duplicated this is solved by simply adding a distinct in your query.
select distinct cartlines.id,cartlines.pageId,cartlines.quantity,cartlines.price
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
If however, this makes no difference, it means the other columns have different values, so the combinations of column values creates distinct (unique) rows.
As Michael Berkowski stated in comments:
DISTINCT - does operate over all columns in the SELECT list, so we
need to understand your special case better.
In the case that simply adding distinct does not cover you, you need to also remove the columns that are different from row to row, or use aggregate functions to get aggregate values per cartlines.
Example - total quantity per distinct pageId:
select distinct cartlines.id,cartlines.pageId, sum(cartlines.quantity)
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
If this is still not what you wish, you need to give us data and specify better what it is you want.

Efficiency of joining subqueries in SQL Server

I have a customers and orders table in SQL Server 2008 R2. Both have indexes on the customer id (called id). I need to return details about all customers in the customers table and information from the orders table, such as details of the first order.
I currently left join my customers table on a subquery of the orders table, with the subquery returning the information I need about the orders. For example:
SELECT c.id
,c.country
,First_orders.product
,First_orders.order_id
FROM customers c
LEFT JOIN SELECT( id,
product
FROM (SELECT id
,product
,order_id
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY Order_Date asc) as order_No
FROM orders) orders
WHERE Order_no = 1) First_Orders
ON c.id = First_orders.id
I'm quite new to SQL and want to understand if I'm doing this efficiently. I end up left joining quite a few subqueries like this onto the customers table in one select query and it can take tens of minutes to run.
So am I doing this efficiently or can it be improved? For example, I'm not sure if my index on id in the orders table is of any use and maybe I could speed up the query by creating a temporary table of what is in the subquery first and creating a unique index on id in the temporary table so SQL Server knows id is now a unique column and then joining my customers table to this temporary table? I typically have one or two million rows in the customers and orders tables.
Many thanks in advance!
You can remove one of your subqueries to make it a little more efficient:
SELECT c.id
,c.country
,First_orders.product
,First_orders.order_id
FROM customers c
LEFT JOIN (SELECT id
,product
,order_id
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY Order_Date asc) as order_No
FROM orders) First_Orders
ON c.id = First_orders.id AND First_Orders.order_No = 1
In your above query, you need to be careful where you place your parentheses as I don't think it will work. Also, you're returning product in your results, but not including in your nested subquery.
For someone who is just learning SQL, your query looks pretty good.
The index on customers may or may not be used for the query -- you would need to look at the execution plan. An index on orders(id, order_date) could be used quite effectively for the row_number function.
One comment is on the naming of fields. The field orders.id should not be the customer id. That should be something like 'orders.Customer_Id`. Keeping the naming system consistent across tables will help you in the future.
Try this...its easy to understand
;WITH cte
AS (
SELECT id
,product
,order_id
,ROW_NUMBER() OVER (
PARTITION BY id ORDER BY Order_Date ASC
) AS order_No
FROM orders
)
SELECT c.id
,c.country
,c1.Product
,c1.order_id
FROM customers c
INNER JOIN cte c1 ON c.id = c1.id
WHERE c1.order_No = 1

Refactoring a tsql view which uses row_number() to return rows with a unique column value

I have a sql view, which I'm using to retrieve data. Lets say its a large list of products, which are linked to the customers who have bought them. The view should return only one row per product, no matter how many customers it is linked to. I'm using the row_number function to achieve this. (This example is simplified, the generic situation would be a query where there should only be one row returned for each unique value of some column X. Which row is returned is not important)
CREATE VIEW productView AS
SELECT * FROM
(SELECT
Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
customer.Id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
) as temp
WHERE temp.prodcut_numbering = 1
Now lets say that the total number of rows in this view is ~1 million, and running select * from productView takes 10 seconds. Performing a query such as select * from productView where productID = 10 takes the same amount of time. I believe this is because the query gets evaluated to this
SELECT * FROM
(SELECT
Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
customer.Id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
) as temp
WHERE prodcut_numbering = 1 and prodcut.Id = 10
I think this is causing the inner subquery to be evaluated in full each time. Ideally I'd like to use something along the following lines
SELECT
Row_number() OVER(PARTITION BY products.productID ORDER BY products.productID) AS product_numbering,
customer.id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
WHERE prodcut_numbering = 1
But this doesn't seem to be allowed. Is there any way to do something similar?
EDIT -
After much experimentation, the actual problem I believe I am having is how to force a join to return exactly 1 row. I tried to use outer apply, as suggested below. Some sample code.
CREATE TABLE Products (id int not null PRIMARY KEY)
CREATE TABLE Customers (
id int not null PRIMARY KEY,
productId int not null,
value varchar(20) NOT NULL)
declare #count int = 1
while #count <= 150000
begin
insert into Customers (id, productID, value)
values (#count,#count/2, 'Value ' + cast(#count/2 as varchar))
insert into Products (id)
values (#count)
SET #count = #count + 1
end
CREATE NONCLUSTERED INDEX productId ON Customers (productID ASC)
With the above sample set, the 'get everything' query below
select * from Products
outer apply (select top 1 *
from Customers
where Products.id = Customers.productID) Customers
takes ~1000ms to run. Adding an explicit condition:
select * from Products
outer apply (select top 1 *
from Customers
where Products.id = Customers.productID) Customers
where Customers.value = 'Value 45872'
Takes some identical amount of time. This 1000ms for a fairly simple query is already too much, and scales the wrong way (upwards) when adding additional similar joins.
Try the following approach, using a Common Table Expression (CTE). With the test data you provided, it returns specific ProductIds in less than a second.
create view ProductTest as
with cte as (
select
row_number() over (partition by p.id order by p.id) as RN,
c.*
from
Products p
inner join Customers c
on p.id = c.productid
)
select *
from cte
where RN = 1
go
select * from ProductTest where ProductId = 25
What if you did something like:
SELECT ...
FROM products
OUTER APPLY (SELECT TOP 1 * from customer where customerid = products.buyerid) as customer
...
Then the filter on productId should help. It might be worse without filtering, though.
The problem is that your data model is flawed. You should have three tables:
Customers (customerId, ...)
Products (productId,...)
ProductSales (customerId, productId)
Furthermore, the sale table should probably be split into 1-to-many (Sales and SalesDetails). Unless you fix your data model you're just going to run circles around your tail chasing red-herring problems. If the system is not your design, fix it. If the boss doesn't let your fix it, then fix it. If you cannot fix it, then fix it. There isn't a easy way out for the bad data model you're proposing.
this will probably be fast enough if you really don't care which customer you bring back
select p1.*, c1.*
FROM products p1
Left Join (
select p2.id, max( c2.id) max_customer_id
From product p2
Join customer c2 on
c2.productID = p2.id
group by 1
) product_max_customer
Left join customer c1 on
c1.id = product_max_customer.max_customer_id
;