Joining to CTE transformation of table - sql

frequently I encounter a situation like this, where I need to join a big table to a certain transformation of a table.
I have made an example with a big table and a smaller prices table.
Enter the table CarPrices, which has prices per car brand/model with starting and ending dates. I want to join all sold cars to the sales price in the CarPrices table, on the criterium SaleDate BETWEEN PriceStartingDate and PriceEndingDate, but if there is no price for the period, I want to join to the newest price that is found.
I can accomplish it like this but it is terribly slow:
WITH CarPricesTransformation AS (
SELECT CarBrand, CarModel, PriceStartingDate,
CASE WHEN row_number() OVER (PARTITION BY CarBrand, CarModel,
ORDER BY PriceStartingDate DESC) = 1
THEN NULL ELSE PriceEndingDate END PriceEndingDate,
Price
FROM CarPrices
)
SELECT SUM(Price)
FROM LargeCarDataBase C
INNER JOIN CarPricesTransformation P
ON C.CarBrand = P.CarBrand
AND C.CarModel = P.CarModel
AND C.SaleDate >= P.PriceStartingDate
AND (C.SaleDate <= P.PriceEndingDate OR P.PriceEndingDate IS NULL)
A reliable way to do it quicker is to forget about making a VIEW and creating a stored procedure instead, where I first prepare the smaller prices table as a temporary table with the correct clustered index, and then make the join to that. This is much faster. But I would like to stick with a view.
Any thoughts...?

You can't make a "smaller prices table" since the price depends on the sale date. Also, why the CTE in the first place?
Select
Sum(Coalesce(ActivePrice.Price, LatestPrice.Price))
From
LargeCarDataBase As Sales
Left Outer Join CarPrices As ActivePrice
On Sales.CarBrand = ActivePrice.CarBrand
And Sales.CarModel = ActivePrice.CarModel
And (((Sales.SaleDate >= ActivePrice.PriceStartingDate)
And ((Sales.SaleDate <= ActivePrice.PriceEndingDate)
Or (ActivePrice.PriceEndingDate Is Null)))
Left Outer Join CarPrices As LatestPrice
On Sales.CarBrand = LatestPrice.CarBrand
And Sales.CarModel = LatestPrice.CarModel
And LatestPrice.PriceEndingDate Is Null

Have you tried Indexed Views?
The results from Indexed Views are automatically commited to disk so you can retrieve them super-fast.
CREATE VIEW [dbo].[SuperFastCarPrices] WITH SCHEMABINDING AS
SELECT C.CarBrand,
C.CarModel,
C.SaleDate,
SUM(P.Price) AS Price
FROM CarPrices P
INNER JOIN LargeCarDataBase C
ON C.CarBrand = P.CarBrand
AND C.CarModel = P.CarModel
AND C.SaleDate >= P.PriceStartingDate
AND (P.PriceEndingDate IS NULL OR C.SaleDate <= P.PriceEndingDate)
GROUP BY C.CarBrand, C.CarModel, C.SaleDate
CREATE UNIQUE CLUSTERED INDEX [IDX_SuperFastCarPrices]
ON [dbo].[SuperFastCarPrices](CarBrand, CarModel, SaleDate)
You can then select directly from this view, which will return records at the same speed as selecting from a table.
There is the downside that indexed views slow down changes to the underlying tables. If you are worried about the cost of inserting records into the table LargeCarDataBase after this view has been created, you can create an index on columns CarBrand, CarModel and SaleDate which should speed up insertion and update on this table.
For more on Indexed Views see Microsoft's Article.

Related

Is there a more efficient approach other than using a co-related query for solving the below problem?

I have three database tables:
product
product_manufacturer
product_manufacturer_warranties.
The product table has a one-to-one mapping with product_manufacturer and the product_id is stored in the product_manufacturer table. The product_manufactuer table has a one-to-many mapping with the product_manufacturer_warranties table.
I need to write a query that retrieves all columns from the product table and two other columns that can be used to determine if a valid join exists for product and product_manufacturer and product_manufacturer and product_manufactuer_warranties respectively.
I have written the following co-related query that can handle the above scenario:
select product.*, pm.web_id,
( SELECT count(*)
FROM product_manufacturer_warranty pmw
WHERE pm.web_id = pmw.product_manufacturer_id)
AS total_warranties
from product
left join product_manufacturer pm on product.web_id = pm.product_id
I wonder if there is a better or more efficient way of achieving this using SQL on the PostgreSQL server.
Since you are only interested in bare existence, don't run a potentially much more expensive count().
And while paging through the (potentially big) table, don't compute counts for all products, which would be much more expensive, yet.
This should give you optimal performance:
Paging up:
SELECT p.*
, m.product_id IS NOT NULL AS has_maufacturer
, EXISTS (
SELECT FROM product_manufacturer_warranty w
WHERE w.product_manufacturer_id = m.web_id
) AS has_warranties
FROM (
SELECT *
FROM product
WHERE product_id > $max_product_id_of_last_page
-- more filters here?
ORDER BY product_id
LIMIT $page_size
) p
LEFT JOIN product_manufacturer m ON p.web_id = m.product_id
The query returns exactly what you asked for:
all columns from the product table and two other columns that can be used to determine if a valid join exists for product and product_manufacturer and product_manufacturer and product_manufactuer_warranties respectively.
First select products of interest in the subquery p
Then left-join to product_manufacturer. Cannot multiply rows, defined as one-to-one relationship!
Then check for bare existence with EXISTS. Substantially cheaper if there can be many related rows. See:
PL/pgSQL checking if a row exists
Paging down accordingly:
SELECT p.*
, m.product_id IS NOT NULL AS has_maufacturer
, EXISTS (
SELECT FROM product_manufacturer_warranty w
WHERE w.product_manufacturer_id = m.web_id
) AS has_warranties
FROM (
SELECT *
FROM product
WHERE product_id < $min_product_id_of_last_page
-- more filters here?
ORDER BY product_id DESC
LIMIT $page_size
) p
LEFT JOIN product_manufacturer m ON p.web_id = m.product_id
ORDER BY product_id;
Adapt to your actual sort order. This is more tricky for multiple ORDER BY expressions. Either way, don't fall for LIMIT / OFFSET unless your table is trivially small. See:
Optimize query with OFFSET on large table
Do the aggregation once, and join to the result:
select product.*,
pm.web_id,
pmw.total_warranties
from product
left join product_manufacturer pm on product.web_id = pm.product_id
left join (
SELECT product_manufacturer_id, count(*) as total_warranties
FROM product_manufacturer_warranty pmw
group by product_manufacturer_id
) as pmw on pm.web_id = pmw.product_manufacturer_id)

How to do Query optimization in SQL Sever?

I am trying to speed up my execution time. What's wrong with my query. What is the better way to do query optimization.
TransactionEntry has 2 Million records
Transaction Table has 5 Billion Records
Here is my Query, If I remove the TotalPrice column, I am getting results at 10sec
--Total Quantity
SELECT
items.ItemLookupCode,sum(transactionsEntry.Quantity) Quantity,sum(transactionsEntry.Quantity*transactionsEntry.Price) TotalPrice
into
##temp_TotalPrice
FROM
(
SELECT
TransactionNumber,StoreID,Time
FROM
[HQMatajer].[dbo].[Transaction]
WHERE
Time>=CONVERT(datetime,'2015-01-01 00:00:00',102) and Time<=CONVERT(datetime,'2015-12-31 23:59:59',102)
) transactions
left join [HQMatajer].[dbo].[TransactionEntry] transactionsEntry
ON transactionsEntry.TransactionNumber=transactions.TransactionNumber and transactionsEntry.StoreID=transactions.StoreID
Left join [HQMatajer].[dbo].[Item] items
ON transactionsEntry.ItemID=items.ID
Group By items.ItemLookupCode
Order by items.ItemLookupCode
If I execute this(above one) query, it produce the result in 22 seconds. It's too long
When I execute the subquery alone(Below one). It's taking 11 seconds
(
SELECT
TransactionNumber,StoreID,Time
FROM
[HQMatajer].[dbo].[Transaction]
WHERE
Time>=CONVERT(datetime,'2015-01-01 00:00:00',102) and Time<=CONVERT(datetime,'2015-12-31 23:59:59',102)
) transactions
I have created one index for TransactionEntry Table that
TransactionNumber,StoreID,ItemID,Quantity,Price
One index for Transaction Table
`Time,TransactionNumber,StoreID`
One Index for Item Table
`ID`
Execution Plan
Clustured Index of TransactionEntry is taking 59% cost. That column_Name is AutoID
Assuming this is for SQL 2005 or above version. If its for SQL 2000, then instead of CTE you can have a temp table with proper index.
Also Since you were getting the values from [HQMatajer].[dbo].[TransactionEntry] and [HQMatajer].[dbo].[Item], why Left Join is used?
Avoid sub queries. I have re framed the query. Please check and let me know whether it improved the performance
;WITH transactions
AS
(
SELECT
TransactionNumber,StoreID,Time
FROM
[HQMatajer].[dbo].[Transaction]
WHERE
Time>=CONVERT(datetime,'2015-01-01 00:00:00',102) and Time<=CONVERT(datetime,'2015-12-31 23:59:59',102)
)
SELECT
items.ItemLookupCode,sum(transactionsEntry.Quantity) Quantity,sum(transactionsEntry.Quantity*transactionsEntry.Price) TotalPrice
into
##temp_TotalPrice
FROM [HQMatajer].[dbo].[Item] items INNER JOIN [HQMatajer].[dbo].[TransactionEntry] transactionsEntry
ON transactionsEntry.ItemID=items.ID
WHERE EXISTS (SELECT 1 FROM transactions WHERE transactionsEntry.TransactionNumber=transactions.TransactionNumber and transactionsEntry.StoreID=transactions.StoreID)
Group By items.ItemLookupCode
Order by items.ItemLookupCode
This is your query, simplified and formatted a bit (the subquery makes no difference):
select i.ItemLookupCode,
sum(te.Quantity) as quantity,
sum(te.Quantity * te.Price) as TotalPrice
into ##temp_TotalPrice
from [HQMatajer].[dbo].[Transaction] t left join
[HQMatajer].[dbo].[TransactionEntry] te
on te.TransactionNumber = t.TransactionNumber and
te.StoreID = t.StoreID left join
[HQMatajer].[dbo].[Item] i
on te.ItemID = i.ID
where t.Time >= '2015-01-01' and
t.Time < '2016-01-01'
group by i.ItemLookupCode
order by i.ItemLookupCode;
For this query, you want indexes on Transaction(Time, TransactionNumber, StoreId), TransactionEntry(TransactionNumber, StoreId, ItemId, Quantity, Price), and Item(Id, ItemLookupCode)`.
Even with the right indexes, this is processing a lot of data, so I would be surprised if this reduced the time to a few seconds.
This query is taking to much time because three times the entry is inserted into the temporary table which increases the time. if we insert the record into a another table and thn call it or make it as a cte. It decreases the cost.

Efficiency of joining subqueries in SQL Server

I have a customers and orders table in SQL Server 2008 R2. Both have indexes on the customer id (called id). I need to return details about all customers in the customers table and information from the orders table, such as details of the first order.
I currently left join my customers table on a subquery of the orders table, with the subquery returning the information I need about the orders. For example:
SELECT c.id
,c.country
,First_orders.product
,First_orders.order_id
FROM customers c
LEFT JOIN SELECT( id,
product
FROM (SELECT id
,product
,order_id
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY Order_Date asc) as order_No
FROM orders) orders
WHERE Order_no = 1) First_Orders
ON c.id = First_orders.id
I'm quite new to SQL and want to understand if I'm doing this efficiently. I end up left joining quite a few subqueries like this onto the customers table in one select query and it can take tens of minutes to run.
So am I doing this efficiently or can it be improved? For example, I'm not sure if my index on id in the orders table is of any use and maybe I could speed up the query by creating a temporary table of what is in the subquery first and creating a unique index on id in the temporary table so SQL Server knows id is now a unique column and then joining my customers table to this temporary table? I typically have one or two million rows in the customers and orders tables.
Many thanks in advance!
You can remove one of your subqueries to make it a little more efficient:
SELECT c.id
,c.country
,First_orders.product
,First_orders.order_id
FROM customers c
LEFT JOIN (SELECT id
,product
,order_id
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY Order_Date asc) as order_No
FROM orders) First_Orders
ON c.id = First_orders.id AND First_Orders.order_No = 1
In your above query, you need to be careful where you place your parentheses as I don't think it will work. Also, you're returning product in your results, but not including in your nested subquery.
For someone who is just learning SQL, your query looks pretty good.
The index on customers may or may not be used for the query -- you would need to look at the execution plan. An index on orders(id, order_date) could be used quite effectively for the row_number function.
One comment is on the naming of fields. The field orders.id should not be the customer id. That should be something like 'orders.Customer_Id`. Keeping the naming system consistent across tables will help you in the future.
Try this...its easy to understand
;WITH cte
AS (
SELECT id
,product
,order_id
,ROW_NUMBER() OVER (
PARTITION BY id ORDER BY Order_Date ASC
) AS order_No
FROM orders
)
SELECT c.id
,c.country
,c1.Product
,c1.order_id
FROM customers c
INNER JOIN cte c1 ON c.id = c1.id
WHERE c1.order_No = 1

Refactoring a tsql view which uses row_number() to return rows with a unique column value

I have a sql view, which I'm using to retrieve data. Lets say its a large list of products, which are linked to the customers who have bought them. The view should return only one row per product, no matter how many customers it is linked to. I'm using the row_number function to achieve this. (This example is simplified, the generic situation would be a query where there should only be one row returned for each unique value of some column X. Which row is returned is not important)
CREATE VIEW productView AS
SELECT * FROM
(SELECT
Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
customer.Id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
) as temp
WHERE temp.prodcut_numbering = 1
Now lets say that the total number of rows in this view is ~1 million, and running select * from productView takes 10 seconds. Performing a query such as select * from productView where productID = 10 takes the same amount of time. I believe this is because the query gets evaluated to this
SELECT * FROM
(SELECT
Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
customer.Id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
) as temp
WHERE prodcut_numbering = 1 and prodcut.Id = 10
I think this is causing the inner subquery to be evaluated in full each time. Ideally I'd like to use something along the following lines
SELECT
Row_number() OVER(PARTITION BY products.productID ORDER BY products.productID) AS product_numbering,
customer.id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
WHERE prodcut_numbering = 1
But this doesn't seem to be allowed. Is there any way to do something similar?
EDIT -
After much experimentation, the actual problem I believe I am having is how to force a join to return exactly 1 row. I tried to use outer apply, as suggested below. Some sample code.
CREATE TABLE Products (id int not null PRIMARY KEY)
CREATE TABLE Customers (
id int not null PRIMARY KEY,
productId int not null,
value varchar(20) NOT NULL)
declare #count int = 1
while #count <= 150000
begin
insert into Customers (id, productID, value)
values (#count,#count/2, 'Value ' + cast(#count/2 as varchar))
insert into Products (id)
values (#count)
SET #count = #count + 1
end
CREATE NONCLUSTERED INDEX productId ON Customers (productID ASC)
With the above sample set, the 'get everything' query below
select * from Products
outer apply (select top 1 *
from Customers
where Products.id = Customers.productID) Customers
takes ~1000ms to run. Adding an explicit condition:
select * from Products
outer apply (select top 1 *
from Customers
where Products.id = Customers.productID) Customers
where Customers.value = 'Value 45872'
Takes some identical amount of time. This 1000ms for a fairly simple query is already too much, and scales the wrong way (upwards) when adding additional similar joins.
Try the following approach, using a Common Table Expression (CTE). With the test data you provided, it returns specific ProductIds in less than a second.
create view ProductTest as
with cte as (
select
row_number() over (partition by p.id order by p.id) as RN,
c.*
from
Products p
inner join Customers c
on p.id = c.productid
)
select *
from cte
where RN = 1
go
select * from ProductTest where ProductId = 25
What if you did something like:
SELECT ...
FROM products
OUTER APPLY (SELECT TOP 1 * from customer where customerid = products.buyerid) as customer
...
Then the filter on productId should help. It might be worse without filtering, though.
The problem is that your data model is flawed. You should have three tables:
Customers (customerId, ...)
Products (productId,...)
ProductSales (customerId, productId)
Furthermore, the sale table should probably be split into 1-to-many (Sales and SalesDetails). Unless you fix your data model you're just going to run circles around your tail chasing red-herring problems. If the system is not your design, fix it. If the boss doesn't let your fix it, then fix it. If you cannot fix it, then fix it. There isn't a easy way out for the bad data model you're proposing.
this will probably be fast enough if you really don't care which customer you bring back
select p1.*, c1.*
FROM products p1
Left Join (
select p2.id, max( c2.id) max_customer_id
From product p2
Join customer c2 on
c2.productID = p2.id
group by 1
) product_max_customer
Left join customer c1 on
c1.id = product_max_customer.max_customer_id
;

Using SQL(ite) how do I find the lowest unique child for each parent in a one to many relationship during a JOIN?

I have two tables with a many to one relationship which represent lots and bids within an auction system. Each lot can have zero or more bids associated with it. Each bid is associated with exactly one lot.
My table structure (with irrelevant fields removed) looks something like this:
For one type of auction the winning bid is the lowest unique bid for a given lot.
E.g. if there are four bids for a given lot: [1, 1, 2, 4] the lowest unique bid is 2 (not 1).
So far I have been able to construct a query which will find the lowest unique bid for a single specific lot (assuming the lot ID is 123):
SELECT id, value FROM bid
WHERE lot = 123
AND amount = (
SELECT value FROM bid
WHERE lot = 123
GROUP BY value HAVING COUNT(*) = 1
ORDER BY value
)
This works as I would expect (although I'm not sure it's the most graceful approach).
I would now like to construct a query which will get the lowest unique bids for all lots at once. Essentially I want to perform a JOIN on the two tables where one column is the result of something similar to the above query. I'm at a loss as to how to use the same approach for finding the lowest unique bid in a JOIN though.
Am I on the wrong track with this approach to finding the lowest unique bid? Is there another way I can achieve the same result?
Can anyone help me expand this query into a JOIN?
Is this even possible in SQL or will I have to do it in my application proper?
Thanks in advance.
(I am using SQLite 3.5.9 as found in Android 2.1)
You can use group by with a "having" condition to find the set of bids without duplicate amounts for each lot.
select lotname, amt
from lot inner join bid on lot.id = bid.lotid
group by lotname, amt having count(*) = 1
You can in turn make that query an inline view and select the lowest bid from it for each lot.
select lotname, min(amt)
from
(
select lotname, amt
from lot inner join bid on lot.id = bid.lotid
group by lotname, amt having count(*) = 1
) as X
group by X.lotname
EDIT: Here's how to get the bid id using this approach, using nested inline views:
select bid.id as WinningBidId, Y.lotname, bid.amt
from
bid
join
(
select x.lotid, lotname, min(amt) as TheMinAmt
from
(
select lot.id as lotid, lotname, amt
from lot inner join bid on lot.id = bid.lotid
group by lot.id, lotname, amt
having count(*)=1
) as X
group by x.lotid, x.lotname
) as Y
on Y.lotid = bid.lotid and Y.TheMinAmt = Bid.amt
I think you need some subqueries to get to your desired data:
SELECT [b].[id] AS [BidID], [l].[id] AS [LotID],
[l].[Name] AS [Lot], [b].[value] AS [BidValue]
FROM [bid] [b]
INNER JOIN [lot] [l] ON [b].[lot] = [l].[id]
WHERE [b].[id] =
(SELECT TOP 1 [min].[id]
FROM [bid] [min]
WHERE [min].[lot] = [b].[lot]
AND NOT EXISTS(SELECT *
FROM [bid] [check]
WHERE [check].[lot] = [min].[lot]
AND [check].[value] = [min].[value]
AND [check].[id] <> [min].[id])
ORDER BY [min].[value] ASC)
The most inner query (within the exists) checks if there are no other bids on that lot, having the same value.
The query in the middle (top 1) determines the minimum bid of all unique bids on that lot.
The outer query makes this happen for all lots, that have bids.
SELECT lot.name, ( SELECT MIN(bid.value) FROM bid Where bid.lot = lot.ID) AS MinBid
FROM Lot INNER JOIN
bid on lot.ID = bid.ID
If I understand you correctly this will give you everylot and their smallest bid