efficient way to JOIN in SQL - sql

I have to JOIN two tables: articles and sales, but not all the data of articles, just need the last load.
Is there a difference between this two ways? there is a most faster/efficient way? and more important, why?
1)
SELECT *
FROM sales S
INNER JOIN articles A
ON S.article_id = A.article_id AND A.load_date = (SELECT MAX(load_date) FROM articles)
2)
SELECT *
FROM sales s
INNER JOIN (
SELECT *
FROM articles
WHERE load_date = (SELECT MAX(load_date) FROM articles)
) A
ON s.article_id = a.article_id

Most DBMSes also support Windowed Aggregate Functions and a RANK might be more efficient (and easier to write if you're used to it):
SELECT *
FROM
(
SELECT *,
RANK() -- max date gets rank #1
OVER (PARTITION BY a.article_id
ORDER BY a.load_date DESC) rn
FROM sales s
INNER JOIN articles a
ON s.article_id = a.article_id
) dt
WHER rn = 1

Related

How optimize select with max subquery on the same table?

We have many old selects like this:
SELECT
tm."ID",tm."R_PERSONES",tm."R_DATASOURCE", ,tm."MATCHCODE",
d.NAME AS DATASOURCE,
p.PDID
FROM TABLE_MAPPINGS tm,
PERSONES p,
DATASOURCES d,
(select ID
from TABLE_MAPPINGS
where (R_PERSONES, MATCHCODE)
in (select
R_PERSONES, MATCHCODE
from TABLE_MAPPINGS
where
id in (select max(id)
from TABLE_MAPPINGS
group by MATCHCODE)
)
) tm2
WHERE tm.R_PERSONES = p.ID
AND tm.R_DATASOURCE=d.ID
and tm2.id = tm.id;
These are large tables, and queries take a long time.
How to rebuild them?
Thank you
You can query the table only once using something like (untested as you have not provided a minimal example of your create table statements or sample data):
SELECT *
FROM (
SELECT m.*,
COUNT(CASE WHEN rnk = 1 THEN 1 END)
OVER (PARTITION BY r_persones, matchcode) AS has_max_id
FROM (
SELECT tm.ID,
tm.R_PERSONES,
tm.R_DATASOURCE,
tm.MATCHCODE,
d.NAME AS DATASOURCE,
p.PDID,
RANK() OVER (PARTITION BY tm.matchcode ORDER BY tm.id DESC) As rnk
FROM TABLE_MAPPINGS tm
INNER JOIN PERSONES p ON tm.R_PERSONES = p.ID
INNER JOIN DATASOURCES d ON tm.R_DATASOURCE = d.ID
) m
)
WHERE has_max_id > 0;
First finding the maximum ID using the RANK analytic function and then finding all the relevant r_persones, matchcode pairs using conditional aggregation in a COUNT analytic function.
Note: you want to use the RANK or DENSE_RANK analytic functions to match the maximums as it can match multiple rows per partition; whereas ROW_NUMBER will only ever put a single row per partition first.
You're querying table_mappings 3 times; how about doing it only once?
WITH
tab_map
AS
(SELECT a.id,
a.r_persones,
a.matchcode,
a.datasource,
ROW_NUMBER ()
OVER (PARTITION BY a.matchcode ORDER BY a.id DESC) rn
FROM table_mappings a)
SELECT tm.id,
tm.r_persones,
tm.matchcode,
d.name AS datasource,
p.pdid
FROM tab_map tm
JOIN persones p ON p.id = tm.r_persones
JOIN datasources d ON d.id = tm.r_datasource
WHERE tm.rn = 1

Alternative SQL ANSI for TOP WITH TIES

I have two tables:
product
id_product
description
price
id_category
category
id_category
description
I would like to know the categories that have more products. For example, the category food has 10 products and the eletronics too. They are the same.
Now I'm using SQL Server and I'm using TOP WITH TIES.
SELECT TOP 1 WITH TIES p.id_category, COUNT(*) as amount FROM product p
JOIN category c ON p.id_category = c.id_category
GROUP BY p.id_category
ORDER BY amount
Is there another way to solve this with good performance?
I tried also with DENSE_RANK where the position is = 1.
It also works.
SELECT * FROM (
SELECT p.id_category, COUNT(*) as amount, DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) position FROM product p
JOIN category c ON p.id_category = c.id_category
GROUP BY p.id_category
) rnk
WHERE rnk.position = 1
But I want this solution in SQL ANSI.
I tried using MAX(COUNT(*)) but it doesn't work.
Is there a general solution? Is This solution better than using TOP WITH TIES?
Here is a third option for SQL Server:
WITH cte AS (
SELECT p.id_category, COUNT(*) AS cnt
FROM product p
INNER JOIN category c
ON p.id_category = c.id_category
GROUP BY p.id_category
)
SELECT *
FROM cte
WHERE cnt = (SELECT MAX(cnt) FROM cte);
If you also cannot rely on CTEs being available, you can easily enough just inline the CTE into the query. From a performance point of view, DENSE_RANK would probably outperform my answer.
With the CTE removed this becomes:
SELECT *
FROM
(
SELECT p.id_category, COUNT(*) AS cnt
FROM product p
INNER JOIN category c
ON p.id_category = c.id_category
GROUP BY p.id_category
)
WHERE cnt = (SELECT MAX(cnt) FROM (
SELECT p.id_category, COUNT(*) AS cnt
FROM product p
INNER JOIN category c
ON p.id_category = c.id_category
GROUP BY p.id_category
));
This query would even run on MySQL. As you can see, the query is ugly, which is one reason why things like CTE and analytic functions were introduced into the ANSI standard.

Get Min date as condition

I have a table that contains invoices for all phone numbers, and each number has several invoices, I want to display only the first invoice for precise number but i don't really know how get only first invoice , this is my query
SELECT
b.contrno
a.AR_INVDATE
FROM P_STG_TABS.IVM_INVOICE_RECORD a
INNER JOIN P_EDW_TMP.invoice b
ON b.contrno=a.contrno
WHERE a.AR_INVDATE< (SELECT AR_INVDATE FROM P_STG_TABS.IVM_INVOICE_RECORD WHERE contrno=b.contrno )
Teradata supports a QUALIFY clause to filter the result of an OLAP-function (similar to HAVING after GROUP BY), which greatly simplifies Tim Biegeleisens's answer:
SELECT *
FROM P_STG_TABS.IVM_INVOICE_RECORD a
INNER JOIN P_EDW_TMP.invoice b
ON b.contrno = a.contrno
QUALIFY
ROW_NUMBER()
OVER (PARTITION BY b.contrno
ORDER BY a.AR_INVDATE) = 1
Additionally you can apply the ROW_NUMBER before the join (might be more efficient depending on additional conditions):
SELECT *
FROM
( SELECT *
FROM P_STG_TABS.IVM_INVOICE_RECORD a
QUALIFY
ROW_NUMBER()
OVER (PARTITION BY b.contrno
ORDER BY a.AR_INVDATE) = 1
) AS a
INNER JOIN P_EDW_TMP.invoice b
ON b.contrno = a.contrno
Use ROW_NUMBER():
SELECT
t.contrno,
t.AR_INVDATE
FROM
(
SELECT
b.contrno,
a.AR_INVDATE,
ROW_NUMBER() OVER (PARTITION BY b.contrno ORDER BY a.AR_INVDATE) rn
FROM P_STG_TABS.IVM_INVOICE_RECORD a
INNER JOIN P_EDW_TMP.invoice b
ON b.contrno = a.contrno
) t
WHERE t.rn = 1;
If you are worried about ties, and you want to display all ties, then you can replace ROW_NUMBER with either RANK or DENSE_RANK.
If I correctly understand, then one way is to use group by with min(a.AR_INVDATE):
SELECT
b.contrno,
min(a.AR_INVDATE)
FROM P_STG_TABS.IVM_INVOICE_RECORD a
INNER JOIN P_EDW_TMP.invoice b
ON b.contrno=a.contrno
group by b.contrno

SQL strategy to fetch maximum

Suppose I have these three tables:
I want to get, for all products, it's product_id and the client that bougth it most times (the biggest client of the product).
I solved it like this:
SELECT
product_id AS product,
(SELECT TOP 1 client_id FROM Bill_Item, Bill
WHERE Bill_Item.product_id = p.product_id
and Bill_Item.bill_id = Bill.bill_id
GROUP BY
client_id
ORDER BY
COUNT(*) DESC
) AS client
FROM Product p
Do you know a better way?
the inner query will give you the ranking. The outer query will give you the client that puchase the most for a product
SELECT *
(
SELECT i.product_id, b.client_id,
r = row_number() over (partition by i.product_id
order by count(*) desc)
FROM Bill b
INNER JOIN Bill_Item i ON b.bill_id = i.bill_id
GROUP BY i.product_id, b.client_id
) d
WHERE r = 1
I was going to submit pretty much the same thing as #Squirrell only with a Common Table Expression [CTE] rather than a derived table. So I wont duplicate that but there are some learning points concerning your query. First is IMPLICIT JOINS such as FROM Bill_Item, Bill are really easy to have uintended consequences (one of many questions: Queries that implicit SQL joins can't do?) Next for the Calculated column you can actually do this in a OUTER APPLY or CROSS APPLY which is a very useful technique.
So you could re-write your method as follows:
SELECT *
FROM
Product p
OUTER APPLY (SELECT TOP 1 b.client_id
FROM
Bill_Item bi
INNER JOIN Bill b
ON bi.bill_id = b.bill_id
WHERE
bi.product_id = p.product_id
GROUP BY
b.client_id
ORDER BY
COUNT(*) DESC) c
And to show you how squirell's answer can still include products that have never been sold all you need to do is join Products and LEFT JOIN to other tables:
;WITH cte AS (
SELECT
p.product_id
,b.client_id
,ROW_NUMBER() OVER (PARTITION BY p.product_id ORDER BY COUNT(*) DESC) as RowNumber
FROM
Product p
LEFT JOIN Bill_Item bi
ON p.product_id = bi.product_id
LEFT JOIN Bill b
ON bi.bill_id = b.bill_id
GROUP BY
p.product_id
,b.client_id
)
SELECT *
FROM
cte
WHERE
RowNumber = 1
Techniques used in some of these that are useful.
CTE
APPLY (Outer & Cross)
Window Functions
Squirrel's answer doesn't return products that have never been sold. If you want to include those, then your approach is ok, although I would write the query as:
SELECT product_id as product,
(SELECT TOP 1 b.client_id
FROM Bill_Item bi JOIN
Bill b
ON bi.bill_id = b.bill_id
WHERE Bill_Item.product_id = p.product_id
GROUP BY client_id
ORDER BY COUNT(*) DESC
) as client
FROM Product p;
You can also express this using APPLY, but a correlated subquery is also fine.
Note the correct use of the explicit JOIN syntax.

Problems with SQL Inner join

Having some problems while trying to optimize my SQL.
I got 2 tables like this:
Names
id, analyseid, name
Analyses
id, date, analyseid.
I want to get the newest analyse from Analyses (ordered by date) for every name (they are unique) in Names. I can't really see how to do this without using 2 x nested selects.
My try (Dont get confused about the names. It's the same principle):
SELECT
B.id,
B.chosendatetime,
vStockNames.name
FROM
vStockNames
INNER JOIN
(
SELECT TOP 1
vAnalysesHistory.id,
vAnalysesHistory.chosendatetime,
vAnalysesHistory.companyid
FROM
vAnalysesHistory
ORDER BY
vAnalysesHistory.chosendatetime DESC
) AS B
ON
B.companyid = vStockNames.stockid
In my example the problem is that i only get 1 row returned (because of top 1). But if I exclude this, I can get multiple analyses of the same name.
Can you help me ? - THanks in advance.
SQL Server 2000+:
SELECT (SELECT TOP 1
a.id
FROM vAnalysesHistory AS a
WHERE a.companyid = n.stockid
ORDER BY a.chosendatetime DESC) AS id,
n.name,
(SELECT TOP 1
a.chosendatetime
FROM vAnalysesHistory AS a
WHERE a.companyid = n.stockid
ORDER BY a.chosendatetime DESC) AS chosendatetime
FROM vStockNames AS n
SQL Server 2005+, using CTE:
WITH cte AS (
SELECT a.id,
a.date,
a.analyseid,
ROW_NUMBER() OVER(PARTITION BY a.analyseid
ORDER BY a.date DESC) AS rk
FROM ANALYSES a)
SELECT n.id,
n.name,
c.date
FROM NAMES n
JOIN cte c ON c.analyseid = n.analyseid
AND c.rk = 1
...without CTE:
SELECT n.id,
n.name,
c.date
FROM NAMES n
JOIN (SELECT a.id,
a.date,
a.analyseid,
ROW_NUMBER() OVER(PARTITION BY a.analyseid
ORDER BY a.date DESC) AS rk
FROM ANALYSES a) c ON c.analyseid = n.analyseid
AND c.rk = 1
You're only asking for the TOP 1, so that's all you're getting. If you want one per companyId, you need to specify that in the SELECT on vAnalysesHistory. Of course, JOINs must be constant and do not allow this. Fortunately, CROSS APPLY comes to the rescue in cases like this.
SELECT
B.id,
B.chosendatetime,
vStockNames.name
FROM
vStockNames
CROSS APPLY
(
SELECT TOP 1
vAnalysesHistory.id,
vAnalysesHistory.chosendatetime,
vAnalysesHistory.companyid
FROM
vAnalysesHistory
WHERE companyid = vStockNames.stockid
ORDER BY
vAnalysesHistory.chosendatetime DESC
) AS B
You could also use ROW_NUMBER() to do the same:
SELECT
B.id,
B.chosendatetime,
vStockNames.name
FROM
vStockNames
INNER JOIN
(
SELECT
vAnalysesHistory.id,
vAnalysesHistory.chosendatetime,
vAnalysesHistory.companyid,
ROW_NUMBER() OVER (PARTITION BY companyid ORDER BY chosendatetime DESC) AS row
FROM
vAnalysesHistory
) AS B
ON
B.companyid = vStockNames.stockid AND b.row = 1
Personally I'm a fan of the first approach. It will likely be faster and is easier to read IMO.
Will something like this work for you?
;with RankedAnalysesHistory as
(
SELECT
vah.id,
vah.chosendatetime,
vah.companyid
,rank() over (partition by vah.companyid order by vah.chosendatetime desc) rnk
FROM
vAnalysesHistory vah
)
SELECT
B.id,
B.chosendatetime,
vsn.name
FROM
vStockNames vsn
join RankedAnalysesHistory as rah on rah.companyid = vsn.stockid and vah.rnk = 1
It seems to me that you only need SQL-92 for this. Of course, explicit documentation of the joining columns between the tables would help.
Simple names
SELECT B.ID, C.ChosenDate, N.Name
FROM (SELECT A.AnalyseID, MAX(A.Date) AS ChosenDate
FROM Analyses AS A
GROUP BY A.AnalyseID) AS C
JOIN Analyses AS B ON C.AnalyseID = B.AnalyseID AND C.ChosenDate = B.Date
JOIN Names AS N ON N.AnalyseID = C.AnalyseID
The sub-select generates the latest analysis for each company; the join with Analyses picks up the Analyse.ID value corresponding to that latest analysis, and the join with Names picks up the company name. (The C.ChosenDate in the select-list could be replaced by B.Date AS ChosenDate, of course.)
Complicated names
SELECT B.ID, C.ChosenDateTime, N.Name
FROM (SELECT A.CompanyID, MAX(A.ChosenDateTime) AS ChosenDateTime
FROM vAnalysesHistory AS A
GROUP BY A.CompanyID) AS C
JOIN vAnalysesHistory AS B ON C.CompanyID = B.CompanyID
AND C.ChosenDateTime = B.ChosenDateTime
JOIN vStockNames AS N ON N.AnalyseID = C.AnalyseID
Same query with systematic renaming (and slightly different layout to avoid horizontal scrollbars).