Cross Join, Compare Values, and Select Closest Match - More Efficient Way? - sql

I have two tables with 2 columns. I cross join and subtract the values. I then find the row_number ordered by the subtraction and choose where row = 1. I'm finding the t2.id that has the closest val to t1.id
These tables are quite large. Is the row_number function doing a lot of extra unneeded work by ordering everything after 1? I only need the lowest ranked row.. Is there a more efficient way to write this?
Table 1
id
val
A1
0.123456
A2
1.123456
A3
-0.712345
Table 2
id
val
B1
0.065432
B2
1.654321
B3
-0.654321
--find the t2.id with the closest value to t1.id's val
with cj as (
select
t1.id, t2.id,
row_number() over (partition by t1.id order by abs(t1.val - t2.val)) end as rw
from t1
cross join t2
)
select * from cj where rw = 1

It is possible to run this faster - it depends on how many rows are in t1, t2, and how much flexibility you have to add indexes etc.
As #Chris says, sorting (especially sorting many times) can be a killer. As the cost of sorting increases exponentially (geometrically?) with the amount of values you are sorting, it gets increasingly worse the more you have. If t2 only had two rows, then sorting is easy and your original method would probably be the most efficient. However, if t2 has many rows, then it becomes much much harder. And then if t1 has many rows, and you're doing a sort for every row, that also multiplies the cost.
As such, for testing purposes, I have used 1,000 rows in each of t1 and t2 below.
Below I compare several approaches and indicators of speed and processing. )(Spoiler alert) if you can pre-sort it (like in #Chris' suggestion) then you can get some big improvements.
I don't use databricks (sorry) and cannot measure speeds/etc on them. Therefore the below is written and tested in SQL server - but can be modified to work in databricks pretty easily I would guess. I think the main difference is OUTER APPLY used here - I believe in Databricks that will be an INNER JOIN LATERAL (e.g., How to use outer apply in Spark sql - but note I think they got it wrong. OUTER APPLY is equivalent to INNER JOIN LATERAL, while CROSS APPLY is equivalent to LEFT JOIN LATERAL).
I created the two tables and filled them with 1,000 rows each.
CREATE TABLE #t1 (A_id nvarchar(10) PRIMARY KEY, val decimal(10,8));
CREATE TABLE #t2 (B_id nvarchar(10) PRIMARY KEY, val decimal(10,8));
Original approach - sort all rows
Your original query takes very few data reads, but the cost is the amount of sorting it needs to do. Because ROW_NUMBER() sorts all the rows, and then you only take 1, this is your major cost (as #Chris says).
-- Original query
with cj as (
select
#t1.A_id, #t2.B_id,
row_number() over (partition by #t1.A_id order by abs(#t1.val - #t2.val)) as rw
from #t1
cross join #t2
)
select * from cj where rw = 1;
On my computer, this took 1600ms of CPU time.
Approach 2 - taking the MIN() value
However, as you only need the minimum row, there is no need to really sort the other rows. Taking a 'min' only requires a scan through the data once for each data point in t1, and pick the smallest value.
However, once you have the smallest value, you then need to refer to t2 again to get the relevant t2 IDs.
In other words, the logic of this is
spend less time determining only the smallest absolute difference (instead of sorting them all)
spend more reads and more time finding which value(s) of t2 get that absolute difference
-- Using MIN() to find smallest difference
with cj as (
select
#t1.A_id, #t1.val,
MIN(abs(#t1.val - #t2.val)) AS minvaldif
from #t1
cross join #t2
GROUP BY #t1.A_id, #t1.val
)
select cj.A_ID,
#t2.B_id
FROM cj
CROSS JOIN #t2
WHERE abs(cj.val - #t2.val) = minvaldif;
This took my computer about half the time of the original - about 800ms of computer time - but more than doubles the amount of data reads it does. Note also that it can return several rows if (say) there are repeats of values in t2.
Approach 3 - lateral join
In this case, you do a lateral join (in SQL Server, it's an 'OUTER APPLY') to select just the 1 minimum value you need. Similar to above, you find the min value, but you do it individually for each row in t1. Therefore you need to do 1000 'min' values rather than 1000 sorts.
-- Lateral join with min difference
SELECT #t1.A_id, t2_calc.B_id
FROM #t1
OUTER APPLY
(SELECT TOP (1) #t2.B_Id
FROM #T2
ORDER BY abs(#t1.val - #t2.val)
) AS t2_calc;
This is the most efficient so far - with few reads and only 300ms of compute time. If you cannot add indexes, this is probably the best you could do.
Option 4 - pre-sort the data with an index
If you can pre-sort the data using an index, then you can boost your efficiency by a lot.
CREATE NONCLUSTERED INDEX #IX_t2_val ON #t2 (val);
The 'gotcha' is that even if you have an index on t2.val, databases will have a problem with min(abs(t1.val - t2.val)) - they will usually still need to read all the data rather than use the index.
However, you can use the logic you identified in your question - that min(abs(difference)) value is the one where t1.val is closest to t2.val.
In the method below
For every t1.val, find the closest t2 row that is less than or
equal to it, but not over
Also find, for every t1.val, the closest t2 row that is above it
Then using your logic in the original answer, find the one of these that is the closest.
This also uses lateral views
-- Using indexes
with cj as
(SELECT #t1.A_id, #t1.val AS A_val, t2_lessthan.B_id, t2_lessthan.val AS B_val
FROM #t1
CROSS APPLY
(SELECT TOP (1) #t2.B_Id, #t2.val
FROM #T2
WHERE #t2.val <= #t1.val
ORDER BY #t2.val DESC
) AS t2_lessthan
UNION ALL
SELECT #t1.A_id, #t1.val AS A_val, t2_greaterthan.B_id, t2_greaterthan.val AS B_val
FROM #t1
CROSS APPLY
(SELECT TOP (1) #t2.B_Id, #t2.val
FROM #T2
WHERE #t2.val > #t1.val
ORDER BY #t2.val
) AS t2_greaterthan
),
cj_rn AS
(SELECT A_id, B_id,
row_number() over (partition by A_id order by abs(A_val - B_val)) as rw
FROM cj
)
select * from cj_rn where rw = 1;
Compute time: 4ms.
For each value in t1, it simply does 2 index seeks in t2, and 'sorts' the two value (which is very easy). As such, in this case, it is orders of magnitude faster.
So... really the best approach is if you can pre-sort the data (in this case by adding indexes) and then make sure you take advantage of that sort.

This is a case where procedural code is better than the set logic used in SQL. If you get a cursor on both table1 & table2 (separately) both ordered by val, you can take advantage of the ordering to not compare EVERY combination of As and Bs.
Using Table2 as the primary, prime the 'pump' by reading the first (lowest) value from Table1 into variable FirstA and the second value from Table1 into variable SecondA.
First, loop while next B < FirstA. Output B & FirstA, because every A thereafter will be farther away because the list is ordered.
Now form a loop using the Table2 cursor, read each B value in turn. While B > SecondA, move SecondA to FirstA and read another value from table1 into SecondA or end of cursor. Now B is between FirstA and SecondA; one of those two is closest, compare the abs(difference) and output the lowest and proceed to the next loop iteration.
That's all there is to it. The time-consuming part is sorting the two tables inside their cursors, which is O(nlog(n)) and O(mlog(m)). The comparison of the two is linear [ O(n+m) ].
Hopefully, your database manager has a procedural scripting language that will make this easy.

Related

How to improve SQL query performance containing partially common subqueries

I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:
event_count sys_timestamp
100 167877672772
110 167877672769
121 167877672987
111 167877673877
... ...
With both fields defined as numeric.
With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:
SELECT t1.*,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count >= t1.event_count+10)
AS positive,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count <= t1.event_count-10)
AS negative
FROM tableA as t1
The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).
However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.
Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.
My concrete questions:
How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000)
Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)
Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?
Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?
How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN because the subquery always returns a row. See:
JOIN (SELECT ... ) ue ON 1=1?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:
Aggregate columns with additional (distinct) filters
event_count should probably be integer or bigint. See:
PostgreSQL using UUID vs Text as primary key
Is there any difference in saving same value in different integer types?
sys_timestamp should probably be timestamp or timestamptz. See:
Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application
You have the right idea.
The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.
The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.
Then the next step is to reuse that same detail query for everything else.
Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:
with baseQuery as
(
SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount
,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
FROM tableA t1, tableA t2
where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
select bq.startEventCount, count(*) as positive
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.pEndEventCount
group by bq.startEventCount
), negSummary as
(
select bq.startEventCount, count(*) as negative
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.nEndEventCount
group by bq.startEventCount
)
select t1.*, ps.positive, nv.negative
from tableA t1
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount
Notes:
The distinct for baseQuery may not be necessary based on your actual keys.
The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.
You can play around to see what works.
There are other ways of course but this best illustrates how and where things could be improved.
With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion.
For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.

Which is best to use between the IN and JOIN operators in SQL server for the list of values as table two?

I heard that the IN operator is costlier than the JOIN operator.
Is that true?
Example case for IN operator:
SELECT *
FROM table_one
WHERE column_one IN (SELECT column_one FROM table_two)
Example case for JOIN operator:
SELECT *
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
tl;dr; - once the queries are fixed so that they will yield the same results, the performance is the same.
Both queries are not the same, and will yield different results.
The IN query will return all the columns from table_one,
while the JOIN query will return all the columns from both tables.
That can be solved easily by replacing the * in the second query to table_one.*, or better yet, specify only the columns you want to get back from the query (which is best practice).
However, even if that issue is changed, the queries might still yield different results if the values on table_two.column_one are not unique.
The IN query will yield a single record from table_one even if it fits multiple records in table_two, while the JOIN query will simply duplicate the records as many times as the criteria in the ON clause is met.
Having said all that - if the values in table_two.column_one are guaranteed to be unique, and the join query is changed to select table_one.*... - then, and only then, will both queries yield the same results - and that would be a valid question to compare their performance.
So, in the performance front:
The IN operator has a history of poor performance with a large values list - in earlier versions of SQL Server, if you would have used the IN operator with, say, 10,000 or more values, it would have suffer from a performance issue.
With a small values list (say, up to 5,000, probably even more) there's absolutely no difference in performance.
However, in currently supported versions of SQL Server (that is, 2012 or higher), the query optimizer is smart enough to understand that in the conditions specified above these queries are equivalent and might generate exactly the same execution plan for both queries - so performance will be the same for both queries.
UPDATE: I've done some performance research, on the only available version I have for SQL Server which is 2016 .
First, I've made sure that Column_One in Table_Two is unique by setting it as the primary key of the table.
CREATE TABLE Table_One
(
id int,
CONSTRAINT PK_Table_One PRIMARY KEY(Id)
);
CREATE TABLE Table_Two
(
column_one int,
CONSTRAINT PK_Table_Two PRIMARY KEY(column_one)
);
Then, I've populated both tables with 1,000,000 (one million) rows.
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY ##SPID) As N INTO Tally
FROM sys.objects A
CROSS JOIN sys.objects B
CROSS JOIN sys.objects C;
INSERT INTO Table_One (id)
SELECT N
FROM Tally;
INSERT INTO Table_Two (column_one)
SELECT N
FROM Tally;
Next, I've ran four different ways of getting all the values of table_one that matches values of table_two. - The first two are from the original question (with minor changes), the third is a simplified version of the join query, and the fourth is a query that uses the exists operator with a correlated subquery instead of the in operaor`,
SELECT *
FROM table_one
WHERE Id IN (SELECT column_one FROM table_two);
SELECT TOne.*
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.id = TTwo.column_one;
SELECT TOne.*
FROM table_one TOne
JOIN table_two AS TTwo
ON TOne.id = TTwo.column_one;
SELECT *
FROM table_one
WHERE EXISTS
(
SELECT 1
FROM table_two
WHERE column_one = id
);
All four queries yielded the exact same result with the exact same execution plan - so from it's safe to say performance, under these circumstances, are exactly the same.
You can copy the full script (with comments) from Rextester (result is the same with any number of rows in the tally table).
From the point of performance view, mostly, using EXISTS might be a better option rather than using IN operator and JOIN among the tables :
SELECT TOne.*
FROM table_one TOne
WHERE EXISTS ( SELECT 1 FROM table_two TTwo WHERE TOne.column_one = TTwo.column_one )
If you need the columns from both tables, and provided those have indexes on the column column_one used in the join condition, using a JOIN would be better than using an IN operator, since you will be able to benefit from the indexes :
SELECT TOne.*, TTwo.*
FROM table_one TOne
JOIN table_two TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
The second (JOIN) query cannot be optimal compare to first query unless you put where clause within sub-query as follows:
Select * from table_one TOne
JOIN (select column_one from table_two where column_tow = 'Some Value') AS TTwo
ON TOne.column_one = TTwo.column_one
However, the better decision can be based on execution plan with following points into consideration:
How many tasks the query has to perform to get the result
What is task type and execution time of each task
Variance between Estimated number of row and Actual number of rows in each task - this can be fixed by UPDATED STATISTICS on TABLE if the variance too high.
In general, the Logical Processing Order of the SELECT statement goes as follows, considering that if you manage your query to read the less amount of rows/pages at higher level (as per following order) would make that query less logical I/O cost and eventually query is more optimized. i.e. It's optimal to get rows filtered within From or Where clause rather than filtering it in GROUP BY or HAVING clause.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP

Why would CROSS APPLY not be equivalent to INNER JOIN

This runs in 2 minutes:
SELECT
G.GKey,
Amount = SUM(fct.AmountEUR)
FROM
WH.dbo.vw_Fact fct
INNER JOIN #g G ON
fct.DateKey >= G.Livedate AND
fct.GKey = G.GKey
GROUP BY G.GKey;
This runs in 8 mins:
SELECT
G.GKey,
C.Amount
FROM
#g G
CROSS APPLY
(
SELECT
Amount = SUM(fct.AmountEUR)
FROM
WH.dbo.vw_Fact fct
WHERE
fct.DateKey >= G.Livedate AND
fct.GKey = G.GKey
) C;
These are both quite simple scripts and they look logically the same to me.
Table #G has 50 rows with a clustered index ON #G(Livedate,GKey)
Table WH.dbo.vw_Fact has a billion rows.
I actually felt initially that applying the bigger table to the small table was going to be more efficient.
My experience using CROSS APPLY is limited - is there an obvious reason (without exploring execution plans) for the slow time?
Is there a 'third way' that is likely to be quicker?
Here's the logical difference between the two joins:
CROSS APPLY: yields the Cartesian cross product of an aggregation on a given value of LiveDate and GKey, this gets re-executed for every row.
INNER JOIN: yields a 1-to-1 match on vw_Fact for every value of LiveDate and GKey, then sum accross common values of GKey, this creates the joined set first, then applies the aggregate.
As some of the other answers mentioned, cross apply is convenient when you join to a table valued function that is parameterized by some row level data from another table.
Is there a third way, that is faster? I would generally suggest not using open ended operators in joins (such as >=). Maybe try to pre-aggregate the large table on GKey and some date bucket. Also, set up a non-clustered key on LiveDate including AmountEUR
I think you trying to get Rolling sum. use Over() clause Try this.
SELECT G.GKey,
Amount = Sum(fct.AmountEUR)
OVER(
partition BY G.GKey
ORDER BY id rows UNBOUNDED PRECEDING)
FROM WH.dbo.vw_Fact fct
INNER JOIN #g G
ON fct.GKey = G.GKey
APPLY works on a row-by-row basis and is useful for more complex joins such as joining on the first X rows of a table based upon a value in the first table or for joining a function with parameters.
See here for examples.
The obvious reason for the cross apply being slower is that it works on a row by row basis!
So for each row of #g you are running the aggregate query in the cross apply.

Use of temp table in joining for performance issue

Is there any basic difference in terms of performance and output between this below two query?
select * from table1
left outer join table2 on table1.col=table2.col
and table2.col1='shhjs'
and
select * into #temp from table2 where table2.col1='shhjs'
select * from table1 left outer join #temp on table1.col=#temp.col
Here table2 have huge number of records while #temp have less amount.
Yes, there is. The second method is going to materialize a table in the temp data bases, which requires additional overhead.
The first method does not require such overhead. And, it can be better optimized. For instance, if an index existed on table2(col, col1), the first version might take advantage of it. The second would not.
However, you can always try the two queries on your system with your data and determine if one noticeably outperforms the other.

LEFT JOIN with ROW NUM DB2

I am using two tables having one to many mapping (in DB2) .
I need to fetch 20 records at a time using ROW_NUMBER from two table using LEFT JOIN. But due to the one to many mapping, result is not consistent. I might be getting 20 records but those records does not contains 20 unique records of first table
SELECT
A.*,
B.*,
ROW_NUMBER() OVER (ORDER BY A.COLUMN_1 DESC) as rn
from
table1
LEFT JOIN
table2 ON A.COLUMN_3 = B.COLUMN3
where
rn between 1 and 20
Please suggest some solution.
Sure, this is easy... once you know that you can use subqueries as a table reference:
SELECT <relevant columns from Table1 and Table2>, rn
FROM (SELECT <relevant columns from Table1>,
ROW_NUMBER() OVER (ORDER BY <relevant columns> DESC) AS rn
FROM table1) Table1
LEFT JOIN Table2
ON <relevant equivalent columns>
WHERE rn >= :startOfRange
AND rn < :startOfRange + :numberOfElements
For production code, never do SELECT * - always explicitly list the columns you want (there are several reasons for this).
Prefer inclusive lower-bound (>=), exclusive upper-bound (<) for (positive) ranges. For everything except integral types, this is required to sanely/cleanly query the values. Do this with integral types both to be consistent, as well as for ease of querying (note that you don't actually need to know which value you "stop" on). Further, the pattern shown is considered the standard when dealing with iterated value constructs.
Note that this query currently has two problems:
You need to list sufficient columns for the ORDER BY to return consistent results. This is best done by using a unique value - you probably want something in an index that the optimizer can use.
Every time you run this query, you (usually) have to order the ENTIRE set of results before you can get whatever slice of them you want (especially for anything after the first page). If your dataset is large, look at the answers to this question for some ideas for performance improvements. The optimizer may be able to cache the results for you, but it's not guaranteed (especially on tables that receive many updates).