Why would CROSS APPLY not be equivalent to INNER JOIN - sql

This runs in 2 minutes:
SELECT
G.GKey,
Amount = SUM(fct.AmountEUR)
FROM
WH.dbo.vw_Fact fct
INNER JOIN #g G ON
fct.DateKey >= G.Livedate AND
fct.GKey = G.GKey
GROUP BY G.GKey;
This runs in 8 mins:
SELECT
G.GKey,
C.Amount
FROM
#g G
CROSS APPLY
(
SELECT
Amount = SUM(fct.AmountEUR)
FROM
WH.dbo.vw_Fact fct
WHERE
fct.DateKey >= G.Livedate AND
fct.GKey = G.GKey
) C;
These are both quite simple scripts and they look logically the same to me.
Table #G has 50 rows with a clustered index ON #G(Livedate,GKey)
Table WH.dbo.vw_Fact has a billion rows.
I actually felt initially that applying the bigger table to the small table was going to be more efficient.
My experience using CROSS APPLY is limited - is there an obvious reason (without exploring execution plans) for the slow time?
Is there a 'third way' that is likely to be quicker?

Here's the logical difference between the two joins:
CROSS APPLY: yields the Cartesian cross product of an aggregation on a given value of LiveDate and GKey, this gets re-executed for every row.
INNER JOIN: yields a 1-to-1 match on vw_Fact for every value of LiveDate and GKey, then sum accross common values of GKey, this creates the joined set first, then applies the aggregate.
As some of the other answers mentioned, cross apply is convenient when you join to a table valued function that is parameterized by some row level data from another table.
Is there a third way, that is faster? I would generally suggest not using open ended operators in joins (such as >=). Maybe try to pre-aggregate the large table on GKey and some date bucket. Also, set up a non-clustered key on LiveDate including AmountEUR

I think you trying to get Rolling sum. use Over() clause Try this.
SELECT G.GKey,
Amount = Sum(fct.AmountEUR)
OVER(
partition BY G.GKey
ORDER BY id rows UNBOUNDED PRECEDING)
FROM WH.dbo.vw_Fact fct
INNER JOIN #g G
ON fct.GKey = G.GKey

APPLY works on a row-by-row basis and is useful for more complex joins such as joining on the first X rows of a table based upon a value in the first table or for joining a function with parameters.
See here for examples.
The obvious reason for the cross apply being slower is that it works on a row by row basis!
So for each row of #g you are running the aggregate query in the cross apply.

Related

Cross Join, Compare Values, and Select Closest Match - More Efficient Way?

I have two tables with 2 columns. I cross join and subtract the values. I then find the row_number ordered by the subtraction and choose where row = 1. I'm finding the t2.id that has the closest val to t1.id
These tables are quite large. Is the row_number function doing a lot of extra unneeded work by ordering everything after 1? I only need the lowest ranked row.. Is there a more efficient way to write this?
Table 1
id
val
A1
0.123456
A2
1.123456
A3
-0.712345
Table 2
id
val
B1
0.065432
B2
1.654321
B3
-0.654321
--find the t2.id with the closest value to t1.id's val
with cj as (
select
t1.id, t2.id,
row_number() over (partition by t1.id order by abs(t1.val - t2.val)) end as rw
from t1
cross join t2
)
select * from cj where rw = 1
It is possible to run this faster - it depends on how many rows are in t1, t2, and how much flexibility you have to add indexes etc.
As #Chris says, sorting (especially sorting many times) can be a killer. As the cost of sorting increases exponentially (geometrically?) with the amount of values you are sorting, it gets increasingly worse the more you have. If t2 only had two rows, then sorting is easy and your original method would probably be the most efficient. However, if t2 has many rows, then it becomes much much harder. And then if t1 has many rows, and you're doing a sort for every row, that also multiplies the cost.
As such, for testing purposes, I have used 1,000 rows in each of t1 and t2 below.
Below I compare several approaches and indicators of speed and processing. )(Spoiler alert) if you can pre-sort it (like in #Chris' suggestion) then you can get some big improvements.
I don't use databricks (sorry) and cannot measure speeds/etc on them. Therefore the below is written and tested in SQL server - but can be modified to work in databricks pretty easily I would guess. I think the main difference is OUTER APPLY used here - I believe in Databricks that will be an INNER JOIN LATERAL (e.g., How to use outer apply in Spark sql - but note I think they got it wrong. OUTER APPLY is equivalent to INNER JOIN LATERAL, while CROSS APPLY is equivalent to LEFT JOIN LATERAL).
I created the two tables and filled them with 1,000 rows each.
CREATE TABLE #t1 (A_id nvarchar(10) PRIMARY KEY, val decimal(10,8));
CREATE TABLE #t2 (B_id nvarchar(10) PRIMARY KEY, val decimal(10,8));
Original approach - sort all rows
Your original query takes very few data reads, but the cost is the amount of sorting it needs to do. Because ROW_NUMBER() sorts all the rows, and then you only take 1, this is your major cost (as #Chris says).
-- Original query
with cj as (
select
#t1.A_id, #t2.B_id,
row_number() over (partition by #t1.A_id order by abs(#t1.val - #t2.val)) as rw
from #t1
cross join #t2
)
select * from cj where rw = 1;
On my computer, this took 1600ms of CPU time.
Approach 2 - taking the MIN() value
However, as you only need the minimum row, there is no need to really sort the other rows. Taking a 'min' only requires a scan through the data once for each data point in t1, and pick the smallest value.
However, once you have the smallest value, you then need to refer to t2 again to get the relevant t2 IDs.
In other words, the logic of this is
spend less time determining only the smallest absolute difference (instead of sorting them all)
spend more reads and more time finding which value(s) of t2 get that absolute difference
-- Using MIN() to find smallest difference
with cj as (
select
#t1.A_id, #t1.val,
MIN(abs(#t1.val - #t2.val)) AS minvaldif
from #t1
cross join #t2
GROUP BY #t1.A_id, #t1.val
)
select cj.A_ID,
#t2.B_id
FROM cj
CROSS JOIN #t2
WHERE abs(cj.val - #t2.val) = minvaldif;
This took my computer about half the time of the original - about 800ms of computer time - but more than doubles the amount of data reads it does. Note also that it can return several rows if (say) there are repeats of values in t2.
Approach 3 - lateral join
In this case, you do a lateral join (in SQL Server, it's an 'OUTER APPLY') to select just the 1 minimum value you need. Similar to above, you find the min value, but you do it individually for each row in t1. Therefore you need to do 1000 'min' values rather than 1000 sorts.
-- Lateral join with min difference
SELECT #t1.A_id, t2_calc.B_id
FROM #t1
OUTER APPLY
(SELECT TOP (1) #t2.B_Id
FROM #T2
ORDER BY abs(#t1.val - #t2.val)
) AS t2_calc;
This is the most efficient so far - with few reads and only 300ms of compute time. If you cannot add indexes, this is probably the best you could do.
Option 4 - pre-sort the data with an index
If you can pre-sort the data using an index, then you can boost your efficiency by a lot.
CREATE NONCLUSTERED INDEX #IX_t2_val ON #t2 (val);
The 'gotcha' is that even if you have an index on t2.val, databases will have a problem with min(abs(t1.val - t2.val)) - they will usually still need to read all the data rather than use the index.
However, you can use the logic you identified in your question - that min(abs(difference)) value is the one where t1.val is closest to t2.val.
In the method below
For every t1.val, find the closest t2 row that is less than or
equal to it, but not over
Also find, for every t1.val, the closest t2 row that is above it
Then using your logic in the original answer, find the one of these that is the closest.
This also uses lateral views
-- Using indexes
with cj as
(SELECT #t1.A_id, #t1.val AS A_val, t2_lessthan.B_id, t2_lessthan.val AS B_val
FROM #t1
CROSS APPLY
(SELECT TOP (1) #t2.B_Id, #t2.val
FROM #T2
WHERE #t2.val <= #t1.val
ORDER BY #t2.val DESC
) AS t2_lessthan
UNION ALL
SELECT #t1.A_id, #t1.val AS A_val, t2_greaterthan.B_id, t2_greaterthan.val AS B_val
FROM #t1
CROSS APPLY
(SELECT TOP (1) #t2.B_Id, #t2.val
FROM #T2
WHERE #t2.val > #t1.val
ORDER BY #t2.val
) AS t2_greaterthan
),
cj_rn AS
(SELECT A_id, B_id,
row_number() over (partition by A_id order by abs(A_val - B_val)) as rw
FROM cj
)
select * from cj_rn where rw = 1;
Compute time: 4ms.
For each value in t1, it simply does 2 index seeks in t2, and 'sorts' the two value (which is very easy). As such, in this case, it is orders of magnitude faster.
So... really the best approach is if you can pre-sort the data (in this case by adding indexes) and then make sure you take advantage of that sort.
This is a case where procedural code is better than the set logic used in SQL. If you get a cursor on both table1 & table2 (separately) both ordered by val, you can take advantage of the ordering to not compare EVERY combination of As and Bs.
Using Table2 as the primary, prime the 'pump' by reading the first (lowest) value from Table1 into variable FirstA and the second value from Table1 into variable SecondA.
First, loop while next B < FirstA. Output B & FirstA, because every A thereafter will be farther away because the list is ordered.
Now form a loop using the Table2 cursor, read each B value in turn. While B > SecondA, move SecondA to FirstA and read another value from table1 into SecondA or end of cursor. Now B is between FirstA and SecondA; one of those two is closest, compare the abs(difference) and output the lowest and proceed to the next loop iteration.
That's all there is to it. The time-consuming part is sorting the two tables inside their cursors, which is O(nlog(n)) and O(mlog(m)). The comparison of the two is linear [ O(n+m) ].
Hopefully, your database manager has a procedural scripting language that will make this easy.

SQL subselect statement very slow on certain machines

I've got an sql statement where I get a list of all Ids from a table (Machines).
Then need the latest instance of another row in (Events) where the the id's match so have been doing a subselect.
I need to latest instance of quite a few fields that match the id so have these subselects after one another within this single statement so end up with results similar to this...
This works and the results are spot on, it's just becoming very slow as the Events Table has millions of records. The Machine table would have on average 100 records.
Is there a better solution that subselects? Maybe doing inner joins or a stored procedure?
Help appreciated :)
You can use apply. You don't specify how "latest instance" is defined. Let me assume it is based on the time column:
Select a.id, b.*
from TableA a outer apply
(select top(1) b.Name, b.time, b.weight
from b
where b.id = a.id
order by b.time desc
) b;
Both APPLY and the correlated subquery need an ORDER BY to do what you intend.
APPLY is a lot like a correlated query in the FROM clause -- with two convenient enhances. A lateral join -- technically what APPLY does -- can return multiple rows and multiple columns.

MS SQL - Multiple Running Totals, each based on a different GROUP BY

Need to generate the 2 running total columns which are each based on a different group-by. I would PREFER that the solution use the OUTER APPLY method like the one below, except modified to run multiple running totals/sums on different group bys/columns. See image for example of desired result
SELECT t1.LicenseNumber, t1.IncidentDate, t1.TicketAmount,
RunningTotal = SUM(t2.TicketAmount)
FROM dbo.SpeedingTickets AS t1
OUTER APPLY
(
SELECT TicketAmount
FROM dbo.SpeedingTickets
WHERE LicenseNumber = t1.LicenseNumber
AND IncidentDate <= t1.IncidentDate
) AS t2
GROUP BY t1.LicenseNumber, t1.IncidentDate, t1.TicketAmount
ORDER BY t1.LicenseNumber, t1.IncidentDate;
Example + desires result:
i.stack.imgur.com/PvJQe.png
Use outer apply twice:
Here is how you get one running total:
SELECT st.*, r1.RunningTotal
FROM dbo.SpeedingTickets st OUTER APPLY
(SELECT SUM(st2.TicketAmount) as RunningTotal
FROM dbo.SpeedingTickets st2
WHERE st2.LicenseNumber = st.LicenseNumber AND
st2.IncidentDate <= st.IncidentDate
) r1
ORDER BY st.LicenseNumber, st.IncidentDate;
For two, you just add another OUTER APPLY. Your question doesn't specify what the second aggregation is, and the linked picture has no relevance to the description in the question.
Notes:
The aggregation goes in the subquery, not in the outer query.
Use table abbreviations for table aliases. Such consistency makes it easier to follow the query.
When using correlated subqueries, always use qualified column names for all columns.

Google BigQuery; use subselect result in outer select; cross join

I have query that results into one row table and I need to get this result in subsequent computation. Here is non working simplified example (just to depict what I'm trying to achieve):
SELECT amount / (SELECT SUM(amount) FROM [...]) FROM [...]
I tried some nested sub-selects and joins (cross join of the one row table with the other table) but didn't find any working solution. Is there a way to get this working in BigQuery?
Thanks, Radek
EDIT:
ok, I found solution:
select
t1.x / t2.y as z
from
(select 1 as k, amount as x from [...] limit 10) as t1
join
(select 1 as k, sum(amount) as y from [...]) as t2
on
t1.k = t2.k;
but not sure if this is the best how to do it...
With the recently announced ratio_to_report() window function:
SELECT RATIO_TO_REPORT(amount) OVER() AS z
FROM [...]
ratio_to_report takes the amount, and divides it by the sum of all the result rows amounts.
The way you've found (essentially a cross join using a dummy key) is the best way I know of to do this query. We've thought about adding an explicit cross join operator to make it easier to see how to do this, but cross join can get expensive if not done correctly (e.g. if done on two large tables can create n^2 results).

What are the advantages of a query using a derived table(s) over a query not using them?

I know how derived tables are used, but I still can’t really see any real advantages of using them.
For example, in the following article http://techahead.wordpress.com/2007/10/01/sql-derived-tables/ the author tried to show benefits of a query using derived table over a query without one with an example, where we want to generate a report that shows off the total number of orders each customer placed in 1996, and we want this result set to include all customers, including those that didn’t place any orders that year and those that have never placed any orders at all( he’s using Northwind database ).
But when I compare the two queries, I fail to see any advantages of a query using a derived table ( if nothing else, use of a derived table doesn't appear to simplify our code, at least not in this example):
Regular query:
SELECT C.CustomerID, C.CompanyName, COUNT(O.OrderID) AS TotalOrders
FROM Customers C LEFT OUTER JOIN Orders O ON
C.CustomerID = O.CustomerID AND YEAR(O.OrderDate) = 1996
GROUP BY C.CustomerID, C.CompanyName
Query using a derived table:
SELECT C.CustomerID, C.CompanyName, COUNT(dOrders.OrderID) AS TotalOrders
FROM Customers C LEFT OUTER JOIN
(SELECT * FROM Orders WHERE YEAR(Orders.OrderDate) = 1996) AS dOrders
ON
C.CustomerID = dOrders.CustomerID
GROUP BY C.CustomerID, C.CompanyName
Perhaps this just wasn’t a good example, so could you show me an example where benefits of derived table are more obvious?
thanx
REPLY TO GBN:
In this case, you couldn't capture both products and order aggregates if there is no relation between Customers and Products.
Could you elaborate what exactly you mean? Wouldn’t the following query produce the same result set as your query:
SELECT
C.CustomerID, C.CompanyName,
COUNT(O.OrderID) AS TotalOrders,
COUNT(DISTINCT P.ProductID) AS DifferentProducts
FROM Customers C LEFT OUTER JOIN Orders O ON
C.CustomerID = O.CustomerID AND YEAR(O.OrderDate) = 1996
LEFT OUTER JOIN Products P ON
O.somethingID = P.somethingID
GROUP BY C.CustomerID, C.CompanyName
REPLY TO CADE ROUX:
In addition, if expressions are used to derive columns from derived columns with a lot of shared intermediate calculations, a set of nested derived tables or stacked CTEs is the only way to do it:
SELECT x, y, z1, z2
FROM (
SELECT *
,x + y AS z1
,x - y AS z2
FROM (
SELECT x * 2 AS y
FROM A
) AS A
) AS A
Wouldn't the following query produce the same result as your above query:
SELECT x, x * 2 AS y, x + x*2 AS z1, x - x*2 AS z2
FROM A
In your examples, the derived table is not strictly necessary. There are numerous cases where you might need to join to an aggregate or similar, and a derived table is really the only way to handle that:
SELECT *
FROM A
LEFT JOIN (
SELECT x, SUM(y)
FROM B
GROUP BY x
) AS B
ON B.x = A.x
In addition, if expressions are used to derive columns from derived columns with a lot of shared intermediate calculations, a set of nested derived tables or stacked CTEs is the only way to do it:
SELECT x, y, z1, z2
FROM (
SELECT *
,x + y AS z1
,x - y AS z2
FROM (
SELECT x * 2 AS y
FROM A
) AS A
) AS A
As far as maintainability, using stacked CTEs or derived tables (they are basically equivalent) and can make for more readable and maintainable code, as well as facilitating cut-and-paste re-use and refactoring. The optimizer can typically flatten then very easily.
I typically use stacked CTEs instead of nesting for a little better readability (same two examples):
WITH B AS (
SELECT x, SUM(y)
FROM B
GROUP BY x
)
SELECT *
FROM A
LEFT JOIN B
ON B.x = A.x
WITH A1 AS (
SELECT x * 2 AS y
FROM A
)
,A2 AS (
SELECT *
,x + y AS z1
,x - y AS z2
FROM A1
)
SELECT x, y, z1, z2
FROM A2
Regarding your question about:
SELECT x, x * 2 AS y, x + x*2 AS z1, x - x*2 AS z2
FROM A
This has the x * 2 code repeated 3 times. If this business rule needs to change, it will have to change in 3 places - a recipe for injection of defects. This gets compounded any time you have intermediate calculations which need to be consistent and defined in only one place.
This would not be as much of a problem if SQL Server's scalar user-defined functions could be inlined (or if they performed acceptably), you could simply build your UDFs to stack your results and the optimizer would elimnate redundant calls. Unfortunately SQL Server's scalar UDF implementation cannot handle that well for large sets of rows.
I typically use a derived table (or a CTE, which is a sometimes-superior alternative to derived queries in SQL 2005/2008) to simplify reading and building queries, or in cases where SQL doesn't allow me to do a particular operation.
For example, one of the things you can't do without a derived table or CTE is put an aggregate function in a WHERE clause. This won't work:
SELECT name, city, joindate
FROM members
INNER JOIN cities ON cities.cityid = derived.cityid
WHERE ROW_NUMBER() OVER (PARTITION BY cityid ORDER BY joindate) = 1
But this will work:
SELECT name, city, joindate
FROM
(
SELECT name,
cityid,
joindate,
ROW_NUMBER() OVER (PARTITION BY cityid ORDER BY joindate) AS rownum
FROM members
) derived INNER JOIN cities ON cities.cityid = derived.cityid
WHERE rn = 1
Advanced caveats, especially for large-scale analytics
If you're working on relatively small data sets (not gigabytes) you can probably stop reading here. If you're working with gigabytes ot terabytes of data and using derived tables, read on...
For very large-scale data operations, it's sometimes preferable to create a temporary table instead of using a derived query. This may happen if SQL's statistics suggest that your derived query will return many more rows than the query will actually return, which happens more often than you'd think. Queries where your main query self-joins with a non-recursive CTE are are also problematic.
It's also possible that derived tables will generate unexpected query plans. For example, even if you put a strict WHERE clause in your derived table to make that query very selective, SQL Server may re-order your query plan so your WHERE clause is evaluated in the query plan. See this Microsoft Connect feedback for a discussion of this issue and a workaround.
So, for very performance-intensive queries (especially data-warehousing queries on 100GB+ tables), I always like to prototype a temporary-table solution to see if you get better performance than you get from a derived table or CTE. This seems counter-intuitive since you're doing more I/O than an ideal single-query solution, but with temp tables you get total control over the query plan used and the order each subquery is evaluated. Sometimes this can increase performance 10x or more.
I also tend to prefer temp tables in cases where I have to use query hints to force SQL to do what I want-- if the SQL optimizer is already "misbehaving", temp tables are often a clearer way to force them to act the way you want.
I'm not suggesting this is a common case-- most of the time the temporary table solution will be at least a little worse and sometimes query hints are one's only recourse. But don't assume that a CTE or derived-query solution will be your fastest option either. Test, test, test!
Derived tables often replace correlated subqueries and are generally considerably faster.
They also can be used to limit greatly the number of records searched thorugh for a large table and thus may also improve speed of the query.
As with all potentially performance improving techniques, you need to test to see if they did improve performance. A derived table will almost always strongly outperform a correlated subquery but there is the possibility it may not.
Further there are times when you need to join to data containing an aggregate calulation which is almost impossible to do without a derived table or CTE (which is essentually another way of writing a derived tbale in many cases).
Derived tables are one of my most useful ways of figuring out complex data for reporting as well. You can do this in pieces using table variables or temp tables too, but if you don't want to see the code in procedural steps, people often change them to derived tables once they work out what they want using temp tables.
Aggregating data from a union is another place where you need derived tables.
Using your terminology and example the derived tables is only more complex with no advantages. However, some things require a derived table. Those can be in the most complex cases CTEs (as demonstrated above). But, simple joins can demonstrate the necessity of derived tables, all you must do is craft a query that requires the use of an aggregate, here we use a variant of the quota query to demonstrate this.
Select all of the customer's most expensive transactions
SELECT transactions.*
FROM transactions
JOIN (
select user_id, max(spent) AS spent
from transactions
group by user_id
) as derived_table
USING (
derived_table.user_id = transaction.user_id
AND derived_table.spent = transactions.spent
)
In this case, the derived table allows YEAR(O.OrderDate) = 1996 in a WHERE clause.
In the outer where clause, it's useless because it would change the JOIN to INNER.
Personally, I prefer the derived table (or CTE) construct because it puts the filter into the correct place
Another example:
SELECT
C.CustomerID, C.CompanyName,
COUNT(D.OrderID) AS TotalOrders,
COUNT(DISTINCT D.ProductID) AS DifferentProducts
FROM
Customers C
LEFT OUTER JOIN
(
SELECT
OrderID, P.ProductID
FROM
Orders O
JOIN
Products P ON O.somethingID = P.somethingID
WHERE YEAR(Orders.OrderDate) = 1996
) D
ON C.CustomerID = D.CustomerID
GROUP BY
C.CustomerID, C.CompanyName
In this case, you couldn't capture both products and order aggregates if there is no relation between Customers and Products. Of course, this is contrived but I hope I've captured the concept
Edit:
I need to explicitly JOIN T1 and T2 before the JOIN onto MyTable. It does happen. The derived T1/T2 join can be a different query to 2 LEFT JOINs with no derived table. It happens quite often
SELECT
--stuff--
FROM
myTable M1
LEFT OUTER JOIN
(
SELECT
T1.ColA, T2.ColB
FROM
T1
JOIN
T2 ON T1.somethingID = T2.somethingID
WHERE
--filter--
) D
ON M1.ColA = D.ColA AND M1.ColB = D.ColB