I am using two tables having one to many mapping (in DB2) .
I need to fetch 20 records at a time using ROW_NUMBER from two table using LEFT JOIN. But due to the one to many mapping, result is not consistent. I might be getting 20 records but those records does not contains 20 unique records of first table
SELECT
A.*,
B.*,
ROW_NUMBER() OVER (ORDER BY A.COLUMN_1 DESC) as rn
from
table1
LEFT JOIN
table2 ON A.COLUMN_3 = B.COLUMN3
where
rn between 1 and 20
Please suggest some solution.
Sure, this is easy... once you know that you can use subqueries as a table reference:
SELECT <relevant columns from Table1 and Table2>, rn
FROM (SELECT <relevant columns from Table1>,
ROW_NUMBER() OVER (ORDER BY <relevant columns> DESC) AS rn
FROM table1) Table1
LEFT JOIN Table2
ON <relevant equivalent columns>
WHERE rn >= :startOfRange
AND rn < :startOfRange + :numberOfElements
For production code, never do SELECT * - always explicitly list the columns you want (there are several reasons for this).
Prefer inclusive lower-bound (>=), exclusive upper-bound (<) for (positive) ranges. For everything except integral types, this is required to sanely/cleanly query the values. Do this with integral types both to be consistent, as well as for ease of querying (note that you don't actually need to know which value you "stop" on). Further, the pattern shown is considered the standard when dealing with iterated value constructs.
Note that this query currently has two problems:
You need to list sufficient columns for the ORDER BY to return consistent results. This is best done by using a unique value - you probably want something in an index that the optimizer can use.
Every time you run this query, you (usually) have to order the ENTIRE set of results before you can get whatever slice of them you want (especially for anything after the first page). If your dataset is large, look at the answers to this question for some ideas for performance improvements. The optimizer may be able to cache the results for you, but it's not guaranteed (especially on tables that receive many updates).
Related
I have two tables with 2 columns. I cross join and subtract the values. I then find the row_number ordered by the subtraction and choose where row = 1. I'm finding the t2.id that has the closest val to t1.id
These tables are quite large. Is the row_number function doing a lot of extra unneeded work by ordering everything after 1? I only need the lowest ranked row.. Is there a more efficient way to write this?
Table 1
id
val
A1
0.123456
A2
1.123456
A3
-0.712345
Table 2
id
val
B1
0.065432
B2
1.654321
B3
-0.654321
--find the t2.id with the closest value to t1.id's val
with cj as (
select
t1.id, t2.id,
row_number() over (partition by t1.id order by abs(t1.val - t2.val)) end as rw
from t1
cross join t2
)
select * from cj where rw = 1
It is possible to run this faster - it depends on how many rows are in t1, t2, and how much flexibility you have to add indexes etc.
As #Chris says, sorting (especially sorting many times) can be a killer. As the cost of sorting increases exponentially (geometrically?) with the amount of values you are sorting, it gets increasingly worse the more you have. If t2 only had two rows, then sorting is easy and your original method would probably be the most efficient. However, if t2 has many rows, then it becomes much much harder. And then if t1 has many rows, and you're doing a sort for every row, that also multiplies the cost.
As such, for testing purposes, I have used 1,000 rows in each of t1 and t2 below.
Below I compare several approaches and indicators of speed and processing. )(Spoiler alert) if you can pre-sort it (like in #Chris' suggestion) then you can get some big improvements.
I don't use databricks (sorry) and cannot measure speeds/etc on them. Therefore the below is written and tested in SQL server - but can be modified to work in databricks pretty easily I would guess. I think the main difference is OUTER APPLY used here - I believe in Databricks that will be an INNER JOIN LATERAL (e.g., How to use outer apply in Spark sql - but note I think they got it wrong. OUTER APPLY is equivalent to INNER JOIN LATERAL, while CROSS APPLY is equivalent to LEFT JOIN LATERAL).
I created the two tables and filled them with 1,000 rows each.
CREATE TABLE #t1 (A_id nvarchar(10) PRIMARY KEY, val decimal(10,8));
CREATE TABLE #t2 (B_id nvarchar(10) PRIMARY KEY, val decimal(10,8));
Original approach - sort all rows
Your original query takes very few data reads, but the cost is the amount of sorting it needs to do. Because ROW_NUMBER() sorts all the rows, and then you only take 1, this is your major cost (as #Chris says).
-- Original query
with cj as (
select
#t1.A_id, #t2.B_id,
row_number() over (partition by #t1.A_id order by abs(#t1.val - #t2.val)) as rw
from #t1
cross join #t2
)
select * from cj where rw = 1;
On my computer, this took 1600ms of CPU time.
Approach 2 - taking the MIN() value
However, as you only need the minimum row, there is no need to really sort the other rows. Taking a 'min' only requires a scan through the data once for each data point in t1, and pick the smallest value.
However, once you have the smallest value, you then need to refer to t2 again to get the relevant t2 IDs.
In other words, the logic of this is
spend less time determining only the smallest absolute difference (instead of sorting them all)
spend more reads and more time finding which value(s) of t2 get that absolute difference
-- Using MIN() to find smallest difference
with cj as (
select
#t1.A_id, #t1.val,
MIN(abs(#t1.val - #t2.val)) AS minvaldif
from #t1
cross join #t2
GROUP BY #t1.A_id, #t1.val
)
select cj.A_ID,
#t2.B_id
FROM cj
CROSS JOIN #t2
WHERE abs(cj.val - #t2.val) = minvaldif;
This took my computer about half the time of the original - about 800ms of computer time - but more than doubles the amount of data reads it does. Note also that it can return several rows if (say) there are repeats of values in t2.
Approach 3 - lateral join
In this case, you do a lateral join (in SQL Server, it's an 'OUTER APPLY') to select just the 1 minimum value you need. Similar to above, you find the min value, but you do it individually for each row in t1. Therefore you need to do 1000 'min' values rather than 1000 sorts.
-- Lateral join with min difference
SELECT #t1.A_id, t2_calc.B_id
FROM #t1
OUTER APPLY
(SELECT TOP (1) #t2.B_Id
FROM #T2
ORDER BY abs(#t1.val - #t2.val)
) AS t2_calc;
This is the most efficient so far - with few reads and only 300ms of compute time. If you cannot add indexes, this is probably the best you could do.
Option 4 - pre-sort the data with an index
If you can pre-sort the data using an index, then you can boost your efficiency by a lot.
CREATE NONCLUSTERED INDEX #IX_t2_val ON #t2 (val);
The 'gotcha' is that even if you have an index on t2.val, databases will have a problem with min(abs(t1.val - t2.val)) - they will usually still need to read all the data rather than use the index.
However, you can use the logic you identified in your question - that min(abs(difference)) value is the one where t1.val is closest to t2.val.
In the method below
For every t1.val, find the closest t2 row that is less than or
equal to it, but not over
Also find, for every t1.val, the closest t2 row that is above it
Then using your logic in the original answer, find the one of these that is the closest.
This also uses lateral views
-- Using indexes
with cj as
(SELECT #t1.A_id, #t1.val AS A_val, t2_lessthan.B_id, t2_lessthan.val AS B_val
FROM #t1
CROSS APPLY
(SELECT TOP (1) #t2.B_Id, #t2.val
FROM #T2
WHERE #t2.val <= #t1.val
ORDER BY #t2.val DESC
) AS t2_lessthan
UNION ALL
SELECT #t1.A_id, #t1.val AS A_val, t2_greaterthan.B_id, t2_greaterthan.val AS B_val
FROM #t1
CROSS APPLY
(SELECT TOP (1) #t2.B_Id, #t2.val
FROM #T2
WHERE #t2.val > #t1.val
ORDER BY #t2.val
) AS t2_greaterthan
),
cj_rn AS
(SELECT A_id, B_id,
row_number() over (partition by A_id order by abs(A_val - B_val)) as rw
FROM cj
)
select * from cj_rn where rw = 1;
Compute time: 4ms.
For each value in t1, it simply does 2 index seeks in t2, and 'sorts' the two value (which is very easy). As such, in this case, it is orders of magnitude faster.
So... really the best approach is if you can pre-sort the data (in this case by adding indexes) and then make sure you take advantage of that sort.
This is a case where procedural code is better than the set logic used in SQL. If you get a cursor on both table1 & table2 (separately) both ordered by val, you can take advantage of the ordering to not compare EVERY combination of As and Bs.
Using Table2 as the primary, prime the 'pump' by reading the first (lowest) value from Table1 into variable FirstA and the second value from Table1 into variable SecondA.
First, loop while next B < FirstA. Output B & FirstA, because every A thereafter will be farther away because the list is ordered.
Now form a loop using the Table2 cursor, read each B value in turn. While B > SecondA, move SecondA to FirstA and read another value from table1 into SecondA or end of cursor. Now B is between FirstA and SecondA; one of those two is closest, compare the abs(difference) and output the lowest and proceed to the next loop iteration.
That's all there is to it. The time-consuming part is sorting the two tables inside their cursors, which is O(nlog(n)) and O(mlog(m)). The comparison of the two is linear [ O(n+m) ].
Hopefully, your database manager has a procedural scripting language that will make this easy.
I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:
event_count sys_timestamp
100 167877672772
110 167877672769
121 167877672987
111 167877673877
... ...
With both fields defined as numeric.
With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:
SELECT t1.*,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count >= t1.event_count+10)
AS positive,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count <= t1.event_count-10)
AS negative
FROM tableA as t1
The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).
However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.
Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.
My concrete questions:
How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000)
Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)
Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?
Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?
How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN because the subquery always returns a row. See:
JOIN (SELECT ... ) ue ON 1=1?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:
Aggregate columns with additional (distinct) filters
event_count should probably be integer or bigint. See:
PostgreSQL using UUID vs Text as primary key
Is there any difference in saving same value in different integer types?
sys_timestamp should probably be timestamp or timestamptz. See:
Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application
You have the right idea.
The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.
The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.
Then the next step is to reuse that same detail query for everything else.
Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:
with baseQuery as
(
SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount
,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
FROM tableA t1, tableA t2
where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
select bq.startEventCount, count(*) as positive
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.pEndEventCount
group by bq.startEventCount
), negSummary as
(
select bq.startEventCount, count(*) as negative
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.nEndEventCount
group by bq.startEventCount
)
select t1.*, ps.positive, nv.negative
from tableA t1
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount
Notes:
The distinct for baseQuery may not be necessary based on your actual keys.
The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.
You can play around to see what works.
There are other ways of course but this best illustrates how and where things could be improved.
With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion.
For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.
We're currently trying to improve a system that allows the user to filter and sort a large list of objects (> 100k) by fields that are not being displayed. Since the fields can be selected dynamically we'd plan to build the query dynamically as well.
That doesn't sound too hard and the basics are done easily but the problem lies in how the data is structured. In some cases some more or less expensive joins would be needed which could some up to a quite expensive query, especially when those joins are combined (i.e. select * from table join some_expensive_join join another_expensive_join ...).
Filtering wouldn't be that big a problem since we can use intersections.
Ordering, however, would require us to first build a table that contains all necessary data which if being done via a huge select statement with all those joins would become quite expensive.
So the question is: is there a more efficient way to do that?
I could think of doing it like this:
do a select query for the first column and order by that
for all elements that basically have the same order (e.g. same value) do another query to resolve that
repeat the step above until the order is unambiguous or we run out of sort criteria
Does that make sense? If yes, how could this be done in Postgresql 9.4 (we currently can't upgrade so 9.5+ solutions though welcome wouldn't help as right now).
Does this help, or is it too trivial? (the subqueries could be prefab join views)
SELECT t0.id, t0.a,t0.b,t0.c, ...
FROM main_table t0
JOIN ( SELECT t1.id AS id
, rank() OVER (ORDER BY whatever) AS rnk
FROM different_tables_or_JOINS
) AS t1 ON t1.id=t0.id
JOIN ( SELECT t2.id AS id
, rank() OVER (ORDER BY whatever) AS rnk
FROM different_tables_or_JOINS2
) AS t2 ON t2.id=t0.id
...
ORDER BY t1.rnk
, t2.rnk
...
, t0.id
;
Based on surfing the web, I came up with two methods of counting the records in a table "Table1". The counter field increments according to a date field "TheDate". It does this by summing records with an older TheDate value. Furthermore, records with different values for the compound field (Field1,Field2) are counted using separate counters. Field3 is just an informational field that is included for added awareness and does not affect the counting or how records are grouped for counting.
Method 1: Use corrrelated subquery
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
(
SELECT SUM(1) FROM Table1 InnerQuery
WHERE InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
) AS RunningCounter
FROM Table1 MainQuery
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
Method 2: Use join and group-by
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
SUM(1) AS RunningCounter
FROM Table1 MainQuery INNER JOIN Table1 InnerQuery
ON InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
There is no inner query per se in Method 2, but I use the table alias InnerQuery so that a ready parellel with Method 1 can be drawn. The role is the same; the 2nd instance of Table 1 is for accumulating the counts of the records which have TheDate less than that of any record in MainQuery (1st instance of Table 1) with the same Field1 and Field2 values.
Note that in Method 2, Field 3 is include in the Group-By clause even though I said that it does not affect how the records are grouped for counting. This is still true, since the counting is done using the matching records in InnerQuery, whereas the GROUP By applies to Field 3 in MainQuery.
I found that Method 1 is noticably faster. I'm surprised by this because it uses a correlated subquery. The way I think of a correlated subquery is that it is executed for each record in MainQuery (whether or not that is done in practice after optimization). On the other hand, Method 2 doesn't run an inner query over and over again. However, the inner join still has multiple records in InnerQuery matching each record in MainQuery, so in a sense, it deals with a similar order of complexity.
Is there a decent intuitive explanation for this speed difference, as well as best practice or considerations in choosing an approach for time-base accumulation?
I've posted this to
Microsoft Answers
Stack Exchange
In fact, I think the easiest way is to do this:
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
COUNT(*)
FROM Table1 MainQuery
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
(The order by isn't required to get the same data, just to order it. In other words, removing it will not change the number or contents of each row returned, just the order in which they are returned.)
You only need to specify the table once. Doing a self-join (joining a table to itself as both your queries do) is not required. The performance of your two queries will depend on a whole load of things which I don't know - what the primary keys are, the number of rows, how much memory is available, and so on.
First, your experience makes a lot of sense. I'm not sure why you need more intuition. I imagine you learned, somewhere along the way, that correlated subqueries are evil. Well, as with some of the things we teach kids as being really bad ("don't cross the street when the walk sign is not green") turn out to be not so bad, the same is true of correlated subqueries.
The easiest intuition is that the uncorrelated subquery has to aggregate all the data in the table. The correlated version only has to aggregate matching fields, although it has to do this over and over.
To put numbers to it, say you have 1,000 rows with 10 rows per group. The output is 100 rows. The first version does 100 aggregations of 10 rows each. The second does one aggregation of 1,000 rows. Well, aggregation generally scales in a super-linear fashion (O(n log n), technically). That means that 100 aggregations of 10 records takes less time than 1 aggregation of 1000 records.
You asked for intuition, so the above is to provide some intuition. There are a zillion caveats that go both ways. For instance, the correlated subquery might be able to make better use of indexes for the aggregation. And, the two queries are not equivalent, because the correct join would be LEFT JOIN.
Actually, I was wrong in my original post. The inner join is way, way faster than the correlated subquery. However, the correlated subquery is able to display its results records as they are generated, so it appears faster.
As a side curiosity, I'm finding that if the correlated sub-query approach is modified to use sum(-1) instead of sum(1), the number of returned records seems to vary from N-3 to N (where N is the correct number, i.e., the number of records in Table1). I'm not sure if this is due to some misbehaviour in Access's rush to display initial records or what-not.
While it seems that the INNER JOIN wins hands-down, there is a major insidious caveat. If the GROUP BY fields do not uniquely distinguish each record in Table1, then you will not get an individual SUM for each record of Table1. Imagine that a particular combination of GROUP BY field values matching (say) THREE records in Table1. You will then get a single SUM for all of them. The problem is, each of these 3 records in MainQuery also matches all 3 of the same records in InnerQuery, so those instances in InnerQuery get counted multiple times. Very insidious (I find).
So it seems that the sub-query may be the way to go, which is awfully disturbing in view of the above problem with repeatability (2nd paragraph above). That is a serious problem that should send shivers down any spine. Another possible solution that I'm looking at is to turn MainQuery into a subquery by SELECTing the fields of interest and DISTINCTifying them before INNER JOINing the result with InnerQuery.
I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.
SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').
Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix
This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!
The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.
Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.
[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.