I have a SQL table that contains data of the form:
Id int
EventTime dateTime
CurrentValue int
The table may have multiple rows for a given id that represent changes to the value over time (the EventTime identifying the time at which the value changed).
Given a specific point in time, I would like to be able to calculate the count of distinct Ids for each given Value.
Right now, I am using a nested subquery and a temporary table, but it seems it could be much more efficient.
SELECT [Id],
(
SELECT
TOP 1 [CurrentValue]
FROM [ValueHistory]
WHERE [Ids].[Id]=[ValueHistory].[Id] AND
[EventTime] < #StartTime
ORDER BY [EventTime] DESC
) as [LastValue]
INTO #temp
FROM [Ids]
SELECT [LastValue], COUNT([LastValue])
FROM #temp
GROUP BY [LastValue]
DROP TABLE #temp
Here is my first go:
select ids.Id, count( distinct currentvalue)
from ids
join valuehistory vh on ids.id = vh.id
where vh.eventtime < #StartTime
group by ids.id
However, I am not sure I understand your table model very clearly, or the specific question you are trying to solve.
This would be: The distinct 'currentvalues' from valuehistory before a certain date that for each Id.
Is that what you are looking for?
I think I understand your question.
You want to get the most recent value for each id, group by that value, and then see how many ids have that same value? Is this correct?
If so, here's my first shot:
declare #StartTime datetime
set #StartTime = '20090513'
select ValueHistory.CurrentValue, count(ValueHistory.id)
from
(
select id, max(EventTime) as LatestUpdateTime
from ValueHistory
where EventTime < #StartTime
group by id
) CurrentValues
inner join ValueHistory on CurrentValues.id = ValueHistory.id
and CurrentValues.LatestUpdateTime = ValueHistory.EventTime
group by ValueHistory.CurrentValue
No guarantee that this is actually faster though - for this to work with any decent speed you'll need an index on EventTime.
Let us keep in mind that, because the SQL language describes what you want and not how to get it, there are many ways of expressing a query that will eventually be turned into the same query execution plan by a good query optimizer. Of course, the level of "good" depends on the database you're using.
In general, subqueries are just a syntactically different way of describing joins. The query optimizer is going to recognize this and determine the most optimal way, to the best of its knowledge, to execute the query. Temporary tables may be created as needed. So in many cases, re-working the query is going to do nothing for your actual execution time -- it may come out to the same query execution plan in the end.
If you're going to attempt to optimize, you need to examine the query plan by doing a describe on that query. Make sure it's not doing full-table scans against large tables, and is picking the appropriate indices where possible. If, and only if, it is making sub-optimal choices here, should you attempt to manually optimize the query.
Now, having said all that, the query you pasted isn't entirely compatible with your stated goal of "calculat[ing] the count of distinct Ids for each given Value". So forgive me if I don't quite answer your need, but here's something to perf-test against your current query. (Syntax is approximate, sorry -- away from my desk).
SELECT [IDs].[Id], vh1.[CurrentValue], COUNT(vh2.[CurrentValue]) FROM
[IDs].[Id] as ids JOIN [ValueHistory] AS vh1 ON ids.[Id]=vh1.[Id]
JOIN [ValueHistory] AS vh2 ON vh1.[CurrentValue]=vh2.[CurrentValue]
GROUP BY [Id], [LastValue];
Note that you'll probably see better performance increases by adding indices to make those joins optimal than re-working the query, assuming you're willing to take the performance hit to update operations.
Related
I have a Netezza query with a WHERE clause that includes several hundred potential strings. I'm surprised that it runs, but it takes time to complete and occasionally errors out ('transaction rolled back by client'). Here's a pseudo code version of my query.
SELECT
TO_CHAR(X.I_TS, 'YYYY-MM-DD') AS DATE,
X.I_SRC_NM AS CHANNEL,
X.I_CD AS CODE,
COUNT(DISTINCT CASE WHEN X.I_FLG = 1 THEN X.UID ELSE NULL) AS WIDGETS
FROM
(SELECT
A.I_TS,
A.I_SRC_NM,
A.I_CD,
B.UID,
B.I_FLG
FROM
SCHEMA.DATABASE.TABLE_A A
LEFT JOIN SCHEMA.DATABASE.TABLE_B B ON A.UID = B.UID
WHERE
A.I_TS BETWEEN '2017-01-01' AND '2017-01-15'
AND B.TAB_CODE IN ('00AV', '00BX', '00C2', '00DJ'...
...
...
...
...
...
...
...)
) X
GROUP BY
X.I_TS,
X.I_SRC_NM,
X.I_CD
;
In my query, I'm limiting the results on B.TAB_CODE to about 1,200 values (out of more than 10k). I'm honestly surprised that it works at all, but it does most of the time.
Is there a more efficient way to handle this?
If the IN clause becomes too cumbersome, you can make your query in multiple parts. Create a temporary table containing a TAB_CODE set then use it in a JOIN.
WITH tab_codes(tab_code) AS (
SELECT '00AV'
UNION ALL
SELECT '00BX'
--- etc ---
)
SELECT
TO_CHAR(X.I_TS, 'YYYY-MM-DD') AS DATE,
X.I_SRC_NM AS CHANNEL,
--- etc ---
INNER JOIN tab_codes Q ON B.TAB_CODES = Q.tab_code
If you want to boost performance even more, consider using a real temporary table (CTAS)
We've seen situations where it's "cheaper" to CTAS the original table to another, distributed on your primary condition, and then querying that table instead.
If im guessing correctly , the X.I_TS is in fact a ‘timestamp’, and as such i expect it to contain many different values per day. Can you confirm that?
If I’m right the query can possibly benefit from changing the ‘group by X.I._TS,...’ to ‘group by 1,...’
Furthermore the ‘Count(Distinct Case...’ can never return anything else than 1 or NULL. Can you confirm that?
If I’m right on that, you can get rid of the expensive ‘DISTINCT’ by changing it to ‘MAX(Case...’
Can you follow me :)
I tried running this query against two tables which were very different sizes - #temp was about 15,000 rows, and Member is about 70,000,000, about 68,000,000 of which do not have the ID 307.
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id as varchar) NOT IN (
SELECT IndividualID
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307)
This query ran for 18 hours, before I killed it and tried something else, which was:
SELECT IndividualID
INTO #source
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id AS VARCHAR) NOT IN (
SELECT IndividualID
FROM #source)
And this ran for less than a second before giving me a result.
I was pretty surprised by this. I'm a middle-tier developer rather than a SQL expert and my understanding of what goes on under the hood is a little murky, but I would have presumed that, since the sub-query in my first attempt is the exact same code, asking for the exact same data as in the second attempt, that these would be roughly equivalent.
But that's obviously wrong. I can't look at the execution plan for my original query to see what SQL Server is trying to do. So can someone kindly explain why splitting the data out into a temp table is so much faster?
EDIT: Table schemas and indexes
The #temp table has two columns, Individual_ID int and Source_Code varchar(50)
Member and Person are more complex. They has 29 and 13 columns respectively so I don't really want to post them all in full. PersonID is an int and is the PK on Person and an FK on Member. IndividualID is a column on Person - this is not clear in the query as written.
I tried using a LEFT JOIN instead of NOT IN before asking the question. The performance on the second query wasn't noticeably different - both were sub-second. On the first query I let it run for an hour before stopping it, presuming it would make no significant difference.
I also added an index on #source, just like on the original table, so the performance impact should be identical.
First, your query has two faux pas's that really stick out. You are converting to varchar(), but you do not include a length argument. This should not be allowed! The default length varies by context and you need to be explicit.
Second, you are matching two keys in different tables and they seemingly have different types. Foreign key references should always have the same type. This can have a very big impact on performance. If you are dealing with tables that have millions of rows, then you need to pay some attention to the data structure.
To understand the difference in performance, you need to understand execution plans. The two queries have very different execution plans. My (educated) guess is that the first version version is using a nested loop join algorithm. The second version is using a more sophisticated algorithm. In your case, this would be due to the ability of SQL Server to maintain statistics on tables. So, instantiating the intermediate results actually helps the optimizer produce a better query plan.
The subject of how best to write this logic has been investigated a lot. Here is a very good discussion on the subject by Aaron Bertrand.
I do agree with Aaron on the preference for not exists in this case:
SELECT COUNT(*)
FROM #temp t
WHERE NOT EXISTS (SELECT 1
FROM Member m JOIN
Person p
ON p.PersonID = m.PersonID
WHERE MemberID <> 307 and individual_id = t. individual_id
);
However, I don't know if this will have better performance in this particular case.
This line is probably what kills the first query
WHERE CAST(individual_id as varchar) NOT IN
My guess would be that this forces a table scan rather than using any indexes.
I have two tables that I need to join, the first table contains CustomerNumber and IdentificationNumber, and IdentificationType. The second table contains the IdentificationType, EffectiveDate, and EndDate.
My Query basically looks like this:
Select CustomerNumber, IdentificationNumber
From Identification i
Inner Join IdentificationType it On it.IdentificationType = i.IdentificationType
And it.EffectiveDate < #TodaysDate
And (it.EndDate IS NULL Or it.EndDate > #TodaysDate)
My execution plan is showing a clustered index scan on the identification type table, I'm assuming it's because of the OR in the join clause.
Is there a more efficient way to join, KNOWING that the EndDate field MUST allow Null, or a real datetime value?
I know you said the EndDate column MUST allow NULL, so just for the record: the most efficient way is to stop using NULLs in place of "no end date" in the IdentificationType table, and instead use 9999-12-31. Then your queries can skip the whole OR clause. (I understand this might require some application changes, but it would be worth it in my opinion for this exact reason--and I have seen this "NULL = open ended" pattern make queries difficult or perform badly over and over again in my own work and in SQL questions online.)
Also, you might consider swapping the order of the two OR conditions--this may sound like voodoo but I believe I heard that there are some special cases where it can optimize better when the variable is first in this specific scenario (though I could be wrong).
In the meantime, would you try this and share how well it performs compared to your and other solutions?
SELECT
CustomerNumber, IdentificationNumber
FROM
dbo.Identification i
INNER JOIN dbo.IdentificationType it
ON it.IdentificationType = i.IdentificationType
WHERE
it.EffectiveDate < #TodaysDate
AND it.EndDate IS NULL
UNION ALL
SELECT
CustomerNumber, IdentificationNumber
FROM
dbo.Identification i
INNER JOIN dbo.IdentificationType it
ON it.IdentificationType = i.IdentificationType
WHERE
it.EffectiveDate < #TodaysDate
AND it.EndDate > #TodaysDate
;
I have recovered from poor performance with OR clauses by using this exact strategy. It is painful to explode the query size/complexity, but the possibility of getting just a few seeks is totally worth it compared to the scan you're dealing with now.
There is something fishy about your inequality comparisons: The first one should have an equal sign in it <=. You didn't tell us the data type of the date columns and #TodaysDate, but best practice is to design a system so it does not fail for any input. So even if the variable is datetime and EffectiveDate has no time portion, it should still be <= on that comparison so a query at exactly midnight doesn't fail to include the data for that day.
P.S. Sorry about not preserving your formatting--I just understand queries better when formatted in my preferred style. Also, I moved the date conditions to the WHERE clause because in my opinion they are not part of the JOIN.
Try using isnull instead of the OR statement. I also think you should use Datediff instead of the comparison operator.
select CustomerNumber, IdentificationNumber
From Identification i
Inner Join IdentificationType it On it.IdentificationType = i.IdentificationType
And it.EffectiveDate < #TodaysDate
And (isnull(it.EndDate,#TodaysDate) >= #TodaysDate)
I've been waiting over an hour already for this query, so I know I'm probably doing something wrong. Is there efficient way to tailor this query: ?
select RespondentID, MIN(SessionID) as 'SID'
from BIG_Sessions (nolock)
where RespondentID in (
1418283,
1419863,
1421188,
1422101,
1431384,
1435526,
1437284,
1441394,
/* etc etc THOUSANDS */
1579244 )
and EntryDate between
'07-11-2011' and '07-31-2012'
GROUP BY RespondentID
I kknow that my date range is pretty big, but I can't change that part (the dates are spread all over) .
Also, the reason for MIN(SessionID) is because otherwise we get many SessionID's for each Respondent, and one suffices(it's taking MIN on an alphanumeric ID like ach2a23a-adhsdx123... and getting the first alphabetically)
Thanks
Put your thousands of numbers in a temporary table.
Index the number field in that table.
Index the RespondentID field in BIG_SESSIONS
Join the two tables
eg:
select RespondentID, MIN(SessionID) as 'SID'
from BIG_Sessions (nolock)
inner join RespondentsFilterTable
on BIG_SESSIONS.RespondentID = RespondentsFilterTable.RespondentID
where EntryDate between '07-11-2011' and '07-31-2012'
GROUP BY BIG_Sessions.RespondentID
You could add indexes to EntryDate and SessionID as well, but if you're adding to big_sessions frequently, this could be counter productive elsewhere
In general, you can can get hints of how performance of a query can be improved by studying the estimated (or if possible actual) execution plans.
If the smallest and largest ids in the IN statement are known beforehands and depending on how many ids are in the table then adding a respondedID > [smallest_known_id-1] AND respondedID < [largest_known_id+1] prior to the IN statement would help limiting the problem
I'm trying to optimise my query, it has an inner join and coalesce.
The join table, is simple a table with one field of integer, I've added a unique key.
For my where clause I've created a key for the three fields.
But when I look at the plan it still says it's using a table scan.
Where am I going wrong ?
Here's my query
select date(a.startdate, '+'||(b.n*a.interval)||' '||a.intervaltype) as due
from billsndeposits a
inner join util_nums b on date(a.startdate, '+'||(b.n*a.interval)||'
'||a.intervaltype) <= coalesce(a.enddate, date('2013-02-26'))
where not (intervaltype = 'once' or interval = 0) and factid = 1
order by due, pid;
Most likely your JOIN expression cannot use any index and it is calculated by doing a NATURAL scan and calculate date(a.startdate, '+'||(b.n*a.interval)||' '||a.intervaltype) for every row.
BTW: That is a really weird join condition in itself. I suggest you find a better way to join billsndeposits to util_nums (if that is actually needed).
I think I understand what you are trying to achieve. But this kind of join is a recipe for slow performance. Even if you remove date computations and the coalesce (i.e. compare one date against another), it will still be slow (compared to integer joins) even with an index. And because you are creating new dates on the fly you cannot index them.
I suggest creating a temp table with 2 columns (1) pid (or whatever id you use in billsndeposits) and (2) recurrence_dt
populate the new table using this query:
INSERT INTO TEMP
SELECT PID, date(a.startdate, '+'||(b.n*a.interval)||' '||a.intervaltype)
FROM billsndeposits a, util_numbs b;
Then create an index on recurrence_dt columns and runstats. Now your select statement can look like this:
SELECT recurrence_dt
FROM temp t, billsndeposits a
WHERE t.pid = a.pid
AND recurrence_dt <= coalesce(a.enddate, date('2013-02-26'))
you can add a exp_ts on this new table, and expire temporary data afterwards.
I know this adds more work to your original query, but this is a guaranteed performance improvement, and should fit naturally in a script that runs frequently.
Regards,
Edit
Another thing I would do, is make enddate default value = date('2013-02-26'), unless it will affect other code and/or does not make business sense. This way you don't have to work with coalesce.