Slow T-SQL query with datediff function - sql

I have a query which runs fast when the date clause "and datediff(day,con2.DT_DateIncluded),'2017-01-01')<=0" in the code below isn't used in the query, but runs slowly when it is included. Though it runs fast when I run just the part "select top 2 ID_Contact...", even including the date clause. I have this query on a classic ASP application, and it can't be converted in to a stored procedure (project scope reasons). Can you help me find a way to improve the performance of the full query just by changing the query code?
select distinct top 10
ID_Contact, NO_CodCompany
from
tblContacts con1
where
ID_Contact in (select top 2 ID_Contact
from tblContacts con2
inner join tblCompanies cp on con2.NO_CodCompany = cp.ID_Company
where con2.NO_CodCompany = con1.NO_CodCompany
and datediff(day, con2.DT_DateIncluded), '2017-01-01') <= 0)

Instead of `DATEDIFF() < 0' try using:
and con2.DT_DateIncluded <= '2017-01-01'
Also, ensure that there is an index on the `DT_DateIncluded' column.
The reason DATEDIFF() runs slow is that using it takes a bit of time to perform the calculation, the query optimizer is (probably) ending up running it for the entire table, and there is (probably) no index to help it select the required rows.
When you remove that clause the query runs faster, but that is probably helped along by the fact that you're only selecting the first two rows in the inner query and ten rows in the outer query, allowing a table scan to be performant enough.

This is essentially your query:
This is your query:
select distinct top 10 ID_Contact, NO_CodCompany
from tblContacts con1
where ID_Contact in (select top 2 ID_Contact
from tblContacts con2 inner join
tblCompanies cp
on con2.NO_CodCompany = cp.ID_Company
where con2.NO_CodCompany = con1.NO_CodCompany and
datediff(day, con2.DT_DateIncluded), '2017-01-01') <= 0
);
My first suggestion is to change the datediff() to a simple date comparison:
select distinct top 10 ID_Contact, NO_CodCompany
from tblContacts con1
where ID_Contact in (select top 2 ID_Contact
from tblContacts con2 inner join
tblCompanies cp
on con2.NO_CodCompany = cp.ID_Company
where con2.NO_CodCompany = con1.NO_CodCompany and
con2.DT_DateIncluded < '2017-01-02'
);
Then, I would remove the JOIN in the subquery. I'm not 100% sure this is exactly equivalent, because that might depend on nuances in the data:
select distinct top 10 ID_Contact, NO_CodCompany
from tblContacts con1
where con1.ID_Contact in (select top 2 con2.ID_Contact
from tblCompanies cp
where con1.NO_CodCompany = cp.ID_Company and
con1.DT_DateIncluded < '2017-01-02'
);
Then, if you can remove the select distinct in the outermost query, you should do that.

Try this instead:
con2.DT_DateIncluded < '20170102'
It's better because it still allows the server to make use of any indexes on the DT_DateIncluded column. Currently, this is not possible. Even worse, the query is probably having to run that DATEDIFF() function on every record in the table.
Note that this is equivalent to what you posted, even if it might not match what you intended. I suspect con2.DT_DateIncluded < '20170101' is closer to what you really meant.
I also suspect you could do this either without the 2nd instance of tblContacts or with a windowing function to get much better results, or at least by using JOIN instead of IN to filter the results.
Finally, for historical reasons, when entering a date-only value, you should use the unseparated date format as described here:
The ultimate guide to the datetime datatypes
For date/time values, you can still use the separated yyyy-mm-dd hh:mm:ss you're used to, but if you only have the date part, yyyymmdd is better.
Based on this comment:
My goal with this query is to obtain contacts from companies but limited to "n" contacts per company
You should look into the APPLY operator. Unfortunately, it's still not clear to me how everything fits together, but I will least provide a demonstration using the APPLY operator to show two contacts per company that you can use as a starting point:
SELECT TOP 10 ct.ID_Contact, ct.NO_CodCompany
FROM tblCompanies cp
CROSS APPLY (
SELECT TOP 2 ID_Contact, NO_CodCompany
FROM tblContacs
WHERE NO_CodCompany = cp.ID_Company
AND DT_DateIncluded < '20170102'
ORDER BY DT_DateIncluded DESC
) ct
APPLY works kind of like a JOIN on a nested SELECT query, where there is no ON clause; the join conditional is instead included as part of the WHERE clause in the nested SELECT statement.
Note the use of CROSS. This will exclude companies that have no contacts at all. If you want to include those companies, change it to OUTER.
You should also look at what indexes you have defined. A single index on the tblContacts table that looks at NO_CodCompany and DT_DateIncluded (in that order!) might work wonders for this query, especially if it also has ID_Contact in the INCLUDES clause. Then you could complete the tblContacts portion of the query entirely from the index.

Related

How to improve SQL query performance containing partially common subqueries

I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:
event_count sys_timestamp
100 167877672772
110 167877672769
121 167877672987
111 167877673877
... ...
With both fields defined as numeric.
With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:
SELECT t1.*,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count >= t1.event_count+10)
AS positive,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count <= t1.event_count-10)
AS negative
FROM tableA as t1
The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).
However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.
Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.
My concrete questions:
How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000)
Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)
Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?
Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?
How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN because the subquery always returns a row. See:
JOIN (SELECT ... ) ue ON 1=1?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:
Aggregate columns with additional (distinct) filters
event_count should probably be integer or bigint. See:
PostgreSQL using UUID vs Text as primary key
Is there any difference in saving same value in different integer types?
sys_timestamp should probably be timestamp or timestamptz. See:
Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application
You have the right idea.
The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.
The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.
Then the next step is to reuse that same detail query for everything else.
Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:
with baseQuery as
(
SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount
,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
FROM tableA t1, tableA t2
where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
select bq.startEventCount, count(*) as positive
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.pEndEventCount
group by bq.startEventCount
), negSummary as
(
select bq.startEventCount, count(*) as negative
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.nEndEventCount
group by bq.startEventCount
)
select t1.*, ps.positive, nv.negative
from tableA t1
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount
Notes:
The distinct for baseQuery may not be necessary based on your actual keys.
The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.
You can play around to see what works.
There are other ways of course but this best illustrates how and where things could be improved.
With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion.
For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.

Fastest way to count from a subquery

I have the following query to return a list of current employees and the number of 'corrections' they have. This is working correctly but is very slow.
I was previously not using a subquery, instead opting for a count (from...) as an aggregate subselect but I have read that a subquery should be much faster. Changing to the code to the below did improve performance but not anywhere near what I was expecting.
SELECT DISTINCT
tblStaff.StaffID, CorrectionsOut.Count AS CorrectionsAssigned
FROM tblStaff
LEFT JOIN tblMeetings ON tblMeetings.StaffID = tblStaff.StaffID
JOIN tblTasks ON tblTasks.TaskID = tblMeetings.TaskID
--Get Corrections Issued
LEFT JOIN(
SELECT
COUNT(DISTINCT tblMeetings.TaskID) AS Count, tblMeetings.StaffID
FROM tblRegister
JOIN tblMeetings ON tblRegister.MeetingID = tblMeetings.MeetingID
WHERE tblRegister.FDescription IS NOT NULL
AND tblRegister.CorrectionOutDate IS NULL
GROUP BY tblMeetings.StaffID
) AS CorrectionsOut ON CorrectionsOut.StaffID = tblStaff.StaffID
WHERE tblStaff.CurrentEmployee = 1
I need an open vendor solution as we are transitioning from SQL Server to Postgres. Note this is a simplified example of the query where there are quite few counts. My current query time without the counts is less than half a second, but with the counts, is approx 20 seconds, if it runs at all without locking or otherwise failing.
I would get rid of the joins that you are not using which probably makes the SELECT DISTINCT unnecessary as well:
SELECT s.StaffID, co.Count AS CorrectionsAssigned
FROM tblStaff s LEFT JOIN
(SELECT COUNT(DISTINCT m.TaskID) AS Count, m.StaffID
FROM tblRegister r
tblMeetings m
ON r.MeetingID = m.MeetingID
WHERE r.FDescription IS NOT NULL AND
r.CorrectionOutDate IS NULL
GROUP BY m.StaffID
) co
ON co.StaffID = s.StaffID
WHERE s.CurrentEmployee = 1;
Getting rid of the SELECT DISTINCT and the duplicate rows added by the tasks should help performance.
For additional benefit, you would want to be sure you have indexes on the JOIN keys, and perhaps on the filtering criteria.

SQL Query Performance Issues Using Subquery

I am having issues with my query run time. I want the query to automatically pull the max id for a column because the table is indexed off of that column. If i punch in the number manually, it runs in seconds, but i want the query to be more dynamic if possible.
I've tried placing the sub-query in different places with no luck
SELECT *
FROM TABLE A
JOIN TABLE B
ON A.SLD_MENU_ITM_ID = B.SLD_MENU_ITM_ID
AND B.ACTV_FLG = 1
WHERE A.WK_END_THU_ID_NU >= (SELECT DISTINCT MAX (WK_END_THU_ID_NU) FROM TABLE A)
AND A.WK_END_THU_END_YR_NU = YEAR(GETDATE())
AND A.LGCY_NATL_STR_NU IN (7731)
AND B.SLD_MENU_ITM_ID = 4314
I just want this to run faster. Maybe there is a different approach i should be taking?
I would move the subquery to the FROM clause and change the WHERE clause to only refer to A:
SELECT *
FROM A CROSS JOIN
(SELECT MAX(WK_END_THU_ID_NU) as max_wet
FROM A
) am
ON a.WK_END_THU_ID_NU = max_wet JOIN
B
ON A.SLD_MENU_ITM_ID = B.SLD_MENU_ITM_ID AND
B.ACTV_FLG = 1
WHERE A.WK_END_THU_END_YR_NU = YEAR(GETDATE()) AND
A.LGCY_NATL_STR_NU IN (7731) AND
A.SLD_MENU_ITM_ID = 4314; -- is the same as B
Then you want indexes. I'm pretty sure you want indexes on:
A(SLD_MENU_ITM_ID, WK_END_THU_END_YR_NU, LGCY_NATL_STR_NU, SLD_MENU_ITM_ID)
B(SLD_MENU_ITM_ID, ACTV_FLG)
I will note that moving the subquery to the FROM clause probably does not affect performance, because SQL Server is smart enough to only execute it once. However, I prefer table references in the FROM clause when reasonable. I don't think a window function would actually help in this case.

MS SQL - Multiple Running Totals, each based on a different GROUP BY

Need to generate the 2 running total columns which are each based on a different group-by. I would PREFER that the solution use the OUTER APPLY method like the one below, except modified to run multiple running totals/sums on different group bys/columns. See image for example of desired result
SELECT t1.LicenseNumber, t1.IncidentDate, t1.TicketAmount,
RunningTotal = SUM(t2.TicketAmount)
FROM dbo.SpeedingTickets AS t1
OUTER APPLY
(
SELECT TicketAmount
FROM dbo.SpeedingTickets
WHERE LicenseNumber = t1.LicenseNumber
AND IncidentDate <= t1.IncidentDate
) AS t2
GROUP BY t1.LicenseNumber, t1.IncidentDate, t1.TicketAmount
ORDER BY t1.LicenseNumber, t1.IncidentDate;
Example + desires result:
i.stack.imgur.com/PvJQe.png
Use outer apply twice:
Here is how you get one running total:
SELECT st.*, r1.RunningTotal
FROM dbo.SpeedingTickets st OUTER APPLY
(SELECT SUM(st2.TicketAmount) as RunningTotal
FROM dbo.SpeedingTickets st2
WHERE st2.LicenseNumber = st.LicenseNumber AND
st2.IncidentDate <= st.IncidentDate
) r1
ORDER BY st.LicenseNumber, st.IncidentDate;
For two, you just add another OUTER APPLY. Your question doesn't specify what the second aggregation is, and the linked picture has no relevance to the description in the question.
Notes:
The aggregation goes in the subquery, not in the outer query.
Use table abbreviations for table aliases. Such consistency makes it easier to follow the query.
When using correlated subqueries, always use qualified column names for all columns.

query behave not as expected

I have a query:
select count(*) as total
from sheet_record right join
(select * from sheet_record limit 10) as sr
on 1=1;
If i understood correct (which i think i did not), right join suppose to return all row from right table in conjunction with left table. it suppose to be at list 10 row. But query returns only 1 row with 1 column 'total' . And it doesn't matter left full inner join it will be, result is the same always.
If i reverse tables and use left join with small modification of query, then it work correct (Modifications have no matter because in this case i get exactly what i expected to get). But I am interested to find what i actually didn't understand about join and why this query works not as expected.
You are returning one column because the select contains an aggregation function, turning this into an aggregation query. The query should be returning 10 times the number of rows in the sheet_record table.
Your query is effectively a cross join. So, if you did:
select *
from sheet_record right join
(select * from sheet_record limit 10) as sr
on 1=1;
You would get 10 rows for each record in sheet_record. Each of those records would have additional columns from one of ten records from the same table.
You are using a count(*) function, without any groupings. This will pretty much will result in retrieving a single row back. Try running your query without the count() to see if you get something closer to what you expect.
Eventually with help of commentators I did understood what was wrong. Not wrong actually, but what exactly i was not catching.
// this code below is work fine. query will return page 15 with 10 records in.
select *from sheet_record inner join (select count(*) as total from sheet_record) as sr on 1=1 limit 10 offset 140;
I was thinking that join takes table from left and join with the right table. But the moment i was working on script(above) I had on right side a view(table built by subquery) instead of pure table and i was thinking that left side as well a view, made by (select * from sheet_record) which is a mistake.
Idea is to get set of records from table X with additional column having value of total number of records in table.
(This is common problem when there is a demand to show table in UI using paging. To know how many pages still should be available i need to know how many record in total so i can calculate how many pages still available)
I think it should be something
select * from (
(here is some subquery which will give a view using count(*) function on some table X and it will be used as left table)
right join
(here is some subquery which will get some set or records from table X with limit and offset)
on 1=1 //becouse i need all row from right table(view) in all cases it should be true
)
Query with right join will a bit complicated.
I am using postgres.
So eventually i managed to get result with right join
select * from (select count(*) as total from sheet_record) as srt right join (select * from sheet_record limit 10 offset 140) as sr on 1=1;