SQL Server 2012 Estimated Row Numbers Much Different than Actual - sql

I have a query that cross joins two tables. TABLE_1 has 15,000 rows and TABLE_2 has 50,000 rows. A query very similar to this one has run in the past in roughly 10 minutes. Now it is running indefinitely with the same server situation (i.e. nothing else running), and the very similar query is also running indefinitely.
SELECT A.KEY_1
,A.FULL_TEXT_1
,B.FULL_TEXT_2
,B.KEY_2
,MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0) AS confidence
FROM #TABLE_1 A
CROSS JOIN #TABLE_2 B
WHERE MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0) >= 0.9
When I run the estimated execution plan for this query, the Nested Loops (Inner Join) node is estimated at 96% of the execution. The estimated number of rows is 218 million, even though cross joining the tables should result in 15,000 * 50,000 = 750 million rows. When I add INSERT INTO #temp_table to the beginning of the query, the estimated execution plan puts Insert Into at 97% and estimates the number of rows as 218 million. In reality, there should be less than 100 matches that have a similarity score above 0.9.
I have read that large differences in estimated vs. actual row counts can impact performance. What could I do to test/fix this?

I have read that large differences in estimated vs. actual row counts can impact performance. What could I do to test/fix this?
Yes, this is true. It particularly affects optimizations involving join algorithms, aggregation algorithms, and indexes.
But it is not true for your query. Your query has to do a nested loops join with no indexes. All pairs of values in the two tables need to be compared. There is little algorithmic flexibility and (standard) indexes cannot really help.

For better performance, use the minScoreHint parameter. This allows to prevent doing the full similarity calculation for many pairs and early exit.
So this should run quicker:
SELECT A.KEY_1
,A.FULL_TEXT_1
,B.FULL_TEXT_2
,B.KEY_2
,MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0, 0.9) AS confidence
FROM #TABLE_1 A
CROSS JOIN #TABLE_2 B
WHERE MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0, 0.9) >= 0.9
It is not clear from docs if 0.9 results would be included. If not, change 0.9 to 0.89

The link provided by scsimon will help you prove whether it's statistics or not. Have the estimates changed significantly since to when it was running fast?
Parallelism spring to mind. If the query was going parallel, but now isn't (e.g. if a server setting has been changed, or statistics) then that could cause significant performance degradation.

Related

SQL Server query plan sort operator cost spikes when column count increased

I use a ROW_NUMBER() function inside a CTE that causes a SORT operator in the query plan. This SORT operator has always been the most expensive element of the query, but has recently spiked in cost after I increased the number of columns read from the CTE/query.
What confuses me is the increase in cost is not proportional to the column count. I can increase the column count without much issue, normally. However, it seems my query has past some threshold and now costs so much it has doubled the query execution time from 1hour to 2hour+.
I can't figure out what has caused the spike in cost and it's having an impact on business. Any ideas or next steps for troubleshooting you can advise?
Here is the query (simplified):
WITH versioned_events AS (
SELECT [event].*
,CASE WHEN [event].[handle_space] IS NOT NULL THEN [inv].[involvement_id]
ELSE [event].[involvement_id]
END AS [derived_involvement_id]
,ROW_NUMBER() OVER (PARTITION BY [event_id], [event_version] ORDER BY [event_created_date] DESC, [timestamp] DESC ) AS [latest_version]
FROM [database].[schema].[event_table] [event]
LEFT JOIN [database].[schema].[involvement] as [inv]
ON [event].[service_delivery_id] = [inv].[service_delivery_id]
AND [inv].[role_type_code] = 't'
AND [inv].latest_involvement = 1
WHERE event.deletion_type IS NULL AND (event.handle_space IS NULL
OR (event.handle_space NOT LIKE 'x%'
AND event.handle_space NOT LIKE 'y%'))
)
INSERT INTO db.schema.table (
....
)
SELECT
....
FROM versioned_events AS [event]
INNER JOIN (
SELECT DISTINCT service_delivery_id, derived_involvement_id
FROM versioned_events
WHERE latest_version = 1
WHERE ([versioned_events].[timestamp] > '2022-02-07 14:18:09.777610 +00:00')
) AS [delta_events]
ON COALESCE([event].[service_delivery_id],'NULL') = COALESCE([delta_events].[service_delivery_id],'NULL')
AND COALESCE([event].[derived_involvement_id],'NULL') = COALESCE([delta_events].[derived_involvement_id],'NULL')
WHERE [event].[latest_version] = 1
Here is the query plan from the version with the most columns that experiences the cost spike (all others look the same except this operator takes much less time (40-50mins):
I did a comparison of three executions, each with different column counts in the INSERT INTO SELECT FROM clause. I can't share the spreadsheet, but I will try convey my findings so far. The following is true of the query with the most columns:
It takes more than twice as long to execute than the other two executions
It performs more logical & physical reads and scans
It has more CPU time
It reads the most from Tempdb
The increase in execution time is not proportional with the increase in reads or other mentioned metrics
It is true that there is a memory spill level 8 happening. I have tried updating statistics, but it didn't help and all the versions of the query suffer the same problem so like-for-like is still compared.
I know it can be hard to help with this kind of problem without being able to poke around but I would be grateful if anyone could point me in the direction for what to check / try next.
P.S. the table it reads from is a heap and the table it joins to is indexed. The heap table needs to be a heap otherwise inserts into it will take too long and the problem is kicked down the road.
Also, when I say added more columns, I mean in the SELECT FROM versioned_events statement. The columns are replaced with "...." in the above example.
UPDATE
Using a temp table halved the execution time when the column count is the high number that caused the issue but actually takes longer with a reduced column count. It goes back to the idea that a threshold is crossed when the column count is increased :(. In any event, we've used a temp table for now to see if it helps in production.

Does SQL Server Table-Scan Time depend on the Query?

I observed that doing a full table scan takes a different time based on the query. I believed that under similar conditions (set of columns under select, column data types) a table scan should take a somewhat similar time. Seems like it's not the case. I just want to understand the reason behind that.
I have used "CHECKPOINT" and "DBCC DROPCLEANBUFFERS" before querying to make sure there is no impact from the query cache.
Table:
10 Columns
10M rows Each column has different densities ranging from 0.1 to 0.000001
No indexes
Queries:
Query A: returned 100 rows, time took: ~ 900ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL07 = 50000
Query B: returned 910595 rows, time took: ~ 15000ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL01 = 5
** Where column COL07 was randomly populated with integers ranging from 0 to 100000 and column COL01 was randomly populated with integers ranging from 0 to 10
Time Taken:
Query A: around 900 ms
Query B: around 18000 ms
What's the point I'm missing here?
Query A: (returned 100 rows, time took: ~ 900ms)
Query B: (returned 910595 rows, time took: ~ 15000ms)
I believe that what you are missing is that there are about x100 more rows to fetch in the second query. That only could explain why it took 20 times longer.
The two columns have different density of the data.
Query A, COL07: 10000000/100000 = 100
Query B, COL05: 10000000/10 = 1000000
The fact that both the search parameters are in the middle of the data range doesn't necessarily impact the speed of the search. This is depending on the number of times the engine scans the column to return the values of the search predicate.
In order to see if this is indeed the case, I would try the following:
COL04: 10000000/1000 = 10000. Filtering on WHERE COL04 = 500
COL08: 10000000/10000 = 1000. Filtering on WHERE COL05 = 5000
Considering the times from the initial test, you would expect to see COL04 at ~7200ms and COL05 at ~3600ms.
An interesting article about SQL Server COUNT() Function Performance Comparison
Full Table Scan (also known as Sequential Scan) is a scan made on a database where each row of the table under scan is read in a sequential (serial) order
Reference
In your case, full table scan scans sequentially (in ordered way) so that it does not need to scan whole table in order to advance next record because Col7 is ordered.
but in Query2 the case is not like that, Col01 is randomly distributed so full table scan is needed.
Query 1 is optimistic scan where as Query 2 is pessimistic can.

Teradata SQL Optimization

I hope this is concise. I am basically looking for a methodology on how to improve queries after watching one of my colleagues speed up my query almost 10 fold with a quick change
I had a query that had two tables t_item and t_action
t_item is basically an item with characteristics and t_action is the events or actions that are performed on this item with a time stamp for each action each action also has an id
My query joined the two tables on id. There were also some criteria made on t_action.action_type which is free text
My simplified original query was like below
SELECT *
FROM t_item
JOIN t_action
ON t_item.pk = t_action.fk
WHERE t_action.action_type LIKE ('%PURCHASE%')
AND t_item.location = 'DE'
This ran OK, it came back in roughly 8 mins
My colleague changed it so that the t_action.action_type ended up in the FROM portion of the SQL. This reduced the time to 2 mins
SELECT *
FROM t_item
JOIN t_action
ON t_item.pk = t_action.fk
t_action.action_type LIKE ('%PURCHASE%')
WHERE t_item.location = 'DE'
My question is, Generally, how do you know when to put limits in the FROM clause vs in the WHERE clause.
I thought that Teradata SQL optimizer does this automatically
Thank you for your help
In this case, you don't actually need to understand the plan. You just need to see if the two plans are the same. Teradata has a pretty good optimizer, so I would not expect there to be a difference between the two version (could be, but I would be surprised). Hence, caching is a possibility for explaining the difference in performance.
For this query:
SELECT *
FROM t_item JOIN
t_action
ON t_item.pk = t_action.fk
t_action.action_type LIKE '%PURCHASE%'
WHERE t_item.location = 'DE';
The best indexes are probably on t_item(location, pk) and t_action(action_type). However, you should try to get rid of the wildcards for a production query. This makes the query harder to optimize, which in turn might have a large impact on performance.
I tried to create similar query but didn't see any difference in the explain plan..though record counts were less trans(15k) and accounts(10k) with indexes on Account_number. Probably what Gordon has specified , try to run the query at different time and also check explain plan for both the queries to see any difference.
Explain select * from trans t
inner join
ap.accounts a
on t.account_number = a.account_number
where t.trans_id like '%DEP%';
4) We do an all-AMPs JOIN step from ap.a by way of a RowHash match
scan with no residual conditions, which is joined to ap.t by way
of a RowHash match scan with a condition of ("ap.t.Trans_ID LIKE
'%DEP%'"). ap.a and ap.t are joined using a merge join, with a
join condition of ("ap.t.Account_Number = ap.a.Account_Number").
The result goes into Spool 1 (group_amps), which is built locally
on the AMPs. The size of Spool 1 is estimated with no confidence
to be 11,996 rows (1,511,496 bytes). The estimated time for this
step is 0.25 seconds.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.25 seconds.
Explain select * from trans t
inner join
ap.accounts a
on t.account_number = a.account_number
and t.trans_id like '%DEP%';
4) We do an all-AMPs JOIN step from ap.a by way of a RowHash match
scan with no residual conditions, which is joined to ap.t by way
of a RowHash match scan with a condition of ("ap.t.Trans_ID LIKE
'%DEP%'"). ap.a and ap.t are joined using a merge join, with a
join condition of ("ap.t.Account_Number = ap.a.Account_Number").
The result goes into Spool 1 (group_amps), which is built locally
on the AMPs. The size of Spool 1 is estimated with no confidence
to be 11,996 rows (1,511,496 bytes). The estimated time for this
step is 0.25 seconds.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.25 seconds.
The general order of query processing on Teradata is :
Where/And + Joins
Aggregate
Having
Olap/Window
Qualify
Sample/Top
Order By
Format
An easy way to remember is WAHOQSOF - as in Wax On, Wax Off :)

Heavy cost (99%) for insert in query execution plan

I have a query as follows:
SELECT c.irn,
pLog.policingname,
ce.*
INTO #caselist
FROM employeereminder_ilog ce
JOIN cases c
ON ce.caseid = c.caseid
JOIN policinglog pLog
ON ( ce.logdatetimestamp BETWEEN
pLog.startdatetime AND pLog.finishdatetime )
WHERE ce.logdatetimestamp BETWEEN #start_pre AND #end_pre
employeereminder_iLOG is a pretty huge table, around 32M rows.
POLICINGLOG has around 50 rows.
CASES around 0.5m rows.
#start_pre and #end_pre are predefined variabled around 30 minutes apart.
This query took around 30 minutes to run, and returns around 600 results.
I was trying to find way to speed up the query by looking at the execution plan. I couldn't work out why however the insert was taking up 99% of the query, as opposed to the select from employeereminder_iLOG .
So, my questions are:
Why is the cost coming from the insert, and not the select from employeereminder_iLOG.
Is it possible to speed up this query?

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?