SQL Server 2012 Paging and TotalRows - sql

I've just had to start paging in SQL Server 2012 and I'm trying to get the total row count before paging is applied, but the problem I have is that my view has a few too many function calls in it that massively slow it down.
I've looked at this post and I've ended up with a query that takes 39 secs to run without the full data set in the DB.
Get total row count while paging
SELECT *
, COUNT(TaskId) OVER()
FROM TaskVersionView
WHERE (.. ~10 predicates here .. )
ORDER BY StartDate
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY
Without the COUNT it takes <1 second.
I would have expected SQL to optimize it so that it only counts the TaskIds instead of calling the functions but that doesn't seem to be the case, because:
SELECT COUNT(TaskId)
FROM TaskVersionView
Takes <1 sec.

I would have expected SQL to optimize it so that it only counts the TaskIds instead of calling the functions
If the predicates are always 'true' then this 'optimization' would return the correct value. Not that SQL Server could, even in theory, guess that the functions will always return true. But if you know (as it seems to imply from your expectation) that the functions in the predicates always return true then obviously you should remove them from the WHERE clause...
If the predicates sometimes return 'false' then obviously they cannot be optimized away, as the returned values would be incorrect.
Something gotta give.
PS. Paging with total counts is a bad idea, as it forces a full scan on every visit. Paging with total counts on which the total count is returned for every row is a horrible bad idea (modeling wise, perf wise, sanity wise).

Related

Get filtered row count using dm_db_partition_stats

I'm using paging in my app but I've noticed that paging has gone very slow and the line below is the culprit:
SELECT COUNT (*) FROM MyTable
On my table, which only has 9 million rows, it takes 43 seconds to return the row count. I read in another article which states that to return the row count for 1.4 billion rows, it takes over 5 minutes. This obviously cannot be used with paging as it is far too slow and the only reason I need the row count is to calculate the number of available pages.
After a bit of research I found out that I get the row count pretty much instantly (and accurately) using the following:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
But the above returns me the count for the entire table which is fine if no filters are applied but how do I handle this if I need to apply filters such as a date range and/or a status?
For example, what is the row count for MyTable when the DateTime field is between 2013-04-05 and 2013-04-06 and status='warning'?
Thanks.
UPDATE-1
In case I wasn't clear, I require the total number of rows available so that I can determine the number of pages required that will match my query when using 'paging' feature. For example, if a page returns 20 records and my total number of records matching my query is 235, I know I'll need to display 12 buttons below my grid.
01 - (row 1 to 20) - 20 rows displayed in grid.
02 - (row 21 to 40) - 20 rows displayed in grid.
...
11 - (row 200 to 220) - 20 rows displayed in grid.
12 - (row 221 to 235) - 15 rows displayed in grid.
There will be additional logic added to handle a large amount of pages but that's a UI issue, so this is out of scope for this topic.
My problem with using "Select count(*) from MyTable" is that it is taking 40+ seconds on 9 million records (thought it isn't anymore and I need to find out why!) but using this method I was able to add the same filter as my query to determine the query. For example,
SELECT COUNT(*) FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning'
Once I determine the page count, I would then run the same query but include the fields instead of count(*), the CurrentPageNo and PageSize in order to filter my results by page number using the row ids and navigate to a specific pages if needed.
SELECT RowId, DateTime, Status, Message FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning' AND
RowId BETWEEN (CurrentPageNo * PageSize) AND ((CurrentPageNo + 1) * PageSize)
Now, if I use the other mentioned method to get the row count i.e.
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
It returns the count instantly but how do I filter this so that I can include the same filters as if I was using the SELECT COUNT(*) method, so I could end up with something like:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable') AND
(index_id=0 or index_id=1) AND
([DateTime] BETWEEN '2018-04-05' AND '2018-04-06') AND
([Status] = 'Warning')
The above clearing won't work as I'm querying the dm_db_partition_stats but I would like to know if I can somehow perform a join or something similar to provide me with the total number of rows instantly but it needs to be filtered rather than apply to the entire table.
Thanks.
Have you ever asked for directions to alpha centauri? No? Well the answer is, you can't get there from here.
Adding indexes, re-orgs/re-builds, updating stats will only get you so far. You should consider changing your approach.
sp_spaceused will return the record count typically instantly; You may be able to use this, however depending (which you've not quite given us enough information) on what you are using the count for might not be adequate.
I am not sure if you are trying to use this count as a means to short circuit a larger operation or how you are using the count in your application. When you start to highlight 1.4 billion records and you're looking for a window in said set, it sounds like you might be a candidate for partitioned tables.
This allows you assign several smaller tables, typically separated by date, years / months, that act as a single table. When you give the date range on 1.4+ Billion records, SQL can meet performance expectations. This does depend on SQL Edition, but there is also view partitioning as well.
Kimberly Tripp has a blog and some videos out there, and Kendra Little also has some good content on how they are used and how to set them up. This would be a design change. It is a bit complex and not something you would want implement on a whim.
Here is a link to Kimberly's Blog: https://www.sqlskills.com/blogs/kimberly/sqlskills-sql101-partitioning/
Dev banter:
Also, I hear you blaming SQL, are you using entity framework by chance?

PDO SQL issue displaying multiple rows when using COUNT()

To display my results from PDO, I always use following PHP code for example:
$STH = $DBH->prepare("SELECT logo_id, guess_count, guessed, count(id) AS counter FROM guess WHERE user_id=:id");
$STH->bindParam(":id",$loginuser['id']);
$STH->execute();
while($row = $STH->fetch()){
print_r($row);
}
Now the issue is that I only get one result. I used to use $STH->rowCount() to check the amount of rows returned, but this method isn't really advised for SELECT statements because in some databases it doesn't react correctly. So I used the count(id) AS counter, but now I only get one result every time, even though the value of $row['counter'] is larger than one.
What is the correct way to count the amount of results in one query?
If you want to check the number of rows that are returned by a query, there are a couple of options.
You could do a ->fetchAll to get an array of all rows. (This isn't advisable for large result sets (i.e. a lot of rows returned by the query); you could add a LIMIT clause on your query to avoid returning more than a certain number of rows, if what you are checking is whether you get more than one row back, you would only need to retrieve two rows.) Checking the length of the array is trivial.
Another option is to run a another, separate query, to get the count separately, e.g.
SELECT COUNT(1) AS counter FROM guess WHERE user_id=:id
But, that approach requires another round trip to the database.
And the old standby SQL_CALC_ROUND_ROWS is another option, though that too can have problematic performance with large sets.
You could also just add a loop counter in your existing code:
$i = 0;
while($row = $STH->fetch()){
$i++
print_r($row);
}
print "fetched row count: ".$i;
If what you need is an exact count of the number of rows that satisfy a particular predicate PRIOR to running a query to return the rows, then the separate COUNT(1) query is likely the most suitable approach. Yes, it's extra code in your app; I recommend you preface the code with a comment that indicates the purpose of the code... to get an exact count of rows that satisfy a set of predicates, prior to running a query that will retrieve the rows.
If I had to process the rows anyway, and adding LIMIT 0,100 to the query was acceptable, I would go for the ->fetchAll(), get the count from the length of the array, and process the rows from the array.
You have to use GROUP BY. Your query should look like
SELECT logo_id, guess_count, guessed, COUNT(id) AS counter
FROM guess
WHERE user_id=:id
GROUP BY logo_id, guess_count, guessed

rowcount before and after query

Can someone explain the difference between these 2 simple queries ...
SET ROWCOUNT = 10
select * from t_Account order by account_date_last_maintenance
and this one
select * from t_Account order by account_date_last_maintenance
SET ROWCOUNT = 10
when executed both return only 10 rows, but the rows are different. There are millions of rows in the table if that matters. Also, the first query runs consistently 20% longer.
Thanks everyone
When you execute SET ROWCOUNT 10 you are telling SQL to stop the query after 10 results are returned. Your first SQL statement is the correct syntax (with the exception of the first line which should read SET ROWCOUNT 10).
The second statement as written will return all of the values ordered when initially executed and then set the row count to 10, so any subsequent execution will return the first 10 items.
ROWCOUNT must be set to 0 to get things back to "normal" execution.
As to why things were returning differently, the data might not be processed the same every time and given the size of your dataset it is most likely that you might sometimes get matching results, but it is not a sure thing. If you want consistent results and only want the first 10 results I would recommend using TOP.

SQL Server aggregate performance

I am wondering whether SQL Server knows to 'cache' if you like aggregates while in a query, if they are used again.
For example,
Select Sum(Field),
Sum(Field) / 12
From Table
Would SQL Server know that it has already calculated the Sum function on the first field and then just divide it by 12 for the second? Or would it run the Sum function again then divide it by 12?
Thanks
It calculates once
Select
Sum(Price),
Sum(Price) / 12
From
MyTable
The plan gives:
|--Compute Scalar(DEFINE:([Expr1004]=[Expr1003]/(12.)))
|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
|--Stream Aggregate(DEFINE:([Expr1010]=Count(*), [Expr1011]=SUM([myDB].[dbo].[MyTable].[Price])))
|--Index Scan(OBJECT:([myDB].[dbo].[MyTable].[IX_SomeThing]))
This table has 1.35 million rows
Expr1011 = SUM
Expr1003 = some internal thing to do with "no rows" etc but is Expr1011 basically
Expr1004 = Expr1011 / 12
According to the execution plan, it doesn't re-sum the column.
good question, i think the answer is no, it doesn't not cache it.
I ran a test query with around 3000 counts in it, and it was much slower than one with only a few. Still want to test if the query would be just as slow selecting just plain columns
edit: OK, i just tried selecting a large amount of columns or just one, and the amount of columns (when talking about thousands being returned) does effect the speed.
Overall, unless you are using that aggregate number a ton of times in your query, you should be fine. Push comes to shove, you could always save the outcome to a variable and do the math after the fact.

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?