Fastest Way To Get Count From A Table With Conditions? - sql

I am using sql server 2017 and EF Core 2.2. One of my tables right now has 5 million records in it.
I want to group all these records by "CategoryId" and then have a count for each one.
I also need to filter out with a where clause.
However even if I write the query in Sql it still takes around a minute to get me these numbers.
This is way too slow and I need something faster.
select CategoryId, count(*) from Items where Deleted = 'False'
group by CategoryId
I am guessing that EF core probably won't have a solution that will be fast enough so I am open to using ado.net if needed. I just need something that is fast.

Consider creating an indexed view to materialize the aggregation:
CREATE VIEW dbo.ItemCategory
WITH SCHEMABINDING
AS
SELECT CategoryId, COUNT_BIG(*) AS CountBig
FROM dbo.Items
WHERE Deleted = 'False'
GROUP BY CategoryId;
GO
CREATE UNIQUE CLUSTERED INDEX cdx_ItemCategory
ON dbo.ItemCategory (CategoryId);
GO
Using this view for the aggregated result will improve performance significantly:
SELECT CategoryId, CountBig
FROM dbo.ItemCategory;
Depending on your SQL Server edition, you may need to specify the NOEXPAND hint for the view index to be used:
SELECT CategoryId, CountBig
FROM dbo.ItemCategory WITH (NOEXPAND);

You better add indexes on "deleted" and categoryid.
Or put all deleted items on a separate table

You should have a covering index for your query to make it go fast, other than this there is no shortcut to get performance out of it as your query will need to read every page from the table to count the category ID.
I have a table with 5 million rows almost 4.7 million rows are set to Delete = False, without the covering index, my query takes about 12 seconds and execution plan looks like this.
Once I create the following covering index on my table the query is executed in less than a second and the execution plan looks exactly the same but it is doing a seek on the nonclustered index rather than doing a scan on the clustered index:
Index Definition:
CREATE NONCLUSTERED INDEX [Test_Index]
ON [dbo].[Test] ([IsDeleted])
INCLUDE ([CategoryId])
With this covering Index SQL Server will only need to look into the index and return the results rather than looking into your whole table.
If you really want to speed up this query then there is another very specific way to speed up this query by creating a filtered index specifically for your query;
Index definition would be:
CREATE NONCLUSTERED INDEX [Test_Index2]
ON [dbo].[Test] ([CategoryId])
WHERE IsDeleted = 'False'
With this filtered index my query was pretty instant, I didnt set IO time on my query but I would see a few milliseconds. The execution plan slightly changed with this index.

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

Indexed views. Query ignoring view and uses table instead

My task is to optimize this query:
Declare #sumBalance NUMERIC = (select SUM(CURRENT_BALANCE) as Balance
from dbo.ACCOUNT_DETAILS)
select #sumBalance
I've read that the best solution for aggregation functions is using indexed views instead of tables.
I've created the view with SCHEMABINDING:
CREATE VIEW dbo.CURRENT_BALANCE_VIEW
WITH SCHEMABINDING
AS
SELECT id,CURRENT_BALANCE
FROM dbo.ACCOUNT_DETAILS
After that I've created 2 indexes:
The first for ID
CREATE UNIQUE CLUSTERED INDEX index_ID_VIEW ON dbo.View(ID);
The second for CURRENT_BALANCE my second one column
CREATE NONCLUSTERED INDEX index_CURRENT_BALANCE_VIEW
ON dbo.CURRENT_BALANCE_VIEW(ID);
And here I got troubles with new query:
Declare #sumBalance NUMERIC = (select SUM(CURRENT_BALANCE) as Balance
from dbo.CURRENT_BALANCE_VIEW)
select #sumBalance
New query doesn't use view
http://i.stack.imgur.com/jlPEd.png
Somehow my indexes were added to the folder Statistics
Look in another post
I don't understand why I can see index 'index_current_balance' cause there is no such an index in the table
Look in another post
P.S. Already tried create index in the table and it helped. It made query works faster from 0.2 Es.operator cost to 0.009 but anyway it must be faster.
p.s.s Sorry for making you click on the link, my reputation doesn't allow me to past images properly =\
p.s.s.s Working with SQL Server 2014
p.s.s.s.s Just realized that I don't need to sum 0-s. Expected them grom function.
Thanks in advance.
if you use Standard Edition of SQL-Server you have to use the NOEXPAND-Hint in order to use the index of a view.
For example:
SELECT *
FROM dbo.CURRENT_BALANCE_VIEW (NOEXPAND);
This query:
Declare #sumBalance NUMERIC = (select SUM(CURRENT_BALANCE) as Balance
from dbo.ACCOUNT_DETAILS);
select #sumBalance;
is not easy to optimize. The only index that will help it is:
create index idx_account_details_current_balance on account_details(current_balance);
This is a covering index for the query, and can be used for the SUM(). However, the index still needs to be scanned to do the SUM(). Scanning the index should be faster than scanning the table because it is likely to be much smaller.
SQL Server 2012+ has a facility called columnstore indexes that would have the same effect.
The advice for using indexed views for aggregation functions doesn't seem like good advice. For instance, if the above query used MIN() or MAX(), then the above index should be the optimal index for the query, and it should run quite fast.
EDIT:
Your reference article is quite reasonable. If you want to create an indexed view for this purpose, then create it with aggregation.
CREATE VIEW dbo.CURRENT_BALANCE_VIEW
WITH SCHEMABINDING
AS
SELECT SUM(CURRENT_BALANCE) as bal, COUNT_BIG(CURRENT_BALANCE) as cnt
FROM dbo.ACCOUNT_DETAILS;
This is a little weird, because it returns one row. I think the following will work:
create index idx_account_details on current_balance_view(bal);
If not, you may need to introduce a dummy column for the index.
Then:
select *
from dbo.current_balance_view;
should have the precomputed value.

Performance Tuning SQL

I have the following sql. When I check the execution plan of this query we observe an index scan. How do I replace it with index seek. I have non clustered index on IdDeleted column.
SELECT Cast(Format(Sum(COALESCE(InstalledSubtotal, 0)), 'F') AS MONEY) AS TotalSoldNet,
BP.BoundProjectId AS ProjectId
FROM BoundProducts BP
WHERE BP.IsDeleted=0 or BP.IsDeleted is null
GROUP BY BP.BoundProjectId
I tried like this and got index seek, but the result was wrong.
SELECT Cast(Format(Sum(COALESCE(InstalledSubtotal, 0)), 'F') AS MONEY) AS TotalSoldNet,
BP.BoundProjectId AS ProjectId
FROM BoundProducts BP
WHERE BP.IsDeleted=0
GROUP BY BP.BoundProjectId
Can anyone kindly suggest me to get the right result set with index seek.
I mean how to I replace (BP.IsDeleted=0 or BP.IsDeleted is null) condition to make use of index seek.
Edit, added row counts from comments of one of the answers:
null: 254962 rows
0: 392002 rows
1: 50211 rows
You're not getting an index seek because you're fetching almost 93% of the rows in the table and in that kind of scenario, just scanning the whole index is faster and cheaper to do.
If you have performance issues, you should look into removing format() -function, especially if the query returns a lot of rows. Read more from this blog post
Other option might be to create an indexed view and pre-aggregate your data. This of course adds an overhead to update / insert operations, so consider that only if this is done really often vs. how often the table is updated.
have you tried a UNION ALL?
select ...
Where isdeleted = 0
UNION ALL
select ...
Where isdeleted is null
Or you can add an Query hint (index= [indexname])
also be aware that the cardinality will determine if SQL uses a seek or a scan - seeks are quicker but can require key lookups if the index doesnt cover all the columns required. A good rule of thumb is that if the query will return more than 5% of the table then SQL will prefer a scan.

CREATING INDEX SQL Server 2008

Recently I was put into database fine tuning. I have some ideas about SQL Server and decided to create some index.
Referred this http://sqlserverplanet.com/ddl/create-index
But i don't understand how other types of Index like INCLUDE, WITH options will help. I tried google to but failed to see a simple description when to use those.
CREATE NONCLUSTERED INDEX IX_NC_PresidentNumber
ON dbo.Presidents (PresidentNumber)
INCLUDE (President,YearsInOffice,RatingPoints)
WHERE ElectoralVotes IS NOT NULL
CREATE NONCLUSTERED INDEX IX_NC_PresidentNumber
ON dbo.Presidents (PresidentNumber)
WITH ( DATA_COMPRESSION = ROW )
CREATE NONCLUSTERED INDEX IX_NC_PresidentNumber
ON dbo.Presidents (PresidentNumber)
WITH ( DATA_COMPRESSION = PAGE )
What scenario I should use the above? Will they increase performance?
Data compression will help your query performance too, since after compression, when you run a query, less page/extent will be loaded, since I/O is reduced, reducing I/O is always a good choice.
I can't speak to the with datacompression option, but the Include option can definitely improve performance. If you select only the PresidentNumber and one or more of President, YearsInOffice, or RatingPoints columns, and the ElectoralVotes is not null, then your query will get values from the index itself and not have to touch the underlying table. If your table has additional columns and you include one of those in your query then it will have to retrieve values from the table as well as the index.
Select top 20 PresidentNumber, President, YearsInOffice, RatingPoints
From Presidents
where ElectoralVotes IS NOT NULL
The above query will only read from IX_NC_PresidentNumber and not have to pull data from the Presidents table because all columns from the query are included in the index
Select top 20 PresidentNumber, President, YearsInOffice, PoliticalParty
From Presidents
where ElectoralVotes IS NOT NULL
This query will use the index IX_NC_PresidentNumber and the Presidents table as well because the PoliticalParty column in the query is not included in the index.
Select PresidentNumber, President, YearsInOffice, RatingPoints
From Presidents
Where RatingPoints > 50
This query will most likely end up doing a table scan because the where clause in the query versus the where clause used in the index don't match, and there no limit on the rowcount.

Getting RID Lookup instead of Table Scan?

SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.
What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....
The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data
If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);