Speedup SQL Query with aggregates on DateTime and group by - sql

I've a large (> 100 million rows) table in my MS SQL database with the following columns:
Id int not null,
ObjectId int not null,
Timestamp datetime not null
State int not null
Id it the primary key of the table (and has a clustered index on it). I added a non clustered index on Timestamp and ObjectId (in this order). There are just around 2000 distinct values in ObjectId. I want now perform the following query:
SELECT ObjectId, MAX(Timestamp) FROM Table GROUP BY ObjectId
It takes something around four seconds, which is too slow for my application. The execution plan says that 97% of the runtime goes to an Index Scan of the non clustered index.
On a copy of the table I created a clustered index on ObjectId and Timestamp. The resulting runtime is same, the execution plan says its doing now a Index Scan of the clustered index.
Is there any other possibility to improve the runtime without splitting the table's data into multiple tables?

I can propose you another answer, add a boolean column LAST and update last true for the ObjectID to false before insert now row for this ObjectID with LAST to true. Create an index on ObjectID and LAST. Query very simple :
SELECT ObjectId, Timestamp FROM Table where LAST = true
No more group by and fullscan but one more update each for insert.

4 seconds in not bad for that kind on work in DB with more 100M rows.
You can archive daily some data in another table to preserve historic. You can archive all data in another table and delete old changing of objects :
delete from TABLE where Id in (select t1.Id from Table t1, Table t2
where t1.ObjectId = t2.ObjectId and t1.Timestamp < t2.Timestamp )

For this particular query, an index on (ObjectId, Timestamp) will be optimal. And there is a chance that (ObjectId, Timestamp DESC) will perform even faster.

Related

SQL Server Update Indexing Design

I got a table, IDMAP with DML:
CREATE TABLE tempdb2.dbo.idmaptemp (
OldId varchar(20),
CV_ModStamp datetimeoffset,
NewId varchar(20),
RestoreComplete bit,
RestoreErrorMessage varchar(1000),
OperationType varchar(20)
)
As it is defined, it already contains a predefined set of rows about (1Million). When restore operation is complete, I have to update NewId, RestoreComplete, RestoreErrorMessage on the table.
The statement is:
update tempdb2.dbo.IdMaptemp set NewId = 'xxx', RestoreComplete = 'false', RestoreErrorMessage = 'error' where OldId = 'ABC';
The Java application has about a million values on memory and has to update the values with the above statement. The database is set to autocommit off and is varied with batch (batchsize 500).
I have tried two options on Indexing with OldId field:
Clustered Index - Execution plan lists as clustered index update (100% cost). This occurs as the leaves are the rows that are getting updated and which would trigger an index update. Am I right here?
Non-Clustered Index - Execution plan lists as update (75%) and seek(25%).
Are there any other speed ups that can be achieved on mass update on a database table? The table cannot be cleared and re-inserted as there are other rows that aren't affected by the updates. Clustered index on a sample of 500 rows per batch has taken around 7 hours to update.
Should I go for the Non-Clustered index option?
Changing a large table's clustered index is an expensive proposition. A table's clustered index is defined for the entire table and not for a subset of rows.
If you're leaving oldid as the clustered index and just want to improve the batching performance, consider allowing the db to participate in the batching process rather than the application/java layer. Asking the db to update millions of rows 1 row at a time, is an expensive proposition. Populating a temp table with batches worth and then letting SQL update the entire batch at a time can be a good way of improving performance.
insert #temptable (OldId,NewId)
...
Update
set T1.NewId = T2.NewId
T1
from
T1 join #tempTable T2
on T1.OldId = T2.OldId
If you can compute the new id, consider another batching tactic.
update tempdb2.dbo.IdMaptemp top 1000 set NewId = 'xxx', RestoreComplete = 'false',
RestoreErrorMessage = 'error' where NewId is null;
If you really want to create a new table with NewId as the clustered index
Create the new table as you like
insert into NewTable()
select top 10000 *
from OldTable O
left join NewTable N
on O.OldId = N.OldId
where N.OldId is null
When done, drop the old table.
Note: Does your id need to be 20 bytes? Typically clustered indexes are either int - 4 bytes or bigint - 8 bytes.
If this is a one time thing, then changing the clustered index on a large persistent table will be worth it. If the oldid though will always be in the process of acquiring the newid value, and that's just the workflow you have, I wouldn't bother changing the persistent table's clustered index. Just leave the oldid as the clustered index. NewId sounds like a surrogate key.

Create a unique index on a non-unique column

Not sure if this is possible in PostgreSQL 9.3+, but I'd like to create a unique index on a non-unique column. For a table like:
CREATE TABLE data (
id SERIAL
, day DATE
, val NUMERIC
);
CREATE INDEX data_day_val_idx ON data (day, val);
I'd like to be able to [quickly] query only the distinct days. I know I can use data_day_val_idx to help perform the distinct search, but it seems this adds extra overhead if the number of distinct values is substantially less than the number of rows in the index covers. In my case, about 1 in 30 days is distinct.
Is my only option to create a relational table to only track the unique entries? Thinking:
CREATE TABLE days (
day DATE PRIMARY KEY
);
And update this with a trigger every time we insert into data.
An index can only index actual rows, not aggregated rows. So, yes, as far as the desired index goes, creating a table with unique values like you mentioned is your only option. Enforce referential integrity with a foreign key constraint from data.day to days.day. This might also be best for performance, depending on the complete situation.
However, since this is about performance, there is an alternative solution: you can use a recursive CTE to emulate a loose index scan:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT day FROM data ORDER BY 1 LIMIT 1
)
UNION ALL
SELECT (SELECT day FROM data WHERE day > c.day ORDER BY 1 LIMIT 1)
FROM cte c
WHERE c.day IS NOT NULL -- exit condition
)
SELECT day FROM cte;
Parentheses around the first SELECT are required because of the attached ORDER BY and LIMIT clauses. See:
Combining 3 SELECT statements to output 1 table
This only needs a plain index on day.
There are various variants, depending on your actual queries:
Optimize GROUP BY query to retrieve latest row per user
Unused index in range of dates query
Select first row in each GROUP BY group?
More in my answer to your follow-up querstion:
Counting distinct rows using recursive cte over non-distinct index

Correct SQL index for Partition + Order to remove SORT

I have a SQL Statement which i am trying to optimise to remove the sort operator
SELECT *,ROW_NUMBER() OVER (
PARTITION BY RuleInstanceId
ORDER BY [Timestamp] DESC
) AS rn
FROM RuleInstanceHistoricalMembership
Everything I have read (eg. Optimizing SQL queries by removing Sort operator in Execution plan) suggests this is the correct index to add however it appears to have no effect at all.
CREATE NONCLUSTERED INDEX IX_MyIndex ON dbo.[RuleInstanceHistoricalMembership](RuleInstanceId, [Timestamp] DESC)
I must be missing something as I have read heaps of articles which all seem to sugguest an index spanning both columns should solve this issue
Technically the index you have added does allow you to avoid a sort.
However the index you have created is non covering so SQL Server would then also need to perform 60 million key lookups back to the base table.
Simply scanning the clustered index and sorting it on the fly is costed as being considerably cheaper than that option.
In order to get the index to be used automatically you would need to either.
Remove columns from the query SELECT list so the index covers it.
Add INCLUDE-d columns to the index.
BTW: For a table with 60 million rows you may well find that even if you were to try and force the issue with an index hint on the non covering index you still don't get the desired results of avoiding a sort.
CREATE TABLE RuleInstanceHistoricalMembership
(
ID INT PRIMARY KEY,
Col2 INT,
Col3 INT,
RuleInstanceId INT,
[Timestamp] INT
)
CREATE NONCLUSTERED INDEX IX_MyIndex
ON dbo.[RuleInstanceHistoricalMembership](RuleInstanceId, [Timestamp] DESC)
/*Fake small table*/
UPDATE STATISTICS RuleInstanceHistoricalMembership
WITH ROWCOUNT = 600,
PAGECOUNT = 10
SELECT *,
ROW_NUMBER() OVER ( PARTITION BY RuleInstanceId
ORDER BY [Timestamp] DESC ) AS rn
FROM RuleInstanceHistoricalMembership WITH (INDEX = IX_MyIndex)
Gives the plan
With no sort but up the row and page count
/*Fake large table*/
UPDATE STATISTICS RuleInstanceHistoricalMembership
WITH ROWCOUNT = 60000000,
PAGECOUNT = 10000000
And try again and you get
Now it has two sorts!
The scan on the NCI is in RuleInstanceId, Timestamp DESC order but then SQL Server reorders it into clustered index key order (Id ASC) per Optimizing I/O Performance by Sorting.
This step is to try and reduce the expected massive cost of 60 million random lookups into the clustered index. Then it gets sorted back into the original RuleInstanceId, Timestamp DESC order that the index delivered it in.

SQLite: COUNT slow on big tables

I'm having a performance problem in SQLite with a SELECT COUNT(*) on a large tables.
As I didn't yet receive a usable answer and I did some further testing, I edited my question to incorporate my new findings.
I have 2 tables:
CREATE TABLE Table1 (
Key INTEGER NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL,
CONSTRAINT PK_Table1 PRIMARY KEY (Key ASC))
CREATE Table2 (
Key INTEGER NOT NULL,
Key2 INTEGER NOT NULL,
... a few other fields ...,
CONSTRAINT PK_Table2 PRIMARY KEY (Key ASC, Key2 ASC))
Table1 has around 8 million records and Table2 has around 51 million records, and the databasefile is over 5GB.
Table1 has 2 more indexes:
CREATE INDEX IDX_Table1_Status ON Table1 (Status ASC, Key ASC)
CREATE INDEX IDX_Table1_Selection ON Table1 (Selection ASC, Key ASC)
"Status" is required field, but has only 6 distinct values, "Selection" is not required and has only around 1.5 million values different from null and only around 600k distinct values.
I did some tests on both tables, you can see the timings below, and I added the "explain query plan" for each request (QP). I placed the database file on an USB-memorystick so i could remove it after each test and get reliable results without interference of the disk cache. Some requests are faster on USB (I suppose due to lack of seektime), but some are slower (table scans).
SELECT COUNT(*) FROM Table1
Time: 105 sec
QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
Time: 153 sec
QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Key = 5123456
Time: 5 ms
QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
Time: 16 sec
QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)
SELECT * FROM Table1 WHERE Selection = 'SomeValue' AND Key > 5123456 LIMIT 1
Time: 9 ms
QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Selection (Selection=?) (~3 rows)
As you can see the counts are very slow, but normal selects are fast (except for the 2nd one, which took 16 seconds).
The same goes for Table2:
SELECT COUNT(*) FROM Table2
Time: 528 sec
QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~1000000 rows)
SELECT COUNT(Key) FROM Table2
Time: 249 sec
QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table2 WHERE Key = 5123456 AND Key2 = 0
Time: 7 ms
QP: SEARCH TABLE Table2 USING INDEX sqlite_autoindex_Table2_1 (Key=? AND Key2=?) (~1 rows)
Why is SQLite not using the automatically created index on the primary key on Table1 ?
And why, when he uses the auto-index on Table2, it still takes a lot of time ?
I created the same tables with the same content and indexes on SQL Server 2008 R2 and there the counts are nearly instantaneous.
One of the comments below suggested executing ANALYZE on the database. I did and it took 11 minutes to complete.
After that, I ran some of the tests again:
SELECT COUNT(*) FROM Table1
Time: 104 sec
QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~7848023 rows)
SELECT COUNT(Key) FROM Table1
Time: 151 sec
QP: SCAN TABLE Table1 (~7848023 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
Time: 5 ms
QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid>?) (~196200 rows)
SELECT COUNT(*) FROM Table2
Time: 529 sec
QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~51152542 rows)
SELECT COUNT(Key) FROM Table2
Time: 249 sec
QP: SCAN TABLE Table2 (~51152542 rows)
As you can see, the queries took the same time (except the query plan is now showing the real number of rows), only the slower select is now also fast.
Next, I create dan extra index on the Key field of Table1, which should correspond to the auto-index. I did this on the original database, without the ANALYZE data. It took over 23 minutes to create this index (remember, this is on an USB-stick).
CREATE INDEX IDX_Table1_Key ON Table1 (Key ASC)
Then I ran the tests again:
SELECT COUNT(*) FROM Table1
Time: 4 sec
QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Key(~1000000 rows)
SELECT COUNT(Key) FROM Table1
Time: 167 sec
QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
Time: 17 sec
QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)
As you can see, the index helped with the count(*), but not with the count(Key).
Finaly, I created the table using a column constraint instead of a table constraint:
CREATE TABLE Table1 (
Key INTEGER PRIMARY KEY ASC NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL)
Then I ran the tests again:
SELECT COUNT(*) FROM Table1
Time: 6 sec
QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
Time: 28 sec
QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
Time: 10 sec
QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)
Although the query plans are the same, the times are a lot better. Why is this ?
The problem is that ALTER TABLE does not permit to convert an existing table and I have a lot of existing databases which i can not convert to this form. Besides, using a column contraint instead of table constraint won't work for Table2.
Has anyone any idea what I am doing wrong and how to solve this problem ?
I used System.Data.SQLite version 1.0.74.0 to create the tables and to run the tests I used SQLiteSpy 1.9.1.
Thanks,
Marc
If you haven't DELETEd any records, doing:
SELECT MAX(_ROWID_) FROM "table" LIMIT 1;
will avoid the full-table scan.
Note that _ROWID_ is a SQLite identifier.
From http://old.nabble.com/count(*)-slow-td869876.html
SQLite always does a full table scan for count(*). It
does not keep meta information on tables to speed this
process up.
Not keeping meta information is a deliberate design
decision. If each table stored a count (or better, each
node of the B-tree stored a count) then much more updating
would have to occur on every INSERT or DELETE. This
would slow down INSERT and DELETE, even in the common
case where count(*) speed is unimportant.
If you really need a fast COUNT, then you can create
a trigger on INSERT and DELETE that updates a running
count in a separate table then query that separate
table to find the latest count.
Of course, it's not worth keeping a FULL row count if you
need COUNTs dependent on WHERE clauses (i.e. WHERE field1 > 0 and field2 < 1000000000).
This may not help much, but you can run the ANALYZE command to rebuild statistics about your database. Try running "ANALYZE;" to rebuild statistics about the entire database, then run your query again and see if it is any faster.
Do not count the stars, count the records! Or in other language, never issue
SELECT COUNT(*) FROM tablename;
use
SELECT COUNT(ROWID) FROM tablename;
Call EXPLAIN QUERY PLAN for both to see the difference. Make sure you have an index in place containing all columns mentioned in the WHERE clause.
On the matter of the column constraint, SQLite maps columns that are declared to be INTEGER PRIMARY KEY to the internal row id (which in turn admits a number of internal optimizations). Theoretically, it could do the same for a separately-declared primary key constraint, but it appears not to do so in practice, at least with the version of SQLite in use. (System.Data.SQLite 1.0.74.0 corresponds to core SQLite 3.7.7.1. You might want to try re-checking your figures with 1.0.79.0; you shouldn't need to change your database to do that, just the library.)
The output for the fast queries all start with the text "QP: SEARCH". Whilst those for the slow queries start with text "QP: SCAN", which suggests that sqlite is performing a scan of the entire table in order to generate the count.
Googling for "sqlite table scan count" finds the following, which suggests that using a full table scan to retrieve a count is just the way sqlite works, and is therefore probably unavoidable.
As a workaround, and given that status has only eight values, I wondered if you could get a count quickly using a query like the following?
select 1 where status=1
union
select 1 where status=2
...
then count the rows in the result. This is clearly ugly, but it might work if it persuades sqlite to run the query as a search rather than a scan. The idea of returning "1" each time is to avoid the overhead of returning real data.
Here's a potential workaround to improve the query performance. From the context, it sounds like your query takes about a minute and a half to run.
Assuming you have a date_created column (or can add one), run a query in the background each day at midnight (say at 00:05am) and persist the value somewhere along with the last_updated date it was calculated (I'll come back to that in a bit).
Then, running against your date_created column (with an index), you can avoid a full table scan by doing a query like SELECT COUNT(*) FROM TABLE WHERE date_updated > "[TODAY] 00:00:05".
Add the count value from that query to your persisted value, and you have a reasonably fast count that's generally accurate.
The only catch is that from 12:05am to 12:07am (the duration during which your total count query is running) you have a race condition which you can check the last_updated value of your full table scan count(). If it's > 24 hours old, then your incremental count query needs to pull a full day's count plus time elapsed today. If it's < 24 hours old, then your incremental count query needs to pull a partial day's count (just time elapsed today).
I had the same problem, in my situation VACUUM command helped. After its execution on database COUNT(*) speed increased near 100 times. However, command itself needs some minutes in my database (20 millions records). I solved this problem by running VACUUM when my software exits after main window destruction, so the delay doesn't make problems to user.

SQL Server Index Usage with an Order By

I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?