SQL order by in two steps - sql

I have a big table with two datetime columns.
[Timestamp] and [TimestampRounded]
The [Timestamp] column has the full timestamp including milliseconds and the table has no index for this column.
The [TimestampRounded] column has the timestamp but milliseconds, seconds, and minutes truncated (set to 0). The table has a clustered index for this column. That is, the table is effectively stored in the order of this column. Typically the newest row is on the top of the table. The index was created like this:
CREATE CLUSTERED INDEX cidx_time ON [dbo].[MyTable] ([TimestampRounded] DESC)
Now, I want to retrieve some data leveraging my clustered index so I do the following select, my table has around 5 million rows.
Query 1:
SELECT TOP(100) * FROM [dbo].[MyTable] ORDER BY [TimestampRounded] DESC
This query returns immediately (less than 1 second). But the 100 returned rows are not ordered with respect to milliseconds, only by hour.
Then I learned if I also want to order by a second column I do:
Query 2:
SELECT TOP(100) * FROM [dbo].[MyTable] ORDER BY [TimestampRounded] DESC, [Timestamp] DESC
This query is very slow and takes around 23 seconds to return the 100 rows.
My immediate solution was to use the first query and then just order those returned 100 rows in my client frontend code. But I experienced some problem that I missed rows that should have returned so I would like to understand how I can fix/rewrite query 2 to return those 100 sorted rows as expected, and by reasonable logic should also take less than 1 second. Since the table is already stored by hour (clustered index) I do not understand why it should take longer.

I might be oversimplifying, but why not simply create an index on the column that stores the entire timestamp?
CREATE INDEX cidx_time2 ON [dbo].[MyTable] ([Timestamp] DESC)
Then, you can just do:
SELECT TOP(100) * FROM [dbo].[MyTable] ORDER BY [[Timestamp] DESC
Or, if you need to two timestamps in the order by clause for some reason, then you want an index on both columns:
CREATE INDEX cidx_time3 ON [dbo].[MyTable] ([TimestampRounded] DESC, [Timestamp] DESC);
Then you can run your original query:
SELECT TOP(100) * FROM [dbo].[MyTable] ORDER BY [TimestampRounded] DESC, [Timestamp] DESC

Specify WITH TIES so sqlserver will return you [upto] "several thousand" rows that have all the same rounded timestamp value, then order those several thousand by the precise time stamp to get your truly most recent 100; quicker to sort thousands than millions

Related

SQL Server 1 million records: best way to get fastest last record of table?

SQL Server 1 million records: best way to get fastest last record of table?
Example: I have a table A with 1 million records. What is the way to get fastest last records?
I know: SELECT TOP 1 * FROM A ORDER BY ID DESC
But I think It's not good way for me.
The query in your question will perform very well if you have a clustered index (which may be the primary key index) on ID. There is no faster way to retrieve all columns from a single row of a table.
I'll add that a table is logically an unordered set of rows so ORDER BY is required to return a "last" or "first" row. The b-tree index on the ORDER BY column will locate the row efficiently.
you have only one way index on primary key and where values . order by has a little bit cost but it's ok if you has index on order Column
--ORDER BY 1 DESC means order by primary key index desc
SELECT [Columns] FROM [TABLENAME] ORDER BY 1 DESC
--or you can use this if your first column is IDENTITY or A/A
SELECT [Columns] FROM [TABLENAME] ORDER BY [YOUR_COLUMN_WITHA/A ] DESC

Create a unique index on a non-unique column

Not sure if this is possible in PostgreSQL 9.3+, but I'd like to create a unique index on a non-unique column. For a table like:
CREATE TABLE data (
id SERIAL
, day DATE
, val NUMERIC
);
CREATE INDEX data_day_val_idx ON data (day, val);
I'd like to be able to [quickly] query only the distinct days. I know I can use data_day_val_idx to help perform the distinct search, but it seems this adds extra overhead if the number of distinct values is substantially less than the number of rows in the index covers. In my case, about 1 in 30 days is distinct.
Is my only option to create a relational table to only track the unique entries? Thinking:
CREATE TABLE days (
day DATE PRIMARY KEY
);
And update this with a trigger every time we insert into data.
An index can only index actual rows, not aggregated rows. So, yes, as far as the desired index goes, creating a table with unique values like you mentioned is your only option. Enforce referential integrity with a foreign key constraint from data.day to days.day. This might also be best for performance, depending on the complete situation.
However, since this is about performance, there is an alternative solution: you can use a recursive CTE to emulate a loose index scan:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT day FROM data ORDER BY 1 LIMIT 1
)
UNION ALL
SELECT (SELECT day FROM data WHERE day > c.day ORDER BY 1 LIMIT 1)
FROM cte c
WHERE c.day IS NOT NULL -- exit condition
)
SELECT day FROM cte;
Parentheses around the first SELECT are required because of the attached ORDER BY and LIMIT clauses. See:
Combining 3 SELECT statements to output 1 table
This only needs a plain index on day.
There are various variants, depending on your actual queries:
Optimize GROUP BY query to retrieve latest row per user
Unused index in range of dates query
Select first row in each GROUP BY group?
More in my answer to your follow-up querstion:
Counting distinct rows using recursive cte over non-distinct index

Correct SQL index for Partition + Order to remove SORT

I have a SQL Statement which i am trying to optimise to remove the sort operator
SELECT *,ROW_NUMBER() OVER (
PARTITION BY RuleInstanceId
ORDER BY [Timestamp] DESC
) AS rn
FROM RuleInstanceHistoricalMembership
Everything I have read (eg. Optimizing SQL queries by removing Sort operator in Execution plan) suggests this is the correct index to add however it appears to have no effect at all.
CREATE NONCLUSTERED INDEX IX_MyIndex ON dbo.[RuleInstanceHistoricalMembership](RuleInstanceId, [Timestamp] DESC)
I must be missing something as I have read heaps of articles which all seem to sugguest an index spanning both columns should solve this issue
Technically the index you have added does allow you to avoid a sort.
However the index you have created is non covering so SQL Server would then also need to perform 60 million key lookups back to the base table.
Simply scanning the clustered index and sorting it on the fly is costed as being considerably cheaper than that option.
In order to get the index to be used automatically you would need to either.
Remove columns from the query SELECT list so the index covers it.
Add INCLUDE-d columns to the index.
BTW: For a table with 60 million rows you may well find that even if you were to try and force the issue with an index hint on the non covering index you still don't get the desired results of avoiding a sort.
CREATE TABLE RuleInstanceHistoricalMembership
(
ID INT PRIMARY KEY,
Col2 INT,
Col3 INT,
RuleInstanceId INT,
[Timestamp] INT
)
CREATE NONCLUSTERED INDEX IX_MyIndex
ON dbo.[RuleInstanceHistoricalMembership](RuleInstanceId, [Timestamp] DESC)
/*Fake small table*/
UPDATE STATISTICS RuleInstanceHistoricalMembership
WITH ROWCOUNT = 600,
PAGECOUNT = 10
SELECT *,
ROW_NUMBER() OVER ( PARTITION BY RuleInstanceId
ORDER BY [Timestamp] DESC ) AS rn
FROM RuleInstanceHistoricalMembership WITH (INDEX = IX_MyIndex)
Gives the plan
With no sort but up the row and page count
/*Fake large table*/
UPDATE STATISTICS RuleInstanceHistoricalMembership
WITH ROWCOUNT = 60000000,
PAGECOUNT = 10000000
And try again and you get
Now it has two sorts!
The scan on the NCI is in RuleInstanceId, Timestamp DESC order but then SQL Server reorders it into clustered index key order (Id ASC) per Optimizing I/O Performance by Sorting.
This step is to try and reduce the expected massive cost of 60 million random lookups into the clustered index. Then it gets sorted back into the original RuleInstanceId, Timestamp DESC order that the index delivered it in.

Getting a specific number of rows from Database using RowNumber; Inconsistent results

Here is my SQL query:
select * from TABLE T where ROWNUM<=100
If i execute this and then re-execute this, I don't get the same result. Why?
Also, on a sybase system if i execute
set rowcount 100
select * from TABLE
even on re-execution i get the same result?
Can someone explain why? and provide possible solution for RowNum
Thanks
If you don't use ORDER BY in your query you get the results in natural order.
Natural order is whatever is fastest for the database at the moment.
A possible solution is to ORDER BY your primary key, if it's an INT
SELECT TOP 100 START AT 0 * FROM TABLE
ORDER BY TABLE.ID;
If your primary key is not a sequentially incrementing integer and you don't have another column to order by (such as a timestamp) you may need to create an extra column SORT_ORDER INT and increment in automatically on insert using either an Autoincrement column or a sequence and an insert trigger, depending on the database.
Make sure to create an index on that column to speed up the query.
You need to specify an ORDER BY. Queries without explicit ORDER BY clause make no guarantee about the order in which the rows are returned. And from this result set you take the first 100 rows. As the order in which the rows can be different every time, so can be your first 100 rows.
You need to use ORDER BY first, followed by ROWNUM. You will get inconsistent results if you don't follow this order.
select * from
(
select * from TABLE T ORDER BY rowid
) where ROWNUM<=100

TSQL to select the last 10 rows from a table?

I have a table that contains 300 million rows, and a clustered index on the [DataDate] column.
How do I select the last 10 rows of this table (I want to find the most recent date in the table)?
Database: Microsoft SQL Server 2008 R2.
Update
The answers below work perfectly - but only if there is a clustered index on [DataDate]. The table is, after all, 300 million rows, and a naive query would end up taking hours to execute rather than seconds. The query plan is using the clustered index on [DataDate] to get results within a few tens of milliseconds.
TOP
SELECT TOP(10) [DataDate] FROM YourTable ORDER BY [DataDate] DESC
TOP (Transact-SQL) specifies that only the first set of rows will be returned from the query result. The set of rows can be either a number or a percent of the rows. The TOP expression can be used in SELECT, INSERT, UPDATE, MERGE, and DELETE statements.
SELECT TOP(10) *
FROM MyTable
ORDER BY DataDate DESC
Do a reverse sort using ORDER BY and use TOP.