I have simple table with index on DateTime column.
Can someone explain me which one of these two queries will use index?
CREATE TABLE exams
(
name VARCHAR(50),
grade INT,
date DATETIME
);
CREATE INDEX date_idx ON exams(date);
SELECT *
FROM exams
WHERE date = '2018-01-01'; -- doesn't use index?
SELECT *
FROM exams
WHERE MONTH(date) = 1; -- uses index?
There are different ways this to be solved by the SQL Engine. Let's insert some sample data in your table:
DROP TABLE if exists exams;
CREATE TABLE exams(
name varchar(50),
grade INT,
date datetime
);
INSERT exams
SELECT TOP (5000) CONCAT('name', row_number() over(order by t1.number))
,6
,'2019-07-01'
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
INSERT exams
SELECT TOP (5) CONCAT('name', row_number() over(order by t1.number))
,6
,'2019-07-02'
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CREATE INDEX date_idx ON exams(date);
As you can see, we have inserted:
5 000 rows for date 2019-07-01
5 rows for date 2019-07-02
Let's execute the following queries, now:
SELECT * FROM exams WHERE date= '2019-07-01';
SELECT * FROM exams WHERE date= '2019-07-02';
SELECT * FROM exams WHERE MONTH(date)=1;
and check the execution plans:
In the first query, the engine knows (because of the statistics) that almost all of the data is going to be read, so it performs table scan on your heap.
In the second query, the engine see that only few of the records are going to be return, so there is no need to read all the data - it uses the index, and performs index seek.
In the last case, the index can't be used, because the query si not sargable.
So, the engine decides if or not to use an index, and if or not to perform seek or scan. The only thing you can do is to make sure your indexes are covering, your statistics are updated and your queries are sargable.
Well, I believe that the truth is quite inversed:
First query uses index, while second DOES NOT.
Why second query doesn't use index? Because indexed column is wrapped in a function which prevents SQL Server from using index.
Index can be thought of as way of storing records. Applying function to indexed column may alter order of stored records, thus index can be no longer valid when using function.
Related
I have a table named assets. Here is the ddl:
create table assets (
id bigint primary key,
name varchar(255) not null,
value double precision not null,
business_time timestamp with time zone,
insert_time timestamp with time zone default now() not null
);
create index idx_assets_name on assets (name);
I need to extract the newest (based on insert_time) value for each asset name. This is the query that I initially used:
SELECT DISTINCT
ON (a.name) *
FROM home.assets a
WHERE a.name IN (
'USD_RLS',
'EUR_RLS',
'SEKKEH_RLS',
'NIM_SEKKEH_RLS',
'ROB_SEKKEH_RLS',
'BAHAR_RLS',
'GOLD_18_RLS',
'GOLD_OUNCE_USD',
'SILVER_OUNCE_USD',
'PLATINUM_OUNCE_USD',
'GOLD_MESGHAL_RLS',
'GOLD_24_RLS',
'STOCK_IR',
'AED_RLS',
'GBP_RLS',
'CAD_RLS',
'CHF_RLS',
'TRY_RLS',
'AUD_RLS',
'JPY_RLS',
'CNY_RLS',
'RUB_RLS',
'BTC_USD'
)
ORDER BY a.name,
a.insert_time DESC;
I have around 300,000 rows in the assets table. On my VPS this query takes about 800 ms. this is causing a whole response time of about 1 second for a specific endpoint. This is a bit slow and considering the fact that the assets table is growing fast, this endpoint will be even slower in the near future. I also tried to avoid IN(...) using this query:
SELECT DISTINCT
ON (a.name) *
FROM home.assets a
ORDER BY a.name,
a.insert_time DESC;
But I didn't notice a significant difference. Any idea how I could optimize this query?
You may try adding the following index to your table:
CREATE INDEX idx ON assets (name, insert_time DESC);
If used, Postgres can simply scan this index to find the distinct record having the most recent insert_time for each name.
For more than a few rows per name in the table (looks to be so), I expect this query to be substantially faster, yet:
SELECT a.*
FROM unnest('{USD_RLS, EUR_RLS, SEKKEH_RLS, NIM_SEKKEH_RLS, ROB_SEKKEH_RLS
, BAHAR_RLS, GOLD_18_RLS, GOLD_OUNCE_USD, SILVER_OUNCE_USD
, PLATINUM_OUNCE_USD, GOLD_MESGHAL_RLS, GOLD_24_RLS, STOCK_IR
, AED_RLS, GBP_RLS, CAD_RLS, CHF_RLS
, TRY_RLS, AUD_RLS, JPY_RLS, CNY_RLS
, RUB_RLS, BTC_USD}'::text[]) AS n(name)
CROSS JOIN LATERAL (
SELECT *
FROM home.assets a
WHERE a.name = n.name
ORDER BY a.insert_time DESC
LIMIT 1
) a;
Pass your list as array, unnest, and then get each latest row in a LATERAL subquery. The CROSS JOIN eliminates names that are not found at all. (You might be interested in LEFT JOIN LATERAL ... ON true instead, to keep those in the result.)
You still need the multicolumn index that Tim mentioned.
CREATE INDEX ON assets (name, insert_time DESC);
Default ascending sort order would work, too, in this case. Postgres can scan backwards:
CREATE INDEX ON assets (name, insert_time);
See:
Postgres: getting latest rows for an array of keys
Optimize GROUP BY query to retrieve latest row per user - basically type 2a
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
Not the number of rows in the table, but the number of rows per group (per name in your case) decides whether DISTINCT ON is the best choice. See this benchmark comparing relevant query styles:
Select first row in each GROUP BY group?
I want to apply pagination on a table with huge data. All I want to know a better option than using OFFSET in SQL Server.
Here is my simple query:
SELECT *
FROM TableName
ORDER BY Id DESC
OFFSET 30000000 ROWS
FETCH NEXT 20 ROWS ONLY
You can use Keyset Pagination for this. It's far more efficient than using Rowset Pagination (paging by row number).
In Rowset Pagination, all previous rows must be read, before being able to read the next page. Whereas in Keyset Pagination, the server can jump immediately to the correct place in the index, so no extra rows are read that do not need to be.
For this to perform well, you need to have a unique index on that key, which includes any other columns you need to query.
In this type of pagination, you cannot jump to a specific page number. You jump to a specific key and read from there. So you need to save the unique ID of page you are on and skip to the next. Alternatively, you could calculate or estimate a starting point for each page up-front.
One big benefit, apart from the obvious efficiency gain, is avoiding the "missing row" problem when paginating, caused by rows being removed from previously read pages. This does not happen when paginating by key, because the key does not change.
Here is an example:
Let us assume you have a table called TableName with an index on Id, and you want to start at the latest Id value and work backwards.
You begin with:
SELECT TOP (#numRows)
*
FROM TableName
ORDER BY Id DESC;
Note the use of ORDER BY to ensure the order is correct
In some RDBMSs you need LIMIT instead of TOP
The client will hold the last received Id value (the lowest in this case). On the next request, you jump to that key and carry on:
SELECT TOP (#numRows)
*
FROM TableName
WHERE Id < #lastId
ORDER BY Id DESC;
Note the use of < not <=
In case you were wondering, in a typical B-Tree+ index, the row with the indicated ID is not read, it's the row after it that's read.
The key chosen must be unique, so if you are paging by a non-unique column then you must add a second column to both ORDER BY and WHERE. You would need an index on OtherColumn, Id for example, to support this type of query. Don't forget INCLUDE columns on the index.
SQL Server does not support row/tuple comparators, so you cannot do (OtherColumn, Id) < (#lastOther, #lastId) (this is however supported in PostgreSQL, MySQL, MariaDB and SQLite).
Instead you need the following:
SELECT TOP (#numRows)
*
FROM TableName
WHERE (
(OtherColumn = #lastOther AND Id < #lastId)
OR OtherColumn < #lastOther
)
ORDER BY
OtherColumn DESC,
Id DESC;
This is more efficient than it looks, as SQL Server can convert this into a proper < over both values.
The presence of NULLs complicates things further. You may want to query those rows separately.
On very big merchant website we use a technic compound of ids stored in a pseudo temporary table and join with this table to the rows of the product table.
Let me talk with a clear example.
We have a table design this way :
CREATE TABLE S_TEMP.T_PAGINATION_PGN
(PGN_ID BIGINT IDENTITY(-9 223 372 036 854 775 808, 1) PRIMARY KEY,
PGN_SESSION_GUID UNIQUEIDENTIFIER NOT NULL,
PGN_SESSION_DATE DATETIME2(0) NOT NULL,
PGN_PRODUCT_ID INT NOT NULL,
PGN_SESSION_ORDER INT NOT NULL);
CREATE INDEX X_PGN_SESSION_GUID_ORDER
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_GUID, PGN_SESSION_ORDER)
INCLUDE (PGN_SESSION_ORDER);
CREATE INDEX X_PGN_SESSION_DATE
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_DATE);
We have a very big product table call T_PRODUIT_PRD and a customer filtered it with many predicates. We INSERT rows from the filtered SELECT into this table this way :
DECLARE #SESSION_ID UNIQUEIDENTIFIER = NEWID();
INSERT INTO S_TEMP.T_PAGINATION_PGN
SELECT #SESSION_ID , SYSUTCDATETIME(), PRD_ID,
ROW_NUMBER() OVER(ORDER BY --> custom order by
FROM dbo.T_PRODUIT_PRD
WHERE ... --> custom filter
Then everytime we need a desired page, compound of #N products we add a join to this table as :
...
JOIN S_TEMP.T_PAGINATION_PGN
ON PGN_SESSION_GUID = #SESSION_ID
AND 1 + (PGN_SESSION_ORDER / #N) = #DESIRED_PAGE_NUMBER
AND PGN_PRODUCT_ID = dbo.T_PRODUIT_PRD.PRD_ID
All the indexes will do the job !
Of course, regularly we have to purge this table and this is why we have a scheduled job which deletes the rows whose sessions were generated more than 4 hours ago :
DELETE FROM S_TEMP.T_PAGINATION_PGN
WHERE PGN_SESSION_DATE < DATEADD(hour, -4, SYSUTCDATETIME());
In the same spirit as SQLPro solution, I propose:
WITH CTE AS
(SELECT 30000000 AS N
UNION ALL SELECT N-1 FROM CTE
WHERE N > 30000000 +1 - 20)
SELECT T.* FROM CTE JOIN TableName T ON CTE.N=T.ID
ORDER BY CTE.N DESC
Tried with 2 billion lines and it's instant !
Easy to make it a stored procedure...
Of course, valid if ids follow each other.
Consider following example:
DROP TABLE IF EXISTS t1;
CREATE TABLE t1(a INTEGER PRIMARY KEY, b) WITHOUT ROWID;
WITH RECURSIVE
cnt(x) AS (VALUES(1000) UNION ALL SELECT x+1 FROM cnt WHERE x<2000)
INSERT INTO t1(a,b) SELECT x, x FROM cnt;
CREATE INDEX t1b ON t1(b);
This query creates table without rowid column and insert values(x, x) where
1000 < x < 2000. In order to help query planner lets run ANALYZE.
ANALYZE;
EXPLAIN QUERY PLAN
SELECT * FROM t1 WHERE b BETWEEN 500 AND 2500;
EXPLAIN QUERY PLAN
SELECT * FROM t1 WHERE b BETWEEN 2900 AND 3000;
The output in both cases is:0|0|0|SEARCH TABLE t1 USING COVERING INDEX t1b (b>? AND b<?)
However, there is no sense to use index (for the first query) for the reason that anyway we have to iterate through whole table, so ordinary SCAN TABLE seems to be more efficient. Exactly in this way tables with rowid work:
DROP TABLE IF EXISTS t1;
CREATE TABLE t1(a, b);
WITH RECURSIVE
cnt(x) AS (VALUES(1000) UNION ALL SELECT x+1 FROM cnt WHERE x<2000)
INSERT INTO t1(a,b) SELECT x, x FROM cnt;
CREATE INDEX t1a ON t1(a);
ANALYZE;
EXPLAIN QUERY PLAN
SELECT * FROM t1 WHERE a BETWEEN 500 AND 2500;
EXPLAIN QUERY PLAN
SELECT * FROM t1 WHERE a BETWEEN 2900 AND 3000;
In this case output will be:0|0|0|SCAN TABLE t1
and 0|0|0|SEARCH TABLE t1 USING INDEX t1a (a>? AND a<?)
So, could anybody explain how query planner optimize queries for WITHOUT ROWID tables?
The output in both cases is:
0|0|0|SEARCH TABLE t1 USING COVERING INDEX t1b (b>? AND b<?)
However, there is no sense to use index (for
the first query) for the reason that anyway we have to iterate through
whole table, so ordinary SCAN TABLE seems to be more efficient.
You missed the COVERING INDEX part: that means it is using the index-only — not accessing the table at all.
You are right that a regular index access (without "COVERING") might be slower than a full table scan if all rows are needed, but this is not the case for an index-only scan.
Read more about index-only scans here: http://use-the-index-luke.com/sql/clustering/index-only-scan-covering-index
EDIT
WITHOUT ROWID are in SQLite what are so-called clustered indexes in other databases: they contain all table columns. Therefore, there is no need to visit the table, even if you select all columns (like in select *).
Read more about clustered indexes here: http://use-the-index-luke.com/sql/clustering/index-organized-clustered-index
I am using Oracle (Enterprise Edition 10g) and I have a query like this:
SELECT * FROM (
SELECT * FROM MyTable
ORDER BY MyColumn
) WHERE rownum <= 10;
MyColumn is indexed, however, Oracle is for some reason doing a full table scan before it cuts the first 10 rows. So for a table with 4 million records the above takes around 15 seconds.
Now consider this equivalent query:
SELECT MyTable.*
FROM
(SELECT rid
FROM
(SELECT rowid as rid
FROM MyTable
ORDER BY MyColumn
)
WHERE rownum <= 10
)
INNER JOIN MyTable
ON MyTable.rowid = rid
ORDER BY MyColumn;
Here Oracle scans the index and finds the top 10 rowids, and then uses nested loops to find the 10 records by rowid. This takes less than a second for a 4 million table.
My first question is why is the optimizer taking such an apparently bad decision for the first query above?
An my second and most important question is: is it possible to make the first query perform better. I have a specific need to use the first query as unmodified as possible. I am looking for something simpler than my second query above. Thank you!
Please note that for particular reasons I am unable to use the /*+ FIRST_ROWS(n) */ hint, or the ROW_NUMBER() OVER (ORDER BY column) construct.
If this is acceptable in your case, adding a WHERE ... IS NOT NULL clause will help the optimizer to use the index instead of doing a full table scan when using an ORDER BY clause:
SELECT * FROM (
SELECT * FROM MyTable
WHERE MyColumn IS NOT NULL
-- ^^^^^^^^^^^^^^^^^^^^
ORDER BY MyColumn
) WHERE rownum <= 10;
The rational is Oracle does not store NULL values in the index. As your query was originally written, the optimizer took the decision of doing a full table scan, as if there was less than 10 non-NULL values, it should retrieve some "NULL rows" to "fill in" the remaining rows. Apparently it is not smart enough to check first if the index contains enough rows...
With the added WHERE MyColumn IS NOT NULL, you inform the optimizer that you don't want in any circumstances any row having NULL in MyColumn. So it can blindly use the index without worrying about hypothetical rows having NULL in MyColumn.
For the same reason, declaring the ORDER BY column as NOT NULL should prevent the optimizer to do a full table scan. So, if you can change the schema, a cleaner option would be:
ALTER TABLE MyTable MODIFY (MyColumn NOT NULL);
See http://sqlfiddle.com/#!4/e3616/1 for various comparisons (click on view execution plan)
I want to be able to take any arbitrary SELECT TOP(X) query that would normally return a large number of rows (without the X limit) and transform that query into a query that counts how many rows would have been returned without the TOP(X) (i.e. SELECT COUNT(*)). Remember I am asking about an arbitrary query with any number of joins, where clauses, group by's etc.
Is there a way to do this?
edited to show syntax with Shannon's solution:
i.e.
`SELECT TOP(X) [colnames] FROM [tables with joins]
WHERE [constraints] GROUP BY [cols] ORDER BY [cols]`
becomes
`SELECT COUNT(*) FROM
(SELECT [colnames] FROM [tables with joins]
WHERE [constraints] GROUP BY [cols]) t`
Inline view:
select count(*)
from (...slightly transformed query...) t
... slightly transfomed query... is:
If the select clause contains any columns without names, such as select ... avg(x) ... then do one of 1) Alias the column, such as avg(x) as AvgX, 2) Remove the column, but make sure at least one column is left, or my favorite 3) Just make the select clause select 1 as C
Remove TOP from select clause.
Remove order by clause.
EDIT 1 Fixed by adding aliases for the inline view and dealing with unnamed columns in select clause.
EDIT 2 But what about the performance? Doesn't this require the DB to run the big query that I wanted to avoid in the first place with TOP(X)?
Not necessarily. It may be the case for some queries that this count will do more work than the TOP(x) would. And it may be the case that for a particular query, you could make the equivelent count faster by making addional changes to remove work that is not needed for the final count. But those simplifications can not be included in a general method to take any arbitrary SELECT TOP(X) query that would normally return a large number of rows (without the X limit) and transform that query into a query that counts how many rows would have been returned without the TOP(X).
And in some cases, the query optimizer may optimize away stuff so that the DB is not to run the big query.
For example Test table & data, using SQL Server 2005:
create table t (PK int identity(1, 1) primary key,
u int not null unique,
string VARCHAR(2000))
insert into t (u, string)
select top 100000 row_number() over (order by s1.id) , replace(space(2000), ' ', 'x')
from sysobjects s1,
sysobjects s2,
sysobjects s3,
sysobjects s4,
sysobjects s5,
sysobjects s6,
sysobjects s7
The non-clustered index on column u will be much smaller than the clustered index on column PK.
Then set up SMSS to show the actual execution plan for:
select PK, U, String from t
select count(*) from t
The first select does a clusted index scan, because it needs to return data out of the leafs. The second query does an index scan on the smaller non-clusteed index created for the unique constraint on U.
Applying the transform of the first query we get:
select count(*)
from (select PK, U, String from t) t
Running that and looking at the plan, the index on U is used again, exact same plan as select count(*) from t. The leaves are not visited to find the values for String on every row.