table index for DISTINCT values

table index for DISTINCT values - sql

In my stored procedure, I need "Unique" values of one of the columns. I am not sure if I should and if I should, what type of Index I should apply on the table for better performance. No being very specific, the same case happens when I retrieve distinct values of multiple columns.
The column is of String(NVARCHAR) type.
e.g.
select DISTINCT Column1 FROM Table1;
OR
select DISTINCT Column1, Column2, Column3 FROM Table1;

An index on these specific columns could improve performance by a bit, but just because it will require SQL Server to scan less data (just these specific columns, nothing else). Other than that - a SCAN will always be done. An option would be to create indexed view if you need distinct values from that table.
CREATE VIEW Test
WITH SCHEMABINDING
AS
SELECT Column1, COUNT_BIG(*) AS UselessColumn
FROM Table1
GROUP BY Column1;
GO
CREATE UNIQUE CLUSTERED INDEX PK_Test ON Test (Column1);
GO
And then you can query it like that:
SELECT *
FROM Test WITH (NOEXPAND);
NOEXPAND is a hint needed for SQL Server to not expand query in a view and treat it as a table. Note: this is needed for non Enterprise version of SQL Server only.

I recently had the same issue and found it could be overcome using a Columnstore index:
CREATE NONCLUSTERED COLUMNSTORE INDEX [CI_TABLE1_Column1] ON [TABLE1]
([Column1])
WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0)

Related

UNION ALL from each row of a results set

I have a view that runs quickly if I feed it a single parameter, i.e.:
SELECT * FROM v_myView WHERE myVal = 'thisValue';
If I ask it for multiple values IN, it evaluates the entire view's data, and picks the results from the data, in an operation taking around 20 seconds. So this is slow:
SELECT * FROM v_myView WHERE myVal IN (SELECT theseValues FROM myTable);
I have it in mind that for a dataset I know is small, it would be quicker to take all the results from SELECT theseValues FROM myTable query their matches individually from v_myView and UNION ALL the results such that I'm effectively generating the query:
SELECT * FROM v_myView WHERE myVal = 'thisValue1'
UNION ALL
SELECT * FROM v_myView WHERE myVal = 'thisValue2'
UNION ALL
etc...;
Is there any way to force this to happen in a "simple" query without using a stored procedure or dynamic sql, or am I just going to have to do this long-hand?

Getting a look at the execution plan and the structure of the underlying table and the view would have allowed to answer the question better.
Usually use of EXISTS gives the best performance out of the 3 normal frequently used ways to achieve this:
EXISTS
INNER JOIN
IN
However the comparison depends on the structure and indexing of the underlying tables.
EXISTS in a way short circuits the search whenever it finds a match which can make it better than the other 2 approaches.
Query:
SELECT *
FROM v_myView V
WHERE EXISTS (
SELECT 1
FROM myTable T
WHERE T.theseValues = V.myVal
);
But the query should also be covered with required indexes to be able to get good performance:
myVal column of the underlying table of v_myView view should have a nonclustered rowstore index (unless myVal is already the clustered index key of that table).
theseValues column of myTable should have a nonclustered rowstore index (unless theseValues is already the clustered index key of the table)
Do you need to fetch all the columns of the view v_myView in your final result? I would suggest you to fetch only the required columns in result. The selected columns should be covered by use of either nonclustered rowstore index with INCLUDE clause, or create a single nonclustered columnstore index covering those columns per underlying table.

Is it helpful to create a multi-column index if the columns are already indexed on their own?

Let's say your table has three columns:
time (integer)
name (varchar)
other_column (varchar)
and you have two indexes:
CREATE INDEX index_time ON my_table (time);
CREATE INDEX index_name ON my_table (name);
In this case, does it make any difference if I create a new index based on both time and name? i.e.:
CREATE INDEX index_name_and_time ON my_table (name,time);

In regards overall performance the three indexes may be overkill and have a detrimental affect when inserting as there are then the three indexes to maintain and the extra memory/space utilisation.
However, the first factor would be to ascertain if the indexes would actually be utilised which depends upon what queries are to be run.
From a brief play with the following code, which you could use as the basis to explore more fully (EXPLAIN QUERY PLAN your_query being a tool to use):-
DROP TABLE IF EXISTS my_table;
DROP INDEX IF EXISTS index_time;
DROP INDEX IF EXISTS index_name;
DROP INDEX IF EXISTS index_name_and_time;
CREATE TABLE IF NOT EXISTS my_table (time INTEGER, name TEXT, other TEXT);
CREATE INDEX IF NOT EXISTS index_time ON my_table (time); -- INDEX 1
-- CREATE INDEX IF NOT EXISTS index_name ON my_table (name); -- INDEX 2
-- CREATE INDEX index_name_and_time ON my_table (name,time); -- INDEX 3
EXPLAIN QUERY PLAN
SELECT * FROM my_table; -- QUERY 1
-- EXPLAIN QUERY PLAN
-- SELECT time, name, other FROM my_table -- QUERY 2
-- EXPLAIN QUERY PLAN
-- SELECT time, name, other FROM my_table ORDER BY time, name; -- QUERY 3
-- EXPLAIN QUERY PLAN
-- SELECT time, name, other FROM my_table ORDER BY name, time; -- QUERY 4
The following results can be obtained :-
First two Queries, no advantage, just disadvantage.
Having no indexes through to having all 3 makes no difference to the first 2 queries (basically the same). None use any of the indexes when 0,1,2 or 3 indexes are available. They use SCAN TABLE my_table
The 3rd Query
Without any indexes then SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
With just the first index SCAN TABLE my_table USING INDEX index_time and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY.
With the 1st and 2nd SCAN TABLE my_table USING INDEX index_time and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY.
With just the 2nd SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
With all 3 SCAN TABLE my_table USING INDEX index_time and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
With just the 3rd SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
The 4th query
Without any SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
With 1 SCAN TABLE my_table and USE TEMP B-TREE FOR ORDER BY
With 1 and 2 SCAN TABLE my_table USING INDEX index_name and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
With 2 SCAN TABLE my_table USING INDEX index_name and USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
With 1,2 and 3 SCAN TABLE my_table USING INDEX index_name_and_time
With just 3 SCAN TABLE my_table USING INDEX index_name_and_time
Of course this is not factoring in timings as the tables are empty. The code above could easily be adapted to include data and thus then have timings applied. Note you also perhaps ant to consider effects other than running queries, such as insertions and deletions which would alter the indexes.
The Answer - It depends.
So at least from a index utilisation point of view it's quite clear that an index being useful or not is dependant upon the queries used.

The third index, (name, time) is redundant with (name).
You should probably drop the (name) index and just include (name, time) and (time) -- if those are the indexes that you think you need.

How expensive is select distinct * query

In sql server 2012, I have got a table with more than 25 million rows with duplicates. The table doesn't have unique index. It only has a non-clustered index. I wanted to eliminate duplicates and so, I m thinking of the below
select distinct * into #temp_table from primary_table
truncate primary_table
select * into primary_table from #temp_table
I wanted to know how expensive is select distinct * query. If my procedure above is very expensive, I wanted to know if there is another alternate way.

I don't know how expensive it is, but an alternate way is to create another table with a primary key, insert all the data there and silently reject the duplicates as stated here
http://web.archive.org/web/20180404165346/http://sqlblog.com:80/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx
basically, using IGNORE_DUP_KEY

Less than or greater than operator issue for index in sql server

I found that the if I query the table with less than or greater than operator, sql server indexes do not work properly.
Say I have a simple table (TestTable) with only 2 columns like this:
Column Name, column type, primary Key, index
iID, int, yes, cluster index
iCount, int, no, non-cluster index
name, nvarchar(255), no, no index
Now, I query the table by this:
SELECT * FROM TestTable WHERE iCount = 10.
Very good, Sql server will use the non-cluster index for column iCount to retrieve the result.
However, if I query the table by this:
SELECT * FROM TestTable WHERE iCount < 10,
Sql server will do a index scan over the cluster index for the iID to retrieve the result.
I am wondering why sql server is not able to use proper index when I use less than or greater than operator in the query.

If the table has very few rows, it is cheaper for SQL Server to scan the clustered index rather than using the non-clustered index and then doing a lookup for the rest of the columns in the clustered index. If that's the case, change the query to SELECT iCount FROM... and you should see the query plan change to using the index as you are expecting.

Does SQLite multi column primary key need an additional index?

If I create a table like so:
CREATE TABLE something (column1, column2, PRIMARY KEY (column1, column2));
Neither column1 nor column2 are unique by themselves. However, I will do most of my queries on column1.
Does the multi column primary key create an index for both columns separately? I would think that if you specify a multi column primary key it would index them together, but I really don't know.
Would there be any performance benefit to adding a UNIQUE INDEX on column1?

There will probably not be a performance benefit, because queries against col1=xxx and col2=yyy would use the same index as queries like col1=zzz with no mention of col2. But my experience is only Oracle, SQL Server, Ingres, and MySQL. I don't know for sure.

You certainly don't want to add a unique index on column 1 as you just stated:
Neither column1 nor column2 are unique by themselves.
If column one comes first, it will be first in the multicolumn index in most databases and thus it is likely to be used. The second column is the one that might not use the index. I wouldn't add one on the second column unless you see problems and again, I would add an index not a unique index based on the comment you wrote above.
But SQL lite must have some way of seeing what it is using like most other databases, right? Set the Pk and see if queries uing just column1 are using it.

I stumbled across this question while researching this same question, so figured I'd share my findings. Note that all of the below is tested on SQLite 3.39.4. I make no guarantees about how it will hold up on old/future versions. That said, SQLite is not exactly known for radically changing behavior at random.
To give a concrete answer for SQLite specifically: an index on column1 would provide no benefits, but an index on column2 would.
Let's look at a simple SQL script:
CREATE TABLE tbl (
column1 TEXT NOT NULL,
column2 TEXT NOT NULL,
val INTEGER NOT NULL,
PRIMARY KEY (column1, column2)
);
-- Uncomment to make the final SELECT fast
-- CREATE INDEX column2_ix ON tbl (column2);
EXPLAIN QUERY PLAN SELECT val FROM tbl WHERE column1 = 'column1' AND column2 = 'column2';
EXPLAIN QUERY PLAN SELECT val FROM tbl WHERE column1 = 'column1';
EXPLAIN QUERY PLAN SELECT val FROM tbl WHERE column2 = 'column2';
EXPLAIN QUERY PLAN is SQLite's method of allowing you to inspect what its query planner is actually going to do.
You can execute the script via something like:
$ sqlite3 :memory: < sample.sql
This gives the output
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=? AND column2=?)
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=?)
QUERY PLAN
`--SCAN tbl
So the first two queries, the ones which SELECT on (column1, column2) and (column1), will use the index to perform the search. Which should be nice and fast.
Note that the last query, the SELECT on (column2) has different output, though. It says it's going to SCAN the table -- that is, go through each row one by one. This will be significantly less performant.
What happens if we uncomment the CREATE INDEX in the above script? This will give the output
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=? AND column2=?)
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=?)
QUERY PLAN
`--SEARCH tbl USING INDEX column2_ix (column2=?)
Now the query on column2 will also use an index, and should be just as performant as the others.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

table index for DISTINCT values - sql

I recently had the same issue and found it could be overcome using a Columnstore index: CREATE NONCLUSTERED COLUMNSTORE INDEX [CI_TABLE1_Column1] ON [TABLE1] ([Column1]) WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0)

Related

UNION ALL from each row of a results set

Is it helpful to create a multi-column index if the columns are already indexed on their own?

How expensive is select distinct * query

Less than or greater than operator issue for index in sql server

Does SQLite multi column primary key need an additional index?

Categories

Resources