order of columns in primary key in cratedb - cratedb

Does the order of columns in the primary key impact the performance of related queries depending on the order of columns given in the select statement?
Example:
primary key (col1, col2, col3);
select col2, col3 from table;
-> would this select use the pk index?
select col3,col1,col2 from table;
-> would this select use the pk index?

No the order is not relevant.
But the primary key index is only used if all primary key columns will be used inside a where clause (like all indices).
select ... from table where col1 = ... and col2 = ... and col3 = ...;

Related

Why does the optimizer choose a keylookup instead of 2 separate queries?

I have a table that has a primary key/clustered index on an ID column and a nonclustered index on a system date column. If I query all the columns from the table using the system date column (covering index wouldn't make sense here) the execution plan shows a key lookup because for each record it finds it has to go the the ID to get all of the column data.
The weird thing is, if I write 2 queries with a temp table it performs much faster. I can query the system date to get a table of ID's and then use that table to search the ID column. This makes sense because you're no longer doing the slow key lookup for each record.
Why doesn't the optimizer do this for us?
--slow version with key lookup
--id primary key/clustered index
--systemdate nonclustered index
select ID, col1, col2, col3, col4, col5, SystemDate
from MyTable
where SystemDate > '2019-01-01'
--faster version
--id primary key/clustered index
--systemdate nonclustered index
select ID, SystemDate
into #myTempTable
from MyTable
where SystemDate > '2019-01-01'
select t1.ID, t1.col1, t1.col2, t1.col3, t1.col4, t1.col5, t1.SystemDate
from MyTable t1
inner join #myTempTable t2
on t1.ID = t2.ID
Well, in second case you're actually doing a key lookup yourself, aren't you? ; )
Optimizer could perform slower due to outdated (or missing) statistics, fragmented index.
To tell you why it's actually slower, it's best if you'd paste your execution plans here. This would be way easier to explain what happens.
Query optimizer chooses key lookup because the query is not supported by covering index. It has to grab missing columns from table itself:
/*
--slow version with key lookup
--id primary key/clustered index
--systemdate nonclustered index
*/
select ID, col1, col2, col3, col4, col5, SystemDate
from MyTable
where SystemDate > '2019-01-01';
Adding a covering index should boost the performance:
CREATE INDEX my_idx ON MyTable(SystemDate) INCLUDE(col1, col2, col3, col4, col5);
db<>fiddle demo
For query without JOIN:
select ID, col1, col2, col3, col4, col5, SystemDate
from MyTable -- single table
where SystemDate > '2019-01-01';
There is JOIN in execution plan:
After introducing covering index there is no need for additional key lookup:

How to sort a table by columns without query

I would want to leave a sorted table, ie when I did a query select * from NewTable I obtained the sorted table.
I've tried, but not sort the table how I specify
select column1,column2,column3,column4
into NewTable
from Table1,Table2
order by column1,column2
You only get result sets in a particular order when you use order by. Tables represent unordered sets, so they have no order except when being output as result sets.
However, you can use a trick in SQL Server to make that order by fast. The trick is to using the order by in insert and have an identity primary key. Then ordering by the primary key should be very efficient. You could do this as:
create table NewTable (
NewTableId int identity(1, 1) not null primary key,
column1 . . .
. . .
);
insert into NewTable(column1, column2, column3, column4)
select column1, column2, column3, column4
from Table1 cross joinTable2
order by column1, column2;
Now when you select from the table doing:
select column1, column2, column3, column4
from NewTable
order by id;
You are ordering by the primary key and no real sort is being done.
The clustered index of a table decides how the data is ordered, this example will demonstrate it:
CREATE TABLE test (id int, value varchar)
INSERT INTO test VALUES(1, 'z')
INSERT INTO test VALUES(2, 'y')
INSERT INTO test VALUES(3, 'x')
SELECT * FROM test
CREATE CLUSTERED INDEX IX_test ON test (value ASC)
SELECT * FROM test
This is the result:
id value
----------- -----
1 z
2 y
3 x
id value
----------- -----
3 x
2 y
1 z
After creating the index, the result is reversed, since the index is sorting the value-column ascending.
However please note, as others have mentioned, that the only 100% guaranteed way to get a correctly ordered result is to use an ORDER BY clause.
The answer "The clustered index of a table decides how the data is ordered..." is incorrect.
Without an ORDER BY the result set is returned in random order.
It depends on numerous facts. But one of the obvious ones is the index being used:
Here's a simple example to show that the previous statement is wrong:
create table #t (id int identity (1,1) primary key clustered, col1 int)
INSERT INTO #t (col1)
values
(5),
(4),
(3),
(2),
(1),
(0)
SELECT col1 FROM #t
CREATE INDEX IX_t
ON #t (col1);
SELECT col1 FROM #t
Even if the clustered index is present, with a covering index in a different sort order data will be returned more likely in the order of the index being used instead of the clustered index.
But if there are some pages already in memory and other ones need to get loaded from disc, the result set might look different again.
To summarize: Without ORDER BY the sort order cannot be guaranteed.
The ORDER of your result set is dictated by the execution plan, if the query uses an index AND there is no ORDER BY then the sort will be a result of that index. There is no guarantee on the order unless you issue an ORDER BY

SQL: Deleting duplicate records in SQL Server

I have an sql server database, that I pre-loaded with a ton of rows of data.
Unfortunately, there is no primary key in the database, and there is now duplicate information in the table. I'm not concerned about there not being a primary key, but i am concerned about there being duplicates in the database...
Any thoughts? (Forgive me for being an sql server newb)
Well, this is one reason why you should have a primary key on the table. What version of SQL Server? For SQL Server 2005 and above:
;WITH r AS
(
SELECT col1, col2, col3, -- whatever columns make a "unique" row
rn = ROW_NUMBER() OVER (PARTITION BY col1, col2, col3 ORDER BY col1)
FROM dbo.SomeTable
)
DELETE r WHERE rn > 1;
Then, so you don't have to do this again tomorrow, and the next day, and the day after that, declare a primary key on the table.
Let's say your table is unique by COL1 and COL2.
Here is a way to do it:
SELECT *
FROM (SELECT COL1, COL2, ROW_NUMBER() OVER (PARTITION BY COL1, COL2 ORDER BY COL1, COL2 ASC) AS ROWID
FROM TABLE_NAME )T
WHERE T.ROWID > 1
The ROWID > 1 will enable you to select only the duplicated rows.

SQL Server indexes - ascending or descending, what difference does it make?

When you create an index on a column or number of columns in MS SQL Server (I'm using version 2005), you can specify that the index on each column be either ascending or descending. I'm having a hard time understanding why this choice is even here. Using binary sort techniques, wouldn't a lookup be just as fast either way? What difference does it make which order I choose?
This primarily matters when used with composite indexes:
CREATE INDEX ix_index ON mytable (col1, col2 DESC);
can be used for either:
SELECT *
FROM mytable
ORDER BY
col1, col2 DESC
or:
SELECT *
FROM mytable
ORDER BY
col1 DESC, col2
, but not for:
SELECT *
FROM mytable
ORDER BY
col1, col2
An index on a single column can be efficiently used for sorting in both ways.
See the article in my blog for details:
Descending indexes
Update:
In fact, this can matter even for a single column index, though it's not so obvious.
Imagine an index on a column of a clustered table:
CREATE TABLE mytable (
pk INT NOT NULL PRIMARY KEY,
col1 INT NOT NULL
)
CREATE INDEX ix_mytable_col1 ON mytable (col1)
The index on col1 keeps ordered values of col1 along with the references to rows.
Since the table is clustered, the references to rows are actually the values of the pk. They are also ordered within each value of col1.
This means that that leaves of the index are actually ordered on (col1, pk), and this query:
SELECT col1, pk
FROM mytable
ORDER BY
col1, pk
needs no sorting.
If we create the index as following:
CREATE INDEX ix_mytable_col1_desc ON mytable (col1 DESC)
, then the values of col1 will be sorted descending, but the values of pk within each value of col1 will be sorted ascending.
This means that the following query:
SELECT col1, pk
FROM mytable
ORDER BY
col1, pk DESC
can be served by ix_mytable_col1_desc but not by ix_mytable_col1.
In other words, the columns that constitute a CLUSTERED INDEX on any table are always the trailing columns of any other index on that table.
For a true single column index it makes little difference from the Query Optimiser's point of view.
For the table definition
CREATE TABLE T1( [ID] [int] IDENTITY NOT NULL,
[Filler] [char](8000) NULL,
PRIMARY KEY CLUSTERED ([ID] ASC))
The Query
SELECT TOP 10 *
FROM T1
ORDER BY ID DESC
Uses an ordered scan with scan direction BACKWARD as can be seen in the Execution Plan. There is a slight difference however in that currently only FORWARD scans can be parallelised.
However it can make a big difference in terms of logical fragmentation. If the index is created with keys descending but new rows are appended with ascending key values then you can end up with every page out of logical order. This can severely impact the size of the IO reads when scanning the table and it is not in cache.
See the fragmentation results
avg_fragmentation avg_fragment
name page_count _in_percent fragment_count _size_in_pages
------ ------------ ------------------- ---------------- ---------------
T1 1000 0.4 5 200
T2 1000 99.9 1000 1
for the script below
/*Uses T1 definition from above*/
SET NOCOUNT ON;
CREATE TABLE T2( [ID] [int] IDENTITY NOT NULL,
[Filler] [char](8000) NULL,
PRIMARY KEY CLUSTERED ([ID] DESC))
BEGIN TRAN
GO
INSERT INTO T1 DEFAULT VALUES
GO 1000
INSERT INTO T2 DEFAULT VALUES
GO 1000
COMMIT
SELECT object_name(object_id) AS name,
page_count,
avg_fragmentation_in_percent,
fragment_count,
avg_fragment_size_in_pages
FROM
sys.dm_db_index_physical_stats(db_id(), object_id('T1'), 1, NULL, 'DETAILED')
WHERE index_level = 0
UNION ALL
SELECT object_name(object_id) AS name,
page_count,
avg_fragmentation_in_percent,
fragment_count,
avg_fragment_size_in_pages
FROM
sys.dm_db_index_physical_stats(db_id(), object_id('T2'), 1, NULL, 'DETAILED')
WHERE index_level = 0
It's possible to use the spatial results tab to verify the supposition that this is because the later pages have ascending key values in both cases.
SELECT page_id,
[ID],
geometry::Point(page_id, [ID], 0).STBuffer(4)
FROM T1
CROSS APPLY sys.fn_PhysLocCracker( %% physloc %% )
UNION ALL
SELECT page_id,
[ID],
geometry::Point(page_id, [ID], 0).STBuffer(4)
FROM T2
CROSS APPLY sys.fn_PhysLocCracker( %% physloc %% )
The sort order matters when you want to retrieve lots of sorted data, not individual records.
Note that (as you are suggesting with your question) the sort order is typically far less significant than what columns you are indexing (the system can read the index in reverse if the order is opposite what it wants). I rarely give index sort order any thought, whereas I agonize over the columns covered by the index.
#Quassnoi provides a great example of when it does matter.

Add a sequential number on create / insert - Teradata

In oracle we would use rownum on the select as we created this table. Now in teradata, I can't seem to get it to work. There isn't a column that I can sort on and have unique values (lots of duplication) unless I use 3 columns together.
The old way would be something like,
create table temp1 as
select
rownum as insert_num,
col1,
col2,
col3
from tables a join b on a.id=b.id
;
This is how you can do it:
create table temp1 as
(
select
sum(1) over( rows unbounded preceding ) insert_num
,col1
,col2
,col3
from a join b on a.id=b.id
) with data ;
Teradata has a concept of identity columns on their tables beginning around V2R6.x. These columns differ from Oracle's sequence concept in that the number assigned is not guaranteed to be sequential. The identity column in Teradata is simply used to guaranteed row-uniqueness.
Example:
CREATE MULTISET TABLE MyTable
(
ColA INTEGER GENERATED BY DEFAULT AS IDENTITY
(START WITH 1
INCREMENT BY 20)
ColB VARCHAR(20) NOT NULL
)
UNIQUE PRIMARY INDEX pidx (ColA);
Granted, ColA may not be the best primary index for data access or joins with other tables in the data model. It just shows that you could use it as the PI on the table.
This works too:
create table temp1 as
(
select
ROW_NUMBER() over( ORDER BY col1 ) insert_num
,col1
,col2
,col3
from a join b on a.id=b.id
) with data ;