I have a table structure like below :
FeatureList
ID - BIGINT - Primary Key - Clustered Index
VIN - VARCHAR(50)
Text - VARCHAR(50)
Value - VARCHAR(50)
Most of the query I execute on this are like :
SELECT * FROM FeatureList WHERE VIN = 'ABCD' --- Will give multiple records
OR
DELETE FROM FeatureList WHERE VIN = 'ABCD'
I want to know, is VIN column is a good candidate for nonclustered index? Or it might degrade the performance?
Not declaring an index on VIN absolutely will drastically degrade performance. You take a small performance hit on each insert, delete, or update involving VIN. Reads (especially once you get into millions of records) will run orders of magnitude faster.
As for BIGINT versus INT, I generally go for BIGINT. Yes, it takes up a bit more disk space. Yes, it takes up a bit more memory. The plus side for me, though, is that I never, ever have to worry about migrating the table (and every other table that takes ID as a foreign key) to BIGINT. Been there. Done that. The extra space is worth it.
Related
I created a table in postgres that includes various columns that based on an import from a csv. Since the csv can be re-uploaded with some changes and I don't want the re-upload to create duplicates on rows that didn't change, I added a constraint and unique index with majority of the columns included in them. Is there any performance concerns to doing this? Is there a better way to solving this?
CREATE TABLE IF NOT EXISTS table_a (
id SERIAL PRIMARY KEY,
table_b_id SERIAL REFERENCES table_b(id) ON DELETE CASCADE,
column_a INT,
column_b INT,
column_c BOOLEAN,
created_time BIGINT NOT NULL,
CONSTRAINT "table_a_key" UNIQUE ("table_b_id", "column_a", "column_b", "column_c")
);
Adding an index always imposes a performance penalty for INSERT, UPDATE and DELETE statements as the index needs to be maintained.
However, in most cases the added overhead doesn't really matter and might be out-weighted by the improved performance when SELECTing data. The performance overhead is more clearly visible when doing bulk loads compared to single row operations.
But if you want to prevent duplicates, you don't have a choice to begin with.
The only reliable way to prevent duplicates is to create a unique constraint (or index) as you did. It's also the only way you can make use of an "upsert" using insert ... on conflict ...
So, yes there is a performance penalty to pay. And no, there is no better way to ensure uniqueness.
So I'm importing large JSON-data and translating it to a SQLite server. I'm using transactions for the inserts, and I've tried tables using NULL or not using NULL to check the difference in performance.
When I had tables in SQLite that looked like this:
CREATE TABLE comments(
id TEXT,
author TEXT,
body TEXT,
score INTEGER,
created_utc TEXT
);
The import time was really slow, and searching in the table (e.g. select * from comments where author = 'blabla') was also slow.
When instead using a table with specified NULL or NOT NULL constraints, the import time and search time went much faster (from 2000 seconds to 600 seconds).
CREATE TABLE comments(
id TEXT PRIMARY KEY,
author TEXT NOT NULL,
body TEXT NULL,
score INTEGER NULL,
created_utc TEXT NULL
);
Does anyone know why this change in performance happened when using NULL or NOT NULL?
As per my comment, adding PRIMARY KEY may be a major factor regarding improvements for searches. Although it may have a negative impact on inserts as the that index will have to be maintained.
Coding NULL makes no difference as it just leaves the NOT NULL flag as 0, so that can be ignored.
Coding NOT NULL may result in fewer inserts due to the constraint being met and could thus result in a performance improvement.
Regarding PRIMARY INDEX, coding this as anything other than INTEGER PRIMARY KEY or INTEGER PRIMARY KEY AUTOINCREMENT will result in a subsequent index being created.
That is, if a table is not defined with WITHOUT ROWID then SQLite creates the "REAL" primary index with a normally invisible column named rowid. This uniquely identifies a row. (Try SELECT rowid FROM comments)
As such, in both scenarios there is an index based upon the rowid. For all intents and purposes this will be the order in which the rows were inserted.
In the second scenario there will be 2 indexes the "REAL" primary index based upon the rowid and the defined primary index based upon the id column. There would be some impact on inserts due to the 2nd index needing to be maintained.
So say you search the id column for id x, in the first table it will be relatively slow as it's got to search according to rowid order, it's all it has. However, adding the index according to id and the search is going to be favourable because that index (of the 2 available) is the one the search would likely be based upon.
Note the above is a pretty simplistic overview it doesn't consider The SQLite Query Planner which may be of interest. The ANALYZE statement may also be of interest.
I have a table that tracks statuses that a particular file goes through as it is checked over by our system. It looks like this:
FileID int
Status tinyint
TouchedBy varchar(50)
TouchedWhen datetime
There is currently NO primary key on this table however there is a clustered index on Status and TouchedWhen
As the table has continued to grow and performance decrease in querying against it, one thought I've had is to do add a PrimaryKey so that I get off the heap lookups -- a primary key on FileID, Status and TouchedWhen
The problem I'm running into si that TouchedWhen, due to it's rounding issues, has, on occasion, 2 entries with the exact same datetime.
So then I started researching what it takes to convert that to a datetime2(7) and alter those that are duplicate at that time. My table would then look like:
FileID int
Status tinyint
TouchedBy varchar(50)
TouchedWhen datetime2(7)
And a primarykey on FileID, Status and TouchedWhen
My question is this -- what is the best way to go through and add a millisecond to the existing tables if there are duplicates? How can I do this to a table that needs to remain online?
In advance, thanks,
Brent
You shouldn't need to add a primary key to make queries faster - just adding an index on FileID, Status, TouchedWhen will have just as much of a performance impact as adding a primary key. The main benefit of defining a primary key is for record identity and referential integrity, which could be accomplished with a auto-increment primary key.
(I'm NOT saying you shouldn't have a primary key, I'm saying the performance impact of a primary key is in the index itself, not the fact that it's a primary key)
On the other hand, changing your clustered index to include FileID would likely have a bigger impact as lookups using those columns would not need to search the index then look up the data - the data pages would be right there with the index values.
Is it better if I use ID nr:s instead of VARCHARS as foreign keys?
And is it better to use ID nr:s isntead of VARCHARS as Primary Keys?
By ID nr I mean INT!
This is what I have now:
category table:
cat_id ( INT ) (PK)
cat_name (VARCHAR)
category options table:
option_id ( INT ) (PK)
car_id ( INT ) (FK)
option_name ( VARCHAR )
I COULD HAVE THIS I THINK:
category table:
cat_name (VARCHAR) (PK)
category options table:
cat_name ( VARCHAR ) (FK)
option_name ( VARCHAR ) ( PK )
Or am I thinking completely wrong here?
The problem with VARCHAR being used for any KEY is that they can hold WHITE SPACE. White space consists of ANY non-screen-readable character, like spaces tabs, carriage returns etc. Using a VARCHAR as a key can make your life difficult when you start to hunt down why tables aren't returning records with extra spaces at the end of their keys.
Sure, you CAN use VARCHAR, but you do have to be very careful with the input and output. They also take up more space and are likely slower when doing a Queries.
Integer types have a small list of 10 characters that are valid, 0,1,2,3,4,5,6,7,8,9. They are a much better solution to use as keys.
You could always use an integer-based key and use VARCHAR as a UNIQUE value if you wanted to have the advantages of faster lookups.
My 2 cents:
From a performance perspective, using CHAR or VARCHAR as primary key or index is a nightmare.
I've tested compound primary keys (INT + CHAR, INT + VARCHAR, INT + INT) and by far INT + INT was the best performance (loading a data warehouse). Lets say about twice more performance if you keep only numeric primary keys/indexes.
When I'm doing design work I ask myself: have I got anything in this data that I can guarantee is going to be non-NULL, unique, and unchanging? If so that's a candidate to be the primary key. If not, I know I have to generate a key value to use. Assuming, then, that my candidate key happens to be a VARCHAR I then look at the data. Is it reasonably short in length (meaning, say, 20 characters or less)? Or is the VARCHAR field rather long? If it's short it's usable as a key - if it's long, perhaps it's better to not use it as a key (although if it's in consideration for being the primary key I'm probably going to have to index it anyways). At least part of my concern is that the primary key is going to have to be indexed and will perhaps be used as a foreign key from some other table. Comparisons of VARCHAR fields tend to be slower than the comparison of numeric fields (particularly binary numeric fields such as integers) so using a long VARCHAR field as a key may result in slow performance. YMMV.
with an int you can store up to 2 billion in 4 bytes with varchars you cannot you need to have 10 bytes or so to store that, if you use varchars there is also a 2 byte overhead
so now you add up the 6 extra bytes in every PK and FK + the 2 byte varchar overhead
I would say it is fine to use VARCHAR as both PRIMARY and FOREIGN KEYS.
Only issue I could forsee is if you have a table, lets say Instruments (share instruments) and you create the PRIMARY/FOREIGN KEY as VARCHAR, and it happens that the CODE changes.
This does happen on Stock Exchanges, and would require you to rename all references to this CODE, where as a ID nr would not require this from you.
So to conclude, I would say this dependes on your intended use.
EDIT
When I say CODE, I mean the Ticker Code for lets say GOOG, or any other share. It is possible for these codes to change over time, lets say you look at Dirivative/Future instruments.
If you make the category name into the ID you will have a problem if you ever decide to rename a category.
There's nothing wrong with either approach, although this question might start the usual argument of which is better: natural or surrogate keys.
If you use CHAR or VARCHAR as a primary key you'll end up using it as a forign key at some point. When it comes down to it, as #astander says, it depends on your data and how you are going to use it.
There is a requirement to use GUID(s) as primary keys. Am I right in thinking that
ProductID UNIQUEIDENTIFIER NOT NULL
ROWGUIDCOL DEFAULT (NEWSEQUNTIALID()) PRIMARY KEY CLUSTERED
will give the fastest select for where clause
productid in ( guid1 , guid2 ,..., guidn )
and doesn't deteriorate non-clustered
natural_key like 'Something*'
independent select. Table for querying only by users and created/recreated programmatically from scratch.
The fact you're using GUID's as a clustered index will most definitely negatively impact your performance. Even with the NEWSEQUENTIALGUID, the GUIDs aren't really sequential - they're only partially so. Their randomness by nature will definitely lead to higher index fragmentation and thus to less optimal seek times.
Additionally, if you have a 16-byte GUID as your clustered key, it will be added to any non-clustered index on that table. That might not sound so bad, but if you have 10 mio. rows, 10 non-clustered indices, using a 16-byte GUID vs. a 4-byte INT will cost you 1.2 GByte of storage wasted - and not just on disk (which is cheap), but also in your SQL server's memory (since SQL server always loads entire 8k pages into 8k blocks of memory, no matter how full or empty they are).
I can see the point of using a GUID as a primary key - they're almost 100% guarantee to be unique is appealing to developers. BUT: as a clustered key, they're a nightmare for your database.
My best practice: if I really need a GUID as primary key, I add a 4-byte INT IDENTITY to the table which then serves as the clustered key - the results are way better that way!
If you have a non-clustered primary key, your queries using list of GUIDs will be just as fast as if it where a clustered primary key, and by not using GUIDs for your clustered key, your table will perform even better in the end.
Read up more on clustered key and why it's so important to pick the right one in Kimberly Tripps' blog - the the Queen of Indexing and can explain things much better than I do:
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
As well as GUIDs being bad (answer from marc_s), you also have an IN clause. This beaks down to:
productid = guid1 OR productid = guid2 OR ... OR productid = guidn
...in practice, which is not optimal either.
Generally, natural_key like 'Something%' will most likely be better for a clustered index on your natrual key column.
A Clustered index is best suited to range searches, so it might satisfy your query:
productid in ( guid1 , guid2 ,..., guidn )
but depends what else you are selecting, grouping by, ordering by etc if the index is to be a covering index. Otherwise another non-clustered index might be picked by the optimiser followed by a lookup into the clustered index. It also depends to some extent on the number of rows in that table.
Also, I think you might want to use NEWID() as oppose to NEWSEQUENTIALID()