I have a large number of records of type [crc INT, name VARCHAR]
Some (few) of the crc records will be duplicates. I am interested in a fast way to select items that have a specific crc value.
Does it worth (in terms of performance) to make crc field INTEGER PRIMARY KEY (that is unique) and make name a compound value (it's doable but ugly - i think) or just create an index on crc field ?
Making the crc column PRIMARY will add significant performance ?
Can someone help ?
1) select items that have a specific crc value:
SELECT * FROM tablename WHERE crc=777;
2)don't make the crc field as INTEGER PRIMARY KEY, cause it may duplicate. You should better create the 'id' field that would be unique and mark it as INTEGER PRIMARY KEY AUTOINCREMENT
Storing a list of names will create problem when you want to access individual names.
(But if you always look up or retrieve all those names at once, there will be not much of a problem.)
If there are duplicates, you cannot make that column the primary key.
Just creating a (non-unique) index is the simplest solution.
Using an INTEGER PRIMARY KEY index is a little bit more efficient that a separate index, but the difference is unlikely to be significant.
Related
I have this table that has a 'unique together' index:
CREATE TABLE IF NOT EXISTS entity_properties
(
entity_id INTEGER NOT NULL REFERENCES entities,
property_id UUID NOT NULL REFERENCES properties,
value VARCHAR,
value_label VARCHAR,
UNIQUE (entity_id, property_id)
);
I want to create an index on 'value' column to minimize search time:
CREATE INDEX index_property_value ON entity_properties (value)
I get this error:
index row requires 8296 bytes, maximum size is 8191
As the error clearly says creating this index would exceed the maximum limit size.
You can see this answer.
But I really need 'value' column to be indexed for efficiency reasons. In my database this table holds the largest part of data (millions of rows). Also it gets updated very frequently. As far as I know updating indexed columns has effects on performance. That is why I am concerned about performance
How can I achieve this?
PS: my other thought is that I can add the 'value' column to 'unique together' index.
CREATE TABLE IF NOT EXISTS entity_properties
(
entity_id INTEGER NOT NULL REFERENCES entities,
property_id UUID NOT NULL REFERENCES properties,
value VARCHAR,
value_label VARCHAR,
UNIQUE (entity_id, property_id, value)
);
Can this be a solution? If so is it best approach? If not, what is the best approach
PostgreSQL has a built-in hash index type which doesn't suffer from this limitation, so you can just create one of those:
CREATE INDEX index_property_value ON entity_properties using hash (value)
This has the advantage (over using a functional index as Laurenz suggests) in that you don't need to write your query in an unnatural way.
But, is it sensible that the "value" column can contain values this large? Maybe the best solution would be to investigate the large data and clean it up if it is not sensible.
Trying to add this as another column into an existing unique index would just make things worse. It would still need 8296 bytes, plus more for the other columns
It is an unusual requirement to search for long texts. To avoid the error and get efficient index access, use a hash of the column:
CREATE INDEX ON entity_properties (hashtext(value));
This can be used with a query like
SELECT ...
FROM entity_properties
WHERE value = 'long string'
AND hashtext(value) = hashtext('long string');
The first condition is necessary to deal with hash collisions.
So I'm importing large JSON-data and translating it to a SQLite server. I'm using transactions for the inserts, and I've tried tables using NULL or not using NULL to check the difference in performance.
When I had tables in SQLite that looked like this:
CREATE TABLE comments(
id TEXT,
author TEXT,
body TEXT,
score INTEGER,
created_utc TEXT
);
The import time was really slow, and searching in the table (e.g. select * from comments where author = 'blabla') was also slow.
When instead using a table with specified NULL or NOT NULL constraints, the import time and search time went much faster (from 2000 seconds to 600 seconds).
CREATE TABLE comments(
id TEXT PRIMARY KEY,
author TEXT NOT NULL,
body TEXT NULL,
score INTEGER NULL,
created_utc TEXT NULL
);
Does anyone know why this change in performance happened when using NULL or NOT NULL?
As per my comment, adding PRIMARY KEY may be a major factor regarding improvements for searches. Although it may have a negative impact on inserts as the that index will have to be maintained.
Coding NULL makes no difference as it just leaves the NOT NULL flag as 0, so that can be ignored.
Coding NOT NULL may result in fewer inserts due to the constraint being met and could thus result in a performance improvement.
Regarding PRIMARY INDEX, coding this as anything other than INTEGER PRIMARY KEY or INTEGER PRIMARY KEY AUTOINCREMENT will result in a subsequent index being created.
That is, if a table is not defined with WITHOUT ROWID then SQLite creates the "REAL" primary index with a normally invisible column named rowid. This uniquely identifies a row. (Try SELECT rowid FROM comments)
As such, in both scenarios there is an index based upon the rowid. For all intents and purposes this will be the order in which the rows were inserted.
In the second scenario there will be 2 indexes the "REAL" primary index based upon the rowid and the defined primary index based upon the id column. There would be some impact on inserts due to the 2nd index needing to be maintained.
So say you search the id column for id x, in the first table it will be relatively slow as it's got to search according to rowid order, it's all it has. However, adding the index according to id and the search is going to be favourable because that index (of the 2 available) is the one the search would likely be based upon.
Note the above is a pretty simplistic overview it doesn't consider The SQLite Query Planner which may be of interest. The ANALYZE statement may also be of interest.
Just thinking about database design issues. Suppose i have a table like this:
CREATE TABLE LEGACYD.CV_PLSQL_COUNT
(
RUN_DTTM DATE NOT NULL,
TABLE_NAME VARCHAR (80) NOT NULL,
COUNT Number(50) PRIMARY KEY
);
Is that a bad idea - to make the COUNT a PRIMARY KEY ? Generally, how should I make the decision of what is a primry key?
Is that a bad idea - to make the COUNT a PRIMARY KEY ? Generally, how should I make the decision of what is a primry key?
Candidate keys are based on functional dependencies. A primary key is one of the candidate keys. There's no formal way to look at a set of candidate keys and say, "This one must be the primary key." That decision is based on practical matters, not on formal logic.
Your table structure tells us these things.
COUNT is unique. No matter how many rows this table has, you'll never find two rows that have the same value for COUNT.
COUNT determines TABLE_NAME. That is, given a value for COUNT, we will forever find one and only one value for TABLE_NAME.
TABLE_NAME is not unique. We expect to find several rows that have the same value for TABLE_NAME.
COUNT determines RUN_DTTM. That is, given a value for COUNT, we will forever find one and only one value for RUN_DTTM.
RUN_DTTM is not unique. We expect to find several rows that have the same value for RUN_DTTM.
The combination of TABLE_NAME and RUN_DTTM is not unique. We expect to find several rows that have the same values for TABLE_NAME and RUN_DTTM in a single row.
There are no other determinants. That means that given a value for TABLE_NAME, we'll find multiple unrelated values for COUNT and RUN_DTTM. Likewise, if we're given a value for RUN_DTTM, or values for the pair of columns {TABLE_NAME, RUN_DTTM}.
If all those things are true, then COUNT might be a good primary key. But I doubt that all those things are true.
Based only on the column names--a risky way to proceed--I think it's far more likely that the only candidate key is {TABLE_NAME, RUN_DTTM}. I think it's also likely that either RUN_DTTM is misnamed, or RUN_DTTM has the wrong data type. If it's a date, you probably meant to name it RUN_DT; if it's a timestamp, the data type should be TIMESTAMP.
Can I use the ROWID in place of a timestamp in an SQLite table?
I need to get the 50 most recent items of an SQLite table, I was thinking about using a separate timestamp field, but then I figured a bigger ROWID means a newer item, and the ROWID is already there, no need to change the schema.
Well, as it says here (with my emphasis):
The data for rowid tables is stored as a B-Tree structure containing
one entry for each table row, using the rowid value as the key. This
means that retrieving or sorting records by rowid is fast. Searching
for a record with a specific rowid, or for all records with rowids
within a specified range is around twice as fast as a similar search
made by specifying any other PRIMARY KEY or indexed value.
However, it also says:
Rowid values may be modified using an UPDATE statement in the same way
as any other column value can, either using one of the built-in
aliases ("rowid", "oid" or "rowid") or by using an alias created by
an integer primary key.
It would certainly be faster, but it sort of hurts my feeling for "open" design, in that you're relying on a feature of the implementation rather than making it specific.
Having said that, the same link also says this:
With one exception noted below, if a rowid table has a primary key
that consists of a single column and the declared type of that column
is "INTEGER" in any mixture of upper and lower case, then the column
becomes an alias for the rowid. Such a column is usually referred to
as an "integer primary key". A PRIMARY KEY column only becomes an
integer primary key if the declared type name is exactly "INTEGER".
Other integer type names like "INT" or "BIGINT" or "SHORT INTEGER" or
"UNSIGNED INTEGER" causes the primary key column to behave as an
ordinary table column with integer affinity and a unique index, not as
an alias for the rowid.
Which I think gives you the perfect answer:
Define an INTEGER primary key on your table and use that for selection. You'll get the speed of using the ROWID (because as it says above, it's just an alias) and you'll get visibility in the schema of what you're doing.
Is it better if I use ID nr:s instead of VARCHARS as foreign keys?
And is it better to use ID nr:s isntead of VARCHARS as Primary Keys?
By ID nr I mean INT!
This is what I have now:
category table:
cat_id ( INT ) (PK)
cat_name (VARCHAR)
category options table:
option_id ( INT ) (PK)
car_id ( INT ) (FK)
option_name ( VARCHAR )
I COULD HAVE THIS I THINK:
category table:
cat_name (VARCHAR) (PK)
category options table:
cat_name ( VARCHAR ) (FK)
option_name ( VARCHAR ) ( PK )
Or am I thinking completely wrong here?
The problem with VARCHAR being used for any KEY is that they can hold WHITE SPACE. White space consists of ANY non-screen-readable character, like spaces tabs, carriage returns etc. Using a VARCHAR as a key can make your life difficult when you start to hunt down why tables aren't returning records with extra spaces at the end of their keys.
Sure, you CAN use VARCHAR, but you do have to be very careful with the input and output. They also take up more space and are likely slower when doing a Queries.
Integer types have a small list of 10 characters that are valid, 0,1,2,3,4,5,6,7,8,9. They are a much better solution to use as keys.
You could always use an integer-based key and use VARCHAR as a UNIQUE value if you wanted to have the advantages of faster lookups.
My 2 cents:
From a performance perspective, using CHAR or VARCHAR as primary key or index is a nightmare.
I've tested compound primary keys (INT + CHAR, INT + VARCHAR, INT + INT) and by far INT + INT was the best performance (loading a data warehouse). Lets say about twice more performance if you keep only numeric primary keys/indexes.
When I'm doing design work I ask myself: have I got anything in this data that I can guarantee is going to be non-NULL, unique, and unchanging? If so that's a candidate to be the primary key. If not, I know I have to generate a key value to use. Assuming, then, that my candidate key happens to be a VARCHAR I then look at the data. Is it reasonably short in length (meaning, say, 20 characters or less)? Or is the VARCHAR field rather long? If it's short it's usable as a key - if it's long, perhaps it's better to not use it as a key (although if it's in consideration for being the primary key I'm probably going to have to index it anyways). At least part of my concern is that the primary key is going to have to be indexed and will perhaps be used as a foreign key from some other table. Comparisons of VARCHAR fields tend to be slower than the comparison of numeric fields (particularly binary numeric fields such as integers) so using a long VARCHAR field as a key may result in slow performance. YMMV.
with an int you can store up to 2 billion in 4 bytes with varchars you cannot you need to have 10 bytes or so to store that, if you use varchars there is also a 2 byte overhead
so now you add up the 6 extra bytes in every PK and FK + the 2 byte varchar overhead
I would say it is fine to use VARCHAR as both PRIMARY and FOREIGN KEYS.
Only issue I could forsee is if you have a table, lets say Instruments (share instruments) and you create the PRIMARY/FOREIGN KEY as VARCHAR, and it happens that the CODE changes.
This does happen on Stock Exchanges, and would require you to rename all references to this CODE, where as a ID nr would not require this from you.
So to conclude, I would say this dependes on your intended use.
EDIT
When I say CODE, I mean the Ticker Code for lets say GOOG, or any other share. It is possible for these codes to change over time, lets say you look at Dirivative/Future instruments.
If you make the category name into the ID you will have a problem if you ever decide to rename a category.
There's nothing wrong with either approach, although this question might start the usual argument of which is better: natural or surrogate keys.
If you use CHAR or VARCHAR as a primary key you'll end up using it as a forign key at some point. When it comes down to it, as #astander says, it depends on your data and how you are going to use it.