NULL or NOT to NULL - Effects on performance

NULL or NOT to NULL - Effects on performance - sql

So I'm importing large JSON-data and translating it to a SQLite server. I'm using transactions for the inserts, and I've tried tables using NULL or not using NULL to check the difference in performance.
When I had tables in SQLite that looked like this:
CREATE TABLE comments(
id TEXT,
author TEXT,
body TEXT,
score INTEGER,
created_utc TEXT
);
The import time was really slow, and searching in the table (e.g. select * from comments where author = 'blabla') was also slow.
When instead using a table with specified NULL or NOT NULL constraints, the import time and search time went much faster (from 2000 seconds to 600 seconds).
CREATE TABLE comments(
id TEXT PRIMARY KEY,
author TEXT NOT NULL,
body TEXT NULL,
score INTEGER NULL,
created_utc TEXT NULL
);
Does anyone know why this change in performance happened when using NULL or NOT NULL?

As per my comment, adding PRIMARY KEY may be a major factor regarding improvements for searches. Although it may have a negative impact on inserts as the that index will have to be maintained.
Coding NULL makes no difference as it just leaves the NOT NULL flag as 0, so that can be ignored.
Coding NOT NULL may result in fewer inserts due to the constraint being met and could thus result in a performance improvement.
Regarding PRIMARY INDEX, coding this as anything other than INTEGER PRIMARY KEY or INTEGER PRIMARY KEY AUTOINCREMENT will result in a subsequent index being created.
That is, if a table is not defined with WITHOUT ROWID then SQLite creates the "REAL" primary index with a normally invisible column named rowid. This uniquely identifies a row. (Try SELECT rowid FROM comments)
As such, in both scenarios there is an index based upon the rowid. For all intents and purposes this will be the order in which the rows were inserted.
In the second scenario there will be 2 indexes the "REAL" primary index based upon the rowid and the defined primary index based upon the id column. There would be some impact on inserts due to the 2nd index needing to be maintained.
So say you search the id column for id x, in the first table it will be relatively slow as it's got to search according to rowid order, it's all it has. However, adding the index according to id and the search is going to be favourable because that index (of the 2 available) is the one the search would likely be based upon.
Note the above is a pretty simplistic overview it doesn't consider The SQLite Query Planner which may be of interest. The ANALYZE statement may also be of interest.

Related

What is the most efficient strategy for lookups on a large, static table which is already in sorted order (sqlite)?

I have a basic reverse lookup table in which the ids are already sorted in ascending numerical order:
id INT NOT NULL,
value INT NOT NULL
The ids are not unique; each id has from 5 to 25,000 associated values. Each id is independent, i.e., no relationships between the ids.
The table is static. Read only, no inserts or updates ever. The table has 100-200 million records. The database itself will be around 7-12gb. Sqlite.
I will do frequent lookups in this table and want the fastest response time for each query. Lookups are one-direction only, unordered, and always of the form:
SELECT value WHERE id IN (x,y,z)
What advantages does the pre-sorted order give me in terms of database efficiency? What should I do differently than I would with typical unordered tables? How do I tell sql that it's an ordered list?
What about indices: is it necessary or even helpful to create an index on id?
[Updated for clustered comment thanks to Gordon Linoff]. As far as I can tell, sqlite doesn't support clustered indices directly. The wiki says: "Are [clustered indices] supported? No, but if you use INTEGER PRIMARY KEY it acts as a clustered index." In my situation, the column id is not unique...

Assuming that space is not an issue, you should create an index on (id, value). This should be sufficient for your purposes.
However, if the table is static, then I would recommend that you create a clustered index when you create the table. The index would have the same keys, (id, value).

If the table happens to be sorted, the database does not know about this, so you'd still need an index.
It is a better idea to use a WITHOUT ROWID table (what other DBs call a clustered index):
CREATE TABLE MyLittleLookupTable (
id INTEGER,
value INTEGER,
PRIMARY KEY (id, value)
) WITHOUT ROWID;

RDBMS primary key design for row versioning

I want to design primary key for my table with row versioning. My table contains 2 main fields : ID and Timestamp, and bunch of other fields. For a unique "ID" , I want to store previous versions of a record. Hence I am creating primary key for the table to be combination of ID and timestamp fields.
Hence to see all the versions of a particular ID, I can give,
Select * from table_name where ID=<ID_value>
To return the most recent version of a ID, I can use
Select * from table_name where ID=<ID_value> ORDER BY timestamp desc
and get the first element.
My question here is, will this query be efficient and run in O(1) instead of scanning the entire table to get all entries matching same ID considering ID field was a part of primary key fields? Ideally to get a result in O(1), I should have provided the entire primary key. If it does need to do entire table scan, then how else can I design my primary key so that I get this request done in O(1)?
Thanks,
Sriram

The canonical reference on this subject is Effective Timestamping in Databases:
https://www.cs.arizona.edu/~rts/pubs/VLDBJ99.pdf
I usually design with a subset of this paper's recommendations, using a table containing a primary key only, with another referencing table that has that key as well change_user, valid_from and valid_until colums with appropriate defaults. This makes referential integrity easy, as well as future value insertion and history retention. Index as appropriate, and consider check constraints or triggers to prevent overlaps and gaps if you expose these fields to the application for direct modification. These have an obvious performance overhead.
We then make a "current values view" which is exposed to developers, and is also insertable via an "instead of" trigger.

It's far easier and better to use the History Table pattern for this.
create table foo (
foo_id int primary key,
name text
);
create table foo_history (
foo_id int,
version int,
name text,
operation char(1) check ( operation in ('u','d') ),
modified_at timestamp,
modified_by text
primary key (foo_id, version)
);
Create a trigger to copy a foo row to foo_history on update or delete.
https://wiki.postgresql.org/wiki/Audit_trigger_91plus for a full example with postgres

oracle query is not using indexes

I am brand new to Oracle and although I have used SQL Server fairly extensively I have not had the need to delve deeply into the details of database design ... specifically INDEXES. So I have spent a good deal of time sitting through tutorials on Indexes ... both in concept as well as Oracle specific.
In an effort to put my understanding into practice I set up a VERY simple table with some basic indexes.
CREATE TABLE "SYSTEM"."TBL_PERSON"
(
"PERSON_ID" NUMBER(10,0) NOT NULL ENABLE,
"FIRST_NAME" NVARCHAR2(120) NOT NULL ENABLE,
"MIDDLE_NAME" NVARCHAR2(120),
"LAST_NAME" NVARCHAR2(120) NOT NULL ENABLE,
"DOB" DATE NOT NULL ENABLE,
"IS_MALE" NCHAR(1) DEFAULT 'T' NOT NULL ENABLE,
CONSTRAINT "TBL_PERSON_PK" PRIMARY KEY ("PERSON_ID")
)
As you can see the PERSON_ID field contains the unique ROWID for each record in the table and is an auto-incrementing Primary Key.
(please don't get hung up on missing SQL unless it pertains to the issue of INDEXES not working. I tried to select only the relevant SQL from the DDL and may have missed some items. There was a ton of stuff there that I didn't think was relevant to this issue so I tried to trim it out)
I have created a couple of additional non-clustered indexes on the table.
CREATE INDEX "SYSTEM"."IDX_LAST_NAME" ON "SYSTEM"."TBL_PERSON" ("LAST_NAME")
CREATE INDEX "SYSTEM"."IDX_PERSON_NAME" ON "SYSTEM"."TBL_PERSON" ("FIRST_NAME", "LAST_NAME")
When I run an "Explain Plan" on the following SQL I get notified that the PK index was used as expected.
select * from TBL_PERSON where PERSON_ID = 21
However when I run a query to select someone with a particular LAST_NAME the LAST_NAME index seems to go ignored.
select * from TBL_PERSON where LAST_NAME = 'Stenstrom'
Why would it not use IDX_LAST_NAME? For what it's worth I have the same issue with the composite index IDX_PERSON_NAME.

The key to your question is the column "cardinality". You have only five rows estimated as being returned for the table.
Oracle has a choice between two execution plans:
Load the data page. Scan the five records on the data page and choose the one(s) that match the condition.
Load the index and scan it for the match. Then load the data page and lookup the matching record(s).
Oracle has concluded that for five records, the first approach is faster. If you load more data into the table, you should see the execution plan change. Alternatively, if you had select last_name instead of select *, then Oracle might very well choose the index.

Should I use a unique constraint in a table even though it isn't necessarily required?

In Microsoft SQL Server, when creating tables, are there any downsides to using a unique constraint on a column even though you don't really need it to be unique?
An example would be descriptions for say a role in a user management system:
CREATE TABLE Role
(
ID TINYINT PRIMARY KEY NOT NULL IDENTITY(0, 1),
Title CHARACTER VARYING(32) NOT NULL UNIQUE,
Description CHARACTER VARYING(MAX) NOT NULL UNIQUE
)
My fear is that validating this constraint when doing frequent insertions in other tables will be a very time consuming process. I am unsure as to how this constraint is validated, but I feel like it could be done in a very efficient way or as a linear comparison.

Your fear becomes true: UNIQUE constraint are implemented as indices, and this is time and space consuming.
So, whenever you insert a new row, the database have to update the table, and also one index for each unique constraint.
So, according to you:
using a unique constraint on a column even though you don't really need it to be unique
the answer is no, don't use it. there are time and space downsides.
Your sample table would need a clustered index for the Id, and 2 extra indices, one for each unique constraint. This takes up space, and time to update the 3 indices on the inserts.
This would only be justified if you made queries filtering by those fields.
BY THE WAY:
The original post sample table have several flaws:
that syntax is not SQL Server syntax (and you tagged this as SQL Server)
you cannot create an index in a varchar(max) column
If you correct the syntax and create this table:
CREATE TABLE Role
(
ID tinyint PRIMARY KEY NOT NULL IDENTITY(0, 1),
Title varchar(32) NOT NULL UNIQUE,
Description varchar(32) NOT NULL UNIQUE
)
You can then execute sp_help Role and you'll find the 3 indices.

The database creates an index which backs up the UNIQUE constraint, so it should be very low-cost to do the uniqueness check.
http://msdn.microsoft.com/en-us/library/ms177420.aspx
The Database Engine automatically creates a UNIQUE index to enforce the uniqueness requirement of the UNIQUE constraint. Therefore, if an attempt to insert a duplicate row is made, the Database Engine returns an error message that states the UNIQUE constraint has been violated and does not add the row to the table. Unless a clustered index is explicitly specified, a unique, nonclustered index is created by default to enforce the UNIQUE constraint.

Is it typically a good practice to constrain it if you know the data
will always be unique but it doesn't necessarily need to be unique for
the application to function correctly?
My question to you: would it make sense for two roles to have different titles but the same description? e.g.
INSERT INTO Role ( Title , Description )
VALUES ( 'CEO' , 'Senior manager' ),
( 'CTO' , 'Senior manager' );
To me it would seem to devalue the use of the description; if there were many duplications then it might make more sense to do something more like this:
INSERT INTO Role ( Title )
VALUES ( 'CEO' ),
( 'CTO' );
INSERT INTO SeniorManagers ( Title )
VALUES ( 'CEO' ),
( 'CTO' );
But then again you are not expecting duplicates.
I assume this is a low activity table. You say you fear validating this constraint when doing frequent insertions in other tables. Well, that will not happen (unless there is a trigger we cannot see that might update this table when another table is updated).
Personally, I would ask the designer (business analyst, whatever) to justify not applying a unique constraint. If they cannot then I would impose the unqiue constraint based on common sense. As is usual for such a text column, I would also apply CHECK constraints e.g. to disallow leading/trailing/double spaces, zero-length string, etc.

On SQL Server, the data type tinyint only gives you 256 distinct values. No matter what you do outside of the id column, you're not going to end up with a very big table. It will surely perform quickly even with a dozen indexed columns.
You usually need at least one unique constraint besides the surrogate key, though. If you don't have one, you're liable to end up with data like this.
1 First title First description
2 First title First description
3 First title First description
...
17 Third title Third description
18 First title First description
Tables that permit data like that are usually wrong. Any table that uses foreign key references to this table won't be able to report correctly, say, the number of "First title" used.
I'd argue that allowing multiple, identical titles for roles in a user management system is a design error. I'd probably argue that "title" is a really bad name for that column, too.

Performance - Int vs Char(3)

I have a table and am debating between 2 different ways to store information. It has a structure like so
int id
int FK_id
varchar(50) info1
varchar(50) info2
varchar(50) info3
int forTable or char(3) forTable
The FK_id can be a foreign key to one of 6 tables so I need another field to determine which table it's for.
I see two solutions:
An integer that is a FK to a settings table which has its actual value.
A char(3) field with the a abbreviated version of the table.
I am wondering if anyone knows if one will be more beneficial speed wise over the other or if there will be any major problems using the char(3)
Note: I will be creating an indexed view on each of the 6 different values for this field. This table will contain ~30k rows and will need to be joined with much larger tables

In this case, it probably doesn't matter except for the collation overhead (A vs a vs ä va à)
I'd use char(3), say for currency code like CHF, GBP etc But if my natural key was "Swiss Franc", "British Pound" etc, I'd take the numeric.
3 bytes + collation vs 4 bytes numeric? You'd need a zillion rows or be running a medium sized country before it mattered...

Have you considered using a TinyInt. Only takes one byte to store it's value. TinyInt has a range of values between 0 and 255.

Is the reason you need a single table that you want to ensure that when the six parent tables reference a given instance of a child row that is guaranteed to be the same instance? This is the classic "multi-parent" problem. An example of where you might run into this is with addresses or phone numbers with multiple person/contact tables.
I can think of a couple of options:
Choice 1: A link table for each parent table. This would be the Hoyle architecture. So, something like:
Create Table MyTable(
id int not null Primary Key Clustered
, info1 varchar(50) null
, info2 varchar(50) null
, info3 varchar(50) null
)
Create Table LinkTable1(
MyTableId int not null
, ParentTable1Id int not null
, Constraint PK_LinkTable1 Primary Key Clustered( MyTableId, ParentTable1Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( MyTableId )
References MyTable ( Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( ParentTable1Id )
References ParentTable1 ( Id )
)
...
Create Table LinkTable2...LinkTable3
Choice 2. If you knew that you would never have more than say six tables and were willing to accept some denormalization and a fugly design, you could add six foreign keys to your main table. That avoids the problem of populating a bunch of link tables and ensures proper referential integrity. However, that design can quickly get out of hand if the number of parents grows.
If you are content with your existing design, then with respect to the field size, I would use the full table name. Frankly, the difference in performance between a char(3) and a varchar(50) or even varchar(128) will be negligible for the amount of data you are likely to put in the table. If you really thought you were going to have millions of rows, then I would strongly consider the option of linking tables.
If you wanted to stay with your design and wanted the maximum performance, then I would use a tinyint with a foreign key to a table that contained the list of the six tables with a tinyint primary key. That prevents the number from being "magic" and ensures that you narrow down the list of parent tables. Of course, it still does not prevent orphaned records. In this design, you have to use triggers to do that.

Because your FK cannot be enforced (since it is a variant depending upon type) by database constraint, I would strongly consider re-evaluating your design to use link tables, where each link table includes two FK columns, one to the PK of the entity and one to the PK of one of the 6 tables.
While this might seem to be overkill, it makes a lot of things simpler and adding new link tables is no more complex than accommodating new FK-types. In addition, it is more easily expandable to the case where an entity needs more than a 1-1 relationship to a single table, or needs multiple 1-1 relationships to the 6 other entities.
In a varying-FK scenario, you can lose database consistency, you can join to the wrong entity by neglecting to filter on type code, etc.
I should add that another huge benefit of link tables is that you can link to tables which have keys of varying data types (ints, natural keys, etc) without having to add surrograte keys or stored the key in a varchar or similar workarounds which are prone to problems.

I think a small integer (tinyint) is called for here. An "abbreviated version" looks too much like a magic number.
I also think performance wise the integer should beat the char(3).

First off, a 50 character Id that is not globally unique sounds a little scary. Do the IDs have some meaning? If not, you can easily get a GUID in less space. Personally, I am a big fan of making things human readable whenever possible. I would, and have, put the full name in graphs until I needed to do otherwise. My preference would be to have linking tables for each possible related table though.
Unless you are talking about really large scale, you are much better off decreasing the size of the IDs and taking a few more characters for the name of the table. For really large scale, I would decrease the size of the IDs and use an integer.
Jacob

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas