I am brand new to Oracle and although I have used SQL Server fairly extensively I have not had the need to delve deeply into the details of database design ... specifically INDEXES. So I have spent a good deal of time sitting through tutorials on Indexes ... both in concept as well as Oracle specific.
In an effort to put my understanding into practice I set up a VERY simple table with some basic indexes.
CREATE TABLE "SYSTEM"."TBL_PERSON"
(
"PERSON_ID" NUMBER(10,0) NOT NULL ENABLE,
"FIRST_NAME" NVARCHAR2(120) NOT NULL ENABLE,
"MIDDLE_NAME" NVARCHAR2(120),
"LAST_NAME" NVARCHAR2(120) NOT NULL ENABLE,
"DOB" DATE NOT NULL ENABLE,
"IS_MALE" NCHAR(1) DEFAULT 'T' NOT NULL ENABLE,
CONSTRAINT "TBL_PERSON_PK" PRIMARY KEY ("PERSON_ID")
)
As you can see the PERSON_ID field contains the unique ROWID for each record in the table and is an auto-incrementing Primary Key.
(please don't get hung up on missing SQL unless it pertains to the issue of INDEXES not working. I tried to select only the relevant SQL from the DDL and may have missed some items. There was a ton of stuff there that I didn't think was relevant to this issue so I tried to trim it out)
I have created a couple of additional non-clustered indexes on the table.
CREATE INDEX "SYSTEM"."IDX_LAST_NAME" ON "SYSTEM"."TBL_PERSON" ("LAST_NAME")
CREATE INDEX "SYSTEM"."IDX_PERSON_NAME" ON "SYSTEM"."TBL_PERSON" ("FIRST_NAME", "LAST_NAME")
When I run an "Explain Plan" on the following SQL I get notified that the PK index was used as expected.
select * from TBL_PERSON where PERSON_ID = 21
However when I run a query to select someone with a particular LAST_NAME the LAST_NAME index seems to go ignored.
select * from TBL_PERSON where LAST_NAME = 'Stenstrom'
Why would it not use IDX_LAST_NAME? For what it's worth I have the same issue with the composite index IDX_PERSON_NAME.
The key to your question is the column "cardinality". You have only five rows estimated as being returned for the table.
Oracle has a choice between two execution plans:
Load the data page. Scan the five records on the data page and choose the one(s) that match the condition.
Load the index and scan it for the match. Then load the data page and lookup the matching record(s).
Oracle has concluded that for five records, the first approach is faster. If you load more data into the table, you should see the execution plan change. Alternatively, if you had select last_name instead of select *, then Oracle might very well choose the index.
Related
So I'm importing large JSON-data and translating it to a SQLite server. I'm using transactions for the inserts, and I've tried tables using NULL or not using NULL to check the difference in performance.
When I had tables in SQLite that looked like this:
CREATE TABLE comments(
id TEXT,
author TEXT,
body TEXT,
score INTEGER,
created_utc TEXT
);
The import time was really slow, and searching in the table (e.g. select * from comments where author = 'blabla') was also slow.
When instead using a table with specified NULL or NOT NULL constraints, the import time and search time went much faster (from 2000 seconds to 600 seconds).
CREATE TABLE comments(
id TEXT PRIMARY KEY,
author TEXT NOT NULL,
body TEXT NULL,
score INTEGER NULL,
created_utc TEXT NULL
);
Does anyone know why this change in performance happened when using NULL or NOT NULL?
As per my comment, adding PRIMARY KEY may be a major factor regarding improvements for searches. Although it may have a negative impact on inserts as the that index will have to be maintained.
Coding NULL makes no difference as it just leaves the NOT NULL flag as 0, so that can be ignored.
Coding NOT NULL may result in fewer inserts due to the constraint being met and could thus result in a performance improvement.
Regarding PRIMARY INDEX, coding this as anything other than INTEGER PRIMARY KEY or INTEGER PRIMARY KEY AUTOINCREMENT will result in a subsequent index being created.
That is, if a table is not defined with WITHOUT ROWID then SQLite creates the "REAL" primary index with a normally invisible column named rowid. This uniquely identifies a row. (Try SELECT rowid FROM comments)
As such, in both scenarios there is an index based upon the rowid. For all intents and purposes this will be the order in which the rows were inserted.
In the second scenario there will be 2 indexes the "REAL" primary index based upon the rowid and the defined primary index based upon the id column. There would be some impact on inserts due to the 2nd index needing to be maintained.
So say you search the id column for id x, in the first table it will be relatively slow as it's got to search according to rowid order, it's all it has. However, adding the index according to id and the search is going to be favourable because that index (of the 2 available) is the one the search would likely be based upon.
Note the above is a pretty simplistic overview it doesn't consider The SQLite Query Planner which may be of interest. The ANALYZE statement may also be of interest.
I want to design primary key for my table with row versioning. My table contains 2 main fields : ID and Timestamp, and bunch of other fields. For a unique "ID" , I want to store previous versions of a record. Hence I am creating primary key for the table to be combination of ID and timestamp fields.
Hence to see all the versions of a particular ID, I can give,
Select * from table_name where ID=<ID_value>
To return the most recent version of a ID, I can use
Select * from table_name where ID=<ID_value> ORDER BY timestamp desc
and get the first element.
My question here is, will this query be efficient and run in O(1) instead of scanning the entire table to get all entries matching same ID considering ID field was a part of primary key fields? Ideally to get a result in O(1), I should have provided the entire primary key. If it does need to do entire table scan, then how else can I design my primary key so that I get this request done in O(1)?
Thanks,
Sriram
The canonical reference on this subject is Effective Timestamping in Databases:
https://www.cs.arizona.edu/~rts/pubs/VLDBJ99.pdf
I usually design with a subset of this paper's recommendations, using a table containing a primary key only, with another referencing table that has that key as well change_user, valid_from and valid_until colums with appropriate defaults. This makes referential integrity easy, as well as future value insertion and history retention. Index as appropriate, and consider check constraints or triggers to prevent overlaps and gaps if you expose these fields to the application for direct modification. These have an obvious performance overhead.
We then make a "current values view" which is exposed to developers, and is also insertable via an "instead of" trigger.
It's far easier and better to use the History Table pattern for this.
create table foo (
foo_id int primary key,
name text
);
create table foo_history (
foo_id int,
version int,
name text,
operation char(1) check ( operation in ('u','d') ),
modified_at timestamp,
modified_by text
primary key (foo_id, version)
);
Create a trigger to copy a foo row to foo_history on update or delete.
https://wiki.postgresql.org/wiki/Audit_trigger_91plus for a full example with postgres
In Microsoft SQL Server, when creating tables, are there any downsides to using a unique constraint on a column even though you don't really need it to be unique?
An example would be descriptions for say a role in a user management system:
CREATE TABLE Role
(
ID TINYINT PRIMARY KEY NOT NULL IDENTITY(0, 1),
Title CHARACTER VARYING(32) NOT NULL UNIQUE,
Description CHARACTER VARYING(MAX) NOT NULL UNIQUE
)
My fear is that validating this constraint when doing frequent insertions in other tables will be a very time consuming process. I am unsure as to how this constraint is validated, but I feel like it could be done in a very efficient way or as a linear comparison.
Your fear becomes true: UNIQUE constraint are implemented as indices, and this is time and space consuming.
So, whenever you insert a new row, the database have to update the table, and also one index for each unique constraint.
So, according to you:
using a unique constraint on a column even though you don't really need it to be unique
the answer is no, don't use it. there are time and space downsides.
Your sample table would need a clustered index for the Id, and 2 extra indices, one for each unique constraint. This takes up space, and time to update the 3 indices on the inserts.
This would only be justified if you made queries filtering by those fields.
BY THE WAY:
The original post sample table have several flaws:
that syntax is not SQL Server syntax (and you tagged this as SQL Server)
you cannot create an index in a varchar(max) column
If you correct the syntax and create this table:
CREATE TABLE Role
(
ID tinyint PRIMARY KEY NOT NULL IDENTITY(0, 1),
Title varchar(32) NOT NULL UNIQUE,
Description varchar(32) NOT NULL UNIQUE
)
You can then execute sp_help Role and you'll find the 3 indices.
The database creates an index which backs up the UNIQUE constraint, so it should be very low-cost to do the uniqueness check.
http://msdn.microsoft.com/en-us/library/ms177420.aspx
The Database Engine automatically creates a UNIQUE index to enforce the uniqueness requirement of the UNIQUE constraint. Therefore, if an attempt to insert a duplicate row is made, the Database Engine returns an error message that states the UNIQUE constraint has been violated and does not add the row to the table. Unless a clustered index is explicitly specified, a unique, nonclustered index is created by default to enforce the UNIQUE constraint.
Is it typically a good practice to constrain it if you know the data
will always be unique but it doesn't necessarily need to be unique for
the application to function correctly?
My question to you: would it make sense for two roles to have different titles but the same description? e.g.
INSERT INTO Role ( Title , Description )
VALUES ( 'CEO' , 'Senior manager' ),
( 'CTO' , 'Senior manager' );
To me it would seem to devalue the use of the description; if there were many duplications then it might make more sense to do something more like this:
INSERT INTO Role ( Title )
VALUES ( 'CEO' ),
( 'CTO' );
INSERT INTO SeniorManagers ( Title )
VALUES ( 'CEO' ),
( 'CTO' );
But then again you are not expecting duplicates.
I assume this is a low activity table. You say you fear validating this constraint when doing frequent insertions in other tables. Well, that will not happen (unless there is a trigger we cannot see that might update this table when another table is updated).
Personally, I would ask the designer (business analyst, whatever) to justify not applying a unique constraint. If they cannot then I would impose the unqiue constraint based on common sense. As is usual for such a text column, I would also apply CHECK constraints e.g. to disallow leading/trailing/double spaces, zero-length string, etc.
On SQL Server, the data type tinyint only gives you 256 distinct values. No matter what you do outside of the id column, you're not going to end up with a very big table. It will surely perform quickly even with a dozen indexed columns.
You usually need at least one unique constraint besides the surrogate key, though. If you don't have one, you're liable to end up with data like this.
1 First title First description
2 First title First description
3 First title First description
...
17 Third title Third description
18 First title First description
Tables that permit data like that are usually wrong. Any table that uses foreign key references to this table won't be able to report correctly, say, the number of "First title" used.
I'd argue that allowing multiple, identical titles for roles in a user management system is a design error. I'd probably argue that "title" is a really bad name for that column, too.
So, I have this funny requirement of creating an index on a table only on a certain set of rows.
This is what my table looks like:
USER: userid, friendid, created, blah0, blah1, ..., blahN
Now, I'd like to create an index on:
(userid, friendid, created)
but only on those rows where userid = friendid. The reason being that this index is only going to be used to satisfy queries where the WHERE clause contains "userid = friendid". There will be many rows where this is NOT the case, and I really don't want to waste all that extra space on the index.
Another option would be to create a table (query table) which is populated on insert/update of this table and create a trigger to do so, but again I am guessing an index on that table would mean that the data would be stored twice.
How does mysql store Primary Keys? I mean is the table ordered on the Primary Key or is it ordered by insert order and the PK is like a normal unique index?
I checked up on clustered indexes (http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html), but it seems only InnoDB supports them. I am using MyISAM (I mention this because then I could have created a clustered index on these 3 fields in the query table).
I am basically looking for something like this:
ALTER TABLE USERS ADD INDEX (userid, friendid, created) WHERE userid=friendid
Regarding the conditional index:
You can't do this. MySQL has no such thing.
Regarding the primary key:
It depends on the storage engine. MySQL does not define how data is stored or retrieved, that's left up to the storage engine.
MyISAM does not enforce any order on how rows are stored; they're appended to the end of the table but gaps from deleting can be reused and UPDATE queries can leave things out of order even without any DELETEs.
InnoDB stores rows in order of their primary keys.
hard to tell what you're actually trying to do here (why would a user need to be his own friend?) but it seems to me a simple rethinking of your database schema would resolve this problem.
table 1: USER: userid, created, blah0, blah1, ...,
table 2: userIsFriend (user1,user2,...)
and just do your indexing on table 2 (whose elements presumably have foreign key constraint on table 1)
btw you should probably be using InnoDB if you want to do anything semi-serious with mySQL anyway, IMHO.
So, I have a subscriptions table:
id - int(11) (With Primary Key)
user_id - int(11)
group_id - int(11)
role - int(11)
pending - tinyint(1)
created_at - datetime
updated_at - datetime
I'm often doing queries to see if users have access rights similar to this:
SELECT * FROM `subscriptions` WHERE (group_id = 1 AND user_id = 2 AND pending = 0) LIMIT 1
I'm wondering if adding a unique index on subscriptions(group_id, user_id, pending) would help or hinder in this case? What are the best practices for indexing almost an entire table?
This will certainly help, especially if you replace * with 1 in your query (if you just want to check that the row exists).
It may have a little impact on DML though.
Creating an index is in fact creating a B-Tree which has this structure:
indexed_col1
indexed_col2
...
indexed_colN
row_pointer
as a key, the row_pointer being a file offset (for MyISAM) or the value of the row's PRIMARY KEY (for InnoDB)
If you don't use other columns but indexed in you query, all information you need can be retrieved from the ondex alone, without even having to refer to the table itself.
If your data are intrinsically unique, it's always good to create a UNIQUE index on them. This is less of the case for MySQL, but more advanced optimizers (SQL Server, for instance) can utilize the fact the data are unique and build a more efficient plan.
See this article in my blog for an example:
Making an index UNIQUE