So, I have a subscriptions table:
id - int(11) (With Primary Key)
user_id - int(11)
group_id - int(11)
role - int(11)
pending - tinyint(1)
created_at - datetime
updated_at - datetime
I'm often doing queries to see if users have access rights similar to this:
SELECT * FROM `subscriptions` WHERE (group_id = 1 AND user_id = 2 AND pending = 0) LIMIT 1
I'm wondering if adding a unique index on subscriptions(group_id, user_id, pending) would help or hinder in this case? What are the best practices for indexing almost an entire table?
This will certainly help, especially if you replace * with 1 in your query (if you just want to check that the row exists).
It may have a little impact on DML though.
Creating an index is in fact creating a B-Tree which has this structure:
indexed_col1
indexed_col2
...
indexed_colN
row_pointer
as a key, the row_pointer being a file offset (for MyISAM) or the value of the row's PRIMARY KEY (for InnoDB)
If you don't use other columns but indexed in you query, all information you need can be retrieved from the ondex alone, without even having to refer to the table itself.
If your data are intrinsically unique, it's always good to create a UNIQUE index on them. This is less of the case for MySQL, but more advanced optimizers (SQL Server, for instance) can utilize the fact the data are unique and build a more efficient plan.
See this article in my blog for an example:
Making an index UNIQUE
Related
I have a list of data that I want to relate to some ownerId, but the list can exceed row size limits, so I want to split this list across multiple rows. Each entry in this list has its own id, which is unique per owner. I was looking at composite keys (ownerId:entryId), but the main operation I need is to read this data in bulk (read all entries for ownerId). What is the best way to go about structuring this data?
Example:
ownerId | entryId | data
--------|---------|--------
OwnerA | 1 | aaaaa
OwnerA | 2 | bbbbb
OwnerB | 1 | ccccc
Note that ownerId here is a SQL generated id, and entryId is an externally set id.
If you know that consumers of your query will filter on ownerId instead of entryId (i.e. the vast majority of WHERE clauses in your table will filter on ownerId as opposed to entityId), then you could get significant mileage simly by creating a composite clustered key/index on (ownerId,entryId). I say this because relational indexes use the first column as the primary sort criteria, so as long as you're filtering based on ownerId, under-the-hood, rows can be retrieved with INDEX UNIQUE SCAN operations as opposed to TABLE FULL SCAN operations.
That being said, if you'll have to be filtering both on ownerId and entityId independently (i.e. you'll have several queries in which the WHERE clause will be of the format WHERE ownerId = {specific_owner_id} and several other queries in which the WHERE clause will be of the format WHERE entityId = {specific_entity_id}, you might want to consider having both a PRIMARY KEY/CLUSTERED INDEX on (ownerId, entityId) and a unique index on (entityId, ownerId):
CREATE TABLE t (
ownerId INT NOT NULL,
entityId INT NOT NULL,
/*
...all other values ...
*/
CONSTRAINT PK_t PRIMARY KEY (ownerId, entityId)
);
CREATE UNIQUE INDEX t_entity_owner ON t (entityId, ownerId);
If you do this, both queries which filter on ownerId and entityId can take advantages of INDEX SCAN operations.
That being said, this type of configuration will be most optimal if table t
is used more for READ operations than for WRITE operations. Should your table be more WRITE heavy, then the time taken to modify each of the indexes could outweigh the benefits of more efficient query reads.
You probably need a composite primary key, i.e.:
CREATE TABLE t (
...
PRIMARY KEY (ownerId, entryId)
);
and some separate index for ownedId, for example in postgres hash index might be a good fit.
I have the following table in an SQLite database
CREATE TABLE `log` (
`time` REAL NOT NULL DEFAULT CURRENT_TIMESTAMP,
`data` BLOB NOT NULL
) WITHOUT ROWID;
CREATE INDEX `time_index` ON `log`(`time`);
The index is created because the most frequent query is going to be
SELECT * FROM `log` WHERE `time` BETWEEN ? AND ?
Since the time is going to be always the current time when the new record is added, the index is not really required here. So I would like to "tell" the SQLite engine something like "The lines are going to be added with the 'time' column always having increasing value (similar to AUTO_INCREMENT), and if something goes wrong I will take all responsibility".
Is it possible at all?
You don't want a separate index. You want to declare the column to be the primary key:
CREATE TABLE `log` (
`time` REAL NOT NULL DEFAULT CURRENT_TIMESTAMP PRIMARY KEY,
`data` BLOB NOT NULL
) WITHOUT ROWID;
This creates a single b-tree index for the log based on the primary key. In other databases, this structure would be called a "clustered index". You have probably already read the documentation but I'm referencing it anyway.
You would have an issue, or not depending upon how you consider that you cannot use :-
CREATE TABLE `log` (
`time` REAL NOT NULL DEFAULT CURRENT_TIMESTAMP,
`data` BLOB NOT NULL
) WITHOUT ROWID;
because :-
Every WITHOUT ROWID table must have a PRIMARY KEY. An error is raised
if a CREATE TABLE statement with the WITHOUT ROWID clause lacks a
PRIMARY KEY.
Clustered Indexes and the WITHOUT ROWID Optimization
So you might as well make the time column the PRIMARY KEY.
but the problem is that the precision of REAL is not enough to handle
microsecond resolution, and thus two adjacent records may have the
same time value which would violate the PRIMARY KEY constraint.
Then you could use a composite PRIMARY KEY where the precision required is satisfied by multiple columns (a second column would likely more than suffice) perhaps along the lines of :-
CREATE TABLE log (
time_datepart INTEGER,
time_microsecondpart,
data BLOB NOt NULL,
PRIMARY KEY (time_datepart,time_microsecondpart)
) WITHOUT ROWID;
The time_microsecondpart column needn't necessarily be microseconds it could be a counter derived from another table similar to how the sqlite_sequence table is utilised when AUTOINCREMENT is utilised (less the need for the column that holds the name of the table that a row is attached to).
So I'm importing large JSON-data and translating it to a SQLite server. I'm using transactions for the inserts, and I've tried tables using NULL or not using NULL to check the difference in performance.
When I had tables in SQLite that looked like this:
CREATE TABLE comments(
id TEXT,
author TEXT,
body TEXT,
score INTEGER,
created_utc TEXT
);
The import time was really slow, and searching in the table (e.g. select * from comments where author = 'blabla') was also slow.
When instead using a table with specified NULL or NOT NULL constraints, the import time and search time went much faster (from 2000 seconds to 600 seconds).
CREATE TABLE comments(
id TEXT PRIMARY KEY,
author TEXT NOT NULL,
body TEXT NULL,
score INTEGER NULL,
created_utc TEXT NULL
);
Does anyone know why this change in performance happened when using NULL or NOT NULL?
As per my comment, adding PRIMARY KEY may be a major factor regarding improvements for searches. Although it may have a negative impact on inserts as the that index will have to be maintained.
Coding NULL makes no difference as it just leaves the NOT NULL flag as 0, so that can be ignored.
Coding NOT NULL may result in fewer inserts due to the constraint being met and could thus result in a performance improvement.
Regarding PRIMARY INDEX, coding this as anything other than INTEGER PRIMARY KEY or INTEGER PRIMARY KEY AUTOINCREMENT will result in a subsequent index being created.
That is, if a table is not defined with WITHOUT ROWID then SQLite creates the "REAL" primary index with a normally invisible column named rowid. This uniquely identifies a row. (Try SELECT rowid FROM comments)
As such, in both scenarios there is an index based upon the rowid. For all intents and purposes this will be the order in which the rows were inserted.
In the second scenario there will be 2 indexes the "REAL" primary index based upon the rowid and the defined primary index based upon the id column. There would be some impact on inserts due to the 2nd index needing to be maintained.
So say you search the id column for id x, in the first table it will be relatively slow as it's got to search according to rowid order, it's all it has. However, adding the index according to id and the search is going to be favourable because that index (of the 2 available) is the one the search would likely be based upon.
Note the above is a pretty simplistic overview it doesn't consider The SQLite Query Planner which may be of interest. The ANALYZE statement may also be of interest.
I want to create a lookup table 'orderstatus'. i.e. below, just to clarify this is to be used in a Data Warehouse. I will need to join through OrderStatus to retrieve the INT (if i create one) to be used elsewhere if need be. Like in a fact table for example, I would store the int in the fact table to link to the lookup table.
+------------------------+------------------+
| OrderStatus | ConnectionStatus |
+------------------------+------------------+
| CLOSED | APPROVE |
+------------------------+------------------+
| COMPLETED | APPROVE |
+------------------------+------------------+
| FULFILLED | APPROVE |
+------------------------+------------------+
| CANCELLED | CLOSED |
+------------------------+------------------+
| DECLINED | CLOSED |
+------------------------+------------------+
| AVS_CHECK_SYSTEM_ERROR | CLOSED |
+------------------------+------------------+
What is best practise in terms of primary key/unique key? Should i just create an OrderStatusKey INT as PrimaryKey with identity? Or create a unique constraint on order status (unique)? Thanks.
For this, I would suggest you create an Identity column, and make that the clustered primary key.
It is considered best practice for tables to have a primary key of some kind, but having a clustered index for a table like this is the fastest way to allow for the use of this table in multi table queries ( with joins ).
Here is a sample as to how to add it:
ALTER TABLE dbo.orderstatus
ADD CONSTRAINT PK_orderstatus_OrderStatusID PRIMARY KEY CLUSTERED (OrderStatusID);
GO
Article with more details MSDN
And here is another resource for explaining a primary key Primary Key Primer
If OrderStatus is unique and the primary identifier AND you will be reusing this status code directly in related tables (and not a numeric pointer to this status code) then keep the columns as is and make OrderStatus the primary clustered index.
A little explanation:
A primary key is unique across the table; a clustered index ties all record data back to that index. It is not always necessary to have the primary key also be the clustered index on the table but usually this is the case.
If you are going to be linking to the order status using something other than the status code then create another column of type int as an IDENTITY and make that the primary clustered key. Also add a unique non-clustered index to OrderStatus to ensure that no duplicates could ever be added.
Either way you go every table should have a primary key as well as a clustered index (again, usually they are the same index).
Here are some things to consider:
PRIMARY KEY ensures that there is no NULL values or duplicates in the table
UNIQUE KEY can contain NULL and (by the ANSI standard) any number of NULLs. (This behavior depends on SQL Server settings and possible index filters r not null constraints)
The CLUSTERED INDEX contains all the data related to a row on the leaves.
When the CLUSTERED INDEX is not unique (and not null), the SQL Server will add a hidden GUID to each row.
SQL Server add a hidden GUID column to the key column list when the key columns are not unique to distinguish the individual records)
All indexes are using either values of the key columns of the clustered index or the rowid of a heap table.
The query optimizer uses the index stats to find out the best way to execute a query
For small tables, the indexes are ignore usually, since doing an index scan, then a lookup for each values is more expensive than doing a full table scan (which will read one or two pages when you have really small tables)
Status lookup tables are usually very small and can be stored on one page.
The referencing tables will store the PK value (or unique) in their structure (this is what you'll use to do a join too). You can have a slight performance benefit if you have an integer key to use as reference (aka IDENTITY in SQL Server).
If you usually don't want to list the ConnectionStatus, then using the actual display value (OrderStatus) can be beneficial, since you don't have to join the lookup table.
You can store both values in the referencing tables, but the maintaining both columns have some overhead and more space for errors.
The clustered/non-clustered question depends on the use cases of this table. If you usually use the OrderStatus for filtering (using the textual form), a NON CLUSTERED IDENTITY PK and a CLUESTERED UNIQUE on the OrderStatus can be beneficial. However (as you can read it above), in small tables the effect/performance gain is usually negligible.
If you are not familiar with the above things and you feel it safer, then create an identity clustered PK (OrderKey or OrderID) and a unique non clustered key on the OrderStatus.
Use the PK as referencing/referenced column in foreign keys.
One more thing: if this column will be referenced by only one table, you may want to consider to create an indexed view which contains both table's data.
Also, I would suggest to add a dummy value what you can use if there is no status set (and use it as default for all referencing columns). Because not set is still a status, isn't it?
I am brand new to Oracle and although I have used SQL Server fairly extensively I have not had the need to delve deeply into the details of database design ... specifically INDEXES. So I have spent a good deal of time sitting through tutorials on Indexes ... both in concept as well as Oracle specific.
In an effort to put my understanding into practice I set up a VERY simple table with some basic indexes.
CREATE TABLE "SYSTEM"."TBL_PERSON"
(
"PERSON_ID" NUMBER(10,0) NOT NULL ENABLE,
"FIRST_NAME" NVARCHAR2(120) NOT NULL ENABLE,
"MIDDLE_NAME" NVARCHAR2(120),
"LAST_NAME" NVARCHAR2(120) NOT NULL ENABLE,
"DOB" DATE NOT NULL ENABLE,
"IS_MALE" NCHAR(1) DEFAULT 'T' NOT NULL ENABLE,
CONSTRAINT "TBL_PERSON_PK" PRIMARY KEY ("PERSON_ID")
)
As you can see the PERSON_ID field contains the unique ROWID for each record in the table and is an auto-incrementing Primary Key.
(please don't get hung up on missing SQL unless it pertains to the issue of INDEXES not working. I tried to select only the relevant SQL from the DDL and may have missed some items. There was a ton of stuff there that I didn't think was relevant to this issue so I tried to trim it out)
I have created a couple of additional non-clustered indexes on the table.
CREATE INDEX "SYSTEM"."IDX_LAST_NAME" ON "SYSTEM"."TBL_PERSON" ("LAST_NAME")
CREATE INDEX "SYSTEM"."IDX_PERSON_NAME" ON "SYSTEM"."TBL_PERSON" ("FIRST_NAME", "LAST_NAME")
When I run an "Explain Plan" on the following SQL I get notified that the PK index was used as expected.
select * from TBL_PERSON where PERSON_ID = 21
However when I run a query to select someone with a particular LAST_NAME the LAST_NAME index seems to go ignored.
select * from TBL_PERSON where LAST_NAME = 'Stenstrom'
Why would it not use IDX_LAST_NAME? For what it's worth I have the same issue with the composite index IDX_PERSON_NAME.
The key to your question is the column "cardinality". You have only five rows estimated as being returned for the table.
Oracle has a choice between two execution plans:
Load the data page. Scan the five records on the data page and choose the one(s) that match the condition.
Load the index and scan it for the match. Then load the data page and lookup the matching record(s).
Oracle has concluded that for five records, the first approach is faster. If you load more data into the table, you should see the execution plan change. Alternatively, if you had select last_name instead of select *, then Oracle might very well choose the index.