I have a list of data that I want to relate to some ownerId, but the list can exceed row size limits, so I want to split this list across multiple rows. Each entry in this list has its own id, which is unique per owner. I was looking at composite keys (ownerId:entryId), but the main operation I need is to read this data in bulk (read all entries for ownerId). What is the best way to go about structuring this data?
Example:
ownerId | entryId | data
--------|---------|--------
OwnerA | 1 | aaaaa
OwnerA | 2 | bbbbb
OwnerB | 1 | ccccc
Note that ownerId here is a SQL generated id, and entryId is an externally set id.
If you know that consumers of your query will filter on ownerId instead of entryId (i.e. the vast majority of WHERE clauses in your table will filter on ownerId as opposed to entityId), then you could get significant mileage simly by creating a composite clustered key/index on (ownerId,entryId). I say this because relational indexes use the first column as the primary sort criteria, so as long as you're filtering based on ownerId, under-the-hood, rows can be retrieved with INDEX UNIQUE SCAN operations as opposed to TABLE FULL SCAN operations.
That being said, if you'll have to be filtering both on ownerId and entityId independently (i.e. you'll have several queries in which the WHERE clause will be of the format WHERE ownerId = {specific_owner_id} and several other queries in which the WHERE clause will be of the format WHERE entityId = {specific_entity_id}, you might want to consider having both a PRIMARY KEY/CLUSTERED INDEX on (ownerId, entityId) and a unique index on (entityId, ownerId):
CREATE TABLE t (
ownerId INT NOT NULL,
entityId INT NOT NULL,
/*
...all other values ...
*/
CONSTRAINT PK_t PRIMARY KEY (ownerId, entityId)
);
CREATE UNIQUE INDEX t_entity_owner ON t (entityId, ownerId);
If you do this, both queries which filter on ownerId and entityId can take advantages of INDEX SCAN operations.
That being said, this type of configuration will be most optimal if table t
is used more for READ operations than for WRITE operations. Should your table be more WRITE heavy, then the time taken to modify each of the indexes could outweigh the benefits of more efficient query reads.
You probably need a composite primary key, i.e.:
CREATE TABLE t (
...
PRIMARY KEY (ownerId, entryId)
);
and some separate index for ownedId, for example in postgres hash index might be a good fit.
Related
I want to create a lookup table 'orderstatus'. i.e. below, just to clarify this is to be used in a Data Warehouse. I will need to join through OrderStatus to retrieve the INT (if i create one) to be used elsewhere if need be. Like in a fact table for example, I would store the int in the fact table to link to the lookup table.
+------------------------+------------------+
| OrderStatus | ConnectionStatus |
+------------------------+------------------+
| CLOSED | APPROVE |
+------------------------+------------------+
| COMPLETED | APPROVE |
+------------------------+------------------+
| FULFILLED | APPROVE |
+------------------------+------------------+
| CANCELLED | CLOSED |
+------------------------+------------------+
| DECLINED | CLOSED |
+------------------------+------------------+
| AVS_CHECK_SYSTEM_ERROR | CLOSED |
+------------------------+------------------+
What is best practise in terms of primary key/unique key? Should i just create an OrderStatusKey INT as PrimaryKey with identity? Or create a unique constraint on order status (unique)? Thanks.
For this, I would suggest you create an Identity column, and make that the clustered primary key.
It is considered best practice for tables to have a primary key of some kind, but having a clustered index for a table like this is the fastest way to allow for the use of this table in multi table queries ( with joins ).
Here is a sample as to how to add it:
ALTER TABLE dbo.orderstatus
ADD CONSTRAINT PK_orderstatus_OrderStatusID PRIMARY KEY CLUSTERED (OrderStatusID);
GO
Article with more details MSDN
And here is another resource for explaining a primary key Primary Key Primer
If OrderStatus is unique and the primary identifier AND you will be reusing this status code directly in related tables (and not a numeric pointer to this status code) then keep the columns as is and make OrderStatus the primary clustered index.
A little explanation:
A primary key is unique across the table; a clustered index ties all record data back to that index. It is not always necessary to have the primary key also be the clustered index on the table but usually this is the case.
If you are going to be linking to the order status using something other than the status code then create another column of type int as an IDENTITY and make that the primary clustered key. Also add a unique non-clustered index to OrderStatus to ensure that no duplicates could ever be added.
Either way you go every table should have a primary key as well as a clustered index (again, usually they are the same index).
Here are some things to consider:
PRIMARY KEY ensures that there is no NULL values or duplicates in the table
UNIQUE KEY can contain NULL and (by the ANSI standard) any number of NULLs. (This behavior depends on SQL Server settings and possible index filters r not null constraints)
The CLUSTERED INDEX contains all the data related to a row on the leaves.
When the CLUSTERED INDEX is not unique (and not null), the SQL Server will add a hidden GUID to each row.
SQL Server add a hidden GUID column to the key column list when the key columns are not unique to distinguish the individual records)
All indexes are using either values of the key columns of the clustered index or the rowid of a heap table.
The query optimizer uses the index stats to find out the best way to execute a query
For small tables, the indexes are ignore usually, since doing an index scan, then a lookup for each values is more expensive than doing a full table scan (which will read one or two pages when you have really small tables)
Status lookup tables are usually very small and can be stored on one page.
The referencing tables will store the PK value (or unique) in their structure (this is what you'll use to do a join too). You can have a slight performance benefit if you have an integer key to use as reference (aka IDENTITY in SQL Server).
If you usually don't want to list the ConnectionStatus, then using the actual display value (OrderStatus) can be beneficial, since you don't have to join the lookup table.
You can store both values in the referencing tables, but the maintaining both columns have some overhead and more space for errors.
The clustered/non-clustered question depends on the use cases of this table. If you usually use the OrderStatus for filtering (using the textual form), a NON CLUSTERED IDENTITY PK and a CLUESTERED UNIQUE on the OrderStatus can be beneficial. However (as you can read it above), in small tables the effect/performance gain is usually negligible.
If you are not familiar with the above things and you feel it safer, then create an identity clustered PK (OrderKey or OrderID) and a unique non clustered key on the OrderStatus.
Use the PK as referencing/referenced column in foreign keys.
One more thing: if this column will be referenced by only one table, you may want to consider to create an indexed view which contains both table's data.
Also, I would suggest to add a dummy value what you can use if there is no status set (and use it as default for all referencing columns). Because not set is still a status, isn't it?
I get the point that primary indices are unique to each record and hence retrieving a record gets faster using primary indexing. What happens when we use secondary indexing.
Of what I can think of,
ID Name School
1 John XYZ
2 Roger XYZ
3 Ray ABC
4 Matt KJL
5 Roger ABC
if we have secondary indexing on Name, then it will help me retrieve records relevant to names and not with id hence it would not restrict me to one record if I query a record for Roger and I would be able to get result pertaining to both Rogers. Hence if the table is extensively queried based on the secondary index, it should be used.
Am I right?
Apart from speeding up specific queries, perhaps the most common case for secondary indexes is to speed up checking of UNIQUE constraints. Consider e.g. a table
CREATE TABLE Person (
id int primary key,
fname text not null,
lname text not null,
date_of_birth date not null,
...
UNIQUE (fname, lname, date_of_birth)
)
Here we want to enforce the UNIQUE constraint to ensure the same person doesn't appear in the table multiple times under different ids. But at the same time we wouldn't want to make (fname, lname, date_of_birth) the primary key, because a person's name could potentially change, and because using 3 attributes as reference can be cumbersome.
Now, when inserting a new record into the table, the DBMS needs to check whether it already contains another tuple with the same (fname, lname, date_of_birth), and a secondary index on these attributes can help speed this check up.
Note that UNIQUE constraints automatically generate their indexes, so there is no need to create them explicitly.
Another common case where secondary indexes are required (and must be created explicitly) are foreign key constraints that target attributes that do not make up the primary key for the target table.
I have a basic reverse lookup table in which the ids are already sorted in ascending numerical order:
id INT NOT NULL,
value INT NOT NULL
The ids are not unique; each id has from 5 to 25,000 associated values. Each id is independent, i.e., no relationships between the ids.
The table is static. Read only, no inserts or updates ever. The table has 100-200 million records. The database itself will be around 7-12gb. Sqlite.
I will do frequent lookups in this table and want the fastest response time for each query. Lookups are one-direction only, unordered, and always of the form:
SELECT value WHERE id IN (x,y,z)
What advantages does the pre-sorted order give me in terms of database efficiency? What should I do differently than I would with typical unordered tables? How do I tell sql that it's an ordered list?
What about indices: is it necessary or even helpful to create an index on id?
[Updated for clustered comment thanks to Gordon Linoff]. As far as I can tell, sqlite doesn't support clustered indices directly. The wiki says: "Are [clustered indices] supported? No, but if you use INTEGER PRIMARY KEY it acts as a clustered index." In my situation, the column id is not unique...
Assuming that space is not an issue, you should create an index on (id, value). This should be sufficient for your purposes.
However, if the table is static, then I would recommend that you create a clustered index when you create the table. The index would have the same keys, (id, value).
If the table happens to be sorted, the database does not know about this, so you'd still need an index.
It is a better idea to use a WITHOUT ROWID table (what other DBs call a clustered index):
CREATE TABLE MyLittleLookupTable (
id INTEGER,
value INTEGER,
PRIMARY KEY (id, value)
) WITHOUT ROWID;
I have a table where I store comments for user users. I will have 100 Million+ comments.
2 ways I can create it:
Option 1: user name and comment id as PK. That way all comments are stored physically by user name and comment id.
CREATE TABLE [dbo].[Comments](
[user] [varchar](20) NOT NULL,
[com_id] [int] IDENTITY(1,1) NOT NULL,
[com_posted_by] [varchar](20) NOT NULL,
[com_posted_on] [smalldatetime] NOT NULL CONSTRAINT DEFAULT (getdate()),
[com_text] [nvarchar](225) COLLATE NOT NULL,
CONSTRAINT [PK_channel_comments] PRIMARY KEY CLUSTERED
([channel] ASC, [com_id] ASC) WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]) ON [PRIMARY]
Pros: My query will be get all or top 10 comments for a user order by comment_id DESC. This is SEEK
Option 2: I can make the comment id as the PK. That will store the comments sorted by the comment id, not user name.
Cons: Getting latest top 10 comments of a given user is not a seek anymore as data not stored by user (ie. not sorted by user). So I have to create other index to improve the query performance.
Which way is best way to proceed?
How about insertion and deletion? These operations are allowed. But read is frequent.
User can't modify their comments.
I tested both tables with 1.1M rows. Here is the result:
table_name rows reserved data index_size unused
comments2 1079892 99488 KB 62824 KB 36576 KB 88 KB (PK: com_id Second Index on (user_name, com_id))
comments1 1079892 82376 KB 82040 KB 328 KB 8 KB (PK: user_name, no other indices)
--------------------------------------------------------------------
diff: same rows 17112KB -19216KB 36,248KB 80KB
So the table with com_id as PK is using 36MB extra disk space just for the 2 index
The select top query on both table using SEEK, but table with com_id as PK is slower
But insertion is slightly faster when I have com_id as PK
Any comments?
I would use the Comment ID as the Primary Key for the table. If you are going to have a lot of queries that use the Comment ID and the User name, its probably simpler just to add an Index on those fields.
I would not use User name in a PK as it may change, creating cascade update issues later.
Also, concatenating those two into the PK creates a large(r) PK that might have to be passed to other tables as a FK. I try to keep PK that appear as FKs as small as possible, unless I know I will want all the PK of the contributing tables in one large key for speed of query.
Comment id should be fine.
You may need to create an additional index for fast searching on comment id and user name.
Will you be doing more insertions/updates or queries? if query intensive, then the index is not an issue.
Are you sure that you have that CREATE TABLE statement correct? You're using [Channel] in the PK definition, and I don't see that as a column. Did you mean [User].
Do you have a user table someplace? If so, you might save a lot of overhead by keying that on an integer value and putting UserID into the comments table, rather than User.
I would PK on the CommentID and then add a non-clustered index on [UserID, CommentID]. That gives you immediate access to a comment by ID (for deleting, etc) without having to involve the UserID value in the WHERE clause; and it provides quick access to the user's comments. I do not, however, tend to work with table of the size you anticipate.
As a rule of thumb, always choose the narrowest PK. Then, to improve performance, you may want to use an integer based User_id, instead of a varchar, and add an index for both columns.
The best approach will depends on the number of users, if you have just a few users the commet_id user_id pk could be better (additionally, parttition by user would be an option); in the other hand, if the number of users are high, a combined Pk will be useless.
My initial approach would be to make CommentID alone the PK, maybe in descending order so you don't have to do any reordering on select. Then put an index on UserID.
If you use the concatenated key, consider switching CommentID to desc.
So, I have a subscriptions table:
id - int(11) (With Primary Key)
user_id - int(11)
group_id - int(11)
role - int(11)
pending - tinyint(1)
created_at - datetime
updated_at - datetime
I'm often doing queries to see if users have access rights similar to this:
SELECT * FROM `subscriptions` WHERE (group_id = 1 AND user_id = 2 AND pending = 0) LIMIT 1
I'm wondering if adding a unique index on subscriptions(group_id, user_id, pending) would help or hinder in this case? What are the best practices for indexing almost an entire table?
This will certainly help, especially if you replace * with 1 in your query (if you just want to check that the row exists).
It may have a little impact on DML though.
Creating an index is in fact creating a B-Tree which has this structure:
indexed_col1
indexed_col2
...
indexed_colN
row_pointer
as a key, the row_pointer being a file offset (for MyISAM) or the value of the row's PRIMARY KEY (for InnoDB)
If you don't use other columns but indexed in you query, all information you need can be retrieved from the ondex alone, without even having to refer to the table itself.
If your data are intrinsically unique, it's always good to create a UNIQUE index on them. This is less of the case for MySQL, but more advanced optimizers (SQL Server, for instance) can utilize the fact the data are unique and build a more efficient plan.
See this article in my blog for an example:
Making an index UNIQUE