This is how my table looks:
CREATE TABLE pics(
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
page INTEGER,
w INTEGER,
h INTEGER,
FOREIGN KEY(page) REFERENCES pages(id) ON DELETE CASCADE,
UNIQUE(name, page)
);
CREATE INDEX "myidx" ON "pics"("page"); # is this needed?
so UNIQUE(name, page) should create an index. But is this index enough to make fast queries that involve the page field only? Like selecting a set of "pics" WHERE page = ?. or JOIN pages.id ON pics.page ? Or should I create another index (myidx) just for the page field?
As stated, you will need your other myidx index, because your UNIQUE index specifies name first. In other words, it can be used to query by:
name
name and page
But not by page alone.
Your other option is to reorder the UNIQUE index and place the page column first. Then it can be used for page only queries, but will become incompatible with name only queries.
Think of a composite index as a phone book. The phone book is sorted by last name, then the first name. If you're given the name Bob Smith, you can quickly find the S section, then Sm, then all the Smith's, then eventually Bob. This is fast because you have both keys in the index. Since the book is organized by last name first, it would also be just as trivial to find all the Smith entries.
Now imagine trying to find all the people named Bob in the entire phone book. Much harder, right?
This is analogous to how the index on disk is organized as well. Finding all the rows with a certain page column when the list of sorted in (name, page) order will basically result in a sequential scan of all the rows, looking one by one for anything that has that page.
For more information on how indexes work, I recommend reading through Use the Index, Luke.
You have to analyse the query(s) that will use this table and determine what fields the query will use to sort the results. You should index the fields that are used for sorting the most.
Related
Let's say I have a table of 200,000,000 users. For each user I have saved a certain attribute. Let it be their lastname.
I am unsure of which index type to use with MariaDB. The only queries made to the database will be in the form of SELECT lastname FROM table WHERE username='MYUSERNAME'.
Is it therefore the best to just define the column username as a primary key. Or do I need to do anything else? Also how long is it going to take until the index is built?
Sorry for this question, but this is my first database with more than 200.000 rows.
I would go with:
CREATE INDEX userindex on `table`(username);
This will index the usernames since this is what your query is searching on. This will speed up the results coming back as the username column will be indexed.
Try it and if it reduces performance just delete the index, nothing lost (although make sure you do have backups! :))
This article will help you out https://mariadb.com/kb/en/getting-started-with-indexes/
It says primary keys are best set at table creation and as I guess yours is already in existence that would mean either copying it and creating a primary key or just using an index.
I recently indexed a table with non unique strings as an ID and although it took a few minutes to index the speed performance was a great improvement, this table was 57m rows.
-EDIT- Just re-read and thought it was 200,000 as mentioned at the end but see it is 200,000,000 in the title, that's a hella lotta rows.
username sounds like something that is "unique" and not null. So, make it NOT NULL and have PRIMARY KEY(username), without an AUTO_INCREMENT surrogate PK.
If it not unique, or cannot be NOT NULL, then INDEX(username) is very likely to be useful.
To design indexes, you must first know what queries you will be performing. (If you had called it simply "col1", I would not have been able to guess at the above advice.)
There are 3 index types:
BTree (actually B+Tree; see Wikipedia). This is the default and the most commonly used index type. It is efficient at finding a row given a specific value (WHERE user_name = 'joe'). It is also useful for a range of values (WHERE user_name LIKE 'Smith%').
FULLTEXT is useful for a TEXT column where you want to search for "words" inside it.
SPATIAL is useful for 2-dimensional data, such as geographical points on a map or other type of grid.
I am going to be creating a very large table (320k+ rows) that I am going to be doing many complicated operations on so performance is very important. Each row will be a reference to a page / entity from an external site that already has unique IDs. In order to keep the data easy to read and for consistency reasons I would rather use those external IDs as my own row IDs, however the problem is that the IDs are in the format of XXX######## where the XXX part is always the same identical string prefix and the second ######## part is a completely unique number. From what I know, using varchar ids is measurably slower performance wise, and only looking at the numerical part will have the same results.
What is the best way to do this? I still want to be able to do queries like WHERE ID = 'XXX########' and have the actual correct ids displayed in result sets rather than trimmed ones. Is there a way to define getters and setters for a column? Or is there a way to create an index that is a function on just the numerical part of the id?
Since your ID column (with format XXX########) is a primary key, there will already be an index on that column. If you wish to create an index based on the "completely unique number" portion of the ID, it is possible to create an expression index in Postgres:
CREATE INDEX pk_substr_idx ON mytable (substring(id,4));
This will create an index on the ######## portion of your column. However, bear in mind that the values stored in the index will be text, not numbers. Therefore, you might not be able to see any real benefit to having this index around (i.e., you'll only be able to check for equality = and not comparison >/</>=/<=.
The other drawback of this approach is that for every row you insert, you'll be updating two indexes (the one for the PK, and the one for the substring).
Therefore, if at all possible, I would recommend splitting your ID into separate prefix (the XXX portion) and id_num (the ######## portion) columns. Since you stated that "the XXX part is always the same identical string prefix", you would stand to reap a performance benefit by either 1) splitting the string into two columns or 2) hard-code the XXX portion into your app (since it's "always the same identical string prefix") and only store the numeric portion in the database.
Another approach (if you are willing to split the string into separate prefix and id_num columns) is to create a composite index. The table definition would then look something like:
CREATE TABLE mytable (
prefix text,
id_num int,
<other columns>,
PRIMARY KEY (prefix, id_num)
);
This creates a primary key on the two columns, and you would be able to see your queries use the index if you write your application with two columns in mind. Again, you would need to split the ID up into text and number portions. I believe this is the only way to get the best performance out of your queries. Any value that mixes text and numbers will ultimately be stored and interpreted as text.
Disclosure: I work for EnterpriseDB (EDB)
Use an IDENTITY type column for the primary key and load the external IDs as a separate column
Say, I have a table ResidentInfo with a unique constraint on the column HomeAddress, which is VARCHAR type.
I plan to add an index on this column. The query will only have operation =. I'll use a B-TREE index since the Hash indexes are not recommended currently.
Question:
For efficiency of this B-TREE index, should I add a new column with numbers 1,2,3....,N corresponding to different home addresses,and index that number instead?
I ask this question because I don't know how indexes work.
For simple equality checks (=), a B-Tree index on a varchar or text column is simple and the best choice. It certainly helps performance a lot. And a UNIQUE constraint (like you mentioned) is already implemented with such an index, so you would not create another one.
Of course, a B-Tree index on a simple integer performs better. For starters, comparing simple integer values is a bit faster. But more importantly, performance is also a function of the size of the index. A bigger column means fewer rows per data page, means more pages have to be read ...
Since the HomeAddress is hardly unique anyway, it's not a good natural primary key. I would strongly suggest to use a surrogate primary key instead. A serial column or IDENTITY in Postgres 10+ is the obvious choice. Its only purpose is to have a simple, fast primary key to work with.
If you have other tables referencing said table, this becomes even more efficient. Instead of duplicating a lengthy string for the foreign key column, you only need the 4 bytes for an integer column. And you don't need to cascade updates so much, since an address is bound to change, while a surrogate PK can stay the same (but doesn't have to, of course).
Your table could look like this:
CREATE TABLE resident (
resident_id serial PRIMARY KEY
, address text NOT NULL
-- more columns
);
CREATE INDEX resident_adr_idx ON resident(address);
This results in two B-Tree indexes. A unique index on resident_id (implementing the PK) and a plain index on address.
Postgres offers a lot of options - but you don't need any more for this simple case. See:
The manual about indexes
In Postgres, a unique constraint is enforced by maintaining a unique index on the field, so you're covered already.
In the event you decide the unique constraint on the address is bad (which, honestly, it is: what a spouse creating a separate account? about flatshares? etc.), you can create one like so:
create index on ResidentInfo (HomeAddress);
I'm having a rather silly problem. I'll simplify the situation: I have a table in SQL Server 2008 R2 where I have a field 'ID' (int, PK) and a Name (nvarchar(50)) and Description (text) field. The values in the Name - field should be Unique. When searching the table, the Name - field will be used so performance is key here.
I have been looking for 2 hours on the internet to completely understand the differences between Unique Key, Primary Key, Unique Index and so on, but it doesn't help me solve my problem about what key/constraint/index I should use.
I'm altering the tables in SQL Server Management Studio. My question for altering that Name - field is: should I use "Type = Index" with "Is Unique = Yes" or use "Type = Unique Key"?
Thanks in advance!
Kind regards,
Abbas
A unique key and a primary key are both logical constraints. They are both backed up by a unique index. Columns that participate in a primary key are not allowed to be NULL-able.
From the point of view of creating a Foreign Key the unique index is what is important so all three options will work.
Constraint based indexes have additional metadata stored that regular indexes don't (e.g. create_date in sys.objects). Creating a non constraint based unique index can allow greater flexibility in that it allows you to define included columns in the index definition for example (I think there might be a few other things also).
A unique key cannot have the same value as any other row of a column in a table. A primary key is the column field(s) that is a unique key and not null which is used as the main look up mechanism (meaning every table should have a primary key as either a column or combination of columns that represent a unique entry).
I haven't really used indexes much, but I believe it follows the same logic.
See http://en.wikipedia.org/wiki/Unique_key for more information.
An index is a collection the DBMS uses to organize your table data efficiently. Usually you want to create an index on columns and groups of columns that you frequently search on. For example, if you have a column 'name' and you are searching your table where name = '?' and index on that column will create separate storage that orders that table so searching for a record by name is fast. Typically primary keys are automatically indexed.
Of course the above is a bit too general and you should consider profiling queries before and after adding an index to ensure it's being used and speeding things up. There are quiet a few subtleties to indexes that make the application specific. They take extra storage and time to build and maintain so you always want to be judicious about adding them.
Hope this helps.
It sounds like a similar situation to what's asked here, but I'm not sure his details are the same as mine.
Basically I have a relational table, we'll call it User:
User
-----------
int Id
varchar<100> Name
int AddressId
varchar<max> Description
and it has the following indices:
PK_User_Id - Obviously the primary key.
IX_User_AddressId - which includes only the AddressId.
When I run the following query:
select Id, Name, AddressId, Description from User where AddressId > 200
The execution plan shows that a scan was done, and PK_User_Id was used.
If I run this query:
select AddressId from User where AddressId > 200
The execution plan shows that a scan was done and IX_User_AddressId was used.
if I include all of the columns in the IX_User_AddressId index, then my original query will use the proper index, but that still seems wrong that I'd have to do that.
So my SQL noob question is this: What in the world do I have to do to get my queries to use the fastest index? Be very specific because I must be retarded as I can't figure this out.
You query looks like it has tipped, since your index does not cover all the fields you wanted, I would say it tipped (check out Kimberly Tripp - Tipping Point) and has used the Primary Key index which I would take a pretty good guess as being your clustered index.
When your IX_User_AddressId index contains only the AddressId, SQL must perform bookmark lookups on the base table to retrieve your other columns (Id, Name, Description). If the table is small enough, SQL may decide it is more efficient to scan the entire table rather than using an alternate index in combination with bookmark lookups. When you add those other columns to your index, you create what is called a covering index, meaning that all of the columns necessary to satisfy your query are available in the index itself. This is a good thing as it will eliminate the bookmark lookups.