SQL design for performance - sql

I am new to SQL and I have a basic question about performance.
I want to create a users database which will store information about my users:
Id
Log in name
Password
Real name
Later I want to perform a SELECT query on: Id, Log in name and Real name.
What would be the best design for this database, what tables and what keys should I create?

If it's only about those 4 fields it looks like just one table. Primary key on ID, unique index on LoginName. You may not want to store the password, but only a hash.
Depending on your queries, create different indexes. Furthermore, you may not need the ID field at all.
UPDATE:
Creating an index on certain column(s) enables the database to optimize its SQL statements. Given your user table:
USER
USER_ID BIGINT NOT NULL
LOGIN_ID VARCHAR(<size>) NOT NULL
PASSWORD VARCHAR(<size>) NOT NULL
NAME VARCHAR(<size>) NOT NULL
CONSTRAINT PK_USER PRIMARY KEY ( USER_ID )
The databases I know will automatically create an index on the primary key, which in fact means that the database maintains an optimized lookup table, see WikiPedia for further details.
Now say, you want to query users by LOGIN_ID which is a fairly common use case, I guess, you can create another index like:
CREATE INDEX I_USER_1 ON USER ( LOGIN_ID asc )
The above index will optimize the select * from USER where LOGIN_ID='foo'. Furthermore, you can create a unique index instead, assuming that you do not want duplicate LOGIN_IDs:
CREATE UNIQUE INDEX UI_USER_1 ON USER ( LOGIN_ID asc )
That's the whole story, so if you want to optimize a query for the users real name (NAME), you just create another index:
CREATE INDEX I_USER_2 ON USER ( NAME asc )

Just to add to the #homes answer you should work out what sort of queries you will be running and then optimize for those sorts of queries. For example if you are doing a lot of writes and not as many reads having lots of indexes can cause performance issues. It's a bit like tuning an engine for a car, are you going to be going quickly down a drag strip or are you tuning it for driving long distances.
Anyway you also asked about the NAME column. If you are going to be matching on a varchar column it might be worth investigating the use of FULLTEXT Indexes.
http://msdn.microsoft.com/en-us/library/ms187317.aspx
This allows you to do optimized searchs on names where you might be matching parts of a name and the like. As the #homes answer said it really does depend on what your queries and intent is when writing the query.
Might be worth making the table and using the query execution plan in something like SQL management studio against your queries and see what impact your indexes have on the amount of rows and sort of looks up that are happening.
http://www.sql-server-performance.com/2006/query-execution-plan-analysis/

Related

MariaDB Indexing

Let's say I have a table of 200,000,000 users. For each user I have saved a certain attribute. Let it be their lastname.
I am unsure of which index type to use with MariaDB. The only queries made to the database will be in the form of SELECT lastname FROM table WHERE username='MYUSERNAME'.
Is it therefore the best to just define the column username as a primary key. Or do I need to do anything else? Also how long is it going to take until the index is built?
Sorry for this question, but this is my first database with more than 200.000 rows.
I would go with:
CREATE INDEX userindex on `table`(username);
This will index the usernames since this is what your query is searching on. This will speed up the results coming back as the username column will be indexed.
Try it and if it reduces performance just delete the index, nothing lost (although make sure you do have backups! :))
This article will help you out https://mariadb.com/kb/en/getting-started-with-indexes/
It says primary keys are best set at table creation and as I guess yours is already in existence that would mean either copying it and creating a primary key or just using an index.
I recently indexed a table with non unique strings as an ID and although it took a few minutes to index the speed performance was a great improvement, this table was 57m rows.
-EDIT- Just re-read and thought it was 200,000 as mentioned at the end but see it is 200,000,000 in the title, that's a hella lotta rows.
username sounds like something that is "unique" and not null. So, make it NOT NULL and have PRIMARY KEY(username), without an AUTO_INCREMENT surrogate PK.
If it not unique, or cannot be NOT NULL, then INDEX(username) is very likely to be useful.
To design indexes, you must first know what queries you will be performing. (If you had called it simply "col1", I would not have been able to guess at the above advice.)
There are 3 index types:
BTree (actually B+Tree; see Wikipedia). This is the default and the most commonly used index type. It is efficient at finding a row given a specific value (WHERE user_name = 'joe'). It is also useful for a range of values (WHERE user_name LIKE 'Smith%').
FULLTEXT is useful for a TEXT column where you want to search for "words" inside it.
SPATIAL is useful for 2-dimensional data, such as geographical points on a map or other type of grid.

SQL Server 2008 seems to be picking the PK Index for every query, even if a better one seems to exist

It sounds like a similar situation to what's asked here, but I'm not sure his details are the same as mine.
Basically I have a relational table, we'll call it User:
User
-----------
int Id
varchar<100> Name
int AddressId
varchar<max> Description
and it has the following indices:
PK_User_Id - Obviously the primary key.
IX_User_AddressId - which includes only the AddressId.
When I run the following query:
select Id, Name, AddressId, Description from User where AddressId > 200
The execution plan shows that a scan was done, and PK_User_Id was used.
If I run this query:
select AddressId from User where AddressId > 200
The execution plan shows that a scan was done and IX_User_AddressId was used.
if I include all of the columns in the IX_User_AddressId index, then my original query will use the proper index, but that still seems wrong that I'd have to do that.
So my SQL noob question is this: What in the world do I have to do to get my queries to use the fastest index? Be very specific because I must be retarded as I can't figure this out.
You query looks like it has tipped, since your index does not cover all the fields you wanted, I would say it tipped (check out Kimberly Tripp - Tipping Point) and has used the Primary Key index which I would take a pretty good guess as being your clustered index.
When your IX_User_AddressId index contains only the AddressId, SQL must perform bookmark lookups on the base table to retrieve your other columns (Id, Name, Description). If the table is small enough, SQL may decide it is more efficient to scan the entire table rather than using an alternate index in combination with bookmark lookups. When you add those other columns to your index, you create what is called a covering index, meaning that all of the columns necessary to satisfy your query are available in the index itself. This is a good thing as it will eliminate the bookmark lookups.

Strategy to reduce db indexes, selectively

I have an indexed field users.username, which is only used in the admin interface. Because the table has currently lots of writes, I'd like to remove that index. Of course I want to keep the index searchable for admins.
I could extract the whole column, to move that index to another table. But it feels stupid because I'm already planning to move the write heavy fields into another table (with just one index).
Throwing in an search engine would be overkill.
Any ideas for a simple solution?
[edit]
I've just realized that the need for the admins to search and sort lots of fields has impact on many tables (which would actually need much more indexes). For the first step I'll ensure that the admins get an dedicated server+db to keep off the slow sorts/searches from live servers and in the long run I'll investigate if a search engine is suitable. Thanks all!
Maintaining an index only accessible by certain users is not supported in MySQL, and even if it was, it would be as expensive as maintaining a usual index.
Assuming the usernames are unique, you can create a separate index-like table like that:
CREATE TABLE shadow_username (username VARCHAR(100) NOT NULL PRIMARY KEY, userid INT NOT NULL, UNIQUE (userid))
, fill in on a timely basis:
TRUNCATE
shadow_username
INSERT
INTO shadow_username
SELECT username, id
FROM users
and query it:
SELECT u.*
FROM (
SELECT id
FROM shadow_username
WHERE username = 'user'
) s
JOIN users u
ON u.id = s.id
UNION ALL
SELECT u.*
FROM users
WHERE id >
(
SELECT MAX(id)
FROM shadow_username
)
AND username = 'user'
UNION ALL
SELECT *
FROM users
WHERE username = 'user'
LIMIT 1
The first part does a normal search; the second part processes the usernames that were inserted in between the updates to shadow_username; the third part is a fallback method which does a normal search only if previous two steps found nothing (that may happen if a user changed their username).
If the username never changes, you should omit the third step.
If I understand you correctly, you can't have an index for only a certain subset of $ways_to_access_data (ie, admin interface vs public interface).
Either the column is indexed, or it isn't.
I'm not sure where the actual problem is. Either the "username" field is written to, in which case updating the index is warranted (and whether to have it indexed or not is a trade off between read performance and write performance), or it isn't written to (which I'd assume, as most users tend to change their name rather seldom), in which case your RDBMS should not be touching the index at all.
Looking into my crystal ball, I'd assume the "write heavy" fields in the "users" table are login sessions, which should live in a separate table anyway.

designing index for a web-cms database

I have a question, would you please help me?
I have designed a database for a web-cms, in the User table_ which includes : UserID, Username, Password, FirstName, LastName, …_ which is the best choice that I have to create the index on it, username or FirstName and LastName? Or both of them?
By default the UserID is the clustered index of user table, so the next index must be non-clustered. But I am not sure about UserID to be the clustered index. As this is a web site and many users can register or remove their accounts everyday, is this a good choice to create the clustered index on UserID?
i am using sql server 2008
You should define clustered indexes on fields that are often requested sequentially, or contain a large number of distinct values or are used in queries to joion tables. So that usually mean that the primary key is a good candidate.
Non-Clustered indexes are good for field that are used in the where clause of queries.
Deciding which fields you create indexes on is something that is very specific to your application. If you have very critical queries that use the first name and last name fields then I would say yes, otherwise it may not be worth the effort.
In terms of persons removing their accounts I am sure that you do not intend to delete the row from the table. Usually you just mark these as inactive because what happens to all the other related tables that may be affected by this user?

Why most SQL databases allow defining the same index twice?

Why most SQL databases allow defining the same index (or constraint) twice?
For example in MySQL I can do:
CREATE TABLE testkey(id VARCHAR(10) NOT NULL, PRIMARY KEY(id));
ALTER TABLE testkey ADD KEY (id);
ALTER TABLE testkey ADD KEY (id);
SHOW CREATE TABLE testkey;
CREATE TABLE `testkey` (
`id` varchar(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `id` (`id`),
KEY `id_2` (`id`)
)
I do not see any use case for having the same index or constraint twice. And I would like SQL databases not allowing me do so.
I also do not see the point on naming indexes or constraints, as I could reference them for deletion just as I created them.
Several reasons come to mind. In the case of a database product which supports multiple index types it is possible that you might want to have the same field or combination of fields indexed multiple times, with each index having a different type depending on intended usage. For example, some (perhaps most) database products have a tree-structured index which is good for both direct lookup (e.g KEY_FIELD = 1) and range scans (e.g. KEY_FIELD > 0 AND KEY_FIELD < 5). In addition, some (but definitely not all) database products also support a hashed index type, which is only useful for direct lookups but which is very fast (e.g. would work for a comparison such as KEY_FIELD = 1 but which could not be used for a range comparison). If you need to have very fast direct lookup times but still need to to provide for ranged comparisons it might be useful to create both a tree-structured index and a hashed index.
Some database products do prevent you from having multiple primary key constraints on a table. However, preventing all possible duplicates might require more effort on the part of the database vendor than they feel can be justified. In the case of an open source database the principal developers might take the view that if a given feature is a big enough deal to a given user it should be up to that user to send in a code patch to enable whatever feature it is that is wanted. Open source is not a euphemism for "I use your open-source product; therefore, you are now my slave and must implement every feature I might ever want!".
In the end I think it's fair to say that a product which is intended for use by software developers can take it as a given that the user should be expected to exercise reasonable care when using the product.
All programming languages allow you to write redundancies:
<?php
$foo = 'bar';
$foo = 'bar';
That's just an example, you could obviously have duplicate code, duplicate functions, or duplicate data structures that are much more wasteful.
It's up to you to write good code, and this depends on the situation. Maybe there's a good reason in some rare case to write something that seems redundant. In that case, you'd be just as put out if the technology didn't allow you to do it.
You might be interested in a tool called Maatkit, which is a collection of indispensable tools for MySQL users. One of its tools checks for duplicate keys:
http://www.maatkit.org/doc/mk-duplicate-key-checker.html
If you're a MySQL developer, novice or expert, you should download Maatkit right away and set aside a full day to read the docs, try out each tool in the set, and learn how to integrate them into your daily development tasks. You'll kick yourself for not doing it sooner.
As for naming indexes, it allows you to do this:
ALTER TABLE testkey DROP KEY `id`, DROP KEY `id_2`;
If they weren't named, you'd have no way to drop individual indexes. You'd have to drop the whole table and recreate it without the indexes.
There are only two good reasons - that I can think of - for allowing defining the same index twice
for compatibility with existing scripts that do define the same index twice.
changing the implementation would require work that I am neither willing to do nor pay for
I can see that some databases prevent duplicate indexes. Oracle Database prevents duplicate indexes https://www.techonthenet.com/oracle/errors/ora01408.php while other databases like MySQL and PostgreSQL do not have duplicate index prevention.
You shouldn't be in a scenario that you have so many indexes on a table that you can't just quickly look and see if the index in there.
As for naming constraints and indexes, I only really ever name constraints. I will name a constraint FK_CurrentTable_ForeignKeyedColumn, just so things are more visible when quickly looking through lists of them.
Because databases that support covering indexes - Oracle, MySQL, SQL Server... (but not PostgreSQL, oddly). A covering index means indexing two or more columns, and are processed left to right for that column list in order to use them.
So if I define a covering index on columns 1, 2 and 3 - my queries need to use, at a minimum, column 1 to use the index. The next possible combination is column 1 & 2, and finally 1,2 and 3.
So what about my queries that only use column 3? Without the other two columns, the covering index can't be used. It's the same issue for only column 2 use... Either case, that's a situation where I would consider separate indexes on columns 2 and 3.