I'm working on simple log in page using OpenID: if the user has just registered for an OpenID, then I need to create a new entry in the database for the user, otherwise I just display their alias with a greeting. Every time somebody gets authenticated with their Open ID, I must find their alias by looking up which user has the given OpenID and it seems that it might be fairly slow if the primary key is the UserID (and there are millions of users).
I'm using SQL Server 2008 and I have two tables in my database (Users and OpenIDs): I plan the check if the Open ID exists in the OpenIDs table, then use the corresponding UserID to get the rest of the user information from the Users table.
The Users table is indexed by UserID and has the following columns:
UserID (pk)
EMail
Alias
OpenID (fk)
The OpenIDs table is indexed by OpenID and has the following columns:
OpenID (pk)
UserID (fk)
Alternately, I can index the Users table by UserID and OpenID (i.e have 2 indexes) and completely drop the OpenIDs table.
What would be the recommended way to improve the query for a user with the matching OpenID in this case: index the Users table with two keys or use the OpenIDs table to find the matching UserID?
May be the answers to What are some best practises and “rules of thumb” for creating database indexes? can help you.
Without knowing what kind of queries you'll be running in detail, I would recommend indexing the two foreign key columns - Users.OpenID and OpenIDs.UserID.
Indexing the foreign keys is typically a good idea to help with JOIN conditions and other queries.
But quite honestly, if you use the OpenIDs table only to check the existance of an OpenID, you'd be much better off just indexing (possibly a unique index?) that column in the Users table and be done with it. That OpenIDs table as you have it now serves no real purpose at all - just takes up space for redundant information.
Other than that: you need to observe how your application behaves, samples some usage data, and then see what kind of queries are running the most often, and the longest, and then start doing performance tweaking. Don't over-do the ahead-of-time performance optimizations - too many indices can be worse than having none at all !
Every time somebody gets authenticated
with their Open ID, I must find their
alias by looking up which user has the
given OpenID and it seems that it
might be fairly slow if the primary
key is the UserID (and there are
millions of users).
Actually, quite the contrary! If you have a value that's unique amongst millions of rows, finding that single value is actually quite quick - even with millions of users. It will take only a handful (max. 5-6) comparisons, and bang! you have your one user out of a million. If you have an index on that OpenID column, that should be pretty fast indeed. Such a highly selective index (one value picks out 1 in a million) work very very efficiently.
Related
I am rebuilding a system and migrating all of the data from mysql to postgres (along with redis just as cache layer).
The structure is as follows :
users table with under 5000 rows
items table with a few billion rows
chars table with under a million rows
accounts table with under 200k rows
each item row has a foreign key for chars(char_id)
each char row has a foreign key for accounts(acc_id)
each account row has a foreign key for users(user_id)
I have also added the users(user_id) foreign key to items table. This is to avoid having to join items + chars + accounts and users every time I simply want to look for items by userid.
I want to know the most efficient way to access the items rows. Most likely in the future that table will be sharded to a remote server when the server can't handle the whole.
Some places I read that partitioning is a good idea here, and others say indexing/views are sufficient.
So my idea is to partition items by user_id since queries will almost always be limited to that user's data. A typical query will be to SELECT ~100 or so items for a specific user that are on a char that is the correct type and this char also has to be on an account that is of the correct type (two joins).
My first question is, partitions by user, good? And if so, the ideal layout. And if not, what alternatives? Views?
A secondary question would be, should I put the char and account details on each item row? I don't mind wasting space if it means faster queries.
There's a lot to say and I hope everything makes sense. If it doesn't, please don't hesitate to ask for more details. Thanks in advance.
I notice a pattern which seems pretty obvious now.
Need to get your opinion on this.
Suppose have a One To Many relationship from table 1 to table 2 in a relational model.
For example table 1 could be a User table and table 2 could be a Login table which logs all user logins. One user can log in multiple times.
Given a user we can find all logins by that user.
The first idea that comes to mind will be to store the logins only in the login table. This is design one.
But if for some usecases we are interested in a particular login of the user (say the last login) it is "generally a good idea" to cache the last login time in the user table itself.
Is that right?
Design 2 is obviously redundant as we can always find the last login time by performing a join and then discarding all but the previous logins.
For one user either should be fine. But if you want to find last login time for all the users a SQL query then design 1 would involve a join and a subquery to filter out the unneeded results.
But given our usecase it is a good idea to store last login time in the user table itself which will save us from the join. Is that right?
Is that a generic pattern that you see when designing schemas?
You are confusing the concepts of TABLE and RELATION, a common mistake. You have two RELATIONS in your conceptual model (Users & Logins), but in practice this will involve more than two TABLES in your physical model, as non-clustered indices are nothing more than additional TABLES to speed-up the joining of multiple RELATIONS.
Once the INDEX (UserID, LoginTime) exists on Logins to support the FK relationship to Users, the query to find the most recent login for a user is covered by the non-clustered index. Only when a known, measurable, severe performance problem has been identified with this default model would one look to denormalize, as this (like all denormalizaions) introduces a performance hit for EVERY OTHER READ AND WRITE operation on the denormalized table.
I have a large database, one of the tables is called Users.
There are two kinds of users in the database - basic users, and advanced users. In the users table, there are 29 columns. However, only 12 of these are applicable to the basic users - the other 17 columns are only used for advanced users, and for basic users they all just contain a value of null.
Is this an OK setup? Would it be more efficient to, say, split the two kinds of users into two different tables, or put all the extra fields that advanced users have in a separate table?
It's better to have the right amount of tables - this may be more or less, depending on your needs.
To your specific case, you should always start with third normal form and only revert to lesser forms when absolutely necessary (such as for performance) and only when you understand the consequences.
An attribute (column) belongs in a table if it is dependent on the key, the whole key and nothing but the key (so help me, Codd).
It's arguable whether your other 17 columns depend on the key in your user table but I would be separating them anyway just for the space saving.
Have your basic user table with the twelve columns (including a unique key of some sort) and your advanced user table with the other columns, and also that key so you can tie the rows from each together.
You could go even further and have a one to many relationship if your use case is that users can have any of the 17 attributes independent of each other but that doesn't seem to be what you've described.
It depends:
If the number of columns is large, then it will be more efficient to create two tables as you describe as you will not be reserving space for 17 columns which end up holding null.
You can always tack a view on the front which combines both tables, so your application code could be unaffected.
Yes its better to split up this table but not in two Its better to split in three table
User Table-
Contain common property of both user and Adavace user
UserID(PK)
UserName
Basic user -
Contains basic user property and have use primary key of user table and foreign key
USerID(FK) - from user table
BasicUsedetail
Advance user-
Contains Advance user property and have use primary key of user table and foreign key
USerID(FK) - from user table
AdvanceUsedetail
In this case, it's valid and more efficient to use 'single table per class hierarchy' in terms of speed to retrieve data but if you insert a BasicUser, it will reserve 17 columns per tuple just for nothing. This case is is so frequent that it is provided by ORMs such as Hibernate. Using this approach you avoid a join between tables which may be expensive depending the case.
The bad thing is that in case your design needs to scale in terms of types of users, you will need to add additional columns which many of them will be empty.
Usually it won't matter much, but if you got many many users and only a few of them are advanced user, it might be better to split. To my knowledge there are not exact rules of when to split and when not.
I have a question, would you please help me?
I have designed a database for a web-cms, in the User table_ which includes : UserID, Username, Password, FirstName, LastName, …_ which is the best choice that I have to create the index on it, username or FirstName and LastName? Or both of them?
By default the UserID is the clustered index of user table, so the next index must be non-clustered. But I am not sure about UserID to be the clustered index. As this is a web site and many users can register or remove their accounts everyday, is this a good choice to create the clustered index on UserID?
i am using sql server 2008
You should define clustered indexes on fields that are often requested sequentially, or contain a large number of distinct values or are used in queries to joion tables. So that usually mean that the primary key is a good candidate.
Non-Clustered indexes are good for field that are used in the where clause of queries.
Deciding which fields you create indexes on is something that is very specific to your application. If you have very critical queries that use the first name and last name fields then I would say yes, otherwise it may not be worth the effort.
In terms of persons removing their accounts I am sure that you do not intend to delete the row from the table. Usually you just mark these as inactive because what happens to all the other related tables that may be affected by this user?
I'm at the planning stages of a multi-user application where each user will only have access their own data. There'll be a few tables that relate to each other, so I could use JOINs to ensure they're accessing only their data, but should I include user_id in each table? Would this be faster? It would certainly make some of the queries easier in the long run.
Specifically, the question is about multiple tables containing the user_id field.
For example, each user can configure categories, items (in those categories), and sub-items against those items. There's a logical path from user, to sub-items through the other tables, but it would require 3 JOINs. Should I just include user_id in all the tables?
Thanks!
This is a design decision in multi-tenant databases. With "root" tables, obviously you have to have the user_id. But in the non-"root" tables, you do have a choice when you are using surrogate PKs.
Say you have users with projects and projects with actions. Projects obviously has to have a user_id, but if actions are tied to one and only one project, then the user_id is redundant, and also violates normal form, since if it was to move to another user's project (probably not likely in your use cases), both the project FK and the user FK would have to be updated. Typically in multi-tenant scenarios, this isn't really a possible scenario, and so the primary key of every table is really a combination of tenant and a unique primary key "within" the tenant (which may also happen to be globally unique).
If you use natural keys extensively in your design, then clearly tenant+natural key is necessary so that each tenant's natural keys can be used. It's only when using surrogates like IDENTITY or GUIDs or sequences, that this becomes an issue, since it is tempting to make the IDENTITY the PK, after all, it is unique by definition.
Having the user_id in all tables does allow you to do certain things in views to enhance security (defense in depth), giving you a little bit of defensive programming (in SQL Server you can restrict all access through inline table valued function - essentially parametrized views - which require the app to specify user_id on every "table" access), and also allows you to easily scale out to multiple databases by forklifting everything on shared keys.
See this article for some interesting insights.
(In a massively multi-parallel paradigm like Teradata, the PRIMARY INDEX determines the amp on which the data lives, so I would think that this is a must to stop redistribution of rows to the other amps.)
In general, I would say you have a tenantid in each table, it should be the first column in the table, in most indexes and should be part of the primary key in most cases, unless otherwise justified. Where possible, it should be a required parameter in most stored procedures.
Generally, you use foreign keys to relate data between tables. In many cases, this foreign key is the user id. For example:
users
id
name
phonenumbers
user_id
phonenumber
So yes, that'd make perfect sense.
If a category can only belong to one user then yes, you need to include the user_id in the category table. If a category can belong to multiple people then you would have a separate table that maps category IDs to user IDs. You can still do this if you have a one to one mapping between the two, but there is no real reason for it.
You don't need to include the user_id in further tables if you can guarantee that those child tables will always be accessed via joining to the category table. If there is a chance that you will access them independantly of the category table then you should also have the user_id on those tables.
The extent to which to normalize can be a difficult decision. One of the best StackOverflow answers on this topic (Database Development Mistakes Made by App Developers) warns against both (1) failing to normalize, and (2) over-normalizing.
You mention that it might be easier "in the long run" to repeat the same data in multiple tables (that is, not to normalize that data). Look at the "Not simplifying complex queries through views" topic in the previous link. If you use views effectively, you will only have to do the 3 join query once when writing the view and then you can use a query with no joins for most purposes.
Most developers tend to under-normalize because it seems simpler. Go ahead and normalize. Use views to simplify your daily queries. When your requiremens get more complex or you decide to add features, you will be glad that you put time into a relational database design.
Alternatively, depending on your toolset, you may want to use a database abstraction layer that does the relational design under the covers while you manipulate higher level data object.
if it is Oracle, then you would probably set up a fine grained security rule to do the joins and prevent certain activities based on the existence of the original user id... (SELECT INSERT UPDATE DELETE etc)
You would need a map between the logged in user and the user_id. You could use uid, but then remember this umber may change if the database is reconstructed after some disaster...