Strategy to reduce db indexes, selectively - sql

I have an indexed field users.username, which is only used in the admin interface. Because the table has currently lots of writes, I'd like to remove that index. Of course I want to keep the index searchable for admins.
I could extract the whole column, to move that index to another table. But it feels stupid because I'm already planning to move the write heavy fields into another table (with just one index).
Throwing in an search engine would be overkill.
Any ideas for a simple solution?
[edit]
I've just realized that the need for the admins to search and sort lots of fields has impact on many tables (which would actually need much more indexes). For the first step I'll ensure that the admins get an dedicated server+db to keep off the slow sorts/searches from live servers and in the long run I'll investigate if a search engine is suitable. Thanks all!

Maintaining an index only accessible by certain users is not supported in MySQL, and even if it was, it would be as expensive as maintaining a usual index.
Assuming the usernames are unique, you can create a separate index-like table like that:
CREATE TABLE shadow_username (username VARCHAR(100) NOT NULL PRIMARY KEY, userid INT NOT NULL, UNIQUE (userid))
, fill in on a timely basis:
TRUNCATE
shadow_username
INSERT
INTO shadow_username
SELECT username, id
FROM users
and query it:
SELECT u.*
FROM (
SELECT id
FROM shadow_username
WHERE username = 'user'
) s
JOIN users u
ON u.id = s.id
UNION ALL
SELECT u.*
FROM users
WHERE id >
(
SELECT MAX(id)
FROM shadow_username
)
AND username = 'user'
UNION ALL
SELECT *
FROM users
WHERE username = 'user'
LIMIT 1
The first part does a normal search; the second part processes the usernames that were inserted in between the updates to shadow_username; the third part is a fallback method which does a normal search only if previous two steps found nothing (that may happen if a user changed their username).
If the username never changes, you should omit the third step.

If I understand you correctly, you can't have an index for only a certain subset of $ways_to_access_data (ie, admin interface vs public interface).
Either the column is indexed, or it isn't.

I'm not sure where the actual problem is. Either the "username" field is written to, in which case updating the index is warranted (and whether to have it indexed or not is a trade off between read performance and write performance), or it isn't written to (which I'd assume, as most users tend to change their name rather seldom), in which case your RDBMS should not be touching the index at all.
Looking into my crystal ball, I'd assume the "write heavy" fields in the "users" table are login sessions, which should live in a separate table anyway.

Related

Best way to mimic inheritance in postgresql?

For an application I am writing, there are two types of "users", those who have made accounts and those who have not, virtual_users. These two types are nearly identical, except account_users have a password, and email is required and must be unique amongst all account_users, although it can be the same as any number for virtual_users. A large number of tables have a column that references users, which should include both, and 90% of app functionality treats them as interchangeable. What is the best way of handling this? Some options I have considered:
-Put both types of users in the same table and have a complicated constraints regarding uniqueness, basically, if password is not NULL, email must be unique among all users where password is not NULL. I have no idea how I would write this constraint. On the few occasions I only want account_users query for only users who have a password. This seems like the best solution if I can figure out how to write the constraint.
-Have Account_users inherit from Virtual_usersand Virtual_users has an additional column password and unique constraints on email. From here there are two potential options:
---Have a Users table which includes two columns account_user_id and virtual_user_id one of which is NULL and one of which corresponds to the appropriate user. When other tables need to reference a user, they reference this table. Have all my queries server side for users query both tables and combine.
---When other tables need to reference they reference either table. I don't think this is possible. Have all my queries server side for users query both tables and combine.
Any advice would be appreciated.
I assume the scenario is you have a system which some parts require the user to be signed into a registered account, and others do not, but you'd still like to track users.
Postgres has table inheritance. You could use that, but I'd be concerned about the caveats.
You could put them all into one table and use some sort of flag, like Single Table Inheritance, but then you run into constraint issues. You would then enforce constraints in the model. This should be fine if you have a strong model.
You could have separate accounts and users. Rather than one being a special case of the other, they key is thinking of them as two conceptually different things. In OO terms, an account has a user.
-- Visitors to the site who haven't signed up.
create table users (
id serial,
-- The unverified email they might have given you.
email text,
-- Any other common information like a tracking token
token text
);
-- Users who have registered.
create table accounts (
id serial,
user_id int references users(id),
-- Their verified email.
email text not null,
-- Hashed, of course.
password text not null
-- any additional information only for accounts
);
I like this because there are no flags involved, users and accounts can have separate constraints, and the accounts table doesn't get bloated with users that showed up once and never came back.
You'd access complete account information with a simple join.
select accounts.*, users.token
from accounts
join users on accounts.user_id = users.id
where accounts.id = ?
If you want to delete an account you can do so without losing the underlying user information.
delete from accounts where accounts.id = ?
Systems which require an account use accounts. Systems which don't use users. Systems which behave differently for users and accounts can check if a user has an account easily.
select accounts.id
from accounts
where accounts.user_id = ?
User tracking is associated with the users table, so you have a consistent record of a user's behavior before and after they register.

SQL design for performance

I am new to SQL and I have a basic question about performance.
I want to create a users database which will store information about my users:
Id
Log in name
Password
Real name
Later I want to perform a SELECT query on: Id, Log in name and Real name.
What would be the best design for this database, what tables and what keys should I create?
If it's only about those 4 fields it looks like just one table. Primary key on ID, unique index on LoginName. You may not want to store the password, but only a hash.
Depending on your queries, create different indexes. Furthermore, you may not need the ID field at all.
UPDATE:
Creating an index on certain column(s) enables the database to optimize its SQL statements. Given your user table:
USER
USER_ID BIGINT NOT NULL
LOGIN_ID VARCHAR(<size>) NOT NULL
PASSWORD VARCHAR(<size>) NOT NULL
NAME VARCHAR(<size>) NOT NULL
CONSTRAINT PK_USER PRIMARY KEY ( USER_ID )
The databases I know will automatically create an index on the primary key, which in fact means that the database maintains an optimized lookup table, see WikiPedia for further details.
Now say, you want to query users by LOGIN_ID which is a fairly common use case, I guess, you can create another index like:
CREATE INDEX I_USER_1 ON USER ( LOGIN_ID asc )
The above index will optimize the select * from USER where LOGIN_ID='foo'. Furthermore, you can create a unique index instead, assuming that you do not want duplicate LOGIN_IDs:
CREATE UNIQUE INDEX UI_USER_1 ON USER ( LOGIN_ID asc )
That's the whole story, so if you want to optimize a query for the users real name (NAME), you just create another index:
CREATE INDEX I_USER_2 ON USER ( NAME asc )
Just to add to the #homes answer you should work out what sort of queries you will be running and then optimize for those sorts of queries. For example if you are doing a lot of writes and not as many reads having lots of indexes can cause performance issues. It's a bit like tuning an engine for a car, are you going to be going quickly down a drag strip or are you tuning it for driving long distances.
Anyway you also asked about the NAME column. If you are going to be matching on a varchar column it might be worth investigating the use of FULLTEXT Indexes.
http://msdn.microsoft.com/en-us/library/ms187317.aspx
This allows you to do optimized searchs on names where you might be matching parts of a name and the like. As the #homes answer said it really does depend on what your queries and intent is when writing the query.
Might be worth making the table and using the query execution plan in something like SQL management studio against your queries and see what impact your indexes have on the amount of rows and sort of looks up that are happening.
http://www.sql-server-performance.com/2006/query-execution-plan-analysis/

How should I record in the database that an item/product is visible by 'all' groups?

A user can be in groups. And an item/product is assigned groups that can see the item. Users can see that item if they are in one of the assigned groups.
I want neither public (anonymous users in no groups) nor groupless users (logged in users not in any groups) to see the item. But I want the interface to allow assigning the item an 'all/any groups' attribute so that users that are in any group at all can see the item.
Where/How should I store this assignment?
p.s. I expect the technique to also be extended to other entities, for example I'd assign a file to a category, and groups are linked to categories. so when a file is marked as visible by the 'all/any category' then if the user (thru groups and group-categories) is linked to at least one category then the file is visible to them.
Decision:
It seemed the choice was whether to implement as a row in a entity-groups table or as fields in the entity table. The chosen answer used the former.
And either managing the group membership in a table or adding JOIN conditions. The chosen answer used the former, but I'm going to use the latter. I'm putting an indirection between the query and usage so if (when) performance is a problem I should be able to change to a managed table underneath (as suggested) without changing usage.
I've other special groups like 'admin', 'users', etc. which can also fit into the same concept (the basis simply being a list of groups) more easily than special and variable field handling for each entity.
thanks all.
I'd put it in the items table as a boolean/bit column IsVisibleToAllGroups.
It does make queries to get all items for a user a bit less straightforward but the other alternative would be to expand out "all groups" so you add a permission row for each individual group but this can potentially lead to a huge expansion in the number of rows and you still have to keep this up-to-date if an additional group is added later and somehow distinguish between a permission that was granted explicitly to all (current and future) groups and one that just happened to be granted to all groups currently in existence.
Edit You don't mention the RDBMS you are using. One other approach you could take would be to have a hierarchy of groups.
GroupId ParentGroupId Name
----------- ------------- ----------
0 NULL Base Group
1 0 Group 1
2 0 Group 2
You could then assign your "all" permissions to GroupId=0 and use (SQL Server approach below)
WITH GroupsForUser
AS (SELECT G.GroupId,
G.ParentGroupId
FROM UserGroups UG
JOIN Groups G
ON G.GroupId = UG.GroupId
WHERE UserId = #UserId
UNION ALL
SELECT G.GroupId,
G.ParentGroupId
FROM Groups G
JOIN GroupsForUser GU
ON G.GroupId = GU.ParentGroupId)
SELECT IG.ItemId
FROM GroupsForUser GU
JOIN ItemGroups IG
ON IG.GroupId = GU.GroupId
As mentioned by both Martin Smith and Mikael Eriksson, making this a property of the entity is a very tidy and straight forward approach. Purely in terms of data representation, this has a very nice feel to it.
I would, however, also consider the queries that you are likely to make against the data. For example, based on your description, you seem most likely to have queries that start with a single user, find the groups they are a member of, and then find the entities they are associated to. Possibly something lke this...
SELECT DISTINCT -- If both user and entity relate to multiple groups, de-dupe them
entity.*
FROM
user
INNER JOIN
user_link_group
ON user.id = user_link_group.user_id
INNER JOIN
group_link_entity
ON group_link_entity.group_id = user_link_group.group_id
INNER JOIN
entity
ON entity.id = group_link_entity.entity_id
WHERE
user.id = #user_id
If you were to use this format, and the idea of a property in the entity table, you would need something much less elegant, and I think the following UNION approach is possibly the most efficient...
<ORIGINAL QUERY>
UNION -- Not UNION ALL, as the next query may duplicate results from above
SELECT
entity.*
FROM
entity
WHERE
EXISTS (SELECT * FROM user_link_group WHERE user_id = #user_id)
AND isVisibleToAllGroups != 0
-- NOTE: This also implies the need for an additional index on [isVisibleToAllGroups]
Rather than create the corner case in the "what entity can I see" query, it is instead an option to create the corner case in the maintenance of the link tables...
Create a GLOBAL group
If an enitity is visible to all groups, map them to the GLOBAL group
If a user is added to a group, ensure they are also linked to the GLOBAL group
If a user is removed from all groups, ensure they are also removed from the GLOBAL group
In this way, the original simple query works without modification. This means that no UNION is needed, with it's overhead of sorting and de-duplication, and neither is the INDEX on isVisibleToAllGroups needed. Instead, the overhead is moved to maintaining which groups a user is linked to; a one time overhead instead.
This assumes that the question "what entities can I see" is more common than changing groups. It also adds a behaviour that is defined by the DATA and not by the SCHEMA, which necessitates good documentation and understanding. As such, I do see this as a powerful type of optimisation, but I also see it as a trades-and-balances type of compromise that needs accounting for in the database design.
Instead of a boolean, which needs additional logic in every query, I'd add a column 'needs_group' which contains the name (or number) of the group that is required for the item. Whether a NULL field means 'nobody' or 'everybody' is only a (allow/deny) design-decision. Creating one 'public' group and putting everybody in it is also a design decision. YMMV
This concept should get you going:
The user can see the product if:
the corresponding row exists in USER_GROUP_PRODUCT
or PRODUCT.PUBLIC is TRUE (and user is in at least one group, if I understand your question correctly).
There are 2 key points to consider about this model:
Liberal usage of identifying relationships - primary keys of parents are "migrated" within primary keys of children, which enables "merging" of GROUP_ID at the bottom USER_GROUP_PRODUCT. This is what allows the DBMS to enforce the constraint that both user and product have to belong to the same group to be mutually visible. Usage of non-identifying relationships and surrogate keys would prevent the DBMS from being able to enforce that directly (you'd have to write custom triggers).
Usage of PRODUCT.PUBLIC - you'll have to treat this field as "magic" in your client code. The alternative is to simply fill the USER_GROUP_PRODUCT with all the possible combinations, but this approach is fragile in case a new user is added - it would not automatically see the product unless you update the USER_GROUP_PRODUCT as well, but how would you know you need to update it unless you have a field such as PRODUCT.PUBLIC? So if you can't avoid PRODUCT.PUBLIC anyway, why not treat it specially and save some storage space in the database?

How to properly index my database to increase query performance

I'm working on simple log in page using OpenID: if the user has just registered for an OpenID, then I need to create a new entry in the database for the user, otherwise I just display their alias with a greeting. Every time somebody gets authenticated with their Open ID, I must find their alias by looking up which user has the given OpenID and it seems that it might be fairly slow if the primary key is the UserID (and there are millions of users).
I'm using SQL Server 2008 and I have two tables in my database (Users and OpenIDs): I plan the check if the Open ID exists in the OpenIDs table, then use the corresponding UserID to get the rest of the user information from the Users table.
The Users table is indexed by UserID and has the following columns:
UserID (pk)
EMail
Alias
OpenID (fk)
The OpenIDs table is indexed by OpenID and has the following columns:
OpenID (pk)
UserID (fk)
Alternately, I can index the Users table by UserID and OpenID (i.e have 2 indexes) and completely drop the OpenIDs table.
What would be the recommended way to improve the query for a user with the matching OpenID in this case: index the Users table with two keys or use the OpenIDs table to find the matching UserID?
May be the answers to What are some best practises and “rules of thumb” for creating database indexes? can help you.
Without knowing what kind of queries you'll be running in detail, I would recommend indexing the two foreign key columns - Users.OpenID and OpenIDs.UserID.
Indexing the foreign keys is typically a good idea to help with JOIN conditions and other queries.
But quite honestly, if you use the OpenIDs table only to check the existance of an OpenID, you'd be much better off just indexing (possibly a unique index?) that column in the Users table and be done with it. That OpenIDs table as you have it now serves no real purpose at all - just takes up space for redundant information.
Other than that: you need to observe how your application behaves, samples some usage data, and then see what kind of queries are running the most often, and the longest, and then start doing performance tweaking. Don't over-do the ahead-of-time performance optimizations - too many indices can be worse than having none at all !
Every time somebody gets authenticated
with their Open ID, I must find their
alias by looking up which user has the
given OpenID and it seems that it
might be fairly slow if the primary
key is the UserID (and there are
millions of users).
Actually, quite the contrary! If you have a value that's unique amongst millions of rows, finding that single value is actually quite quick - even with millions of users. It will take only a handful (max. 5-6) comparisons, and bang! you have your one user out of a million. If you have an index on that OpenID column, that should be pretty fast indeed. Such a highly selective index (one value picks out 1 in a million) work very very efficiently.

What is the best way to add users to multiple groups in a database?

In an application where users can belong to multiple groups, I'm currently storing their groups in a column called groups as a binary. Every four bytes is a 32 bit integer which is the GroupID. However, this means that to enumerate all the users in a group I have to programatically select all users, and manually find out if they contain that group.
Another method was to use a unicode string, where each character is the integer denoting a group, and this makes searching easy, but is a bit of a fudge.
Another method is to create a separate table, linking users to groups. One column called UserID and another called GroupID.
Which of these ways would be the best to do it? Or is there a better way?
You have a many-to-many relationship between users and groups. This calls for a separate table to combine users with groups:
User: (UserId[PrimaryKey], UserName etc.)
Group: (GroupId[PrimaryKey], GroupName etc.)
UserInGroup: (UserId[ForeignKey], GroupId[ForeignKey])
To find all users in a given group, you just say:
select * from User join UserInGroup on UserId Where GroupId=<the GroupId you want>
Rule of thumb: If you feel like you need to encode multiple values in the same field, you probably need a foreign key to a separate table. Your tricks with byte-blocks or Unicode chars are just clever tricks to encode multiple values in one field. Database design should not use clever tricks - save that for application code ;-)
I'd definitely go for the separate table - certainly the best relational view of data. If you have indexes on both UserID and GroupID you have a quick way of getting users per group and groups per user.
The more standard, usable and comprehensible way is the join table. It's easily supported by many ORMs, in addition to being reasonably performant for most cases. Only enter in "clever" ways if you have a reason to, say a million of users and having to answer that question every half a second.
I would make 3 tables. users, groups and usersgroups which is used as cross-reference table to link users and groups. In usersgroups table I would add userId and groupId columns and make them as primary key. BTW. What naming conventions there are to name those xref tables?
It depends what you're trying to do, but if your database supports it, you might consider using roles. The advantage of this is that the database provides security around roles, and you don't have to create any tables.