SQL Table linking... is it better to have a linking table, or a delimited column? - sql

My database has two tables, one contains a list of users, the other a list of roles. Each user will belong to one or more roles, and of course each role will have multiple users in it.
I've come across two ways to link the information. The first is to add a third table which contains the ID's from both tables. A simple join will then return all the users that belong to a role, or all the roles to which a user belongs. However, as the database grows, the datasets returned by these simple queries will grow exponentially.
The second method is to add a column to the users table in which a delimited list of roles is stored. This will eliminate the need for the third linking table, which may have a positive effect on database growth. The downside is that SQL does not have the ability to use delimited lists. The only way I've found to process that information is to use a temporary table and a custom function.
Is viewing my execution plans, the "table scan" event is the one that takes the most resources. It makes sense that eliminating a table from the equation would speed things up. The function takes up less than 1% of the resources.
These tests were done on a database with less than 20 records. As the size of the database grows, the table scans will take longer, so perhaps limiting them is the best choice.
If using the delimited list is a good way to go, why is nobody doing it?
Please tell me which is your preferred method (even if it's different from my two) and why.
Thank you.

If you have a delimited list, finding users with a given role is going to become very expensive: effectively, you need to do a FULL scan of that table, and look at all the values for that column in every row, trying to see if it contains a given role.
A separate table (normalized, many to many relation) is the way to go, and with proper indexing you will not have full scans happening.
eg:
User: UserId, Name, ....
Role: RoleId, Name, ....
UserRole: UserRoleId, UserId, RoleId
(UserRoleId is optional, you could alternatively have the PK be UserId+RoleId, I won't get into the discussion here of surrogate vs compound keys here)
You'll want an index on (UserId, RoleId) that is UNIQUE, to enforce no duplicates. This will also help with any queries where you're trying to see if a specific user has a specific role (WHERE userId = x AND roleId = y)
If you are looking up all the roles a user has, you'll want an index on just UserId.
Conversely, if you are looking up all the users a given role has, an index on just roleId will speed that up. If you don't do this query, or do it very rarely, then not having this index will speed up performance slightly for insert/updates, as it is one less thing to do. This is the careful balancing act that is database tuning.

A table scan means that you don't have any indexes, or your query doesn't allow them to be used. In a security database, you should rarely if ever have to download the entire list of users/roles, unless this is for an admin application. You need to address this in your design.
Delimited lists violate first-normal-form (1NF) and almost always cause problems in the long term. What happens if you want to retrieve all users in a particular role? How do you write that query? Don't go down this road. Normalize it.
If you're using correct column types (i.e. not a varchar(4000) or varchar(max) everywhere), disk space really shouldn't be an issue. Yes, it will grow "exponentially" - so what? Databases are good at this kind of scaling. Unless you're trying to run this on a 10 gig hard drive, it's not something to worry about. And if you are trying to run it on a 10 gig hard drive, you probably have bigger issues to worry about.
Short answer: Don't use a delimited list. Normalize.

The first option. It's called a many-to-many join table. This will perform fine if you create appropriate indexes.
Don't go with the second 'denormalised' option.

You could use a separate table or you could go back to cavemen with chisels. The choice is up to you.

A separate table is the way to go, otherwise you're trying to work around your database engine. A separate table is properly normalised - in general, as an application expands, the better it is normalised, the easier you'll find it to work with. What greg said above is also absolutely right.

Although I would highly recommend the normalized method that everyone is suggesting. I do believe that having an enum based role system would allow you to have one digit for the "roles" column and allow you to avoid having to create another table.

Related

Database read complexity

I have a relational table in SQL which relates Users with Permissions. So is in the middle of an Many to Many relationship.
The table has its proper indexes and foreign keys.
The users belongs to Groups, so I'm wondering if it will be much more efficent to keep the relation between Groups-Permissions instead of Users-Permissions, this way I will have a lot of less rows in the table, so I'm wondering how well will performance instead of the other way.
In the end my question is: how is the complexity of reading table in database?
I'm pretty sure it is "theorically" O(1) but actually something more like O(nlogn) or o(n) because it will have to index data somehow.
By the way I'm using SQL Server dbms, but I'm pretty sure the answer applies to SQL, mySQL, etc (excluding non-relational dbms like Mongo)
You can never know even "theoretically" what is the complexity of reading a table in a database, since every table is different then the other.
One may have 2 columns and another can have 200, one have 10 records, and another can have 100mil, one can have indexes and the other may don't and ETC.
This effects the complexity so, every case to it self.
In addition, yes, if your users are seperated into groups(and two people in the same group have the same permissions) then its more right to have the relations
GROUPS > PERMISSIONS
USERS > GROUPS
than what you did, because if you will want(and you probably will) to add or remove a group permissions, in your case it may be complicated and will need unnecessary join and resources, than just change permission of one group.
Performance prespective, again, as I said, I cant tell you

User characteristics database schema

I'm really battling an issue where I have a Users table that has a growing number of user characteristics (regligion, smoking preferences, etc). The strategy I've used thus far has been to add a column for each preference that keys off onto another table.
For example, if User XYZ has a RelgionId of 3, that could mean they're Christian. At runtime, if I need their religion, I join onto another table.
This strategy has worked so far. However, I'm getting concerned about the number of columns in the tables as the number of preferences is increasing. Also, this strategy leads to many joins if I need to get all values for a single user.
I'd like to find out the most normalized way of representing this data. Anybody have any ideas?
I'd like to find out the most normalized way of representing this data.
Well, from what you describe, you seem to have quite a normalized database.
What you are looking for if you want to reduce the number of joins is denormalization.
For instance, if you want to access a subset of those user preferences with a smaller number of joins, you might want to cache them in a UserDetails table, and link that in the User table with a UserDetailsId foreign key.
This might actually be feasible in case you have a subset of seldom-changing values (for instance one's religion does not often change).
The drawback is that in case one of these changes you might have to change the info in two places (depending on if you want to also keep the normalized version of that data or not).
I hope this helps. Feel free to ask for additional clarification.

Dynamically creating tables as a means of partitioning: OK or bad practice?

Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.
NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.
No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.
Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...
Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)

How does row design influence MySQL performance?

I got an users table and some forum, where users can write. Every action on forum uses users table. User can have a profile, which can be quite big (50KB). If I got such big data in each row wouldn't it be faster to have separate table with user's profiles and other data that aren't accessed very often?
In an online RPG game each character have a long list of abilities, for example: pistols experience, machine guns experience, throwing grenades experience, and 15 more. Is it better to store them in a string as numbers separated with semicolon - which would take more space than integers, or should I make for each ability individual field? Or maybe binary? (I use c++)
If you don't need the data from
specific columns, don't get it.
Don't do SELECT * but SELECT a,
b,...
If you need to do SQL-queries over
certain columns e.g. ORDER BY
pistols_experience, you should
leave it in different columns. If
you just display it all at once, you
could serialize the different
key-value-pairs into a text field
via YAML, JSON etc.
(1) Not in itself, no. As stefan says, you should be selecting only what you want, so having stuff you don't want in the table is no issue. A 50K TEXT blob is only a pointer in the row.
However, there can be an issue if you are using MyISAM tables. In MyISAM there is only table-level locking, so when you have one user update their row (eg. last visit time), it blocks all other users from accessing the table. In this case you might experience some improvement by breaking out heavily-updated columns into a separate table from the relatively static but heavily-selected ones.
But you don't want to be using MyISAM anyway: it's a bit crap. Use InnoDB, get row-level locking (and transactions, and foreign key constraints), and don't worry about it. The only reason to use MyISAM tables today is for fulltext search, which InnoDB doesn't support.
(2) You would normally separate every independent value into its own field. If you hit a real performance issue and you don't need to do database-level manipulation of the values on their own, you could consider denormalising it, but you'd be losing the power of the database.

Table with a lot of columns

If my table has a huge number of columns (over 80) should I split it into several tables with a 1-to-1 relationship or just keep it as it is? Why? My main concern is performance.
PS - my table is already in 3rd normal form.
PS2 - I am using MS Sql Server 2008.
PS3 - I do not need to access all table data at once, but rather have 3 different categories of data within that table, which I access separately. It is something like: member preferences, member account, member profile.
80 columns really isn't that many...
I wouldn't worry about it from a performance standpoint. Having a single table (if you're typically using all of the data in your standard operations) will probably outperform multiple tables with 1-1 relationships, especially if you're indexing appropriately.
I would worry about this (potentially) from a maintenance standpoint, though. The more columns of data in a single table, the less understandable the role of that table in your grand scheme becomes. Also, if you're typically only using a small subset of the data, and all 80 columns are not always required, splitting into 2+ tables might help performance.
Re the performance question - it depends. The larger a row is, the less rows can be read from disk in one read. If you have a lot of rows, and you want to be able to read the core information from the table very quickly, then it may be worth splitting it into two tables - one with small rows with only the core info that can be read quickly, and an extra table containing all the info you rarely use that you can lookup when needed.
Taking another tack, from a maintenance & testing point of view, if as you say you have 3 distinct groups of data in the one table albeit all with the same unique id (e.g. member_id) it might make sense to split it out into separate tables.
If you need to add fields to say your profile details section of the members info table, do you really want to run the risk of having to re-test the preferences & account details elements of your app as well to ensure no knock on impacts.
Also for audit trail purposes if you want to track the last user ID/Timestamp to change a members data. If the admin app allows Preferences/Account Details/Profile Details to be updated separately then it makes sense to have them in separate tables to more easily track updates.
Not quite a SQL/Performance answer but maybe something to look at from a DB & App design pov
Depends what those columns are. If you've got hard coded duplicated fields like Colour1, Colour2, Colour3, then these are candidates for child tables. My general rule of thumb is if there's more than one field of the same type (Colour), then you might as well code for N of them, not a fixed number.
Rob.
1-1 may be easier, if you have say Member_Info; Member_Pref; Member_Profile. Having too many columns can make it run if you want lots of varchar(255) as you may go over the rowsize limit, and it just makes it too confusing.
Just make sure you have the correct forgein key constraints and suchwhat, so there's always 1 row in each table with the same member_id