To share a table or not share? - sql

Right now on my (beta) site i have a table called user data which stores name, hash(password), ipaddr, sessionkey, email and message number. Now i would like the user to have a profile description, signature, location (optional) and maybe other things. Should i have this in a separate mysql table? or should i share the table? and why?

This is called vertical partitioning, and in general it will not make much difference whichever option you go for.
One vertical partitioning method is to split fields that are updated frequently from fields that are updated rarely. Imagine a Users table in Stack Overflow's database, having User_ID, DisplayName, Location, etc... and then UpVotes, DownVotes, LastSeen, NumberOfQuestionsAsked, etc. The first set of fields changes rarely (only on profile edits), while the latter set change frequently (on normal activity).
Another splitting method could be between data that is accessed frequently, from data that is accessed rarely.
This kind of partitioning, in some particular situations, can yield better performance.

Use one table.
Because there is no reason to separate.

IMHO, the authentication information and the profile info should be separate. This will secure your data. Of course, if the trust level is high, you can go for a merged table. But the information in the table might grow over time and at the end you will have to separate them out. So why to mess now?

If the tables you are thinking of will have a one-to-one relationship, there's usually no reason to do it.
Exceptions, both of which only apply when there are many, many columns:
if there are a few columns that will always be filled and the rest almost never -- in this case, there can be some benefit from omitting empty rows in the second table, and only fetching them when needed.
if there are a few columns that will constantly be retrieved/updated and the rest almost never.
But again, this is not an optimization you should do at the beginning. If you have your query code reasonably isolated, it's not hard to do this later on.
There are also some relevant comments on this elsewhere on StackOverflow

I think it depends on the your application nature or you can say requirement.
I prefer it should be in the different tables.
Consider example where I need users email, message number and store's name.
So when I find all the the user from the table and all my profile related data in the same table, I get all the unwanted columns in the result set. To overcome this, I can use the SELECT only columns I want but that makes my query very ugly.
Similarly when I need all profile data I have to use profile columns in select clause.
So always suggest to separate the tables wherever it is possible.

Related

User characteristics database schema

I'm really battling an issue where I have a Users table that has a growing number of user characteristics (regligion, smoking preferences, etc). The strategy I've used thus far has been to add a column for each preference that keys off onto another table.
For example, if User XYZ has a RelgionId of 3, that could mean they're Christian. At runtime, if I need their religion, I join onto another table.
This strategy has worked so far. However, I'm getting concerned about the number of columns in the tables as the number of preferences is increasing. Also, this strategy leads to many joins if I need to get all values for a single user.
I'd like to find out the most normalized way of representing this data. Anybody have any ideas?
I'd like to find out the most normalized way of representing this data.
Well, from what you describe, you seem to have quite a normalized database.
What you are looking for if you want to reduce the number of joins is denormalization.
For instance, if you want to access a subset of those user preferences with a smaller number of joins, you might want to cache them in a UserDetails table, and link that in the User table with a UserDetailsId foreign key.
This might actually be feasible in case you have a subset of seldom-changing values (for instance one's religion does not often change).
The drawback is that in case one of these changes you might have to change the info in two places (depending on if you want to also keep the normalized version of that data or not).
I hope this helps. Feel free to ask for additional clarification.

Dynamically creating tables as a means of partitioning: OK or bad practice?

Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.
NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.
No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.
Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...
Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)

SQL Table linking... is it better to have a linking table, or a delimited column?

My database has two tables, one contains a list of users, the other a list of roles. Each user will belong to one or more roles, and of course each role will have multiple users in it.
I've come across two ways to link the information. The first is to add a third table which contains the ID's from both tables. A simple join will then return all the users that belong to a role, or all the roles to which a user belongs. However, as the database grows, the datasets returned by these simple queries will grow exponentially.
The second method is to add a column to the users table in which a delimited list of roles is stored. This will eliminate the need for the third linking table, which may have a positive effect on database growth. The downside is that SQL does not have the ability to use delimited lists. The only way I've found to process that information is to use a temporary table and a custom function.
Is viewing my execution plans, the "table scan" event is the one that takes the most resources. It makes sense that eliminating a table from the equation would speed things up. The function takes up less than 1% of the resources.
These tests were done on a database with less than 20 records. As the size of the database grows, the table scans will take longer, so perhaps limiting them is the best choice.
If using the delimited list is a good way to go, why is nobody doing it?
Please tell me which is your preferred method (even if it's different from my two) and why.
Thank you.
If you have a delimited list, finding users with a given role is going to become very expensive: effectively, you need to do a FULL scan of that table, and look at all the values for that column in every row, trying to see if it contains a given role.
A separate table (normalized, many to many relation) is the way to go, and with proper indexing you will not have full scans happening.
eg:
User: UserId, Name, ....
Role: RoleId, Name, ....
UserRole: UserRoleId, UserId, RoleId
(UserRoleId is optional, you could alternatively have the PK be UserId+RoleId, I won't get into the discussion here of surrogate vs compound keys here)
You'll want an index on (UserId, RoleId) that is UNIQUE, to enforce no duplicates. This will also help with any queries where you're trying to see if a specific user has a specific role (WHERE userId = x AND roleId = y)
If you are looking up all the roles a user has, you'll want an index on just UserId.
Conversely, if you are looking up all the users a given role has, an index on just roleId will speed that up. If you don't do this query, or do it very rarely, then not having this index will speed up performance slightly for insert/updates, as it is one less thing to do. This is the careful balancing act that is database tuning.
A table scan means that you don't have any indexes, or your query doesn't allow them to be used. In a security database, you should rarely if ever have to download the entire list of users/roles, unless this is for an admin application. You need to address this in your design.
Delimited lists violate first-normal-form (1NF) and almost always cause problems in the long term. What happens if you want to retrieve all users in a particular role? How do you write that query? Don't go down this road. Normalize it.
If you're using correct column types (i.e. not a varchar(4000) or varchar(max) everywhere), disk space really shouldn't be an issue. Yes, it will grow "exponentially" - so what? Databases are good at this kind of scaling. Unless you're trying to run this on a 10 gig hard drive, it's not something to worry about. And if you are trying to run it on a 10 gig hard drive, you probably have bigger issues to worry about.
Short answer: Don't use a delimited list. Normalize.
The first option. It's called a many-to-many join table. This will perform fine if you create appropriate indexes.
Don't go with the second 'denormalised' option.
You could use a separate table or you could go back to cavemen with chisels. The choice is up to you.
A separate table is the way to go, otherwise you're trying to work around your database engine. A separate table is properly normalised - in general, as an application expands, the better it is normalised, the easier you'll find it to work with. What greg said above is also absolutely right.
Although I would highly recommend the normalized method that everyone is suggesting. I do believe that having an enum based role system would allow you to have one digit for the "roles" column and allow you to avoid having to create another table.

Table with a lot of columns

If my table has a huge number of columns (over 80) should I split it into several tables with a 1-to-1 relationship or just keep it as it is? Why? My main concern is performance.
PS - my table is already in 3rd normal form.
PS2 - I am using MS Sql Server 2008.
PS3 - I do not need to access all table data at once, but rather have 3 different categories of data within that table, which I access separately. It is something like: member preferences, member account, member profile.
80 columns really isn't that many...
I wouldn't worry about it from a performance standpoint. Having a single table (if you're typically using all of the data in your standard operations) will probably outperform multiple tables with 1-1 relationships, especially if you're indexing appropriately.
I would worry about this (potentially) from a maintenance standpoint, though. The more columns of data in a single table, the less understandable the role of that table in your grand scheme becomes. Also, if you're typically only using a small subset of the data, and all 80 columns are not always required, splitting into 2+ tables might help performance.
Re the performance question - it depends. The larger a row is, the less rows can be read from disk in one read. If you have a lot of rows, and you want to be able to read the core information from the table very quickly, then it may be worth splitting it into two tables - one with small rows with only the core info that can be read quickly, and an extra table containing all the info you rarely use that you can lookup when needed.
Taking another tack, from a maintenance & testing point of view, if as you say you have 3 distinct groups of data in the one table albeit all with the same unique id (e.g. member_id) it might make sense to split it out into separate tables.
If you need to add fields to say your profile details section of the members info table, do you really want to run the risk of having to re-test the preferences & account details elements of your app as well to ensure no knock on impacts.
Also for audit trail purposes if you want to track the last user ID/Timestamp to change a members data. If the admin app allows Preferences/Account Details/Profile Details to be updated separately then it makes sense to have them in separate tables to more easily track updates.
Not quite a SQL/Performance answer but maybe something to look at from a DB & App design pov
Depends what those columns are. If you've got hard coded duplicated fields like Colour1, Colour2, Colour3, then these are candidates for child tables. My general rule of thumb is if there's more than one field of the same type (Colour), then you might as well code for N of them, not a fixed number.
Rob.
1-1 may be easier, if you have say Member_Info; Member_Pref; Member_Profile. Having too many columns can make it run if you want lots of varchar(255) as you may go over the rowsize limit, and it just makes it too confusing.
Just make sure you have the correct forgein key constraints and suchwhat, so there's always 1 row in each table with the same member_id

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.