How does row design influence MySQL performance? - sql

I got an users table and some forum, where users can write. Every action on forum uses users table. User can have a profile, which can be quite big (50KB). If I got such big data in each row wouldn't it be faster to have separate table with user's profiles and other data that aren't accessed very often?
In an online RPG game each character have a long list of abilities, for example: pistols experience, machine guns experience, throwing grenades experience, and 15 more. Is it better to store them in a string as numbers separated with semicolon - which would take more space than integers, or should I make for each ability individual field? Or maybe binary? (I use c++)

If you don't need the data from
specific columns, don't get it.
Don't do SELECT * but SELECT a,
b,...
If you need to do SQL-queries over
certain columns e.g. ORDER BY
pistols_experience, you should
leave it in different columns. If
you just display it all at once, you
could serialize the different
key-value-pairs into a text field
via YAML, JSON etc.

(1) Not in itself, no. As stefan says, you should be selecting only what you want, so having stuff you don't want in the table is no issue. A 50K TEXT blob is only a pointer in the row.
However, there can be an issue if you are using MyISAM tables. In MyISAM there is only table-level locking, so when you have one user update their row (eg. last visit time), it blocks all other users from accessing the table. In this case you might experience some improvement by breaking out heavily-updated columns into a separate table from the relatively static but heavily-selected ones.
But you don't want to be using MyISAM anyway: it's a bit crap. Use InnoDB, get row-level locking (and transactions, and foreign key constraints), and don't worry about it. The only reason to use MyISAM tables today is for fulltext search, which InnoDB doesn't support.
(2) You would normally separate every independent value into its own field. If you hit a real performance issue and you don't need to do database-level manipulation of the values on their own, you could consider denormalising it, but you'd be losing the power of the database.

Related

Having all contact information in one table vs. using key-value tables

(NB. The question is not a duplicate for this, since I am dealing with an ORM system)
I have a table in my database to store all Contacts information. Some of the columns for each contact is fixed (e.g. Id, InsertDate and UpdateDate). In my program I would like to give user the option to add or remove properties for each contact.
Now there are of course two alternatives here:
First is to save it all in one table and add and remove entire columns when user needs to;
Create a key-value table to save each property alongside its type and connect the record to user's id.
These alternatives are both doable. But I am wondering which one is better in terms of speed? In the program it will be a very common thing for the user to view the entire Contact list to check for updates. Plus, I am using an ORM framework (Microsoft's Entity Framework) to deal with database queries. So if the user is to add and remove columns from a table all the time, it will be a difficult task to map them to my program. But again, if alternative (1) is a significantly better option than (2), then I can reconsider the key-value option.
I have actually done both of these.
Example #1
Large, wide table with columns of data holding names, phone, address and lots of small integer values of information that tracked details of the clients.
Example #2
Many different tables separating out all of the Character Varying data fields, the small integer values etc.
Example #1 was a lot faster to code for but in terms of performance, it got pretty slow once the table filled with records. 5000 wasn't a problem. When it reached 50,000 there was a noticeable performance degradation.
Example #2 was built later in my coding experience and was built to resolve the issues found in Example #1. While it took more to get the records I was after (LEFT JOIN this and UNION that) it was MUCH faster as you could ultimately pick and choose EXACTLY what the client was after without having to search a massive wide table full of data that was not all being requested.
I would recommend Example #2 to fit your #2 in the question.
And your USER specified columns for their data set could be stored in a table just to their own (depending on how many you have I suppose) which would allow you to draw on the table specific to that USER, which would also give you unlimited ability to remove and add columns to suit that particular setup.
You could then also have another table which kept track of the custom columns in the custom column table, which would give you the ability to "recover" columns later, as in "Do you want to add this to your current column choices or to one of these columns you have deleted in the past".

Building a MySQL database that can take an infinite number of fields

I am building a MySQL-driven website that will analyze customer surveys distributed by a variety of clients. Generally, these surveys are structured fairly consistently, and most of our clients' data can be reduced to the same normalized database structure.
However, every client inevitably ends up including highly specific demographic questions for their customers that are irrelevant to every other one of our clients. For instance, although all of our clients will ask about customer satisfaction, only our auto clients will ask whether the customers know how to drive manual transmissions.
Up to now, I have been adding columns to a respondents table for all general demographic information, with a lot of default null's mixed in. However, as we add more clients, it's clear that this will end up with a massive number of columns which are almost always null.
Is there a way to do this consistently? I would rather keep as much of the standardized data as possible in the respondents table since our import script is already written for that table. One thought of mine is to build a respondent_supplemental_demographic_info table that has the columns response_id, demographic_field, demographic_value (so the manual transmissions example might become: 'ID999', 'can_drive_manual_indicator', true). This could hold an infinite number of demographic_fields, but would be incredible painful to work with from both a processing and programming perspective. Any ideas?
Your solution to this problem is called entity-attribute-value (EAV). This "unpivots" columns so they are rows in a table and then you tie them together into a single view.
EAV structures are a bit tricky to learn how to deal with. They require many more joins or aggregations to get a single view out. Also, the types of the values becomes challenging. Generally there is one value column, so everything is stored as a string. You can, of course, have a type column with different types.
They also take up more space, because the entity id is repeated on each row (I think that is the response_id in your case).
Although not idea in all situations, they are appropriate in a situation such as you describe. You are adding attributes indefinitely. You will quickly run over the maximum number of columns allowed in a single table (typically between 1,000 and 4,000 depending on the database). You can also keep track of each value in each column separately -- if they are added at different times, for instance, you can keep a time stamp on when they go in.
Another alternative is to maintain a separate table for each client, and then use some other process to combine the data into a common data structure.
Do not fall for a table with key-value pairs (field id, field value) as that is inefficient.
In your case I would create a table per customer. And metadata tables (in a separate DB) describing these tables. With these metadata you can generate SQL etcetera. That is definitely superior too having many null columns. Or copied, adapted scripts. It requires a bit of programming, where an application uses the metadata to generate SQL, collect the data (without customer specific semantic knowledge) and generate reports.

Dynamically creating tables as a means of partitioning: OK or bad practice?

Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.
NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.
No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.
Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...
Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)

MySQL Table with TEXT column

I've been working on a database and I have to deal with a TEXT field.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
Some research revealed this, suggesting that
Separate text/blobs from metadata, don't put text/blobs in results if you don't need them.
However, I am not familiar with the definition of "metadata" being used here.
So I wonder if there are any relevant advantages in putting a TEXT column in a table of its own. What are the potential problems of having it with the rest of the fields? And potential problems of keeping it in a separated table?
This table(without the TEXT field) is supposed to be searched(SELECTed) rather frequently. Is "premature optimization considered evil" important here? (If there really is a penalty in TEXT columns, how relevant is it, considering it is fairly easy to change this later if needed).
Besides, are there any good links on this topic? (Perhaps stackoverflow questions&answers? I've tried to search this topic but I only found TEXT vs VARCHAR discussions)
Yep, it seems you've misinterpreted the meaning of the sentence. What it says is that you should only do a SELECT including a TEXT field if you really need the contents of that field. This is because TEXT/BLOB columns can contain huge amounts of data which would need to be delivered to your application - this takes time and of course resources.
Best wishes,
Fabian
This is probably premature optimisation. Performance tuning MySQL is really tricky and can only be done with real performance data for your application. I've seen plenty of attempts to second guess what makes MySQL slow without real data and the result each time has been a messy schema and complex code which will actually make performance tuning harder later on.
Start with a normalised simple schema, then when something proves too slow add a complexity only where/if needed.
As others have pointed out the quote you mentioned is more applicable to query results than the schema definition, in any case your choice of storage engine would affect the validity of the advice anyway.
If you do find yourself needing to add the complexity of moving TEXT/BLOB columns to a separate table, then it's probably worth considering the option of moving them out of the database altogether. Often file storage has advantages over database storage especially if you don't do any relational queries on the contents of the TEXT/BLOB column.
Basically, get some data before taking any MySQL tuning advice you get on the Internet, including this!
The data for a TEXT column is already stored separately. Whenever you SELECT * from a table with text column(s), each row in the result-set requires a lookup into the text storage area. This coupled with the very real possibility of huge amounts of data would be a big overhead to your system.
Moving the column to another table simply requires an additional lookup, one into the secondary table, and the normal one into the text storage area.
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.
The concern is that a large text field—like way over 8,192 bytes—will cause excessive paging and/or file i/o during complex queries on unindexed fields. In such cases, it's better to migrate the large field to another table and replace it with the new table's row id or index (which would then be metadata since it doesn't actually contain data).
The disadvantages are:
a) More complicated schema
b) If the large field is using inspected or retrieved, there is no advantage
c) Ensuring data consistency is more complicated and a potential source of database malaise.
There might be some good reasons to separate a text field out of your table definition. For instance, if you are using an ORM that loads the complete record no matter what, you might want to create a properties table to hold the text field so it doesn't load all the time. However if you are controlling the code 100%, for simplicity, leave the field on the table, then only select it when you need it to cut down on data trasfer and reading time.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
You probably saw this, from the MySQL manual
http://dev.mysql.com/doc/refman/5.5/en/optimize-character.html
If a table contains string columns such as name and address, but many queries do not retrieve those columns, consider splitting the string columns into a separate table and using join queries with a foreign key when necessary. When MySQL retrieves any value from a row, it reads a data block containing all the columns of that row (and possibly other adjacent rows). Keeping each row small, with only the most frequently used columns, allows more rows to fit in each data block. Such compact tables reduce disk I/O and memory usage for common queries.
Which indeed is telling you that in MySQL you are discouraged from keeping TEXT data (and BLOB, as written elsewhere) in tables frequently searched

SQL Table linking... is it better to have a linking table, or a delimited column?

My database has two tables, one contains a list of users, the other a list of roles. Each user will belong to one or more roles, and of course each role will have multiple users in it.
I've come across two ways to link the information. The first is to add a third table which contains the ID's from both tables. A simple join will then return all the users that belong to a role, or all the roles to which a user belongs. However, as the database grows, the datasets returned by these simple queries will grow exponentially.
The second method is to add a column to the users table in which a delimited list of roles is stored. This will eliminate the need for the third linking table, which may have a positive effect on database growth. The downside is that SQL does not have the ability to use delimited lists. The only way I've found to process that information is to use a temporary table and a custom function.
Is viewing my execution plans, the "table scan" event is the one that takes the most resources. It makes sense that eliminating a table from the equation would speed things up. The function takes up less than 1% of the resources.
These tests were done on a database with less than 20 records. As the size of the database grows, the table scans will take longer, so perhaps limiting them is the best choice.
If using the delimited list is a good way to go, why is nobody doing it?
Please tell me which is your preferred method (even if it's different from my two) and why.
Thank you.
If you have a delimited list, finding users with a given role is going to become very expensive: effectively, you need to do a FULL scan of that table, and look at all the values for that column in every row, trying to see if it contains a given role.
A separate table (normalized, many to many relation) is the way to go, and with proper indexing you will not have full scans happening.
eg:
User: UserId, Name, ....
Role: RoleId, Name, ....
UserRole: UserRoleId, UserId, RoleId
(UserRoleId is optional, you could alternatively have the PK be UserId+RoleId, I won't get into the discussion here of surrogate vs compound keys here)
You'll want an index on (UserId, RoleId) that is UNIQUE, to enforce no duplicates. This will also help with any queries where you're trying to see if a specific user has a specific role (WHERE userId = x AND roleId = y)
If you are looking up all the roles a user has, you'll want an index on just UserId.
Conversely, if you are looking up all the users a given role has, an index on just roleId will speed that up. If you don't do this query, or do it very rarely, then not having this index will speed up performance slightly for insert/updates, as it is one less thing to do. This is the careful balancing act that is database tuning.
A table scan means that you don't have any indexes, or your query doesn't allow them to be used. In a security database, you should rarely if ever have to download the entire list of users/roles, unless this is for an admin application. You need to address this in your design.
Delimited lists violate first-normal-form (1NF) and almost always cause problems in the long term. What happens if you want to retrieve all users in a particular role? How do you write that query? Don't go down this road. Normalize it.
If you're using correct column types (i.e. not a varchar(4000) or varchar(max) everywhere), disk space really shouldn't be an issue. Yes, it will grow "exponentially" - so what? Databases are good at this kind of scaling. Unless you're trying to run this on a 10 gig hard drive, it's not something to worry about. And if you are trying to run it on a 10 gig hard drive, you probably have bigger issues to worry about.
Short answer: Don't use a delimited list. Normalize.
The first option. It's called a many-to-many join table. This will perform fine if you create appropriate indexes.
Don't go with the second 'denormalised' option.
You could use a separate table or you could go back to cavemen with chisels. The choice is up to you.
A separate table is the way to go, otherwise you're trying to work around your database engine. A separate table is properly normalised - in general, as an application expands, the better it is normalised, the easier you'll find it to work with. What greg said above is also absolutely right.
Although I would highly recommend the normalized method that everyone is suggesting. I do believe that having an enum based role system would allow you to have one digit for the "roles" column and allow you to avoid having to create another table.