Table with a lot of columns - sql

If my table has a huge number of columns (over 80) should I split it into several tables with a 1-to-1 relationship or just keep it as it is? Why? My main concern is performance.
PS - my table is already in 3rd normal form.
PS2 - I am using MS Sql Server 2008.
PS3 - I do not need to access all table data at once, but rather have 3 different categories of data within that table, which I access separately. It is something like: member preferences, member account, member profile.

80 columns really isn't that many...
I wouldn't worry about it from a performance standpoint. Having a single table (if you're typically using all of the data in your standard operations) will probably outperform multiple tables with 1-1 relationships, especially if you're indexing appropriately.
I would worry about this (potentially) from a maintenance standpoint, though. The more columns of data in a single table, the less understandable the role of that table in your grand scheme becomes. Also, if you're typically only using a small subset of the data, and all 80 columns are not always required, splitting into 2+ tables might help performance.

Re the performance question - it depends. The larger a row is, the less rows can be read from disk in one read. If you have a lot of rows, and you want to be able to read the core information from the table very quickly, then it may be worth splitting it into two tables - one with small rows with only the core info that can be read quickly, and an extra table containing all the info you rarely use that you can lookup when needed.

Taking another tack, from a maintenance & testing point of view, if as you say you have 3 distinct groups of data in the one table albeit all with the same unique id (e.g. member_id) it might make sense to split it out into separate tables.
If you need to add fields to say your profile details section of the members info table, do you really want to run the risk of having to re-test the preferences & account details elements of your app as well to ensure no knock on impacts.
Also for audit trail purposes if you want to track the last user ID/Timestamp to change a members data. If the admin app allows Preferences/Account Details/Profile Details to be updated separately then it makes sense to have them in separate tables to more easily track updates.
Not quite a SQL/Performance answer but maybe something to look at from a DB & App design pov

Depends what those columns are. If you've got hard coded duplicated fields like Colour1, Colour2, Colour3, then these are candidates for child tables. My general rule of thumb is if there's more than one field of the same type (Colour), then you might as well code for N of them, not a fixed number.
Rob.

1-1 may be easier, if you have say Member_Info; Member_Pref; Member_Profile. Having too many columns can make it run if you want lots of varchar(255) as you may go over the rowsize limit, and it just makes it too confusing.
Just make sure you have the correct forgein key constraints and suchwhat, so there's always 1 row in each table with the same member_id

Related

Having all contact information in one table vs. using key-value tables

(NB. The question is not a duplicate for this, since I am dealing with an ORM system)
I have a table in my database to store all Contacts information. Some of the columns for each contact is fixed (e.g. Id, InsertDate and UpdateDate). In my program I would like to give user the option to add or remove properties for each contact.
Now there are of course two alternatives here:
First is to save it all in one table and add and remove entire columns when user needs to;
Create a key-value table to save each property alongside its type and connect the record to user's id.
These alternatives are both doable. But I am wondering which one is better in terms of speed? In the program it will be a very common thing for the user to view the entire Contact list to check for updates. Plus, I am using an ORM framework (Microsoft's Entity Framework) to deal with database queries. So if the user is to add and remove columns from a table all the time, it will be a difficult task to map them to my program. But again, if alternative (1) is a significantly better option than (2), then I can reconsider the key-value option.
I have actually done both of these.
Example #1
Large, wide table with columns of data holding names, phone, address and lots of small integer values of information that tracked details of the clients.
Example #2
Many different tables separating out all of the Character Varying data fields, the small integer values etc.
Example #1 was a lot faster to code for but in terms of performance, it got pretty slow once the table filled with records. 5000 wasn't a problem. When it reached 50,000 there was a noticeable performance degradation.
Example #2 was built later in my coding experience and was built to resolve the issues found in Example #1. While it took more to get the records I was after (LEFT JOIN this and UNION that) it was MUCH faster as you could ultimately pick and choose EXACTLY what the client was after without having to search a massive wide table full of data that was not all being requested.
I would recommend Example #2 to fit your #2 in the question.
And your USER specified columns for their data set could be stored in a table just to their own (depending on how many you have I suppose) which would allow you to draw on the table specific to that USER, which would also give you unlimited ability to remove and add columns to suit that particular setup.
You could then also have another table which kept track of the custom columns in the custom column table, which would give you the ability to "recover" columns later, as in "Do you want to add this to your current column choices or to one of these columns you have deleted in the past".

Building a MySQL database that can take an infinite number of fields

I am building a MySQL-driven website that will analyze customer surveys distributed by a variety of clients. Generally, these surveys are structured fairly consistently, and most of our clients' data can be reduced to the same normalized database structure.
However, every client inevitably ends up including highly specific demographic questions for their customers that are irrelevant to every other one of our clients. For instance, although all of our clients will ask about customer satisfaction, only our auto clients will ask whether the customers know how to drive manual transmissions.
Up to now, I have been adding columns to a respondents table for all general demographic information, with a lot of default null's mixed in. However, as we add more clients, it's clear that this will end up with a massive number of columns which are almost always null.
Is there a way to do this consistently? I would rather keep as much of the standardized data as possible in the respondents table since our import script is already written for that table. One thought of mine is to build a respondent_supplemental_demographic_info table that has the columns response_id, demographic_field, demographic_value (so the manual transmissions example might become: 'ID999', 'can_drive_manual_indicator', true). This could hold an infinite number of demographic_fields, but would be incredible painful to work with from both a processing and programming perspective. Any ideas?
Your solution to this problem is called entity-attribute-value (EAV). This "unpivots" columns so they are rows in a table and then you tie them together into a single view.
EAV structures are a bit tricky to learn how to deal with. They require many more joins or aggregations to get a single view out. Also, the types of the values becomes challenging. Generally there is one value column, so everything is stored as a string. You can, of course, have a type column with different types.
They also take up more space, because the entity id is repeated on each row (I think that is the response_id in your case).
Although not idea in all situations, they are appropriate in a situation such as you describe. You are adding attributes indefinitely. You will quickly run over the maximum number of columns allowed in a single table (typically between 1,000 and 4,000 depending on the database). You can also keep track of each value in each column separately -- if they are added at different times, for instance, you can keep a time stamp on when they go in.
Another alternative is to maintain a separate table for each client, and then use some other process to combine the data into a common data structure.
Do not fall for a table with key-value pairs (field id, field value) as that is inefficient.
In your case I would create a table per customer. And metadata tables (in a separate DB) describing these tables. With these metadata you can generate SQL etcetera. That is definitely superior too having many null columns. Or copied, adapted scripts. It requires a bit of programming, where an application uses the metadata to generate SQL, collect the data (without customer specific semantic knowledge) and generate reports.

One large table or split into two smaller tables?

Is there any performance benefit to splitting a large table with roughly 100 columns into 2 separate tables? This would be in terms of inserting, deleting and selecting tasks? I'm using SQL Server 2008.
If one of the fields is a CLOB or BLOB and you anticipate it holding a huge amount of data and you won't need that field very often and the result set will transmitted over a long pipe (like server to a web-based client), then I think putting that field in a separate table would be appropriate.
But just returning 100 regular fields probably won't tax your system so much as to justify a separate table and a join.
The only benefit you might see is if a number of columns are only occasionally populated. In which case putting those into their own table and only adding a row when there is data might make sense in terms of overall row overhead and, depending on the number of rows, overall page count for the table(s). That said, this is one of the reasons they introduced sparse columns in SQL Server 2008.
For the maintenance and other overhead of managing two tables instead of one (especially given that people can act on individual tables if they choose), it's unlikely it would be worth it.
Can you describe what type of entity needs to have over 100 columns? Perhaps the data model is just wrong in the first place.
I would say no as it would take more execution time to join the 2 tables whenever you wanted to do something.
I depends if you use these fields in the same time in your application.
These kind of performance improvements are really bad : you make your source code impossible to understand. If you have performance trouble with this table, add something (like a table containing the 15 fields you'll use in a request that'll updated via trigger), don't modify your clean solution.
If you don't have performance problem, don't do anything, you'll see later !

Dynamically creating tables as a means of partitioning: OK or bad practice?

Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.
NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.
No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.
Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...
Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)

Is precalculation denormalization? If not, what is (in simple terms)?

I'm attempting to understand denormalization in databases, but almost all the articles google has spat out are aimed at advanced DB administrators. I fair amount of knowledge about MySQL and MSSQL, but I can't really grasp this.
The only example I can think of when speed was an issue was when doing calculations on about 2,500,000 rows in two tables at a place I used to intern at. As you can guess, calculating that much on demand took forever and froze the dev server I was on for a few minutes. So right before I left my supervisor wanted me to write a calculation table that would hold all the precalculated values, and would be updated about every hour or so (this was an internal site that wasn't used often). However I never got to finish it because I left
Would this be an example of denormalization? If so, is this a good example of it or does it go much farther? If not, then what is it in simple terms?
Say you had an Excel file with 2 worksheets you want to use to store family contact details. On the first worksheet, you have names of your contacts with their cell phone numbers. On the second worksheet, you have mailing addresses for each family with their landline phone numbers.
Now you want to print Christmas card labels to all of your family contacts listing all of the names but only one label per mailing address.
You need a way to link the two normalized sets. All the data in the 2 sets you have is normalized. It's 'atomic,' representing one 'atom,' or piece of information that can't be broken down. None of it is repeated.
In a denormalized view of the 2 sets, you'd have one list of all contacts with the mailing addresses repeated multiple times (cousin Alan lives with Uncle Bob at the same address, so it's listed on both Alan and Bob's rows.)
At this point, you want to introduce a Household ID in both sets to link them. Each mailing address has one householdID, each contact has a householdID value that can be repeated (cousin Alan and Uncle Bob, living in the same household, have the same householdID.)
Now say we're at work and we need to track zillions of contacts and households. Keeping the data normalized is great for maintenance purposes, because we only want to store contact and household details in one place. When we update an address, we're updating it for all the related contacts. Unfortunately, for performance reasons, when we ask the server to join the two related sets, it takes forever.
Therefore, some developer comes along and creates one denormalized table with all the zillions of rows, one for each contact with the household details repleated. Performance improves, and space considerations are tossed right out the window, as we now need space for 3 zillion rows instead of just 2.
Make sense?
I would call that aggregation not denormalization(if it is quantity of orders for example, SUM(Orders) per day...). This is what OLAP is used for. Denormalization would be for example instead of having a PhoneType table and the PhoneTypeID in the Contact table, you would just have the PhoneType in the Contact table thus eliminating 1 join
You could also of course use index/materialized views to have to aggregation values...but now you will slow down your update, delete and inserts
triggers are also another way to accomplish this
In an overly simplified form I would describe de-normalisation as reducing the number of tables used to represent the same data.
Customers and addresses are often kept in different tables to allow the concept of one customer having multiple addresses. (Work, Home, Current Address, Previous Address, etc)
The same could be said to apply to surnames, and other properties, but only the current surname ever be of concern. As such, one might normalise all the way to having a Customer table and a Surname table, with foreign key relationships, etc. But then denormalise this by merging the two tables together.
The benefit of "normalise until it hurts" is that it forces one to consider a pure and (hopefully) complete representation of the data and possible behaviours and relationships.
The benefit of "de-normalise until it works" is to reduce certain maintenance and/or processing overheads, but sticking to the same basic model as derived by working out a normalised model.
In the "Surname" example, by denormalising one is able to add an index to the customers based on their Surname and Date of Birth. Without de-normalising the Surname and DoB are in different tables and the composite index is not possible.
Denormalizing can be beneficial, the example you provided is an instance of this. It is not ideal to dynamically calculate these as the cost is expensive and thus you create a table and have a functional id referencing the other table along with calculation value.
The data is redundant as it can be derived from another table but due to production requirements this is a better design in the functional sense.
Curious to see what others have to say on this topic because I know my sql professor would cringe at the term denormalize but it has practicality.
Normal form would reject this table, as it is fully derivable from existing data. However, for performance reasons data of this type is commonly found. For example inventory counts are typically carried, but are derivable from the transactions that created them.
For smaller faster sets a view can be used to derive the aggregate. This provides the user the data they need (the aggregated value) rather than forcing them to aggregate it themselves. Oracle (and others?) have introduced materialized views to do what your manager was suggesting. This can be updated on various schedules.
If update volumes permit, triggers could be used to emulate a materialized view using a table. This may reduce the cost of maintaining the aggregated value. If not it would spread the overhead over a greater period of time. It does however, add the risk of creating a deadlock condition.
OLAP takes this simple case to more of an extreme interest in aggregates. Analysts are interested in aggregated values not the details. However, if the aggregated value is interesting, they may look at the details. Starting from normal form, is still a good practice.