PostgreSQL -- put everything into one table? - sql

So I'm trying to learn some basic database design principles and decided to download a copy of the sr27 database provided by the USDA. The database is storing nutritional information on food, and statistical information on how these nutritional values were derived.
When I first started this project, my thoughts were: well, I want to be able to search for food names, and I will probably want to do some basic statistical modeling on your most common nutritional values like calories, proteins, fats, etc. So, the thought was simple, just make 3 tables that look like this:
One table for food names
One table for common nutritional values (1-1 relationship with names)
One table for other nutritional values (1-1 relationship with names)
However, it's not clear that this is even necessary. Do you gain anything from partitioning the columns (or values) based on the idea: I like to do searches on names, so let's keep that as one table for less overhead, and I like to data calculations on common nutritional values so let's keep that as another table. (Question 1) Or does proper indexing make this moot?
My next question is then: Why in the world did the USDA decide to use 12 tables? Is this considered good database design practice, or would they have been better off merging a lot of these tables? (this excerpt is taken from the PDF provided in the USDA link above, pg 29)

Do you gain anything from partitioning the columns (or values) based
on the idea: I like to do searches on names, so let's keep that as one
table for less overhead, and I like to data calculations on common
nutritional values so let's keep that as another table. (Question 1)
Or does proper indexing make this moot?
if you just had a list of items, and you want to summarize on just some of them, then indexing is the way to address performance, not splitting some into another table arbitrarily.
Also, do read up on Normalization.
My next question is then: Why in the world did the USDA decide to use
12 tables? Is this considered good database design practice, or would
they have been better off merging a lot of these tables? (this excerpt
is taken from the PDF provided in the USDA link above, pg 29)
Probably because the types of questions they want to ask are not exactly the same ones you are trying to ask.
They clearly have more info about each food - like groups, nutrients, weights, and they are also apparently tracking where the source data is coming from...

There are important rules related to design relational databases - Normal forms - that reduces some artefacts and reduce IO operations. This design is usual for OLTP databases - and I have a possibility to see terrible slow databases because the developers has zero knowledge about it. Analytical databases OLAP are little bit different - there are wide tables used and some modern OLAP databases with column store support it.
PostgreSQL is classic row store database - so all in one table is not common and it is not good strategy. You can use a view to create some typical and often used views on data - so the complex schema can be invisible (transparent) for you.

Related

Three dimensional database table

We have all been there - consider the following example - first, the client says "every user shall only have one profile picture", so we add a field for that to the users table - half a year later, requirements change and a user actually needs to have n profile pictures.
Now, this seems only possible if you add a new table such as user_pictures to handle the new cardinality 1:n instead of 1:1. Oftentimes this can get very complicated. Whenever I come across this problem, I wonder why we don't use all three dimensions that we can think in. A two dimensional table is limited in a way that it is somewhat incomplete - what if, referring to our problem with the profile picture again, the picture field in the users table had a depth, and that depth made the field an array that perfectly represented both cardinalities 1:1 and 1:n at the same time.
Table fields would simply become arrays and automatically support both cardinalities - wouldn't that be something? At least I would use it. Is there something like it out there already?
Oracle has support for arrays as well as nested tables. Either seem to fit your requirements. These days though people prefer to model everything as tables and relationships to keep things simple and consistent and so modern RDBMSes don't generally support this stuff and I don't believe it ever made it into standard SQL either.
The standard many-to-many approach, many users to many profile pictures, is easily covered by the three table approach:
Table: Users
Table: Pictures
Table: User_Pictures
However, if you move to a NoSQL approach, you can store a User document (usually in JSON format), that stores an array of profile pictures for that user in a single table.
#gordy +1 for the Oracle link. I wasn't sure if any RDBS supposed arrays.
You are describing a denormalization technique (multiple columns for instances of one field) and it usually leads to tears unless you thoroughly understand the consequences of violating basic relational principles.
A classic difficulty comes when you want to query on the field ("find the user who has this picture") and you discover that an SQL statement with "AND picture IN (pic1, pic2, pic3)" can't be indexed and your optimizer starts planning its revenge.

Three SQL tables or one?

I have a choice of creating three tables with identical structure but different content or one table with all of the data and one additional column that distinguishes the data. Each table will have about 10,000 rows in it, and it will be used exclusively for looking up data. The key design criteria is speed of lookup, so which is faster: three tables with 10K rows each or one table with 30K rows, or is there no substantive difference? Note: all columns that will be used as query parameters will have indices.
There should be no substantial difference between 10k or 30k rows in any modern RDBMS in terms of lookup time. In any case not enough difference to warrant the de-normalization. Indexed qualifier column is a common approach for such a design.
The only time you may consider de-normalizing if your update pattern affects a limited set of data that you can put in a "short" table (say, today's messages in social network) with few(er) indexes for fast inserts/updates and there is a background process transferring the stabilized updates to a large, fully indexed table. The case were you really win during write operations will be a dramatic one though, with very particular and unfortunate requirements. RDBMS engines are sophisticated enough to handle most of the simple scenarios in very efficient way. 30k or rows does not sound like a candidate.
If still in doubt, it is very easy to write a test to check on your particular database / system setup. I think if you post your findings here with real data, it will be a useful info for everyone in your steps.
Apart from the speed issue, which the other posters have covered and I agree with, you should also take into consideration the business model that your are replicating in your database, as this may affect the maintenance cost of your solution.
If is it possible that the 3 'things' may turn into 4, and you have chosen the separate table path, then you will have to add another table. Whereas if you choose the discriminator path then it is as simple as coming up with a new discriminator.
However, if you choose the discriminator path and then new requirements dictate that one of 'things' has more data to store then you are going to have to add extra columns to your table which have no relevance to the other 'things'.
I cannot say which is the right way to go, as only you know your business model.

When to split a Large Database Table?

I'm working on an 'Employee' database and the fields are beginning to add up (20 say). The database would be populated from different UI say:
Personal Information UI: populates fields of the 'Employee' table such as birthday, surname, gender etc
Employment Details UI: populates fields of the 'Employee' table such as employee number, date employed, grade level etc
Having all the fields populated from a single UI (as you would imagine) is messy and results in one very long form that you'd need to scroll.
I'm thinking of splitting this table into several smaller tables, such that each smaller table captures a related information of an employee (i.e. splitting the table logically according to the UI).
The tables will then be joined by the employee id. I understand that splitting tables with one-to-one relationship is generally not a good idea (multiple-database-tables), but could splitting the table logically help, such that the employee information is captured in several INSERT statements?
Thanks.
Your data model should not abide to any rules imposed by the UI, just for convenience. One way to reduce the column-set for a given UI component is to use views (in most databases, you can also INSERT / UPDATE / DELETE using simple views). Another is to avoid SELECT *. You can always select subsets of your table's columns
"could splitting the table logically help, such that the employee information is captured in several INSERT statements?"
No.
How could it help?
20 fields is a fairly small number of fields for a relational database. There was a question a while ago on SO where a developer expected to have around 3,000 fields for a single table (which was actually beyond the capability of the RDBMS in question) - under such circumstances, it does make sense to split up the table.
It could also make sense to split up the table if a subset of columns were only ever going to be populated for a small proportion of rows (for example, if there were attributes that were specific to company directors).
However, from the information supplied so far, there is no apparent reason to split the table.
Briefly put, you want to normalize your data model. Normalization is the systematic restructuring of data into tables that formed the theoretical foundation of the relational data model developed by EF Codd forty years ago. There are levels of normalization - non-normalized and then first, second, third etc normal forms.
Normalization is now barely an afterthought in many database shops, ostensibly because it is erroneously believed to slow database performance.
I found a terrific summary on an IBM site covering normalization which may help. http://publib.boulder.ibm.com/infocenter/idshelp/v10/topic/com.ibm.ddi.doc/ddi56.htm
The original Codd book later revised by CJ Date is not very accessible, unfortunately. I can recommend "Database Systems: Design, Implementation & Management; Authors: Rob, P. & Coronel, C.M". I TA'ed a class on database design that used this textbook and I've kept using it for reference.

Best pattern for storing (product) attributes in SQL Server

We are starting a new project where we need to store product and many product attributes in a database. The technology stack is MS SQL 2008 and Entity Framework 4.0 / LINQ for data access.
The products (and Products Table) are pretty straightforward (a SKU, manufacturer, price, etc..). However there are also many attributes to store with each product (think industrial widgets). These may range from color to certification(s) to pipe size. Every product may have different attributes, and some may have multiples of the same attribute (Ex: Certifications).
The current proposal is that we will basically have a name/value pair table with a FK back to the product ID in each row.
An example of the attributes Table may look like this:
ProdID AttributeName AttributeValue
123 Color Blue
123 FittingSize 1.25
123 Certification AS1111
123 Certification EE2212
123 Certification FM.3
456 Pipe 11
678 Color Red
999 Certification AE1111
...
Note: Attribute name would likely come from a lookup table or enum.
So the main question here is: Is this the best pattern for doing something like this? How will the performance be? Queries will be based on a JOIN of the product and attributes table, and generally need many WHEREs to filter on specific attributes - the most common search will be to find a product based on a set of known/desired attributes.
If anyone has any suggestions or a better pattern for this type of data, please let me know.
Thanks!
-Ed
You are about to re-invent the dreaded EAV model, Entity-Attribute-Value. This is notorious for having problems in real-life, for various reasons, many covered by Dave's answer.
Luckly the SQL Customer Advisory Team (SQLCAT) has a whitepaper on the topic,
Best Practices for Semantic Data Modeling for Performance and Scalability. I highly recommend this paper. Unfortunately, it does not offer a panacea, a cookie cutter solution, since the problem has no solution. Instead, you'll learn how to find the balance between a fixed queryable schema and a flexible EAV structure, a balance that works for your specific case:
Semantic data models can be very
complex and until semantic databases
are commonly available, the challenge
remains to find the optimal balance
between the pure object model and the
pure relational model for each
application. The key to success is to
understand the issues, make the
necessary mitigations for those
issues, and then test, test, and test.
Scalability testing is a critical
success factor if you are going to
find that optimal design.
This is going to be problematic for a couple of reasons:
Your entity queries will be much harder to write. Transforming the results of those queries into something resembling a ViewModel when it comes time for presentation is going to be painful because it will involve a pivot for each product.
Understanding what your datatypes will be is going to be tough when it comes time to read certain types of data. Are you planning on storing this as strings? For example, DateTimes hold more data than the default .ToString() implementation writes to the string. You're also going to have issues if you try to store floating-point values.
Your objects' data integrity is at risk. There will be a temptation to put properties which should be just attributes of your main product tables in this "bucket o' data". Maybe the design will be semi-sane to begin with, but I guarantee you that after a certain amount of time, folks will start to just throw properties in the bag. It'll then be very tough to keep your objects' integrity with such a loosely defined structure.
Your indexes will most likely be suboptimal. Again think of a property which should be on your product table. Instead of being able to index on just one column, you will now be forced to make a potentially very large composite index on your "type" table.
Since you're apparently planning to throw out proper datatypes and use strings, the performance of range queries for numeric data will likely be poor.
Your table will get big, slowing backups and queries. Instead of an integer being 4 bytes, you're going to have to store far more for an integer of any size.
Better to normalize the table in a more "traditional" way using "IS-A" relationships. For example, you might have Pipes, which are a type of Product, but have a couple more attributes. You might have Stoves, which are a type of product, but have a couple more attributes still.
If you really have a generic database and all sorts of other properties which aren't going to be subject to data integrity rules, you very well may want to consider storing data in an XML column. It's hard to tell you what the correct design choice is unless I know a lot more about your business.
IMO this is a design antipattern. The siren song of this idea has lured many a developer onto the rocks of of an unmaintainable application.
I know it is an old one - however there might be other readers...
I have seen the balance EAV to attribute modeled approach. Well - it is still EAV. "EAV's are like drugs" is pretty much true. So what about thinking it through once more - and let's be aggressive really:
I still liked the supertype apporach, where a lot of tables use the same primary key from a key generator. Let's reuse this one. So what about creating a new table for each set of attributes - all having the primary from the same key generator? Eg. you would have a table with the fields "color,pipe", another table "fittingsize,pipe", and so on. The requirement "volatility of attributes" screams for a carefully(automatically) maintained data dictionary anyway.
This approach is fully normalized and can be fully automated. You can support checks if specific attribute sets materialized already as table by hashing attribute name clusters, eg. crc32(lower('color~fittingsize~pipe')) where the atribute names need to be sorted alphabetically. Of course this requires to have the hash in the data dictionary. Based on the data dictionary each object can be searched (using 'UNION'), especially if the data dictionary itself is a table. Having the data dictionary as table also allows you to use its primary (surrogate) key as basis for unique tablenames, to end up with tables like 'attributes1','attributes2',... Most databases nowadays support some billion tables - so we are sort of save on that end as well. You could even have a product catalouge with very common attributes, that reference the extended attribute tables.
An open issue are 1:n data sets. I am afraid you need to sort them out in separate tables. However this very much depends on your data presentation and querying strategy. Should they always be presented as comma seperated string attached to the product or do you want to eg. be able to query for all products of a certain Certification?
Before you flame this approach please consider this: It is meant for use cases where you have a very high volatility of attributes - in quantity and quality - only. Also it was preset, that you cannot know most of the attributes at the point in time when the solution is created. So do not discuss this in a context where you can model your attributes upfront which would enable you to balance trade offs much better.
In short, you cannot go all one route. If you use an EAV like your example you will have a myriad of problems like those outlined by the other posters not the least of which will be performance and data integrity. Let me reiterate, that using an EAV as the core of your solution will fail when you get to reporting and analysis. However, as you have also stated, you might have hundreds of attributes that change regularly.
The solution, IMO, is a hybrid. For common attributes, use columns/standard schema. For additional, arbitrary attributes, use an EAV. However, the rule with the EAV data is that you can never, ever, under any circumstances, write a query that includes a sort or filter on an attribute. I.e., you can never write Where AttributeName = 'Foo'. The EAV portion of the schema represents a bag of data that is merely there for tracking purposes. In fact, I have seen many people implement this solution by using Xml for the EAV portion. The moment someone does want to search, filter, sort or place an EAV value in a specific spot on a report, that attribute must be elevated to a top level column in the products table.
The key to this hybrid approach is discipline. It will seem simple enough to add a filter, sort or put an attribute in a specific spot somewhere on a report especially when you get pressure from management. You must resist this temptation. Once you go down the dark path... If you do not think that you can maintain that level of discipline in your development team, then I would not use an EAV. As I've mentioned before, EAV's are like drugs: in small quantities and used under the right circumstances they can be beneficial. Too much will kill you.
Rather than have a name-value table, create the usual Product table structure containing all the common attributes, and add an XML column for the attributes that vary by product.
I have used this structure before and it worked quite well.
As #Dave Markle mentions, the name-value approach can lead to a world of pain.

which is faster, mysql database with one table or multiple tables?

On my website you can search 'ads' or 'classifieds'. There are different categories.
Would the searches be faster with multiple tables, one for each category, or wouldn't it matter?
We are talking about around 500 thousand ads.
If it won't slow down the search, please explain yourself so that I understand why it won't, because it seems like common sense that the more ads you have, the slower the search!
Thanks
Your question is a little unclear. I'm assuming this scenario:
table: ads
id category ad_text
-- -------- ---------------------------
1 pets sample text
2 family sample ad
If you are making one search of ads, then searching multiple tables on each search is slower than searching one table.
HOWEVER, if you're proposing to break "ads" into multiple tables according to "category", leaving you with table names like
pets-ads
family-ads
programmer-ads
And, programatically, you know you're looking for programmer-ads so you can just go search the programmer-ads table, then breaking them out is faster. Barely.
Breaking them out, though, has many drawbacks. You'll need:
some cute code to know which table to
search.
a new table each time you create a new category
to rename a table if you decide a category name is wrong
Given the limited info we have, I would strongly advise one table with a category column, then go ahead and normalize that out into its own table. Slap an index on that column. Databases are built to handle tons of rows of data organized correctly, so don't worry about that so much.
Obviously, it will be nominally faster to search a smaller table (one category) than a larger table. The larger table is probably still the correct design, however. Creating multiple identical tables will simply make the developer's and manager's lives miserable. Furthermore, certain kind of searches are more difficult if you segment the data (for instance, searches across two categories).
Properly indexed, the single-table approach will yield results almost as good as the segmented approach while providing the benefits of proper design.
(Of course, when you say "single table", I assume that you mean a single table to hold the core attributes of the Advertistment entities. Presumably there will be other tables as well.)
It depends.
If you've built a single denormalised table containing text, it'll get progressively slower for a number of reasons. Indexes help to a certain point.
If you have a normalised structure with multiple tables, primary and foreign keys, indexes, etc., it can be more robust and scalable.
A database is very well equipped to deal with 500k adds. Add an index on the category, and you should be fine.
If you add the table definition and the distribution of categories to your question, you'd probably get a better answer :)