I have a database, with one table containing a set of entities - let's say films. I want users on the front end to be able to create and save multiple orderings of these films.
How could I sensibly store these orderings? What should I be thinking about (for making the database schema nice, or for performance, etc.)?
In this particular case, I'm only expecting about 100 films in the database, and probably fewer than 10 saved orderings, so performance is unlikely to be a major issue, although in theory it could get much larger.
My ideas so far, without restricting for niceness, include;
Having a table for orderings, storing ordering id, ordering name,
and a JSON.stringified version of the film ids.
Having a table storing ordering id and ordering name, and a separate
table storing, somehow (a linked list?) several sets of actual
orders - by storing as you might a single order but adding an order
id. (This second table might then contain a record for every pair of
film and order...)
Your ideas basically cover it, omitting some implementation-specific options such as Postgres' array field support which is for practical purposes a different way to accomplish #1.
#1 is compact and easy to set up, but lacks referential integrity: if you delete a film, you have to ensure it is removed from all lists yourself. #2 involves more work and a slightly more complex structure, but the payoff is that the relationships are tracked by the database; taking the example of deleting a film, you have the option to prevent deletion of any film on a list, or of cascading the deletion to remove the entry from all lists.
Each has its advantages and disadvantages. Which is appropriate really comes down to your specific use case.
Related
I am used to seeing relational databases where distinct entities are stored in different tables. (simple example: Country, State, City). Recently I been seeing more cases where distinct but similar entities are bundled into same table combined with different Views. I supposed this can economize on tables and data access programs (maybe at the expense of clarity and flexibility). Re-reading definition of normalized databases, I don't think this breaks any rules, but it seems less intuitive and through back to old mainframe "Miscellaneous" tables where you put anything that was forgotten in design stage. See 2 examples below: Multi-table solution vs Single table solution. Is this phenomenon part of a data or programming design pattern and have a name?
If you have small dedicated tables, then the database can easily cache the ones it needs in memory.
If you take what would otherwise be small tables and cram them together into one, the database doesn't know which entries are important to cache and which aren't.
More importantly, there is more opportunity for errors because you can inadvertently type in the wrong type code and end up joining to something irrelevant, with no RI or typechecking to warn you. If you use small dedicated tables then you can specify RI constraints.
Thinking back to a place where I saw the single monster-lookup-table pattern done, I think the attraction was that developers can add more kinds of entries without needing DBA intervention to create more tables. There were a lot of developers and only a few DBAs and this was how the DBAs avoided getting sucked into having to create dedicated lookup tables every time a new type of lookup entry was introduced. (Apparently granting create table rights in dev was not acceptable for the DBAs there.)
This seems like a workaround for environments where database schema changes are hard to come by. But another consideration is it may be easier to internationalize if all your entries are in one table.
And the pattern has an established name, it's called the One True Lookup Table. The linked article calls it out as an antipattern, and lists more downfalls of this technique. Here is the bulleted list from the article:
It makes the SQL look ugly.
Many statements will require multiple joins to the lookup table. The extra join columns make the statements look bigger and scarier. There will be the same number of joins when using separate lookup tables, but those joins will be simpler.
Multiple references to the same table can make it hard to determine what is happening in the execution plan, as you will see those repeated references there, and have to refer to the predicates to understand the context of table reference. If you were using separate lookup tables, it would be clear which table you were referring to at any point of the execution plan.
You can't foreign key to this type of table. Technically you can if you are willing to put both columns (lookup_type_code and lookup_key) in the table, but you won't because it is ugly. This means there is a good chance your data integrity will be compromised over time. It's really easy to foreign key to individual lookup tables, and therefore protect your data.
It's hard to control the contents of the table. It's a shared resource, so check constraints and triggers are problematic. If you need users to have different privileges, depending on which lookup they are dealing with, things are going to get messy. That would be really easy with separate lookup tables.
If you need to make a change for one reference type, like extending the size of the key or value, it affects all reference data. Using separate lookup tables isolates the change.
Over time, many reference tables take on additional data. To model that you would need to either split out that reference data from this shared lookup table, or start adding optional columns to cope with the "one-off" issues. A change like this is really simple for separate lookup tables.
Data types matter. You should always use the correct data type, as it will reduce the number of data type conversions needed. Implicit data type conversions are bugs waiting to happen!
Performance can be a problem with the OTLT approach as it's hard for the optimizer to make sound judgements about the data. The optimizer cares about cardinality, but it may be hard to make that decision if you are dealing with a large number of rows, most of which are irrelevant in any one specific context. The optimizer also cares about high/low values, but these are not be relevant to any one lookup, but shared. We've also mentioned you probably won't foreign key to this data, which will reduce the amount of information the optimizer has when making its decision. You may have artificially made columns optional, that are actually mandatory, a key must have a value, but which column? I think you get the message.
I think, if you need name dictionary only ( for spellchecking or something like ) second approach is good enough. Otherwise, if objects have some additional specific fields second approach is very bed.
(NB. The question is not a duplicate for this, since I am dealing with an ORM system)
I have a table in my database to store all Contacts information. Some of the columns for each contact is fixed (e.g. Id, InsertDate and UpdateDate). In my program I would like to give user the option to add or remove properties for each contact.
Now there are of course two alternatives here:
First is to save it all in one table and add and remove entire columns when user needs to;
Create a key-value table to save each property alongside its type and connect the record to user's id.
These alternatives are both doable. But I am wondering which one is better in terms of speed? In the program it will be a very common thing for the user to view the entire Contact list to check for updates. Plus, I am using an ORM framework (Microsoft's Entity Framework) to deal with database queries. So if the user is to add and remove columns from a table all the time, it will be a difficult task to map them to my program. But again, if alternative (1) is a significantly better option than (2), then I can reconsider the key-value option.
I have actually done both of these.
Example #1
Large, wide table with columns of data holding names, phone, address and lots of small integer values of information that tracked details of the clients.
Example #2
Many different tables separating out all of the Character Varying data fields, the small integer values etc.
Example #1 was a lot faster to code for but in terms of performance, it got pretty slow once the table filled with records. 5000 wasn't a problem. When it reached 50,000 there was a noticeable performance degradation.
Example #2 was built later in my coding experience and was built to resolve the issues found in Example #1. While it took more to get the records I was after (LEFT JOIN this and UNION that) it was MUCH faster as you could ultimately pick and choose EXACTLY what the client was after without having to search a massive wide table full of data that was not all being requested.
I would recommend Example #2 to fit your #2 in the question.
And your USER specified columns for their data set could be stored in a table just to their own (depending on how many you have I suppose) which would allow you to draw on the table specific to that USER, which would also give you unlimited ability to remove and add columns to suit that particular setup.
You could then also have another table which kept track of the custom columns in the custom column table, which would give you the ability to "recover" columns later, as in "Do you want to add this to your current column choices or to one of these columns you have deleted in the past".
I am building a MySQL-driven website that will analyze customer surveys distributed by a variety of clients. Generally, these surveys are structured fairly consistently, and most of our clients' data can be reduced to the same normalized database structure.
However, every client inevitably ends up including highly specific demographic questions for their customers that are irrelevant to every other one of our clients. For instance, although all of our clients will ask about customer satisfaction, only our auto clients will ask whether the customers know how to drive manual transmissions.
Up to now, I have been adding columns to a respondents table for all general demographic information, with a lot of default null's mixed in. However, as we add more clients, it's clear that this will end up with a massive number of columns which are almost always null.
Is there a way to do this consistently? I would rather keep as much of the standardized data as possible in the respondents table since our import script is already written for that table. One thought of mine is to build a respondent_supplemental_demographic_info table that has the columns response_id, demographic_field, demographic_value (so the manual transmissions example might become: 'ID999', 'can_drive_manual_indicator', true). This could hold an infinite number of demographic_fields, but would be incredible painful to work with from both a processing and programming perspective. Any ideas?
Your solution to this problem is called entity-attribute-value (EAV). This "unpivots" columns so they are rows in a table and then you tie them together into a single view.
EAV structures are a bit tricky to learn how to deal with. They require many more joins or aggregations to get a single view out. Also, the types of the values becomes challenging. Generally there is one value column, so everything is stored as a string. You can, of course, have a type column with different types.
They also take up more space, because the entity id is repeated on each row (I think that is the response_id in your case).
Although not idea in all situations, they are appropriate in a situation such as you describe. You are adding attributes indefinitely. You will quickly run over the maximum number of columns allowed in a single table (typically between 1,000 and 4,000 depending on the database). You can also keep track of each value in each column separately -- if they are added at different times, for instance, you can keep a time stamp on when they go in.
Another alternative is to maintain a separate table for each client, and then use some other process to combine the data into a common data structure.
Do not fall for a table with key-value pairs (field id, field value) as that is inefficient.
In your case I would create a table per customer. And metadata tables (in a separate DB) describing these tables. With these metadata you can generate SQL etcetera. That is definitely superior too having many null columns. Or copied, adapted scripts. It requires a bit of programming, where an application uses the metadata to generate SQL, collect the data (without customer specific semantic knowledge) and generate reports.
If my table has a huge number of columns (over 80) should I split it into several tables with a 1-to-1 relationship or just keep it as it is? Why? My main concern is performance.
PS - my table is already in 3rd normal form.
PS2 - I am using MS Sql Server 2008.
PS3 - I do not need to access all table data at once, but rather have 3 different categories of data within that table, which I access separately. It is something like: member preferences, member account, member profile.
80 columns really isn't that many...
I wouldn't worry about it from a performance standpoint. Having a single table (if you're typically using all of the data in your standard operations) will probably outperform multiple tables with 1-1 relationships, especially if you're indexing appropriately.
I would worry about this (potentially) from a maintenance standpoint, though. The more columns of data in a single table, the less understandable the role of that table in your grand scheme becomes. Also, if you're typically only using a small subset of the data, and all 80 columns are not always required, splitting into 2+ tables might help performance.
Re the performance question - it depends. The larger a row is, the less rows can be read from disk in one read. If you have a lot of rows, and you want to be able to read the core information from the table very quickly, then it may be worth splitting it into two tables - one with small rows with only the core info that can be read quickly, and an extra table containing all the info you rarely use that you can lookup when needed.
Taking another tack, from a maintenance & testing point of view, if as you say you have 3 distinct groups of data in the one table albeit all with the same unique id (e.g. member_id) it might make sense to split it out into separate tables.
If you need to add fields to say your profile details section of the members info table, do you really want to run the risk of having to re-test the preferences & account details elements of your app as well to ensure no knock on impacts.
Also for audit trail purposes if you want to track the last user ID/Timestamp to change a members data. If the admin app allows Preferences/Account Details/Profile Details to be updated separately then it makes sense to have them in separate tables to more easily track updates.
Not quite a SQL/Performance answer but maybe something to look at from a DB & App design pov
Depends what those columns are. If you've got hard coded duplicated fields like Colour1, Colour2, Colour3, then these are candidates for child tables. My general rule of thumb is if there's more than one field of the same type (Colour), then you might as well code for N of them, not a fixed number.
Rob.
1-1 may be easier, if you have say Member_Info; Member_Pref; Member_Profile. Having too many columns can make it run if you want lots of varchar(255) as you may go over the rowsize limit, and it just makes it too confusing.
Just make sure you have the correct forgein key constraints and suchwhat, so there's always 1 row in each table with the same member_id
In a DB I'm designing, there's one fairly central table representing something that's been sold or is for sale. It distinguishes between personal sales (like eBay) and sales from a proper company. This means there is literally 1 or two fields which are not equally appropiate to both cases... for instance one field is only used in one case, another field is optional in one case but mandatory in the other.
If there were more specialty it would be sensible to have a core table and then two tables with the fields relevant to the specific cases. But here, creating two tables just to contain like one field plus the reference to the core table seems both aesthetically bad, and painful to the query designer and DB software.
What do you think? Is it ok to bend the rules slightly by having a single table with weakened constraints - meaning the DB cannot 100% prevent inconsistent data being added (in a very limited way) - or do I suck it up and create dumb-looking 1-field tables?
What you're describing with one table for common columns and dependent tables for subtype-specific columns is called Class Table Inheritance. It's a perfectly good thing to do.
What #Scott Ferguson seems to be describing (two distinct tables for the two types of sales) is called Concrete Table Inheritance. It can also be a good solution depending on your needs, but more often it just makes it harder to write query across both subtypes.
If all you need is one or two columns that apply only to a given subtype, I agree it seems like overkill to create dependent tables. Remember that most brands of SQL database support CHECK constraints or triggers, so you can design data integrity rules into the metadata.
CREATE TABLE Sales (
sale_id SERIAL,
is_business INT NOT NULL, -- 1 for corporate, 0 for personal
sku VARCHAR(20), -- only for corporate
paypal_id VARCHAR(20), -- mandatory but only for personal
CONSTRAINT CHECK (is_business = 0 AND paypal_id IS NOT NULL)
);
I think the choice of having these fields is not going to hurt you today and would be the choice I would go for. just remember that as your database evolves you may need to make the decision to refactor to 2 separate tables, (if you need more fields)
There are some who insist that inapplicable fields should never be allowed, but I think this is one of those rules that someone wrote in a book and now we're all supposed to follow it without questioning why. In the case you're describing, a single table sounds like the simple, intelligent solution.
I would certainly not create two tables. Then all the common fields would be duplicated, and all your queries would have to join or union two tables. So the real question is, One table or three. But you seem to realize that.
You didn't clarify what the additional fields are. If the presence or absence of one field implies the record type, then I sometimes use that fact as the record type indicator rather than creating a redundant type. Like, if the only difference between a "personal sale" and a "business sale" is that a business sale has a foreign key for a company filled in, then you could simply state that you define a business sale as one with a company filled in, and no ambiguity is possible. But if the situation gets even slightly more complicated, this can be a trap: I've seen applications that say if a is null and b=c d / 7 = then it's record type A, else if b is null and etc etc. If you can't do it with one test on one field, forget it and put in a record type field.
You can always enforce consistency with code or constraints.
I worry a lot more about redundant data creating consistency problems then inapplicable fields. Redundant data creates all sorts of problems. Data inapplicable to a record type? In the worst case, just ignore it. If it's a "personal sale" and somehow a company got filled in, ignore it or null it out on sight. Problem solved.
If there are two distinct entities, "Personal Sales" and "Company Sales", then perhaps you ought to have two tables to represent those entities?
News flash: the DB cannot prevent 100% of corrupt data now matter which way you cut it. So far you have only considered what I call level 1 corruption (level 0 corruption is essentially what would happen if you wrote garbage over your database with a hex editor).
I have yet to see a database that could prevent level 2 corruption (syntactically correct records but when taken as a whole mean something perverse).
The PRO for keeping all fields in one table is that you get rid of JOIN's which makes your queries faster.
The CONTRA is that your table grows larger which makes your queries slower.
Which one impacts you more, totally depends on your data distribution and which queries you issue most often.
In general, splitting is better for OLTP systems, joining is better for data analysis (that tends to scan the tables).
Let's imagine 2 scenarios:
Split fields. There are 1,000,000 rows, the average row size is 20 bytes, the split field is filled once per 50 rows (i. e. 20,000 records in the split table).
We want to query like this:
SELECT SUM(mainfield + COALESCE(splitfield, 0))
FROM maintable
LEFT JOIN
splittable
ON splitid = mainid
This will require scanning 20,000,000 bytes and nested loops (or hash lookups) to find 10,000 records.
Each hash lookup is roughly equivalent to scanning 10 rows, so the total time will be equivalent of scanning 20,000,000 + 10 * 20,000 * 20 = 24,000,000 bytes
Joined fields. There are 1,000,000 rows, the average row size is 24 bytes, so the query will scan 24,000,000 bytes.
As you can see, the times tie.
However, if either parameter changes (field is filled more often or more rarely, the row size is more or less, etc), one or another solution will become better.