Structuring Relational databases - combining similar and related tables - sql

I am used to seeing relational databases where distinct entities are stored in different tables. (simple example: Country, State, City). Recently I been seeing more cases where distinct but similar entities are bundled into same table combined with different Views. I supposed this can economize on tables and data access programs (maybe at the expense of clarity and flexibility). Re-reading definition of normalized databases, I don't think this breaks any rules, but it seems less intuitive and through back to old mainframe "Miscellaneous" tables where you put anything that was forgotten in design stage. See 2 examples below: Multi-table solution vs Single table solution. Is this phenomenon part of a data or programming design pattern and have a name?

If you have small dedicated tables, then the database can easily cache the ones it needs in memory.
If you take what would otherwise be small tables and cram them together into one, the database doesn't know which entries are important to cache and which aren't.
More importantly, there is more opportunity for errors because you can inadvertently type in the wrong type code and end up joining to something irrelevant, with no RI or typechecking to warn you. If you use small dedicated tables then you can specify RI constraints.
Thinking back to a place where I saw the single monster-lookup-table pattern done, I think the attraction was that developers can add more kinds of entries without needing DBA intervention to create more tables. There were a lot of developers and only a few DBAs and this was how the DBAs avoided getting sucked into having to create dedicated lookup tables every time a new type of lookup entry was introduced. (Apparently granting create table rights in dev was not acceptable for the DBAs there.)
This seems like a workaround for environments where database schema changes are hard to come by. But another consideration is it may be easier to internationalize if all your entries are in one table.
And the pattern has an established name, it's called the One True Lookup Table. The linked article calls it out as an antipattern, and lists more downfalls of this technique. Here is the bulleted list from the article:
It makes the SQL look ugly.
Many statements will require multiple joins to the lookup table. The extra join columns make the statements look bigger and scarier. There will be the same number of joins when using separate lookup tables, but those joins will be simpler.
Multiple references to the same table can make it hard to determine what is happening in the execution plan, as you will see those repeated references there, and have to refer to the predicates to understand the context of table reference. If you were using separate lookup tables, it would be clear which table you were referring to at any point of the execution plan.
You can't foreign key to this type of table. Technically you can if you are willing to put both columns (lookup_type_code and lookup_key) in the table, but you won't because it is ugly. This means there is a good chance your data integrity will be compromised over time. It's really easy to foreign key to individual lookup tables, and therefore protect your data.
It's hard to control the contents of the table. It's a shared resource, so check constraints and triggers are problematic. If you need users to have different privileges, depending on which lookup they are dealing with, things are going to get messy. That would be really easy with separate lookup tables.
If you need to make a change for one reference type, like extending the size of the key or value, it affects all reference data. Using separate lookup tables isolates the change.
Over time, many reference tables take on additional data. To model that you would need to either split out that reference data from this shared lookup table, or start adding optional columns to cope with the "one-off" issues. A change like this is really simple for separate lookup tables.
Data types matter. You should always use the correct data type, as it will reduce the number of data type conversions needed. Implicit data type conversions are bugs waiting to happen!
Performance can be a problem with the OTLT approach as it's hard for the optimizer to make sound judgements about the data. The optimizer cares about cardinality, but it may be hard to make that decision if you are dealing with a large number of rows, most of which are irrelevant in any one specific context. The optimizer also cares about high/low values, but these are not be relevant to any one lookup, but shared. We've also mentioned you probably won't foreign key to this data, which will reduce the amount of information the optimizer has when making its decision. You may have artificially made columns optional, that are actually mandatory, a key must have a value, but which column? I think you get the message.

I think, if you need name dictionary only ( for spellchecking or something like ) second approach is good enough. Otherwise, if objects have some additional specific fields second approach is very bed.

Related

Are one-to-one related tables good for distributed sql databases?

Suppose i have a User table, and other tables (e.g. UserSettings, UserStatistics) which have one-to-one relationship with a user.
Since sql databases don't save complex structs in table fields (some allow JSON fields with undefined format), is it ok to just add said tables, allowing to store individual (complex) data for each user? Will it complicate performance by 'joining' more queries?
And in distirbuted databases cases, will it save those (connected) tables randomly in different nodes, making more redundant requests with each other and decreasing efficiency?
1:1 joins can definitely add overhead, especially in a distributed database. Using a JSON or other schema-less column is one way to avoid that, but there are others.
The simplest approach is a "wide table": instead of creating a new table UserSettings with columns a,b,c, add columns setting_a, setting_b, setting_c to your User table. You can still treat them as separate objects when using an ORM, it'll just need a little extra code.
Some databases (like CockroachDB which you've tagged in your question) let you subdivide a wide table into "column families". This tends to let you get the best of both worlds: the database knows to store rows for the same user on the same node, but also to let them be updated independently.
The main downside of using JSON columns is they're harder to query efficiently--if you want all users with a certain setting, or want to know just one setting for a user, you're going to get at least a minor performance hit if the database has to parse a JSON column to figure that out, or you have to fetch the entire blob and do it in your app. If they're more convenient for other reasons though, you can work around this by adding inverted indexes on your JSON columns, or expression indexes on the specific values you're interested in. Indexes can have a similar cost to 1:1 joins, but you can mitigate that in CockroachDB using by using the STORING keyword to tell the DB to write a copy of all the user columns to the index.

Modeling database : many small tables or not?

I have a database with some information which are repeated in some tables.
I want to know if it's interesting to create a table with this information and in the other table, I put only the id.
It's interesting because with this method I haven't got redundance. But I will have to do many joints between my tables in my request, and I'm afraid my request will be more slow.
(I work with symfony if it changes something)
It sounds like the 'information' in question is data that makes up key values. If so, it sounds like the database designer likes to use natural keys and that you prefer to use surrogate keys.
First, these are both merely a question of style. If the natural key values are composite (i.e. involve more than one column) and are included in other columns for data integrity purposes then they are not redundant.
Second, as you have observed, when it comes to performance of surrogate keys you have to weigh the advantage of the more efficient data type (e.g. a single integer column) against the degrading performance of needing to write more JOINs. Note that using surrogates tends to make constraints more troublesome to write e.g. when the required values for a rule is in another table and you SQL product doesn't support subqueries in CHECK constraints then you will need to use a trigger which degrades performance in a high activity environment.
Further consider that performance is not the only consideration e.g. using natural key values will tend to make the data more readable and therefore make the schema easier to maintain because the physical model will reflect the logical model more closely (surrogate keys do not appear in the logical model at all).
You're talking about Normalisation. As with so many design aspects it's a trade-off.
Having duplication within the database leads to many problems - for example how to keep those duplicates in step when updating data. So Inserts and Updates may well go more slowly because of the duplication. Hence we tend to normalise the database to avoid such duplication. That does lead to more complex queries and possibly some retrieval overhead.
Modern database products tend to do such queries really well if you take a bit of care to have the right indexes in place.
Hence my starting position would be to normalise your data, avoid duplication. Then in a special case perhaps denormalise just pieces where it really becomes essential. For example suppose some part of you database is large, mostly queried rather than updated (eg. historic order information) then perhaps denormalise that bit.
It is not a question of style.
The answer is, as the seeker has already identified, removal of duplication; Normalisation. Pull them all into one table, and place a Foreign Key wherever they are used.
Now an Integer FK may be "tidy", but any good, short, fixed length key will do. Variable length keys are very bad for performance, as the key needs to be unpacked every time the index is searched.
The nature of a Normalised database is more, smaller tables, which is much faster than an Unnormalised data heap, with fewer, larger tables. Get used to it.
As long as you are Joining on keys, Joins do not cost anything in themselves; ten joins to construct a row do not cost more than five. The cost is in the table sizes; the indices used; the distribution; the datatypes of the index columns; etc. Relational dbms are heavily engineered for Normalised databases.
If you need to do lookups of lookups, then that is the way it is. Just ensure that the tables are Normalised.
If you don't normalise
How are you going to store values that could potentially be used?
How are you going to separate "Lookup value" from "Look up value from "LookUpValue" etc
You'll be slows because you are storing variable length string "Lookup value" across many rows, rather than a nice tidy integer key
This is the more practical points to the other 2 answers...

MySQL Table with TEXT column

I've been working on a database and I have to deal with a TEXT field.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
Some research revealed this, suggesting that
Separate text/blobs from metadata, don't put text/blobs in results if you don't need them.
However, I am not familiar with the definition of "metadata" being used here.
So I wonder if there are any relevant advantages in putting a TEXT column in a table of its own. What are the potential problems of having it with the rest of the fields? And potential problems of keeping it in a separated table?
This table(without the TEXT field) is supposed to be searched(SELECTed) rather frequently. Is "premature optimization considered evil" important here? (If there really is a penalty in TEXT columns, how relevant is it, considering it is fairly easy to change this later if needed).
Besides, are there any good links on this topic? (Perhaps stackoverflow questions&answers? I've tried to search this topic but I only found TEXT vs VARCHAR discussions)
Yep, it seems you've misinterpreted the meaning of the sentence. What it says is that you should only do a SELECT including a TEXT field if you really need the contents of that field. This is because TEXT/BLOB columns can contain huge amounts of data which would need to be delivered to your application - this takes time and of course resources.
Best wishes,
Fabian
This is probably premature optimisation. Performance tuning MySQL is really tricky and can only be done with real performance data for your application. I've seen plenty of attempts to second guess what makes MySQL slow without real data and the result each time has been a messy schema and complex code which will actually make performance tuning harder later on.
Start with a normalised simple schema, then when something proves too slow add a complexity only where/if needed.
As others have pointed out the quote you mentioned is more applicable to query results than the schema definition, in any case your choice of storage engine would affect the validity of the advice anyway.
If you do find yourself needing to add the complexity of moving TEXT/BLOB columns to a separate table, then it's probably worth considering the option of moving them out of the database altogether. Often file storage has advantages over database storage especially if you don't do any relational queries on the contents of the TEXT/BLOB column.
Basically, get some data before taking any MySQL tuning advice you get on the Internet, including this!
The data for a TEXT column is already stored separately. Whenever you SELECT * from a table with text column(s), each row in the result-set requires a lookup into the text storage area. This coupled with the very real possibility of huge amounts of data would be a big overhead to your system.
Moving the column to another table simply requires an additional lookup, one into the secondary table, and the normal one into the text storage area.
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.
The concern is that a large text field—like way over 8,192 bytes—will cause excessive paging and/or file i/o during complex queries on unindexed fields. In such cases, it's better to migrate the large field to another table and replace it with the new table's row id or index (which would then be metadata since it doesn't actually contain data).
The disadvantages are:
a) More complicated schema
b) If the large field is using inspected or retrieved, there is no advantage
c) Ensuring data consistency is more complicated and a potential source of database malaise.
There might be some good reasons to separate a text field out of your table definition. For instance, if you are using an ORM that loads the complete record no matter what, you might want to create a properties table to hold the text field so it doesn't load all the time. However if you are controlling the code 100%, for simplicity, leave the field on the table, then only select it when you need it to cut down on data trasfer and reading time.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
You probably saw this, from the MySQL manual
http://dev.mysql.com/doc/refman/5.5/en/optimize-character.html
If a table contains string columns such as name and address, but many queries do not retrieve those columns, consider splitting the string columns into a separate table and using join queries with a foreign key when necessary. When MySQL retrieves any value from a row, it reads a data block containing all the columns of that row (and possibly other adjacent rows). Keeping each row small, with only the most frequently used columns, allows more rows to fit in each data block. Such compact tables reduce disk I/O and memory usage for common queries.
Which indeed is telling you that in MySQL you are discouraged from keeping TEXT data (and BLOB, as written elsewhere) in tables frequently searched

mysql - how many columns is too many?

I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w

How do you know when an SQL database needs more normalization?

Is it when you're trying to get data and there is no apparent easy way of doing it?
When you find something should be a table on it's own?
What are the laws?
Check out Wikipedia. The article talks about database normalization and the different forms (first, second, third, etc.). Most times you should be aiming for at least third normal form. There are times when you want to relax the rules a bit (it may be too expensive to join multiple tables together so might want to de-normalize a bit) but for the most part third normal form is good.
When you notice you have to repeat the same data, or when you start using single fields as arrays.
While this is a somewhat snarky answer, when you discover that the data isn't sufficiently normalized. There are many resources on the web about the levels (or, more properly, "forms") of normalization, and they more completely describe the forms than I could here. First and second normal forms should be pretty much required. If you aren't at third (or, really, fourth) normal form, you need to have a strong justification as to why.
Check out the Wikipedia article on database normalization.
When you're starting to question whether an SQL database needs more normalization.
Whenever you have a relational database.... <grin/>
No, actually there are laws, check out this Wikipedia link.
they are called the five normal forms or something like that. Originally from the guy who invented relational databases in the 50s/60s, E. F. Codd.
"The key the whole key and nothing but the Key, so help me Codd"
This is a synopsis:
First normal form (1NF) Table
faithfully represents a relation and
has no repeating groups
Second normal form (2NF) No
non-prime attribute in the table is
functionally dependent on a part
(proper subset) of a candidate key
Third normal form (3NF) Every
non-prime attribute is
non-transitively dependent on every
key of the table Every non-trivial functional dependency in the table is a dependency on a superkey
Fourth normal form (4NF) Every
non-trivial multivalued dependency
in the table is a dependency on a
superkey
Fifth normal form (5NF) Every non-trivial join dependency in the table is implied by the superkeys of the table. Domain/key normal form (DKNF) Ronald Fagin (1981)[19] Every constraint on the table is a logical consequence of the table's domain constraints and key constraints
Sixth normal form (6NF) Table features no
non-trivial join dependencies at all
(with reference to generalized join
operator)
Other people have pointed you to the formal rules for normalization. Here are some informal guidelines I use:
If you have columns in a table the names of which differ only by a number (eg Phone1 and PHone2).
If you have any columns in a table that should be filled in only when another column in the table is filled in.
If updating a "fact" in the database (such as a street address) requires more than one UPDATE.
If the same question could ever get two different answers depending on which table you get your information from.
If the answer to any non-trivial question can be gotten from the database without JOINing at least two tables.
If you have any quantity-based restrictions in the database other than "only 1 of something is allowed" (that is, "only one address is allowed" is okay, but "only two addresses are allowed" indicates a normalization problem).
3NF is generally all you need and it follows three rules:
Every column in the table should be dependent on:
the key (1NF),
the whole key (2NF),
and nothing but the key (3NF) (so help me Codd is the way that quote usually ends).
You can often "downgrade" to 2NF for performance reasons, provided you understand the implications and only when you strike problems, but 3NF should be the initial goal for all your designs..
As everyone else has said, you know when you start having (too many) duplicate columns in multiple tables.
That being said, it is sometimes useful to have redundant columns across multiple tables. This can reduce the number of JOINs you have to do in complicated queries. Just be careful to keep all the tables in sync, or you're just asking for trouble.
This is a pretty good article. Getting normal is a science, not an art. Now knowing when to DEnormalize... that's an art.
http://www.alvechurchdata.co.uk/hints-and-tips/softnorm.html
See Description of the database normalization basics
What level of normalization are you currently at? If you can't answer that I assume your database is a nasty mess. I always hit 3rd normal on initial design and de-normalize or normalize further if and when needed.
I assume you're talking about a transactional database supporting an interactive application, but for what it's worth...
OLAP databases used exclusively for reporting and only updated by ETL processes may benefit from a less normalized structure. In these applications you accept the cost of redundant data storage and duplication for the performance benefit of fewer joins and the increased ease of use for (sometimes less technical) data analysts and business analysts.
Transactional databases should always be normalized to the extent practical (at least 3NF) and then selectively denormalized only as needed. And the need to denormalize should ideally be based on actual performance testing results.
When you have to search trough huge amounts of data just to extract some basic info - i.e. what kind of Product categories are there or something like that.