I read something like below in 1NF form of DBMS.
There was a sentence as follows:
"Every column should be atomic."
Can anyone please explain it to me thoroughly with an example?
Re "atomic"
In Codd's original 1969 and 1970 papers he defined relations as having a value for every attribute in a row. The value could be anything, including a relation. This used no notion of "atomic". He explained that "atomic" meant not relation-valued (ie not table-valued):
So far, we have discussed examples of relations which are defined on
simple domains--domains whose elements are atomic (nondecomposable)
values. Nonatomic values can be discussed within the relational
framework. Thus, some domains may have relations as elements.
He used "simple", "atomic" and "nondecomposable" as informal expository notions. He understood that a relation has rows of which each column has an associated name and value; attributes are by definition "single-valued"; the value is of any type. The only structural property that matters relationally is being a relation. It is also just a value, but you can query it relationally. Then he used "nonsimple" etc meaning relation-valued.
By the time of Codd's 1990 book The Relational Model for Database Management: Version 2:
From a database perspective, data can be classified into two types:
atomic and compound. Atomic data cannot be decomposed into smaller
pieces by the DBMS (excluding certain special functions). Compound
data, consisting of structured combinations of atomic data, can be
decomposed by the DBMS.
In the relational model there is only one type of compound data: the
relation. The values in the domains on which each relation is defined
are required to be atomic with respect to the DBMS. A relational
database is a collection of relations of assorted degrees. All of the
query and manipulative operators are upon relations, and all of them
generate relations as results. Why focus on just one type of compound
data? The main reason is that any additional types of compound data
add complexity without adding power.
"In the relational model there is only one type of compound data: the relation."
Sadly, "atomic = non-relation" is not what you're going to hear. (Unfortunately Codd was not the clearest writer and his expository remarks get confused with his bottom line.) Virtually all presentations of the relational model get no further than what was for Codd merely a stepping stone. They promote an unhelpful confused fuzzy notion canonicalized/canonized as "atomic" determining "normalized". Sometimes they wrongly use it to define realtion. Whereas Codd used everyday "nonatomic" to introduce defining relational "nonatomic" as relation-valued and defined "normalized" as free of relation-valued domains.
(Neither is "not a repeating group" helpful as "atomic", defining it as not something that is not even a relational notion. And sure enough in 1970 Codd says "terms attribute and repeating group in present database terminology are roughly analogous to simple domain and nonsimple domain, respectively".)
Eg: This misinterpretation was promoted for a long time from early on by Chris Date, honourable early relational explicator and proselytizer, primarily in his seminal still-current book An Introduction to Database Systems. Which now (2004 8th edition) thankfully presents the helpful relationally-oriented extended notion of distinguishing relation, row and "scalar" (non-relation non-row) domains:
This definition merely states that all [relation variables] are in 1NF
Eg: Maiers' classic The Theory of Relational Databases (1983):
The definition of atomic is hazy; a value that is atomic in one application could be non-atomic in another. For a general guideline, a value is non-atomic if the application deals with only a part of the value.
Eg: The current Wikipedia article on First NF (Normal Form) section Atomicity actually quotes from the introductory parts above. And then ignores the precise meaning. (Then it says something unintelligible about when the nonatomic turtles should stop.):
Codd states that the "values in the domains on which each
relation is defined are required to be atomic with respect to the
DBMS." Codd defines an atomic value as one that "cannot be decomposed
into smaller pieces by the DBMS (excluding certain special functions)"
meaning a field should not be divided into parts with more than one
kind of data in it such that what one part means to the DBMS depends
on another part of the same field.
Re "normalized" and "1NF"
When Codd used "normalize" in 1970, he meant eliminate relation-valued ("non-simple") domains from a relational database:
For this reason (and others to be cited below) the possibility of
eliminating nonsimple domains appears worth investigating. There is,
in fact, a very simple elimination procedure, which we shall call
normalization.
Later the notion of "higher NFs" (involving FDs (functional dependencies) & then JDs (join dependencies)) arose and "normalize" took on a different meaning. Since Codd's original normalization paper, normalization theory has always given results relevant to all relations, not just those in Codd's 1NF. So one can "normalize" in the original sense of going from just relations to a "normalized" "1NF" without relation-valued columns. And one can "normalize" in the normalization-theory sense of going from a just-relations "1NF" to higher NFs while ignoring whether domains are relations. And "normalization" is commonly also used for the "hazy" notion of eliminating values with "parts". And "normalization" is also wrongly used for designing a relational version of a non-relational database (whether just relations and/or some other sense of "1NF").
Relational spirit is to eschew multiple columns with the same meaning or domains with interesting parts in favour of another base table. But we must always come to an informal ergonomic decision about when to stop representing parts and just treat a column as "atomic" (non-relation-valued) vs "nonatomic" (relation-valued).
Normalization in database management system
Atomicity and 1NF... that is not about atomic transactions, but about definition and column content.
"Atomic" means "cannot be divided or split in smaller parts". Applied to 1NF this means that a column should not contain more than one value. It should not compose or combine values that have a meaning of their own.
This tipically regards 2 very common mistakes made by database designers:
1. multiple values in one column (list columns)
columns that contain a list of values, tipically space or comma separated, like this blog post table:
id title date_posted content tags
1 new idea 2014-05-23 ... tag1,tag2,tag3
2 why this? 2014-05-24 ... tag2,tag5
3 towel day 2014-05-26 ... tag42
or this contacts table:
id room phones
4 432 111-111-111 222-222-222
5 456 999-999-999
6 512 888-888-8888 333-3333-3333
This type of denormalization is rare, as most database designers see this cannot be a good thing. But you do find tables like this. They usually come from modifications to the database, whereas it may seem simpler to widen a column and use it to stuff multiple values instead of adding a normalized related table (which often breaks existing applications).
2. complex multi-part columns
In this case one column contains different bits of information and could maybe be designed as a set of separate columns.
Typical example are fullname and address columns:
id fullname address
1 Mark Tomers 56 Tomato Road
2 Fred Askalong 3277 Hadley Drive
3 May Anne Brice 225 Century Avenue - apartment 43/a
These types of denormalizations are very common, as it is quite difficult to draw the line and what is atomic and what is not. Depending on the application, a multi-part column could very well be the best solution in some cases. It is less structured, but simpler.
Structuring an address in many atomic columns may mean having more complex code to handle results for output. Another complexity comes from the structure not being adeguate to fit all types of addresses. Using one single VARCHAR column does not pose this problem, but may pose others... typically about searching and sorting.
An extreme case of multi-part columns are dates and times. Most RDBMS provide date and time data types and provide functions to handle date and time algebra and the extraction of the various bits (month, hour, etc...). Few people would consider convenient to have separate year, mont, day columns in a relational database. But I've seen it... and with good reasons: the use case was birthdates for a justice department database. They had to handle many immigrants with few or no documents. Sometimes you just knew a person was born in a certain year, but you would not know the day or month or birth. You can't handle that type of info with a single date column.
"Every column should be atomic."
Chris Date says, "Please note very carefully that it is not just simple things like the integer 3 that are legitimate values. On the contrary, values can be arbitrarily complex; for example, a value might be a geometric point, or a polygon, or an X ray, or an XML document, or a fingerprint, or an array, or a stack, or a list, or a relation (and so on)."[1]
He also says, "A relvar is in 1NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute."[2]
He generally discourages the use of the word atomic, because it has confusing connotations. Single value is probably a better term to use.
For example, a date like '2014-01-01' is a single value. It's not indivisible; on the contrary, it quite clearly is divisible. But the dbms does one of two things with single values that have parts. The dbms either returns those values as a whole, or the dbms provides functions to manipulate the parts. (Clients don't have to write code to manipulate the parts.)[3]
In the case of dates, SQL can
return dates as a whole (SELECT CURRENT_DATE),
return one or more parts of a date (EXTRACT(YEAR FROM CURRENT_DATE)),
add and subtract intervals (CURRENT_DATE + INTERVAL '1' DAY),
subtract one date from another (CURRENT_DATE - DATE '2014-01-01'),
and so on. In this (narrow) respect, SQL is quite relational.
An Introduction to Database Systems, 8th ed, p 113. Emphasis in the original.
Ibid, p 358.
In the case of a "user-defined" type, the "user" is presumed to be a database programmer, not a client of the database.
it means column should not contain multiple values(like comma seperated values).
plz see below link.
http://www.studytonight.com/dbms/database-normalization.php
Related
Context
NULL is often conflated to mean both unknown data and the known absence of data
In aggregates (such as DATEs or worse, DATETIMEs), sparse data must be represented with magic values
Examples
The middle name of a person
A person who does not have a middle name compared to someone whose middle name is not known
Someone's birth date and time
Differentiating between the time not being known at all vs. the time not being known by your system
Knowing someone's month and day of birth but not the year
Other thoughts, context, and/or approaches
If this were a NoSQL context, one could have a "rule" that if a field is known to be absent, it's not stored at all and if it's unknown, stored as a null
This might make more sense with the rule flipped
Aggregates could be broken up and the rule could be applied to individual fields
I am admittedly ignorant in the NoSQL realm, but it seems like this would be easy to get wrong
For better or for worse, this doesn't apply to a SQL database; omission and NULL are the same
Any field that can be either unknown or absent could have an associated BOOLEAN field that states whether it is absent or not
This is the only approach that seems bulletproof to me
Could seemingly grow to all fields
Seems extremely tedious at the very least
Some "special value" (or values since there are various types) to represent the difference
For a text field, perhaps my-application/unknown and/or my-application/absent (or pick NULL for one)
Impossible to enforce without ambiguity (if one chose for example 42 for a Unix time, that is also 1970-01-01T00:00:42+00:00)
Question
What are the best practices for dealing with the difference between unknown data and known absence of data?
This is too long for a comment.
The "best practices" depends on the context and the overall data model. There are lots of different approaches:
Using a "magic" non-NULL value.
Using a separate boolean/tinyint column.
Using an EAV (entity-attribute-value) model.
Storing the values as JSON or XML (similar to your NOSQL approach).
(and no doubt other methods.)
For your example with the middle name, I find it hard to imagine a scenario where someone is known to have a middle name, but the name (or initial) is not known. So, NULL seems quite appropriate.
This applies to phone numbers, emails addresses, preferred honorifics and so on.
In practice, I don't find that this is a major concern. NULL seems to work well for most columns. Admittedly, there are some cases, but those are rather rare.
I was wondering if you have a website with a dozen different types of listings (Shops, Restaurants, Clubs, Hotels, Events) that require different fields, is there a benefit of creating a table with columns defined like so
Example Shop:
shop_id | name | X | Y | city | district | area | metro | station | address | phone | email | website | opening_hours
Or a more abstract approach similar to this:
object_id | name
---------------
1 | Messy Joe's
2 | Bate's Motel
type_id | name
---------------
1 | hotel
2 | restaurant
object_id | type_id
---------------
1 | 2
2 | 1
field_id | name | field_type
---------------
1 | address | text
2 | opening_hours | date
3 | speciality | text
type_id | field_id
---------------
1 | 1
1 | 2
2 | 1
2 | 3
object_id | field_id | value
1 | 1 | 1st street....
1 | 3 | English Cuisine
Of course it can be more abstract if value's are predefined (Example: specialties could have their own list)
If I take the abstract approach it can be very flexible, but queries will be more complex with a lot of joins.
But I don't know if this affects the performance, executing these 'more complex' queries.
I would be interested to know what are the up and downsides of both methods. I can just imagine for myself, but I don't have the experience to confirm this.
Certain issues need to be clarified and resolved before we can enter into a reasonable discussion.
Pre-requisite Resolution
Labels
In a profession that demands precision, it is important that we use precise labels, to avoid confusion, and so that we can communicate without having to use long-winded descriptions and qualifiers.
What you have posted as FixedTables, is Unnormalised. Fair enough, it may be an attempt at Third Normal form, but in fact it is a flat file, Unnormalised (not "denormalised). What you have posted as AbstractTables is, to be precise, Entity-Attribute-Value, which is almost, but not quite, Sixth Normal form, and is therefore more Normalised than 3NF. Assuming it is done correctly, of course.
The Unnormalised flat file is not "denormalised". It is chock full of duplication (nothing has been done to remove repeating groups and duplicate columns or to resolve dependencies) and Nulls, it is a performance hog in many ways, and prevents concurrency.
In order to be Denormalised, it has to first be Normalised, and then the Normalisation backed off a little for some good reason. Since it is not Normalised in the first place, it cannot be Denormalised. It is simply Unnormalised.
It cannot be said to be denormalised "for performance", because being a performance hog, it is the very antithesis of performance. Well, they need a justification for the lack of formalised design], and "for performance" is it. Even the smallest formal scrutiny exposed the misrepresentation (but very few people can provide, so it remains hidden, until they get an outsider to address, you guessed it, the massive performance problem).
Normalised structures perform far better than Unnormalised structures. More normalised structures (EAV/6NF) perform better than less normalised structures (3NF/5NF).
I am agreeing with the thrust of OMG Ponies, but not their labels and definitions
rather than saying 'don't "denormalise" unless you have to', I am saying, 'Normalise faithfully, period' and 'if there is a performance problem, you have not Normalised correctly'.
Wikipedia
The entries for Normal Forms and Normalisation offer definitions that are incorrect; they confuse the Normal Forms; they are lacking regarding the process of Normalisation; and they give equal weight to absurd or questionable NFs which have been debunked long ago. The result is, Wikipedia adds to an already confused and rarely understood subject. So don't waste your time.
However, in order to progress, without that reference posing a hindrance, let me say this.
The definition of 3NF is stable, and has not changed.
There is a lot of confusion of the NFs between 3NF and 5NF. The truth is that this is an area that progressed over the last 15 years; and many orgs, academics as well as vendors with their products with limitations, jumped to create a new "Normal Form" to validate their offerings. All serving commercial interests and academically unsound. 3NF in its original untampered state intended and guaranteed certain attributes.
The sum total is, 5NF is today, what 3NF was intended to be 15 years ago, and you can skip the commercial banter and the twelve or so "special" (commercial and pseudo-academic) NFs in-between, some of which are identified in Wikipedia, and even that in confusing terms.
Fifth Normal Form
Since you have been able to understand and implement the EAV in your post, you will have no problem understanding the following. Of course a true Relational Model is pre-requisite, strong keys, etc. Fifth Normal Form is, since we are skipping the Fourth:
Third Normal Form
which in simple definitive terms is, every non-key column in every table has a 1::1 relationship to the Primary Key of the table,
and to no other non-key columns
Zero data duplication (the result, if Normalisation is progressed diligently; not achieved by intelligence or experience alone, or by working toward it as a goal without the formal process)
no Update Anomalies (when you update a column somewhere, you do not have to update the same column located somewhere else; the column exists in one and only one place).
If you understand the above, 4NF, BCNF, and all the silly "NFs" can be dismissed, they are required for physicalised Record Filing Systems, as promoted by academics, quite foreign to the Relational Model (Codd).
Sixth Normal Form
The purpose is elimination of missing data (attribute columns), aka elimination of Nulls
This is the one true solution to the Null Problem (also called Handling Missing Values), and the result is a database without Nulls. (It can be done at 5NF with standards and Null substitutes but that is not optimal.) How you interpret and display the missing values is another story.
Technically, is not a true Normal Form, because it does not have 5NF as a pre-requisite, but it has a value
EAV vs Sixth Normal Form
All the databases I have written, except one, are pure 5NF. I have worked with (administered, fixed up, enhanced) a couple of EAV databases, and I have implemented many true 6NF databases. EAV is a loose implementation of 6NF, often done by people who do not have a good grasp on Normalisation and the NFs, but who can see the value in, and need the flexibility of, EAV. You are a perfect example.
The difference is this: because it is loose, and because implementers do not have a reference (6NF) to be faithful to, they only implement what they need, and they write it all in code; that ends up being an inconsistent model.
Whereas, a pure 6NF implementation does have a pure academic reference point, and thus it is usually tighter, and consistent. Typically this shows up in two visible elements:
6NF has a catalogue to contain metadata, and everything is defined in metadata, not code. EAV does not have one, everything is in code (implementers keep track of the objects and attributes). Obviously a catalogue eases the addition of columns, navigation, and allows utilities to be formed.
6NF when understood, provides the true solution to The Null Problem. EAV implementers, since they are absent the 6NF context, handle missing data in code, inconsistently, or worse, allow Nulls in the database. 6NF implementers disallow Nulls, and handle missing Data consistently and elegantly, without requiring code constructs (for Null handling; you still have to code for missing data of course).
Eg. For 6NF databases with a catalogue, I have a set of procs that will [re]generate the SQL required to perform all SELECTs, and I provide Views in 5NF for all users, so they do not need to know or understand the underlying 6NF structure. They are driven off the catalogue. Thus changes are easy and automated. EAV types do that manually, due to the absence of the catalogue.
Discussion
Now, we can start the discussion.
"Of course it can be more abstract if
value's are predefined (Example:
specialities could have their own
list)"
Sure. But do not get too "abstract". Maintain consistency and implement such lists in the same EAV (or 6NF) manner as other lists.
"If I take the abstract approach it
can be very flexible, but queries will
be more complex with a lot of joins.
But I don't know if this affects the
performance, executing these 'more
complex' queries."
Joins are pedestrian in relational databases. The problem is not the database, the problem is that SQL is cumbersome when handling joins, especially compound keys.
EAV and 6NF databases have more Joins, which just as pedestrian, no more, no less. If you have to code each SELECT manually, sure, the cumbersome gets really cumbersome.
The entire problem can be eliminated by (a) going with 6NF over EAV and (b) implementing a catalogue, from which you can (c) generate all the basic SQL. Eliminates an entire class of errors as well.
It is a common myth that Joins somehow have a cost. Totally false.
The join is implemented at compile time, there is nothing of substance to 'cost' CPU cycles.
The issue is the size of tables being joined, not the cost of the Join between those same tables.
Joining two tables with millions of rows each, on a correct PK⇢FK relation, each of which have the appropriate indices
(Unique on the parent [PK] side; Unique on the Child side [PK=parent FK + something]
is instantaneous
Where the Child index is not unique, but at least the leading columns are valid, it is slower; where there is no useful index, of course it is very slow.
None of it has to do with Join cost.
Where many rows are returned, the bottleneck will be the network and the disk layout; not the join processing.
Therefore you can get as "complex" as you like, there is no cost, SQL can handle it.
I would be interested to know what are
the up and downsides of both methods.
I can just imagine for myself, but I
don't have the experience to confirm
this.
5NF (or 3NF for those who have not made the progression) is the easiest and best, in terms of implementation; ease of use (developers as well as users); and maintenance.
The drawback is, every time you add a column, you have to change the database structure (table DDL). That is fine is some cases, but not in most cases, due to change control in place, quite onerous.
Second, you have to change existing code (code handling the new column does not count, because that is an imperative): where good standards are implemented, that is minimised; where they are absent, the scope is unpredictable.
EAV (which is what you have posted), allows columns to be added without DDL changes. That is the single reason people choose it. (code handling the new column does not count, because that is an imperative). If implemented well, it will not affect existing code; if not, it will.
But you need EAV-capable developers.
When EAV is implemented badly, it is abominable, a worse mess than 5NF done badly, but not any worse than Unnormalised which is what most databases out there are (misrepresented as "denormalised for performance").
Of course, it is even more important (than in 5NF/3NF) to hold a strong Transaction context, because the columns are far more distributed.
Likewise, it is essential to retain Declarative Referential Integrity: the messes I have seen were due in large part to the developers removing DRI because it became "too hard to maintain", the result was, as you can imagine, one mother of a data heap with duplicate 3NF/5NF rows and columns all over the place. And inconsistent Null handling.
There is no difference in performance, assuming that the server has been reasonably configured for the intended purpose. (Ok, there are specific optimisations that are possible only in 6NF, which are not possible in other NFs, but I think that is outside the scope of this thread.) And again, EAV done badly can cause unnecessary bottlenecks, no more so than Unnormalised.
Of course, if you go with EAV, I am recommending more formality; buy the full quid; go with 6NF; implement a catalogue; utilities to produce SQL; Views; handle Missing Data consistently; eliminate Nulls altogether. This reduces your vulnerability to the quality of your developers; they can forget about the EAV/6NF esoteric issues, use Views, and concentrate on the app logic.
In your question, you have presented at least two major issues at the same time. Those two issues are E-A-V and gen-spec.
First, let's talk about E-A-V. Your last table (object_id, field_id, value) is essentially an E-A-V. There is an upside to E-A-V and a downside to E-A-V. The upside is that the structure is so generic that it can accomodate almost any body of data describing almost any subject matter. That means that you can proceed to design and implementation with no data analysis and no understanding of the subject matter, and not worry about wrong assumptions. The down side is that at retrieval time, you have to do the data analysis that you skipped over before building the data base, in order to come up with queries that mean anything. This is much more serious than just retrieval efficiency. But you are also going to have terrible problems with retrieval efficiency. There are only two ways to learn about this pitfall: live through it or read about it from those who have. I recommend the reading.
Second, you have a gen-spec case. Your table (object_id, type_id) captures a gen-spec (generalization-specialization) pattern, along with the related tables. If I had to generalize between hotels and restaurants, I might call it something like "public accomodations" or "venues". But I'm not sure I understand your case, and you may be driving for something even more general than those two names suggest. After all, you've included "events" in your list, and an event is not a type of venue in my mind.
I've referred other people to readings on gen-spec and the relational model in previous responses.
When two tables are very similar, when should they be combined?
But I hesitate to send you off in the same direction, because it's not clear to me that you want to come up with a relational model of the data before building your database. A relational model of a body of data and an E-A-V model of the same data are almost totally at odds with each other. It seems to me you have to make that choice before you even explore how to express gen-spec in the relational model of data.
When you start to require a large number of different entities (or even before...), a nosql solution would be vastly simpler than either choice.
Just store each entity/record with the exact fields you require.
{
"id": 1,
"type":"Restaurant",
"name":"Messy Joe",
"address":"1 Main St.",
"tags":["asian","fusion","casual"]
}
The "abstract" approach is better known as "Normalization", looks like 3rd Normal Form (3NF).
The other one is called "Denormalized", and can be a valid performance option... when you've encountered speed issues using the Normalized approach, not before.
How do you have the listings represented in code? I'd guess Listing as a supertype, with Shop, Restuarant, etc. as subtypes?
Assuming so, this is a case of how to map subtypes to a relational database. There are generally three choices:
Option 1: single table per subtype,
with common attributes repeated in
each table (name, id, etc).
Option 2: single table for all objects (your single table approach)
Option 3: table for the supertype and one for each subtype
There's no universally correct solution. My preference is generally to start with option 3; it provides an intituitive structure to work with, is pretty well normalised and can easily be extended. It means a single join for retrieving each instance - but RDBMS are well optimised for doing joins so it doesn't really cause performance problems in practice.
Option 2 can be more performant for queries (no joins) but causes problems if other tables need to refer to all supertype instances (proliferation of foreign keys).
Option 1 appears at first sight to be the most performant, although 2 caveats: (1) It's not resilient to change. If you add a new subtype (and so different attributes) you'll need to change the table structure and migrate it. (2) It can be less efficient than it seems. Because the table population is sparse, some DBs don't store it particularly efficiently. As a consequence it can be less efficicent than option 1 - since the query engine can do joins faster than it can search bloated sparse table spaces.
Which to choose really comes down to knowing details of your problem. I'd suggest reading up a bit on the options: this article is a good place to start.
hth
Here's a simple example of an SQL table:
CREATE TABLE persons
(
id INTEGER,
name VARCHAR(255),
height DOUBLE
);
Since I haven't used SQL very much, I haven't yet learned to think in its terms. Effectively, my brain translates the above into this:
struct Person
{
int id;
string name;
double height;
Person(int id_, const char* name_, double height_)
:id(id_),name(name_),height(height_)
{}
};
Person persons[64];
Then, inserting some elements, in SQL:
INSERT INTO persons (id, name, height) VALUES (1234, 'Frank', 5.125);
INSERT INTO persons (id, name, height) VALUES (5678, 'Jesse', 6.333);
...and how I'm thinking of it:
persons[0] = Person(1234, "Frank", 5.125);
persons[1] = Person(5678, "Jesse", 6.333);
I've read that SQL can be thought of as two major parts: data manipulation and data definition. I'm more concerned about organizing my data in the first place, as opposed to querying and modifying it. There, the distinctions of SQL are immediately obvious. To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic. Where does the array-of-structs analogy I'm automatically drawing for myself break down?
To give a concrete example, let's say that I want each entry in my persons table (or each of my Person objects) to contain a field denoting the names of that person's children (actual fruit-of-your-loins children, not hierarchical data structure children). In reality, these would probably be cross-table references (or pointers to objects), but let's keep things simple and make this field contain zero or more names. In my C++ example, I'd modify the declaration like so:
vector<string> namesOfChildren;
...and do something like this:
persons[0].namesOfChildren.push_back("John");
persons[0].namesOfChildren.push_back("Jane");
But, from what I can tell, the typical usage of SQL doesn't mirror this approach. If I'm wrong and there's a simple, straightforward solution, great. If not, I'm sure a SQL novice like myself could benefit greatly from a little cogitation on the subject of how databases of SQL tables are meant to be used in contrast to bare, generic data structures.
To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic.
It's called "data(base) modeling" and is somewhere between engineering discipline and art (like much of the computer programming). If you are really interested in the topic, take a look at ERwin Methods Guide.
Where does the array-of-structs analogy I'm automatically drawing for myself break down?
At persistency, concurrency, consistency and scalability.
Persistency: The table is automatically saved to the permanent storage. It'll stay there and survive reboots (not that a real database server will reboot much) until you explicitly delete it or there is a catastrophic hardware failure. DBMSes have well-oiled backup procedures that should help in the latter case.
Concurrency: Tables are meant to be accessed and (need be) modified by many clients concurrently. Mechanisms such as locking and multi-version concurrency control are employed to ensure clients will not "step on each other's toes".
Consistency: You can define certain constraints (such as uniqueness, foreign keys or checks) and the DBMS will make sure they are never broken. Furthermore, this can often be done in a declarative manner, minimizing chance for errors. On top of that, everything you do in a database is transactional, so you reap the benefits of atomicity, consistency, isolation and durability (aka. "ACID"). In a nutshell, the database will defend itself from bad data.
Scalability: A well designed database schema can grow well beyond the confines of the available RAM, and still keep good performance, using techniques such as indexing, partitioning, clustering etc... Furthermore, SQL is declarative and set-based, which means that the DBMS has the latitude to pick the optimal "query execution plan" for the data at hand, auto-parallelize the query, cache the results in hope they will be reused etc... without changing the meaning of the query.
Your analogy to the array of structs is not bad ... for the beginning.
After this beginning the differences start in relation to organizing data.
Database people love their "Normal Forms" laws. We do not have these laws in C++ or similar programming languages. Organizing data in the tables according to these laws help database engines to do their magic (queries, joins) better, i.e keep databases compact and crunch millions of rows in fractions of a second, and allow multiple requests concurrently. They are not absolute laws: the 1NF (1st Normal Form) is followed in 99.9999% cases, but the bigger the number (2NF, 3NF, ...) the more often DB planners allow themselves to deviate from them.
Description of normal forms can be found for example here.
I will try to illustrate differences on your example.
In your example the fields of your struct correspond to the columns of the database table. Adding vector of the names as a new field of struct would correspond to adding comma separated list of the names into a new column of your table. This is a violation of the 1NF which demands that one cell is for one value - not for the list of values. To normalize your data you will need to have two arrays: one of Person structs, and another new of structs for Child. While in C++ we can use just pointers to link each child to its parent, in SQL we must use the mechanism of the key. You already added id field into Person struct, now we need to add ParentId field to Child struct so that database engine could find the Parent. ParentId column is called foreign key. Another approach to satisfy 1NF instead of creating the new table/struct for children is that we can switch to children-centric thinking and have just one table with a record per child which will include all the information about the parent of the child. Info about the parent obviously will be repeated in as many records as many children this parent has.
Note (this is also considered part of 1NF) that while in the array of structs we always know the order of the elements, in databases it is up to the engine in what order to store the records. It is just mathematical un-ordered set of records, the engine can resort it in internal storage for various optimizations as it likes. When you retrieve the records from the database with the SELECT statement, if you care about the order, you need to provide ORDER BY clause.
2NF is about removing repetitions from your records. Imagine you would have place of work related fields also as part of your Person struct. Imagine it would include Name of the company and company address. If many Persons in your dataset work in the same Company your would repeat address of the company in their records. Probably we wouldn't do these repetitions in C++ either, but nevertheless extracting these repetitions into a separate table would satisfy 2NF. Strictly speaking even if there is no repetitions and all your Persons work in different places, 2NF still requires to extract data about the workplaces into separate table because it requires that one table would represent one entity.
3NF is about removing transitive dependency and is considered kind of optional, so I will not describe it here. See link above.
Another feature of databases quite different from conventional programming of data structures in C++ is the database indexes. Simplifying, index is just a copy of a column (or columns) (i.e vertical slice) into a separate table where they are stored in an inherent for them order and each record in the index retains the reference to the whole record. So, in your example to create index by height you would create another array of 64 elems of the new
struct HeightIndexElem
{
double height;
Person* pFullRecord;
}
and sort them by height in this array. This will allow the DB engine to automatically optimize certain queries. The database engine itself decides when to use certain index. In C++ we usually create maps (Dictionaries in C#) to speed up finding element by certain characteristic but we must use these maps ourselves - no automatic aspect there.
There are major differences:-
SQL tables are persistent -- (English Tran: written to disk)
They are transactional -- (really written to disk)
They can be an arbitary size -- (Tables of a several hundred million rows are quite common)
They support relational algebra -- (Joins with other tables, filtering etc.)
Relational Algebra is provable -- For a given SELECT statement there is only one possible correct answer.
The biggest differences are that when you "UPDATE" and "COMMIT" you know your data is saved in the database and will be there until you decide to "DELETE" it. When you update a structure within an array its gone when the machine is switched off.
The other big difference is scale. The size of a modern DBMS is only limited by your hard disk budget.
[I really like farfareast's answer from an academic stand point, but I feel the need to add a more practically oriented answer too.]
SQL tables themselves are "bare, generic data structures" as you call C++'s structures. They are only different data structures: a table is always an array of (fixed size) structs and the only pointers you can use are foreign keys.
For example, when you are adding a vector<string> to your struct, you are already using pointers internally as strings are only a "fancy" way of writing char*. This would already require a second table in SQL (using a secondayr index column to keep the elements in order). Of course there are things like postgresql's arrays that can help in this specific case, but those are "only" shortcuts for similar hand-writeable constructs.
The real difference in data structure and algorithms comes from the fact that you can easily add declarations of index structures. Say you know you need to always access Persons in the order of their height. In C++ you'd use some kind of tree or sorted list to keep them in order. There is an STL container for that. The downside is, that when you need to access them in a different order (say by name), you'll have to add a second tree and duplicate the data or start using pointers to Persons. If you add a Person, you need to update all containers and so on. This becomes cumbersome and soon you'll be on the front page of The Daily WTF. SQL tables on the other side can have attached indices which automatically keep up with new and changed data. Of course, their maintenance also must be paid in performance, but the management of them is basically deciding which are required by your access patterns -- something needed in every case -- and defining them. In contrast to having to rewrite large parts of an application, this is a much more favorable situation.
I was wondering if you have a website with a dozen different types of listings (Shops, Restaurants, Clubs, Hotels, Events) that require different fields, is there a benefit of creating a table with columns defined like so
Example Shop:
shop_id | name | X | Y | city | district | area | metro | station | address | phone | email | website | opening_hours
Or a more abstract approach similar to this:
object_id | name
---------------
1 | Messy Joe's
2 | Bate's Motel
type_id | name
---------------
1 | hotel
2 | restaurant
object_id | type_id
---------------
1 | 2
2 | 1
field_id | name | field_type
---------------
1 | address | text
2 | opening_hours | date
3 | speciality | text
type_id | field_id
---------------
1 | 1
1 | 2
2 | 1
2 | 3
object_id | field_id | value
1 | 1 | 1st street....
1 | 3 | English Cuisine
Of course it can be more abstract if value's are predefined (Example: specialties could have their own list)
If I take the abstract approach it can be very flexible, but queries will be more complex with a lot of joins.
But I don't know if this affects the performance, executing these 'more complex' queries.
I would be interested to know what are the up and downsides of both methods. I can just imagine for myself, but I don't have the experience to confirm this.
Certain issues need to be clarified and resolved before we can enter into a reasonable discussion.
Pre-requisite Resolution
Labels
In a profession that demands precision, it is important that we use precise labels, to avoid confusion, and so that we can communicate without having to use long-winded descriptions and qualifiers.
What you have posted as FixedTables, is Unnormalised. Fair enough, it may be an attempt at Third Normal form, but in fact it is a flat file, Unnormalised (not "denormalised). What you have posted as AbstractTables is, to be precise, Entity-Attribute-Value, which is almost, but not quite, Sixth Normal form, and is therefore more Normalised than 3NF. Assuming it is done correctly, of course.
The Unnormalised flat file is not "denormalised". It is chock full of duplication (nothing has been done to remove repeating groups and duplicate columns or to resolve dependencies) and Nulls, it is a performance hog in many ways, and prevents concurrency.
In order to be Denormalised, it has to first be Normalised, and then the Normalisation backed off a little for some good reason. Since it is not Normalised in the first place, it cannot be Denormalised. It is simply Unnormalised.
It cannot be said to be denormalised "for performance", because being a performance hog, it is the very antithesis of performance. Well, they need a justification for the lack of formalised design], and "for performance" is it. Even the smallest formal scrutiny exposed the misrepresentation (but very few people can provide, so it remains hidden, until they get an outsider to address, you guessed it, the massive performance problem).
Normalised structures perform far better than Unnormalised structures. More normalised structures (EAV/6NF) perform better than less normalised structures (3NF/5NF).
I am agreeing with the thrust of OMG Ponies, but not their labels and definitions
rather than saying 'don't "denormalise" unless you have to', I am saying, 'Normalise faithfully, period' and 'if there is a performance problem, you have not Normalised correctly'.
Wikipedia
The entries for Normal Forms and Normalisation offer definitions that are incorrect; they confuse the Normal Forms; they are lacking regarding the process of Normalisation; and they give equal weight to absurd or questionable NFs which have been debunked long ago. The result is, Wikipedia adds to an already confused and rarely understood subject. So don't waste your time.
However, in order to progress, without that reference posing a hindrance, let me say this.
The definition of 3NF is stable, and has not changed.
There is a lot of confusion of the NFs between 3NF and 5NF. The truth is that this is an area that progressed over the last 15 years; and many orgs, academics as well as vendors with their products with limitations, jumped to create a new "Normal Form" to validate their offerings. All serving commercial interests and academically unsound. 3NF in its original untampered state intended and guaranteed certain attributes.
The sum total is, 5NF is today, what 3NF was intended to be 15 years ago, and you can skip the commercial banter and the twelve or so "special" (commercial and pseudo-academic) NFs in-between, some of which are identified in Wikipedia, and even that in confusing terms.
Fifth Normal Form
Since you have been able to understand and implement the EAV in your post, you will have no problem understanding the following. Of course a true Relational Model is pre-requisite, strong keys, etc. Fifth Normal Form is, since we are skipping the Fourth:
Third Normal Form
which in simple definitive terms is, every non-key column in every table has a 1::1 relationship to the Primary Key of the table,
and to no other non-key columns
Zero data duplication (the result, if Normalisation is progressed diligently; not achieved by intelligence or experience alone, or by working toward it as a goal without the formal process)
no Update Anomalies (when you update a column somewhere, you do not have to update the same column located somewhere else; the column exists in one and only one place).
If you understand the above, 4NF, BCNF, and all the silly "NFs" can be dismissed, they are required for physicalised Record Filing Systems, as promoted by academics, quite foreign to the Relational Model (Codd).
Sixth Normal Form
The purpose is elimination of missing data (attribute columns), aka elimination of Nulls
This is the one true solution to the Null Problem (also called Handling Missing Values), and the result is a database without Nulls. (It can be done at 5NF with standards and Null substitutes but that is not optimal.) How you interpret and display the missing values is another story.
Technically, is not a true Normal Form, because it does not have 5NF as a pre-requisite, but it has a value
EAV vs Sixth Normal Form
All the databases I have written, except one, are pure 5NF. I have worked with (administered, fixed up, enhanced) a couple of EAV databases, and I have implemented many true 6NF databases. EAV is a loose implementation of 6NF, often done by people who do not have a good grasp on Normalisation and the NFs, but who can see the value in, and need the flexibility of, EAV. You are a perfect example.
The difference is this: because it is loose, and because implementers do not have a reference (6NF) to be faithful to, they only implement what they need, and they write it all in code; that ends up being an inconsistent model.
Whereas, a pure 6NF implementation does have a pure academic reference point, and thus it is usually tighter, and consistent. Typically this shows up in two visible elements:
6NF has a catalogue to contain metadata, and everything is defined in metadata, not code. EAV does not have one, everything is in code (implementers keep track of the objects and attributes). Obviously a catalogue eases the addition of columns, navigation, and allows utilities to be formed.
6NF when understood, provides the true solution to The Null Problem. EAV implementers, since they are absent the 6NF context, handle missing data in code, inconsistently, or worse, allow Nulls in the database. 6NF implementers disallow Nulls, and handle missing Data consistently and elegantly, without requiring code constructs (for Null handling; you still have to code for missing data of course).
Eg. For 6NF databases with a catalogue, I have a set of procs that will [re]generate the SQL required to perform all SELECTs, and I provide Views in 5NF for all users, so they do not need to know or understand the underlying 6NF structure. They are driven off the catalogue. Thus changes are easy and automated. EAV types do that manually, due to the absence of the catalogue.
Discussion
Now, we can start the discussion.
"Of course it can be more abstract if
value's are predefined (Example:
specialities could have their own
list)"
Sure. But do not get too "abstract". Maintain consistency and implement such lists in the same EAV (or 6NF) manner as other lists.
"If I take the abstract approach it
can be very flexible, but queries will
be more complex with a lot of joins.
But I don't know if this affects the
performance, executing these 'more
complex' queries."
Joins are pedestrian in relational databases. The problem is not the database, the problem is that SQL is cumbersome when handling joins, especially compound keys.
EAV and 6NF databases have more Joins, which just as pedestrian, no more, no less. If you have to code each SELECT manually, sure, the cumbersome gets really cumbersome.
The entire problem can be eliminated by (a) going with 6NF over EAV and (b) implementing a catalogue, from which you can (c) generate all the basic SQL. Eliminates an entire class of errors as well.
It is a common myth that Joins somehow have a cost. Totally false.
The join is implemented at compile time, there is nothing of substance to 'cost' CPU cycles.
The issue is the size of tables being joined, not the cost of the Join between those same tables.
Joining two tables with millions of rows each, on a correct PK⇢FK relation, each of which have the appropriate indices
(Unique on the parent [PK] side; Unique on the Child side [PK=parent FK + something]
is instantaneous
Where the Child index is not unique, but at least the leading columns are valid, it is slower; where there is no useful index, of course it is very slow.
None of it has to do with Join cost.
Where many rows are returned, the bottleneck will be the network and the disk layout; not the join processing.
Therefore you can get as "complex" as you like, there is no cost, SQL can handle it.
I would be interested to know what are
the up and downsides of both methods.
I can just imagine for myself, but I
don't have the experience to confirm
this.
5NF (or 3NF for those who have not made the progression) is the easiest and best, in terms of implementation; ease of use (developers as well as users); and maintenance.
The drawback is, every time you add a column, you have to change the database structure (table DDL). That is fine is some cases, but not in most cases, due to change control in place, quite onerous.
Second, you have to change existing code (code handling the new column does not count, because that is an imperative): where good standards are implemented, that is minimised; where they are absent, the scope is unpredictable.
EAV (which is what you have posted), allows columns to be added without DDL changes. That is the single reason people choose it. (code handling the new column does not count, because that is an imperative). If implemented well, it will not affect existing code; if not, it will.
But you need EAV-capable developers.
When EAV is implemented badly, it is abominable, a worse mess than 5NF done badly, but not any worse than Unnormalised which is what most databases out there are (misrepresented as "denormalised for performance").
Of course, it is even more important (than in 5NF/3NF) to hold a strong Transaction context, because the columns are far more distributed.
Likewise, it is essential to retain Declarative Referential Integrity: the messes I have seen were due in large part to the developers removing DRI because it became "too hard to maintain", the result was, as you can imagine, one mother of a data heap with duplicate 3NF/5NF rows and columns all over the place. And inconsistent Null handling.
There is no difference in performance, assuming that the server has been reasonably configured for the intended purpose. (Ok, there are specific optimisations that are possible only in 6NF, which are not possible in other NFs, but I think that is outside the scope of this thread.) And again, EAV done badly can cause unnecessary bottlenecks, no more so than Unnormalised.
Of course, if you go with EAV, I am recommending more formality; buy the full quid; go with 6NF; implement a catalogue; utilities to produce SQL; Views; handle Missing Data consistently; eliminate Nulls altogether. This reduces your vulnerability to the quality of your developers; they can forget about the EAV/6NF esoteric issues, use Views, and concentrate on the app logic.
In your question, you have presented at least two major issues at the same time. Those two issues are E-A-V and gen-spec.
First, let's talk about E-A-V. Your last table (object_id, field_id, value) is essentially an E-A-V. There is an upside to E-A-V and a downside to E-A-V. The upside is that the structure is so generic that it can accomodate almost any body of data describing almost any subject matter. That means that you can proceed to design and implementation with no data analysis and no understanding of the subject matter, and not worry about wrong assumptions. The down side is that at retrieval time, you have to do the data analysis that you skipped over before building the data base, in order to come up with queries that mean anything. This is much more serious than just retrieval efficiency. But you are also going to have terrible problems with retrieval efficiency. There are only two ways to learn about this pitfall: live through it or read about it from those who have. I recommend the reading.
Second, you have a gen-spec case. Your table (object_id, type_id) captures a gen-spec (generalization-specialization) pattern, along with the related tables. If I had to generalize between hotels and restaurants, I might call it something like "public accomodations" or "venues". But I'm not sure I understand your case, and you may be driving for something even more general than those two names suggest. After all, you've included "events" in your list, and an event is not a type of venue in my mind.
I've referred other people to readings on gen-spec and the relational model in previous responses.
When two tables are very similar, when should they be combined?
But I hesitate to send you off in the same direction, because it's not clear to me that you want to come up with a relational model of the data before building your database. A relational model of a body of data and an E-A-V model of the same data are almost totally at odds with each other. It seems to me you have to make that choice before you even explore how to express gen-spec in the relational model of data.
When you start to require a large number of different entities (or even before...), a nosql solution would be vastly simpler than either choice.
Just store each entity/record with the exact fields you require.
{
"id": 1,
"type":"Restaurant",
"name":"Messy Joe",
"address":"1 Main St.",
"tags":["asian","fusion","casual"]
}
The "abstract" approach is better known as "Normalization", looks like 3rd Normal Form (3NF).
The other one is called "Denormalized", and can be a valid performance option... when you've encountered speed issues using the Normalized approach, not before.
How do you have the listings represented in code? I'd guess Listing as a supertype, with Shop, Restuarant, etc. as subtypes?
Assuming so, this is a case of how to map subtypes to a relational database. There are generally three choices:
Option 1: single table per subtype,
with common attributes repeated in
each table (name, id, etc).
Option 2: single table for all objects (your single table approach)
Option 3: table for the supertype and one for each subtype
There's no universally correct solution. My preference is generally to start with option 3; it provides an intituitive structure to work with, is pretty well normalised and can easily be extended. It means a single join for retrieving each instance - but RDBMS are well optimised for doing joins so it doesn't really cause performance problems in practice.
Option 2 can be more performant for queries (no joins) but causes problems if other tables need to refer to all supertype instances (proliferation of foreign keys).
Option 1 appears at first sight to be the most performant, although 2 caveats: (1) It's not resilient to change. If you add a new subtype (and so different attributes) you'll need to change the table structure and migrate it. (2) It can be less efficient than it seems. Because the table population is sparse, some DBs don't store it particularly efficiently. As a consequence it can be less efficicent than option 1 - since the query engine can do joins faster than it can search bloated sparse table spaces.
Which to choose really comes down to knowing details of your problem. I'd suggest reading up a bit on the options: this article is a good place to start.
hth
Is it when you're trying to get data and there is no apparent easy way of doing it?
When you find something should be a table on it's own?
What are the laws?
Check out Wikipedia. The article talks about database normalization and the different forms (first, second, third, etc.). Most times you should be aiming for at least third normal form. There are times when you want to relax the rules a bit (it may be too expensive to join multiple tables together so might want to de-normalize a bit) but for the most part third normal form is good.
When you notice you have to repeat the same data, or when you start using single fields as arrays.
While this is a somewhat snarky answer, when you discover that the data isn't sufficiently normalized. There are many resources on the web about the levels (or, more properly, "forms") of normalization, and they more completely describe the forms than I could here. First and second normal forms should be pretty much required. If you aren't at third (or, really, fourth) normal form, you need to have a strong justification as to why.
Check out the Wikipedia article on database normalization.
When you're starting to question whether an SQL database needs more normalization.
Whenever you have a relational database.... <grin/>
No, actually there are laws, check out this Wikipedia link.
they are called the five normal forms or something like that. Originally from the guy who invented relational databases in the 50s/60s, E. F. Codd.
"The key the whole key and nothing but the Key, so help me Codd"
This is a synopsis:
First normal form (1NF) Table
faithfully represents a relation and
has no repeating groups
Second normal form (2NF) No
non-prime attribute in the table is
functionally dependent on a part
(proper subset) of a candidate key
Third normal form (3NF) Every
non-prime attribute is
non-transitively dependent on every
key of the table Every non-trivial functional dependency in the table is a dependency on a superkey
Fourth normal form (4NF) Every
non-trivial multivalued dependency
in the table is a dependency on a
superkey
Fifth normal form (5NF) Every non-trivial join dependency in the table is implied by the superkeys of the table. Domain/key normal form (DKNF) Ronald Fagin (1981)[19] Every constraint on the table is a logical consequence of the table's domain constraints and key constraints
Sixth normal form (6NF) Table features no
non-trivial join dependencies at all
(with reference to generalized join
operator)
Other people have pointed you to the formal rules for normalization. Here are some informal guidelines I use:
If you have columns in a table the names of which differ only by a number (eg Phone1 and PHone2).
If you have any columns in a table that should be filled in only when another column in the table is filled in.
If updating a "fact" in the database (such as a street address) requires more than one UPDATE.
If the same question could ever get two different answers depending on which table you get your information from.
If the answer to any non-trivial question can be gotten from the database without JOINing at least two tables.
If you have any quantity-based restrictions in the database other than "only 1 of something is allowed" (that is, "only one address is allowed" is okay, but "only two addresses are allowed" indicates a normalization problem).
3NF is generally all you need and it follows three rules:
Every column in the table should be dependent on:
the key (1NF),
the whole key (2NF),
and nothing but the key (3NF) (so help me Codd is the way that quote usually ends).
You can often "downgrade" to 2NF for performance reasons, provided you understand the implications and only when you strike problems, but 3NF should be the initial goal for all your designs..
As everyone else has said, you know when you start having (too many) duplicate columns in multiple tables.
That being said, it is sometimes useful to have redundant columns across multiple tables. This can reduce the number of JOINs you have to do in complicated queries. Just be careful to keep all the tables in sync, or you're just asking for trouble.
This is a pretty good article. Getting normal is a science, not an art. Now knowing when to DEnormalize... that's an art.
http://www.alvechurchdata.co.uk/hints-and-tips/softnorm.html
See Description of the database normalization basics
What level of normalization are you currently at? If you can't answer that I assume your database is a nasty mess. I always hit 3rd normal on initial design and de-normalize or normalize further if and when needed.
I assume you're talking about a transactional database supporting an interactive application, but for what it's worth...
OLAP databases used exclusively for reporting and only updated by ETL processes may benefit from a less normalized structure. In these applications you accept the cost of redundant data storage and duplication for the performance benefit of fewer joins and the increased ease of use for (sometimes less technical) data analysts and business analysts.
Transactional databases should always be normalized to the extent practical (at least 3NF) and then selectively denormalized only as needed. And the need to denormalize should ideally be based on actual performance testing results.
When you have to search trough huge amounts of data just to extract some basic info - i.e. what kind of Product categories are there or something like that.