I am looking at quite a monstrosity of an application that uses several tables to represent a reporting hierarchy, but eacvh of these tables is identical. The table at the bottom of the hierarchy has the most records, each of which has a ParentID to the rollup table above it, eventually all adding up to only one total in the top rollup table.
I am plagued to insanity by gargantuan 'if' blocks of code with hard-coded joins and table names, and I am trying hard to recognise some sane reason for not using a single table, with a levelID in each row, instead of one table for each level, for all these levels, or at least several views on the same table. The latter because the database was designed to be used in MSAccess, which doesn't allow aliased sub-queries AFAIK.
Are the tables all the same in their semantics or just in their form? If the tables are identical in both form and semantics, then a single table solution is probably far superior, but I don't know enough about your case to say for sure.
Representing a hierarchy in an SQL table can be quite a challenge. Fortunately, reporting hierarchies are generally small enough and stable enough so that a variety of techniques will work.
The simplest technique goes by the name "adjacency list" model. In this model, there are two columns one of which refers to the other. I'll call them MyTable(ID, ParentID). In a real case, Mytable will have other columns. ParentID references ID in a different row of the same table. This is easy to implement, and easy to update. It can be a pain to do rollups.
Another technique goes by the name "nested sets". Call it MyTable (ID, lft, rgt, Level). (Level is redundant, and often omitted). Here we have two columns (lft and rgt) that show where the row fits into the hierarchy, because lgt and rgt are nested inside the lft and rgt of all the ancestors of the node in question. This technique is hard to update. It's easy to do rollups, and to find subtrees, and ancestor paths, and lots of other types of queries.
A third way to "flatten the hierarchy". In this technique, each level of the hierarchy has a named column of its own, and each row displays its entire ancestry all the way back to the apex of the hierarchy. Here we have MyTable (ID, Division, Department, Group, Team). Division, Department, Group and Team are all levels of the hierarchy. This is ultimately easy for users who access the data via a point and click drill down interface, because there's nothing for them to learn, if the column names are chosen well. It requires a name for each level. It does not adapt well to indefinite levels of hierarchy. It's got a lot of redundancy. In general flattened hiearachies are generated automatically from a table that stores hierarchies in adajency list form or nested set form.
Any one of these are a good alternative to separate tables for every level in the hierarchy.
excellent solutions are suggested here already. I would add one to use a hierarchical database such as hypergraph.
Related
If in a databse we have a parent table and two children tables. Is it better to use joins to get the children or add a flag to distinguish them ?
For example, the parent table is Person[Person_Name, Person_ID]. The first child table is Employee[Person_ID, Employee_ID, Department] and the other child is Customer[Person_ID, Location, Rank].
So, is it a good thing to add flag [isEmployee] or [isCustomer] to the parent table (Person) and save the effort of Joining the tables on "person_Id" ?
Another case would be with one child, for example, the parent table would be Member[Member_Name, Member_ID] and a child table GoldenMember[Member_ID, Phone_Number, EMail].
Now in this case, if I want to show the info of a specific Member, I need to do a join between tables to see whether it's a Golden Memmber or not, but if the flag "isGolden" was in the table (Member) it would save us a join?
So, which is better and why ??
Thanks in advance :)
There is no "better" unless you provide criteria for measurement of "goodness".
SQL's support for entity subtyping is inadequate. You can hack your way around any of the shortcomings that there are, but each hack will do no more than introduce new problems of its own.
Additional "Type" columns on the top level introduce the problem of database updating becoming more complex. Defective update procedures will corrupt the database's integrity.
Leaving out the additional "Type" columns at the top level will make the problem of formulating read queries more complex (more joins, notably). Many people would add here "and degrade performance", but it's unlikely that you will suffer noticeably from this.
Choose which difficulty is the easiest to live with in your particular use case.
i am trying to design a table which contains sections and each section contains tasks and each task contains sub tasks and so on. I would like to do it under one table. Please let me know the best single table approach which is scalable. I am pretty new to database design. Also please suggest if single table is not the best approach then what could be the best approach to do this. I am using db2.
Put quite simply, I would say use 1 table for tasks.
In addition to all its various other attributes, each task should have a primary identifier, and another column to optionally contain the identifier of its parent task.
If you are using DB2 for z/OS, then you will use a recursive query with a common table expression. Otherwise you you can use a hierarchical recursive query in DB2 for i, or possibly in DB2 for LUW (Linux, Unix, Windows).
Other designs requiring more tables, each specializing in a certain part of the task:subtask relationship, may needlessly introduce issues or limitations.
There are a few ways to do this.
One idea is to use two tables: Sections and Tasks
There could be a one to many relationship between the two. The Task table could be designed as a tree with a TaskId and a ParentTaksId which means you can have Tasks that go n-levels deep (sub tasks of sub tasks og sub tasks etc). Every Task except for the root task will have a parent.
I guess you can also solve this by using a single table where you just add a section column to the Task table I described above.
If you are going to put everything into one table although convenient will be inefficient in the long run. This would mean you will be storing unnecessary repeated groups of data in your database which would not be processor and memory friendly at all. It would in fact violate the Normalization rules and to be more specific the 1st Normal Form which says that there should be no repeating groups that could be found in your table. And it would actually also violate the 3rd Normal Form which means there will be no (transitional) dependency of a non-primary key to another non-primary key.
To give you an illustration, I will put your design into one table. Although I will be guessing on the possible fields but just bear with it because this is for the sake of discussion. Look at the graphics below:
If you look the graphics above (although this is rather small you could download the image and see it closer for yourself), the SectionName, Taskname, TaskInitiator, TaskStartDate and TaskEndDate are unnecessary repeated which as I mentioned earlier a violation of the 1st Normal Form.
Secondly, Taskname, TaskInitiator, TaskStartDate and TaskEndDate are functionally dependent on TaskID which is not a primary key instead of SectionID which in this case should be the primary key (if on a separate table). This is violation of 3rd Normal Form which says that there should be no Transitional Dependence or non-primary key should be dependent on
another non-primary key.
Although there are instances that you have to de-normalized but I believe this one should be normalized. In my own estimation there should be three tables involved in your design, namely, Sections,Tasks and SubTasks that would like the one below.
Section is related to Tasks, that is, a section could have many Tasks.
And Task is related to Sub-Tasks, that is, a Task could have many Sub-tasks.
If I understand correctly the original poster does not know, how many levels of hierarchy will be needed (hence "and so on"). His problem is to create a design that can hold a structure of any depth.
Imho that is a complex issue that does not have a single answer. When implementing such a design you need to count such factors as:
Will the structure be fairly constant? (How many writes?)
How often will this structure be read?
What operations will need to be possible? (Get all children objects of a given object? Get the parent object? Get the direct children?)
If the structure will be constant You could use the nested set model (http://en.wikipedia.org/wiki/Nested_set_model)
In this way the table has a 'left' and 'right' column. The parent object has its left and right column encompasing the values of any of its children object.
In that way you can list all the children of an object using a query like this:
SELECT child.id
FROM table AS parent
JOIN table AS child
ON child.left BETWEEN parent.left AND parent.right
AND child.right BETWEEN parent.left AND parent.right
WHERE
parent.id = #searchId
This design can be VERY fast to read, but is also EXTREMELY costly when the structure changes (for example when adding a child to any object You will have to update any object with a 'right' value that is higher than the inserted one).
If you need to be able to make changes to structure in real time you should probably use a design with two tables - one holding the objects, the second the structure (something like parentId, childId, differenceInHierarchyLevels).
Let's say you're modeling an entity that has many attributes (2400+), far greater than the physical limit on a given database engine (e.g. ~1000 SQL Server). Knowing nothing about the relative importance of these data points (which ones are hot/used most often) besides the domain/candidate keys, how would you implement it?
A) EAV. (boo... Native relational tools thrown out the window.)
B) Go straight across. The first table has a primary key and 1000 columns, right up to the limit. The next table is 1000, foreign keyed to the first. The last table is the remaining 400, also foreign keyed.
C) Stripe evenly across ceil( n / limit ) tables. Each table has an even number of columns, foreign keying to the first table. 800, 800, 800.
D) Something else...
And why?
Edit: This is more of a philosophical/generic question, not tied to any specific limits or engines.
Edit^2: As many have pointed out, the data was probably not normalized. Per usual, business constraints at the time made deep research an impossibility.
My solution: investigate further. Specifically, establish whether the table is truly normalised (at 2400 columns this seems highly unlikely).
If not, restructure until it is fully normalised (at which point there are likely to be fewer than 1000 columns per table).
If it is already fully normalised, establish (as far as possible) approximate frequencies of population for each attribute. Place the most commonly occurring attributes on the "home" table for the entity, use 2 or 3 additional tables for the less frequently populated attributes. (Try to make frequency of occurrence the criteria for determining which fields should go on which tables.)
Only consider EAV for extremely sparsely populated attributes (preferably, not at all).
Use Sparse Columns for up to 30000 columns. The great advantage over EAV or XML is that you can use Filtered Indexes in conjunction with sparse columns, for very efficient searches over common attributes.
Without having much knowlegde in this area, i think an entity with so many attributes really really needs a re-design. With that I mean splitting the big thing into smaller parts that are logically connected.
The key item to me is this piece:
Knowing nothing about the relative importance of these data points (which ones are hot/used most often)
If you have an idea of which fields are more important, I would put those more important fields in the "native" table and let an EAV structure handle the rest.
The thing is, without this information you're really shooting blind anyway. Whether you have 2400 fields or just 24, you ought to have some kind of idea about the meaning (and therefore relative importance, or at least logical groupings) your data points.
I'd use a one to many attribute table with a foreign key to the entity.
Eg
entities: id,
attrs: id, entity_id, attr_name, value
ADDED
Or as Butler Lampson would say, "all problems in Computer Science can be solved by another level of indirection"
I would rotate the columns and make them rows. Rather than having a column containing the name of the attribute as a string (nvarchar) you could have it as a fkey back to a lookup table which contains a list of all the possible attributes.
Rotating it in this way means you:
don't have masses of tables to record the details of just one item
don't have massively wide tables
you can store only the info you need due to the rotation (if you don't want to store a particular attribute, then just don't have that row)
I'd look at the data model a lot
more carefully. Is it 3rd normal
form? Are there groups of attributes
that should be logically grouped
together into their own tables?
Assuming it is normalized and the
entity truly has 2400+ attributes, I
wouldn't be so quick to boo an
EAV model. IMHO, it's the best,
most flexible solution for the
situation you've described. It gives you built in support for sparse data and gives you good searching speed as the values for any given attribute can be found in a single index.
I would like to use vertical ( increase number of rows ) approach instead of horizontal ( increase number of columns).
You can try this approach like
Table -- id , property_name -- property_value.
The advantage with approach is, no need to alter / create a table when you introduce the new property / column.
I need to store info about county, municipality and city in Norway in a mysql database. They are related in a hierarchical manner (a city belongs to a municipality which again belongs to a county).
Is it best to store this as three different tables and reference by foreign key, or should I store them in one table and relate them with a parent_id field?
What are the pros and cons of either solution? (both structural end efficiency wise)
If you've really got a limit of these three levels (county, municipality, city), I think you'll be happiest with three separate tables with foreign keys reaching up one level each. This will make queries almost trivial to write.
Using a single table with a parent_id field referencing the same table allows you to represent arbitrary tree structures, but makes querying to extract the full path from node to root an iterative process best handled in your application code.
The separate table solution will be much easier to use.
three different tables:
more efficient, if your application mostly accesses information about only one entity (county, municipality, city)
owner-member-relationship is a clear and elegant model ;)
County, Municipality, and City don't sound like they are the same kind of data ; so, I would use three different tables : one per data-type.
And, then, I would indeed use foreign keys between those.
Efficiency-speaking, not sure it'll change much :
you'll do joins on 3 tables instead of joining 3 times on the same table ; I suppose it's quite the same.
it might make a little difference when you need to work on only one of those three type of data ; but with the right indexes, the differences should be minimal.
But, structurally speaking, if those are three different kind of entities, it makes sense to use three different tables.
I would recommend for using three different tables as they are three different entities.
I would use only one table in those cases you don´t know the depth of the hierarchy, but it is not case.
I would put them in three different tables, just on the grounds that it is 3 different concepts. This will hamper speed and will complicate your queries. However given that MySQL does not have any special support for hirachical queries (like Oracle's connect by statement) these would be complicated anyway.
Different tables: it's just "right". I doubt you'll see any performance gains/losses either way but this is one where modelling it properly up-front will probably save you lots of headaches later on. For one thing it'll make SQL SELECTs easier to write and read.
You'll get different opinions coming back to you on this but my personal preference would be to have separate tables because they are separate entities.
In reality you need to think about the queries you will doing on this data and usually your answer will come from that. With separate tables your queries will look much cleaner and in the end your not saving yourself anything because you'll still be joining tables together, even if they are the same table.
I would use three separate tables, since you know exactly what categories of information you are working with, and won't need to dynamically alter the 'depth' of your hierarchy.
It'll also make the data simpler to manage, as you'll be able to tell if the data is for a city, municipality or a county just by knowing the table (and without having to discern the 'depth' of a record in the hierarchy first!).
Since you'll probably be doing self joins anyway to get the hierarchy to work, I'd doubt there would be any benefits from having all the data in a single table.
In dataware housing applications, adherents of the Kimball methodology might place these fields in the same attribute table:
create table city (
id int not null,
county varchar(50) not null,
municipality varchar(50),
city varchar(50),
primary key(id)
);
The idea being that attibutes should never be more than l join away from the fact table.
I just state this as an alternative view. I would go with the 3 table design personally.
This is a case of ‘Database Normalization’, which is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. The purpose is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.
Multiple tables will help in the situation if the task has been distributed among different developers, or users at different levels require different rights to view and change the data or the small tables help when you need this data for other purposes as well or so.
My vote would be for multiple tables - with data appropriately distributed.
We're building a RDBMS-based web site for a federal semantic network (RDF, Protege, etc). This is basically a large collection of nodes, each having a large and indefinite set of named relationships to (and from) other nodes.
My first thought is a single table for all the nodes (name, description, etc), plus one table per named relationship. Any better ideas out there?
On further reflection, two tables total might do, one for nodes (id, name, description), and other for relations (id, name, description, from, to),
where from and two are ids in the nodes table (ints). Still on the right track?
You could optimize the performance by creating 2 rows per relation.
Let's say you have a table Items and a table Relations and that Person A has a relation with Person B. The Relations table has a left and right column, both referring to Items. Now, if you only have one row for this relation, and you want all relations for a certain Item, you would have a query looking like this:
SELECT * FROM Relations WHERE LeftItemId = #ItemId OR RightItemId = #ItemId
The OR in this query will ruin your performance! If you would duplicate the row and switch the relation (left becomes right and vice versa) the query looks like this:
SELECT * FROM Relations WHERE LeftItemId = #ItemId
With the right index this one will go blazingly fast.
No, that sould be fine. Pay attention to primary key and indexes, so that the performance is good.
If you didn't have a single table for the nodes, you'd have to define a lot of relation tables. Each new node type would require a new relation table with every old node type. That could get out of hand quickly.
So a single table sounds best. You can always use a 1:1 relation to extend it, if you need additional fields for certain node types.
if you're using sql server 2008, you might want to consider the new HierarchyID datatype to store your hierarchy in. It's optimized for storage.