A query to summarize data in sub-tree? - sql

My data fits a tree form naturally. Therefore, I have a simple SQL table to store the data: {id, parentid, data1, ..., dataN}
I want to be able to "zoom in" on the data and produce a report which summarizes the data found below the current branch.
That is, when standing in the root, I want to have the totals of all the data. When I have traveled down a certain branch of the tree, I want to only have the summation of the data found only for that node and its child nodes.
How do I write such a query in SQL?
Thanks in advance!
/John

Since sqlite does not support CONNECT BY, you will not be able to perform this calculation in a single query unless you use nested sets or materialized paths for your data.
Alternatively, do it "the hard way" and traverse your tree recursively, one query for each child node starting at the parent-of-interest.
Also see:
Managing Hierarchical Data in MySQL
Recursive Hierarchies: The Relational Taboo!

Vlad's reference on nested sets looks pretty good. If you want something that covers trees and hierarchies in more detail then you can also check out Joe Celko's book.
The "ID, ParentID" adjacency list model is really an "old time" way of looking at hierarchies in a relational database model.

Related

What is the performance of a Postgres recursive query with a large depth on millions of rows? Should I use a graph database instead?

I am trying to decide whether I should use a graph database or a relational one. I am fairly new to SQL so I am not sure how to test this as I don't have the millions of rows of data yet but I figured this could become a problem down the line so I am looking for some anecdotal experiences from people if possible.
Does anyone know the performance of a recursive query for a Postgres database with millions of rows in each table? Is the performance usually good enough to be able to query data in the database through a web api in sub 100-200ms with potentially a large depth of, lets say 30? The depth is unknown, it could be 1 or all the way to something large like 30 or even more.
My data currently maps very well to the relational data model as it is relational in nature, hence why I am leaning towards Postgres but I would like to execute a specific recursive query that is making me lean towards a graph database. The tables (lets say 5-6) will most likely be millions of rows each in production and the depth of the query is essentially unknown. An example of such a query would be:
Depth 1:
EntityA transacts with -> EntityB
Depth 2:
EntityB transacts with -> EntityC
EntityB transacts with -> EntityD
...
Depth n:
until the last entities (leaves) don't transact with any other entities. At each depth, the entities can transact with many entities and the depth is unknown.
Should I use a graph database instead just because of this 1 specific query or can Postgres handle this in a reasonable time using its RECURSIVE command?
Whenever you're dealing with recursive JOINs in a SQL system that's a dead giveaway that you're actually dealing with graphs and trying to figure out how to navigate a graph with SQL.
Using a graph database is a better option in this case.
For a much more complete answer -- see this other answer on a related question. Performance of arbitrary queries with Neo4j

Risks and benefits of a modified closure table for hierarchical data

I am attempting to store hierarchical data in SQL and have resolved to use
an object table, where all of the main data will be
and a closure table, defining the relationships between the objects (read more on closure tables here [slides 40 to 68]).
After quite a bit of research, a closure table seemed to suit my needs well. One thing that I kept reading, however, is that if you want to query the direct ancestor / descendant of a particular node - then you can use a depth column in your closure table (see slide 68 from the above link). I have a need for this depth column to facilitate this exact type of query. This is all well and good, but one of the main attractions to the closure table in the first place was the ease by which one could both query and modify data contained there in. And adding a depth column seems to complete destroy the ease by which one can modify data (imagine adding a new node and offsetting an entire branch of the tree).
So - I'm considering modifying my closure table to define relations only between a node and its immediate ancestor / descendant. This allows me to still easily traverse the tree. Querying data seems relatively easy. Modifying data is not as easy as the original closure table without the depth field, but significantly easier than the one with the depth field. It seems like a fair compromise (almost between a closure table and an adjacency list).
Am I overlooking something though? Am I loosing one of the key advantages of the closure table by doing it this way? Does anyone see any inherent risks in doing it this way that may come to haunt me later?
I believe the key advantage you are losing is that if you want to know all of the descendants or ancestors of a node, you now have to do a lot more traversals.
For example, if you start with the following simple tree:
A->B->C->D
To get all descendants of A you have to go A->B then B->C then C->D. So, three queries, as opposed to a single query if following the normal pattern.

Is there an effcient hierarchy method for storing a large tree in a SQL table?

I have a table with well over 5 millions rows, that contains hierarchical data (~20 levels). The table is growing exponetially every year and the recursive method for CRUD operations from the table is becoming slow. The table recieves a high volume of updates, reads and deletes. Does any one know of any data models that would be suitable to replace the current Adjacency List Model, or what steps if any to speed up the table?
Have you looked at the HierachyID data type which is available in SQL Server 2008 onwards.
http://technet.microsoft.com/en-us/library/bb677290.aspx
There's a good section on it's use in this free e-book from MS Press
http://blogs.msdn.com/b/microsoft_press/archive/2009/11/16/free-e-book-introducing-microsoft-sql-server-2008.aspx
Five million rows is nothing.
There is a difference between a well-designed Adjacency List model and a badly-designed one. If you post your DDL maybe we could improve it, rather than you throwing out the whole concept, because th eimplementation is poor.
In any case, I would not implement a tree structure or an hierarchy in a Relational database using such a model. I have use the following (ignore the History), hundreds of times, and it is very fast. If you provide the DDL for the table and all indices, I can provide a model specifically for it.
Data Model
▶Tree Structure Data Model◀
Readers who are unfamiliar with the Relational Modelling Standard may find ▶IDEF1X Notation◀ useful.
Maybe a hierarchical or graphical database would be better choices. SQL isn't always the answer - that's why NoSQL is a viable niche.

When to use recursive table

I have a need to build a schema structure to support table of contents (so the level of sections / sub-sections could change for each book or document I add)...one of my first thoughts was that I could use a recursive table to handle that. I want to make sure that my structure is normalized, so I was trying to stay away from deonormalising the table of contents data into a single table (then have to add columns when there are more sub-sections).
It doesn't seem right to build a recursive table and could be kind of ugly to populate.
Just wanted to get some thoughts on some alternate solutions or if a recursive table is ok.
Thanks,
S
It helps that SQL Server 2008 has both the recursive WITH clause and hierarchyid to make working with hierarchical data easier - I was pointing out to someone yesterday that MySQL doesn't have either, making things difficult...
The most important thing is to review your data - if you can normalize it to be within a single table, great. But don't shoehorn it in to fit a single table setup - if it needs more tables, then design it that way. The data & usage will show you the correct way to model things.
When in doubt, keep it simple. Where you've a collection of similar items, e.g. employees then a table that references itself makes sense. Whilst here you can argue (quite rightly) that each item within the table is a 'section' of some form or another, unless you're comfortable with modelling the data as sections and handling the different types of sections through relationships to these entities, I would avoid the complexity of a self-referencing table and stick with a normalized approach.

SQL Query for an Organization Chart?

I feel that this is likely a common problem, but from my google searching I can't find a solution quite as specific to my problem.
I have a list of Organizations (table) in my database and I need to be able to run queries based on their hierarchy. For example, if you query the highest Organization, I would want to return the Id's of all the Organizations listed under that Organization. Further, if I query an organization sort of mid-range, I want only the Organization Id's listed under that Organization.
What is the best way to a) set up the database schema and b) query? I want to only have to send the topmost Organization Id and then get the Id's under that Organization.
I think that makes sense, but I can clarify if necessary.
As promised in my comment, I dug up an article on how to store hierarchies in a database that allows constant-time retrieval of arbitrary subtrees. I think it will suit your needs much better than the answer currently marked as accepted, both in ease of use and speed of access. I could swear I saw this same concept on wikipedia originally, but I can't find it now. It's apparently called a "modified preorder tree traversal". The gist of it is you number each node in the tree twice, while doing a depth-first traversal, once on the way down, and once on the way back up (i.e. when you're unrolling the stack, in a recursive implementation). This means that the children of a given node have all their numbers in between the two numbers of that node. Throw an index on those columns and you've got really fast lookups. I'm sure that's a terrible explanation, so read the article, which goes into more depth and includes pictures.
One simple way is to store the organization's parentage in a text field, like:
SALES-EUROPE-NORTH
To search for every sales organization, you can query on SALES-%. For each European sales org, query on SALES-EUROPE-%.
If you rename an organization, take care to update its child organizations as well.
This keeps it simple, without recursion, at the cost of some flexibility.
The easy way is to have a ParentID column, which is a foreign key to the ID column in the same table, NULL for root nodes. But this method has some drawbacks.
Nested sets are an efficient way to store trees in an relational database.
You could have an Organization have an id PK and a parent FK reference to the id. Then for the query, use (if your database backend supports them) recursive queries, aka Common Table Expressions.