SQL Server Nested set vs Hierarchyid performance - sql

I have a hierarchical data. The most common queries will be "get parent branch for node" and "get subtree of node". Updates and inserts are not likely to occur often. I am choosing between nested sets and hierarchyid. As far as I am concerned, search on nested set should be pretty fast on indexed columns, however, I have no clue about internal implementation of hierarchyid. What should I use in order to achieve highest performance possible?

Having used HierarchyID and self-referencing tables in different projects, I'd say HierarchyId wins hands down in terms of ease of querying.
See Querying a Hierarchical Table Using Hierarchy Methods to see how easy it can be with the built-in query methods for HierarchyID.

Related

What is the performance of a Postgres recursive query with a large depth on millions of rows? Should I use a graph database instead?

I am trying to decide whether I should use a graph database or a relational one. I am fairly new to SQL so I am not sure how to test this as I don't have the millions of rows of data yet but I figured this could become a problem down the line so I am looking for some anecdotal experiences from people if possible.
Does anyone know the performance of a recursive query for a Postgres database with millions of rows in each table? Is the performance usually good enough to be able to query data in the database through a web api in sub 100-200ms with potentially a large depth of, lets say 30? The depth is unknown, it could be 1 or all the way to something large like 30 or even more.
My data currently maps very well to the relational data model as it is relational in nature, hence why I am leaning towards Postgres but I would like to execute a specific recursive query that is making me lean towards a graph database. The tables (lets say 5-6) will most likely be millions of rows each in production and the depth of the query is essentially unknown. An example of such a query would be:
Depth 1:
EntityA transacts with -> EntityB
Depth 2:
EntityB transacts with -> EntityC
EntityB transacts with -> EntityD
...
Depth n:
until the last entities (leaves) don't transact with any other entities. At each depth, the entities can transact with many entities and the depth is unknown.
Should I use a graph database instead just because of this 1 specific query or can Postgres handle this in a reasonable time using its RECURSIVE command?
Whenever you're dealing with recursive JOINs in a SQL system that's a dead giveaway that you're actually dealing with graphs and trying to figure out how to navigate a graph with SQL.
Using a graph database is a better option in this case.
For a much more complete answer -- see this other answer on a related question. Performance of arbitrary queries with Neo4j

"Canonical" approach for mapping custom queries to hierarchical entities with user-defined key/value pairs

In about every SQL-based database application I have worked on so far, sooner or later the following three-faceted requirement has popped up:
There is some entity, linked in a hierarchical fashion (i.e. the tuples form a tree structure).
Users must be able to define any number of custom attributes with values for the tuples, and these values are inherited/overridden towards the leaves of the tree structure. ("Dumb" attributes usually suffice. That is, no uniqueness constraints, no foreign keys, only one value per attribute, ...)
Users must be able to run arbitrary queries on this data (i.e. custom boolean expressions, based upon filters for the values of the user-defined attributes that are linked with AND/OR).
Storing the data, roughly matching the first two bullets above, is quite straightforward:
The hierarchy is built up by giving the respective table a parent column. This column will be null for root nodes, and a pointer to the ID of the parent node for all other nodes.
The user-defined attributes are stored according to the entity-attribute-value pattern.
While there are numerous resources that suggest to use a different approach especially in the latter point (e.g. answers here, here, or here), I have not usually been in a position to move away from a traditional static relational database schema. Hence, let's simply assume the above as a given. Also, hardly ever could I rely on the specifics of a particular DBMS; the more usual case was systems that were supposed to work with MS SQL Server, Oracle, and possibly others as backends without requiring two significantly different product versions.
Solving the third item, however, is always problematic (even without considering the hierarchical inheritance of attribute values). The number of joins depends on the different number of attributes considered in the boolean expression. Alternatively, the number of joins can somewhat be reduced by determining the maximum number of distinct attributes considered in any case of the custom boolean expression, which may save joins, but makes the resulting queries and the code used to generate them even less intelligible and maintainable. For instance,
a = 5 or (b = 8 and c = 9)
could do with 2 joins to the attribute-value table.
I have always been able to do this "somehow", but as this appears to be a fairly ubiquitous situation, I am looking for the "canonical" way to generate SQL queries in this situation. Is there a "standard pattern" to follow here?
Careful not to fall prey to the inner platform effect. It is a complicated problem, and SQL itself is designed to handle the complexities. Generate DDL to add and remove columns as needed, and generate simple select statements for queries. Store each Tuple Type (distinct set of attributes) as a table.
With regards to inheritance, I recommend handling it in the application or DAL, and only storing the non-inherited values. On retrieval, read all parent rows to calculate the functional values. If you do need to access "functional" values from SQL, use an indexed view or triggers to maintain them separate from storage.
Hierarchies can be represented as you describe, but a simple "Parent" column can make it difficult to query beyond a single level. Look at hierarchyid on SQL Server or CONNECT BY on oracle.
Avoiding EAV stores allows you to:
Use indexes and statistics where needed
Keep efficient storage (ints stored as ints, money stored as money)
Keep understandable queries (SELECT * FROM vwProducts WHERE Color = 'RED' ORDER BY Price ASC)
If you want an EAV system because you have too many attributes (>1024 per type) or they are not somewhat statically defined (many changes per hour), I would avoid using a relational database in the first place. Use an EAV (NoSQL) database server instead.
tl;dr: If you have a schema, use DDL to tell the server about it. If you don't, use a more appropriate server.

Risks and benefits of a modified closure table for hierarchical data

I am attempting to store hierarchical data in SQL and have resolved to use
an object table, where all of the main data will be
and a closure table, defining the relationships between the objects (read more on closure tables here [slides 40 to 68]).
After quite a bit of research, a closure table seemed to suit my needs well. One thing that I kept reading, however, is that if you want to query the direct ancestor / descendant of a particular node - then you can use a depth column in your closure table (see slide 68 from the above link). I have a need for this depth column to facilitate this exact type of query. This is all well and good, but one of the main attractions to the closure table in the first place was the ease by which one could both query and modify data contained there in. And adding a depth column seems to complete destroy the ease by which one can modify data (imagine adding a new node and offsetting an entire branch of the tree).
So - I'm considering modifying my closure table to define relations only between a node and its immediate ancestor / descendant. This allows me to still easily traverse the tree. Querying data seems relatively easy. Modifying data is not as easy as the original closure table without the depth field, but significantly easier than the one with the depth field. It seems like a fair compromise (almost between a closure table and an adjacency list).
Am I overlooking something though? Am I loosing one of the key advantages of the closure table by doing it this way? Does anyone see any inherent risks in doing it this way that may come to haunt me later?
I believe the key advantage you are losing is that if you want to know all of the descendants or ancestors of a node, you now have to do a lot more traversals.
For example, if you start with the following simple tree:
A->B->C->D
To get all descendants of A you have to go A->B then B->C then C->D. So, three queries, as opposed to a single query if following the normal pattern.

Is there an effcient hierarchy method for storing a large tree in a SQL table?

I have a table with well over 5 millions rows, that contains hierarchical data (~20 levels). The table is growing exponetially every year and the recursive method for CRUD operations from the table is becoming slow. The table recieves a high volume of updates, reads and deletes. Does any one know of any data models that would be suitable to replace the current Adjacency List Model, or what steps if any to speed up the table?
Have you looked at the HierachyID data type which is available in SQL Server 2008 onwards.
http://technet.microsoft.com/en-us/library/bb677290.aspx
There's a good section on it's use in this free e-book from MS Press
http://blogs.msdn.com/b/microsoft_press/archive/2009/11/16/free-e-book-introducing-microsoft-sql-server-2008.aspx
Five million rows is nothing.
There is a difference between a well-designed Adjacency List model and a badly-designed one. If you post your DDL maybe we could improve it, rather than you throwing out the whole concept, because th eimplementation is poor.
In any case, I would not implement a tree structure or an hierarchy in a Relational database using such a model. I have use the following (ignore the History), hundreds of times, and it is very fast. If you provide the DDL for the table and all indices, I can provide a model specifically for it.
Data Model
▶Tree Structure Data Model◀
Readers who are unfamiliar with the Relational Modelling Standard may find ▶IDEF1X Notation◀ useful.
Maybe a hierarchical or graphical database would be better choices. SQL isn't always the answer - that's why NoSQL is a viable niche.

A query to summarize data in sub-tree?

My data fits a tree form naturally. Therefore, I have a simple SQL table to store the data: {id, parentid, data1, ..., dataN}
I want to be able to "zoom in" on the data and produce a report which summarizes the data found below the current branch.
That is, when standing in the root, I want to have the totals of all the data. When I have traveled down a certain branch of the tree, I want to only have the summation of the data found only for that node and its child nodes.
How do I write such a query in SQL?
Thanks in advance!
/John
Since sqlite does not support CONNECT BY, you will not be able to perform this calculation in a single query unless you use nested sets or materialized paths for your data.
Alternatively, do it "the hard way" and traverse your tree recursively, one query for each child node starting at the parent-of-interest.
Also see:
Managing Hierarchical Data in MySQL
Recursive Hierarchies: The Relational Taboo!
Vlad's reference on nested sets looks pretty good. If you want something that covers trees and hierarchies in more detail then you can also check out Joe Celko's book.
The "ID, ParentID" adjacency list model is really an "old time" way of looking at hierarchies in a relational database model.