Is hierarchyid suitable for large trees with frequent insertions of leaf nodes? - sql

We have a database that models a tree. This data can grow fairly huge, that is to say many, may million of rows. (The primary key is actually a bigint, so I guess potentially we would like to support billions of rows, although this is probably never going to occur).
A single node can have a very large amount of direct children, more likely the higher up in the hierarchy they are. We have no specified limit to the actual maximum depth of a leaf, i.e. how many nodes one would have to traverse to get to the root, but in practice this would probably normally not grow beyond a few hundred at the very most. Normally it would probably be below 20.
Insertions in this table is very frequent and needs to be high performing. Insertions nodes inserted are always leaf nodes, and always after the last sibling. Nodes are never moved. Deletions are always made as entire subtrees. Finding subtrees is the other operation made on this table. It does not have the same performance requirements, but of course we would like it as fast as possible.
Today this is modeled with the parent/child model, which is efficient for insertions, but is painfully slow for finding subtrees. When the table grows large, this becomes extremely slow and finding a subtree may take several minutes.
So I was thinking about converting this to perhaps using the new hierarchyid type in SQL Server. But I am having troubles figuring out whether this would be suitable. As I undestand it, for the operations we perform in this scenario, such a tree would be a good idea. (Please correct me if I'm wrong here).
But it also states that the maximum size for a hierarchyid is 892 bytes. However, I can not find any information about what this means in practice. How is the hierarchyid encoded? Will I run out of hierarchyids, and if so, when?

So I did some tests and came to somewhat of a conclusion regarding the limitations of hierarchyid:
If I run for example the following code:
DECLARE #i BIGINT = 1
DECLARE #h hierarchyId = '/'
WHILE 1=1
BEGIN
SET #h = #h.ToString() + '1/'
PRINT CONVERT(nvarchar(max), #i)
SET #i = #i+1
END
I will get to 1427 levels deep before I get an error. Since I am using the value 1 for each level, this ought to be the most compact tree from which I draw the conclusion that I will not ever be able to create a tree with more than 1427 levels.
However, if I use for example 99999999999999 for each level (eg. /99999999999999/99999999999999/99999999999999/..., the error occurs already at 118 levels deep. It also seems that 14 digits are the maximum for an id at each level, since it fails immediately if I use a 15 digit number.
So with this in mind, if I only use whole integer identifiers (i.e. don't insert nodes between other nodes etc.) I should be able to guarantee up to at least 100 levels deep in my scenario, and at no time will I be able to exceed much more than 1400 levels.

892 bytes does not sound like much, but the hierarchy id seems to be very efficient, space-wise. From http://technet.microsoft.com/en-us/library/bb677290.aspx:
The average number of bits that are required to represent a node in a tree with n nodes depends on the average fanout (the average number of children of a node). For small fanouts (0-7), the size is about 6*logAn bits, where A is the average fanout. A node in an organizational hierarchy of 100,000 people with an average fanout of 6 levels takes about 38 bits. This is rounded up to 40 bits, or 5 bytes, for storage.
The calculation given says it's only for small fanouts (0-7) which makes it hard to reason about for bigger fanouts. You say 'up to a few hundred children at the most'. This (extreme) case does sound dangerous. I don't know about the spec of hierarchy_id, but the more nodes are at any one level, the less depth you should be able to have in the tree within those 892 bytes.
I do see a risk here, as do you (hence the question). Do some tests. Evaluate the goals. What are you moving from? Why are you moving? Simplicity or performance?
This problem is a bad fit for Sql. Maybe you should consider other options for this part of the program?

Related

Redis bitmap split key division strategy

I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.

Plone - ZODB catalog query sort_on multiple indexes?

I have a ZODB catalog query with a start and end date. I want to sort the result on end_date first and then start_date second.
Sorting on either end_date or start_date works fine.
I tried with a tuple (start_date,end_date), but with no luck.
Is there a way to achieve this or do one have to employ some custom logic afterwards?
The generalized answer ought to be post-hoc-sort of your entire result set of catalog brains, use zope.sequencesort (via PyPI, but already shipped with Plone) or similar.
The more complex answer is a rabbit-hole of optimizations that you should only go down if you know you need to and know what you are doing:
Make sure when you do sort the brains that your user gets a sticky session to the same instance, at least for cache-affinity to get the same catalog indexes and brains (metadata);
You might want to cache across requests (thread-global) a unique session id, and a sequence of catalog RID (integer) values for your entire sorted request, should you expect the user to come back and need in subsequent batches. Of course, RIDs need to be re-constituted into ZCatalog's lazy-sequences of brains, and this requires some know-how (or reading the source).
Finally, for large result (many thousands) sets, I would suggest that it is reasonable to make application-specific compromises that approximate correct by post-hoc sorting of the current batch through to the end of the n-batches after it, where n is inversely proportional to the len(site.portal_catalog.uniqueValuesFor(indexnamehere)). For a large set of results, the correctness of an approximated secondary-sort is high for high-variability, and low for low variability (many items with same secondary value, such that count is much larger than batch size can make this frustrating).
Do not optimize as such unless you are dealing with particularly large result sets.
It should go without saying: if you do optimize, you need to verify that you are actually getting a superior result (profile and benchmark). If you cannot justify investing the time to do this, you cannot justify optimizing.

Space Complexity of Ids Algorithm

Hello I try to calculate the space complexity of IDS (Iterative deepening depth-first search) algorithm but I cannot figure out how to do that. I can't really understand how the algorithm visits the nodes of a tree, so that I could calculate how space it needs. Can you help me?
As far as i have understood, IDS works like this: starting with a limit of 0, meaning the depth of the root node in a graph(or your starting point), it performs a Depth First Search until it exhausts the nodes it finds within the sub-graph defined by the limit. It then proceeds by increasing the limit by one, and performing a Depth First Search from the same starting point, but on the now bigger sub-graph defined by the larger limit. This way, IDS manages to combine benefits of DFS with those of BFS(breadth first search). I hope this clears up a few things.
from Wikipedia:
The space complexity of IDDFS is O(bd), where b is the branching factor and d is the depth of shallowest goal. Since iterative deepening visits states multiple times, it may seem wasteful, but it turns out to be not so costly, since in a tree most of the nodes are in the bottom level, so it does not matter much if the upper levels are visited multiple times.[2]

mysql - Creating rows vs. columns performance

I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.
I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.
You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?
You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.

Improving scalability of the modified preorder tree traversal algorithm

I've been thinking about the modified preorder tree traversal algorithm for storing trees within a flat table (such as SQL).
One property I dislike about the standard approach is that to insert a node you
have to touch (on average) N/2 of the nodes (everything with left or right higher than the insert point).
The implementations I've seen rely on sequentially numbered values. This leaves no room for updates.
This seems bad for concurrency and scaling. Imagine you have a tree rooted at the world containing user groups for every account in a large system, it's extremely large, to the point you must store subsets of the tree on different servers. Touching half of all the nodes to add a node to the bottom of the tree is bad.
Here is the idea I was considering. Basically leave room for inserts by partitioning the keyspace and dividing at each level.
Here's an example with Nmax = 64 (this would normally be the MAX_INT of your DB)
0:64
________|________
/ \
1:31 32:63
/ \ / \
2:14 15-30 33:47 48:62
Here, a node is added to the left half of the tree.
0:64
________|________
/ \
1:31 32:63
/ | \ / \
2:11 11:20 21:30 33:47 48:62
The alogorithm must be extended for the insert and removal process to recursively renumber to the left/right indexes for the subtree. Since querying for immediate children of a node is complicated, I think it makes sense to also store the parent id in the table. The algorithm can then select the sub tree (using left > p.left && right < p.right), then use node.id and node.parent to work through the list, subdividing the indexes.
This is more complex than just incrementing all the indexes to make room for the insert (or decrementing for removal), but it has the potential to affect far fewer nodes (only decendenants of the parent of the inserted/removed node).
My question(s) are basically:
Has this idea been formalized or implemented?
Is this the same as nested intervals?
I have heard of people doing this before, for the same reasons, yes.
Note that you do lose at a couple of small advantages of the algorithm by doing this
normally, you can tell the number of descendants of a node by ((right - left + 1) div 2). This can occasionally be useful, if e.g. you'd displaying a count in a treeview which should include the number of children to be found further down in the tree
Flowing from the above, it's easy to select out all leaf nodes -- WHERE (right = left + 1).
These are fairly minor advantages and may not be useful to you anyway, though for some usage patterns they're obviously handy.
That said, it does sound like materialized paths may be more useful to you, as suggested above.
I think you're better off looking at a different way of storing trees. If your tree is broad but not terribly deep (which seems likely for the case you suggested), you can store the complete list of ancestors up to the root against each node. That way, modifying a node doesn't require touching any nodes other than the node being modified.
You can split your table into two: the first is (node ID, node value), the second (node ID, child ID), which stores all the edges of the tree. Insertion and deletion then become O(tree depth) (you have to navigate to the element and fix what is below it).
The solution you propose looks like a B-tree. If you can estimate the total number of nodes in your tree, then you can choose the depth of the tree beforehand.