Binary Search Tree balancing by rotation (on AVL tree) - binary-search-tree

I'm wondering, is there only ONE possible outcome for the final balanced Binary Search Tree? Sometimes I get different final balanced tree depending on where I do the rotation.

No. There several self balancing types of trees, most popular are AVL and Red-Black. If you put same data in those trees, the resulting trees would be different, yet balanced.
If speaking about AVL tree, I will just give simple example:
2 | 4 | 3
1 4 | 2 5 | 2 4
3 5 | 1 3 | 1 5
All of them are balanced AVL trees, that can be created by different orders of operations.
But if you repeat the same order of operations with exactly same data, resulting trees should be the same, because there is nothing probabilistic in AVL tree algorithm.

Related

Best data structure for finding tags of nested locations

Somebody pointed out that my data structure architecture sucks.
The task
I have a locations table which stores the name of a location. Then I have a tags table which stores information about those locations. The locations have a hierarchie which I want to use to get all tags.
Example
Locations:
USA <- California <- San Francisco <- Mission St
Tags:
USA: English
California: Sunny
California: West coast
San Francisco: Sea side
Mission St: Cable car station
If somebody requests information about the Mission St I want to deliver all tags of it and it's ancestors (["English", "Sunny", "West coast", "Sea side", "Cable car station"]. If I request all tags of California the answer would be ["English", "Sunny", "West coast"].
I'm looking for the best read performance! I don't care about write performance. This data is not changed very often. And I don't care about table sizes either. If I need more or larger tables to solve this quicker so be it.
The tables
So currently I'm thinking about setting up these tables:
locations
id | name
---|--------------
1 | USA
2 | California
3 | San Francisco
4 | Mission St
tags
id | location_id | name
---|-------------|------------------
1 | 1 | English
2 | 2 | Sunny
3 | 2 | West coast
4 | 3 | Sea side
5 | 4 | Cable car station
ancestors
I added a position field to store the hierarchy.
| id | location_id | ancestor_id | position |
|----|-------------|-------------|----------|
| 1 | 2 | 1 | 1 |
| 2 | 3 | 2 | 1 |
| 3 | 3 | 1 | 2 |
| 4 | 4 | 3 | 1 |
| 5 | 4 | 2 | 2 |
| 6 | 4 | 1 | 3 |
Question
Is this a good solution to solve the problem or is there a better one? I want to select as fast as possible all tags of any given location including all the tags of it's ancestors. I'm using a PostgreSQL database but I think this is a pure SQL architecture problem.
Your problem seems to consist of two challenges. The most interesting is "how do I store hierarchies in a relational database". There are lots of answers to that - the one you've proposed is the most common.
There's an alternative called "nested set" which is faster for reading (in your example, finding all locations within a particular hierarchy would be "between x and y".
Postgres has dedicated support for hierachies; I'd assume this would also provide great performance.
The second part of your question is "given a path in my hierarchy, retrieve all matching tags". The easiest option is to join to the tags table as you suggest.
The final aspect is "should you denormalize/precalculate". I usually recommend building and optimizing the "normalized" solution and only denormalize when you need to.
If you want to deliver all tags for a particular location, then I would recommend replicating the data and storing the tags in a tags array on a row for each location.
You say that the locations don't change very much. So, I would simply batch create the entire table, when any underlying data changes.
Modifying the data in situ is rather problematic. A single update could end up affecting a zillion different rows -- consider a tag change on USA. Recalculating the entire table is going to be more efficient.
If you need to search on the tags as well as return them, then I would go for a more traditional structure of a table with two important columns, location and tag. Then you can have indexes on both (location) and (tag) to facilitate searching in either direction.
If write performance is not crucial, I would go for denormalization of the database. That means you use the above structure for your write operations and fill a table for your read operations by a trigger or a some async job, if you are afraid of triggers. Then the read performance is optimal, but you have to invest a bit more into the write logic.
Using the above structure for read operations is indeed not a smart solution, cause you don't know how deep the tree can get.

How can Postgres represent a tree of row IDs?

I want to represent a file-and-folder hierarchy in a Postgres 10 database. A structure like
Photos/
|-- Dog.jpg
|-- Cat.jpg
|-- Places/
|-- Paris.jpg
|-- Berlin.jpg
Songs/
|-- Happy.mp3
would be represented as something like
| id | filename | parent_id |
|----|------------|-----------|
| 1 | Photos | null |
| 2 | Songs | null |
| 3 | Cat.jpg | 1 |
| 4 | Happy.mp3 | 2 |
| 5 | Places | 1 |
| 6 | Berlin.jpg | 5 |
| 7 | Dog.jpg | 1 |
| 8 | Paris.jpg | 5 |
The database would track multiple users, and each user would have their own file-folder hierarchy.
I've been reading up on Postgres's ltree extension, and it seems like the solution to my problem, but I don't know if it is, and it's difficult to test. The labels seem like arbitrary strings -- is it possible to tell Postgres that a label should be an ID field in the same table? Would I need to create one initial root node for each user, only let them attach children to that or children of children of that, then issue a select * from nodes where path ># that rootnode, can you select descendants that way?
Or am I forcing Postgres to do something it was never intended to do, when I should be looking at other kinds of database?
To answer correctly: It depends on your use case.
If your trees are very static (sub nodes change not very often) then ltree is a really good choice. You can do very fast and comfortable queries for sub nodes and ordering. In that case I would do a single root reference for each user as you mentioned.
On the other hand: moving a sub tree in ltree could force you a huge rewrite of the ltree data structure. E.g. if you have a tree like 1.1.1, 1.1.2 and 1.2.1 and you want to change the order of the sub trees in first level after the root, all three values have to change their data.
So, if you tree structure is very dynamical I would try the structure you mentioned above: saving the parent node in an adjacency list (and maybe an index for ordering) and do the query with recursive CTE queries
https://www.postgresql.org/docs/current/static/queries-with.html
http://schinckel.net/2014/09/13/long-live-adjacency-lists/
Last but not least you could try it with the "nested sets" structure (https://en.wikipedia.org/wiki/Nested_set_model).
Every approach has its very own benefits and drawbacks. You have to do a very detailed analysis and maybe create some prototypes to test which one is the best for you.
Further reading:
https://medium.com/notes-from-a-messy-desk/representing-trees-in-postgresql-cbcdae419022
https://explainextended.com/2009/09/24/adjacency-list-vs-nested-sets-postgresql/

Extracting network/tree structure from table in PostgreSQL

I have a PostgreSQL table that contains several conditions and their individual successors. Some conditions have several successors and these successors might have several successors too. So the goal is extract all possible chains of conditions to achieve something like a tree diagram in data.
The table looks like this:
id | con | succ
----|-----|-----
1 | a | b
2 | a | c
3 | a | d
4 | b | c
5 | b | f
6 | c | e
7 | c | g
8 | c | h
9 | d | h
10 | d | i
I still have no clear idea how to store the single chains in the end, but I need the starting point (a), the respective end point and all nodes between starting and end point.
I'm thankful for all kind of advice on how to store the chains and how to extract them.
UPDATE:
This is an extract of my data:
ID | parent_ID
----|----------
403 | 302
404 | 2xx
405 | 303
406 | 304
407 | 304
408 | 2xx
409 | 305
501 | 2xx
502 | 305
503 | 2xx
504 | 2xx
505 | 2xx
506 | 305
507 | 2xx
508 | 306
509 | 2xx
510 | 307
511 | 308
512 | 308
513 | 308
514 | 309
515 | 310
600 | 5xx
You see that some parent-IDs are not ID themselves but groups of IDs ('all beginning with 2'). Now the question is how to make the recursive query run or how to make the recursive query handle the '2xx'. The values are stored as characters. Instead of '2xx' another notation is possible aswell.
Querying tree- and graph-related data stored in a database efficiently is a rather vast topic.
In terms of storage, note that storing an (id, parent_id) pair will usually be the better (as in widely accepted) option.
The question is how to query it, and more importantly how to do so efficiently.
Your main options for trees include:
WITH queries: http://www.postgresql.org/docs/current/static/queries-with.html
Pros: Built-in, and works fine when dealing with small sets
Cons: Doesn't scale well for larger sets
MPTT, aka pre-ordered trees: http://en.wikipedia.org/wiki/Tree_traversal
Pros: Fastest reads for trees
Cons: Slow writes, hard to maintain unless you do rows one by one
Nested sets (or intervals) for trees: http://en.wikipedia.org/wiki/Nested_set_model
Pros: Fast reads for trees
Cons: Faster than MPTT but still slow, not trivial to understand
The ltree type in Postgres contrib: http://www.postgresql.org/docs/current/static/ltree.html
Pros: Built-in, indexable
Cons: Not ORM friendly
I'd add a hybrid variation of MPTT to the list: if you implement MPTT using float indexes, you can get away with not updating anything when moving things around in your tree, which makes things plenty fast. It's a lot trickier to maintain however, because collisions can occur when the difference between two indexes is too small — you need to re-index a large enough subset of the tree when this happens.
For graphs, WITH queries work too. Variations of MPTT and nested sets exist as well; for instance the GRIPP index. It's an area where research and new indexing methods are still quite active.
Your best best is to work with the ltree data type. See the documentation here. That does require that you rework your table structure a bit though. If that is not an option, you should look at recursive with-queries that can - at first sight - work with your current table structure, but the queries will provide data in a format that is not as easy to manipulate as ltree data.
Converting your current table to a ltree variant is best done using a recursive with-query. First you need to create a new table to hold the ltree column:
CREATE TABLE tree_list (
id int,
chain ltree
);
Then run the recursive query and insert the results into the new table:
WITH RECURSIVE build_tree(id, chain) AS (
SELECT id, con::ltree || succ
FROM tree
WHERE con = 'a'
UNION ALL
SELECT tree.id, build_tree.chain || tree.succ
FROM tree, build_tree
WHERE build_tree.chain ~ ('*.' || tree.con)::lquery)
INSERT INTO tree_list SELECT * FROM build_tree;
You will note that the 10 rows of data you provide above will yield 13 chains because there are multiple paths from a to each of e, g and h. This query should work with trees of practically unlimited depth.
id | chain
----+---------
1 | a.b
2 | a.c
3 | a.d
4 | a.b.c
5 | a.b.f
6 | a.c.e
7 | a.c.g
8 | a.c.h
9 | a.d.h
10 | a.d.i
6 | a.b.c.e
7 | a.b.c.g
8 | a.b.c.h
(13 rows)

Data Modeling : Item With Dimensions (One-to-Many)

I need suggestions on how to properly model an item record along with all of its corresponding dimensions.
Consider the following:
| ITEM_ID | ITEM_DESCRIPTION | ITEM_PRICE | SIZE | LENGTH | COLOR
| SH01 | POLO SHIRT | 22.95 | LARGE| |
| PA02 | KHAKI PANTS | 9.95 | 38 | 32 |
| BR22 | BRACELET | 10.95 | | | GREEN
All of the items have different dimensions that may/may not be used by other items. Shirts and pants have sizes and lengths. The bracelet, however, has only a color.
Also, new dimensions may be necessary as new items are added (weight, pattern, etc.).
I've looked at EAV (entity-attribute-value), but from what I understand, reporting would be a nightmare with such a model.
How can I manage the dimensions for each item? Any and all suggestions would be greatly appreciated.
By using the word 'Dimension' you imply you are building a star schema. The physical representation of these 'optional' attributes is mostly dependent on your query tool and desired performance.
IMHO, in dimensional modelling, you should not be afraid of very wide dimensions, particularly if they make querying easier.
If a user runs a query on all product sizes including watches and pants, does it make sense to bucket watches etc. into a N/A size?
EAV is in many ways the opposite of dimensional modelling. dimensional modelling is about making querying as fast and as simple as possible by rearranging data in the ETL process.
Design is often easier if you find a pre proven design approach and stick with it.

Creating a flattened table/view of a hierarchically-defined set of data

I have a table containing hierarchical data. There are currently ~8 levels in this hierarchy.
I really like the way the data is structured, but performance is dismal when I need to know if a record at level 8 is a child of a record at level 1.
I have PL/SQL stored functions which do these lookups for me, each having a select * from tbl start with ... connect by... statement. This works fine when I'm querying a handful of records, but I'm in a situation now where I need to query ~10k records at once and for each of them run this function. It's taking 2-3 minutes where I need it to run in just a few seconds.
Using some heuristics based on my knowledge of the current data, I can get rid of the lookup function and just do childrecord.key || '%' LIKE parentrecord.key but that's a really dirty hack and will not always work.
So now I'm thinking that for this hierarchically-defined table I need to have a separate parent-child table, which will contain every relationship...for a hierarchy going from level 1-8 there would be 8! records, associating 1 with 2, 1 with 3,...,1 with 8 and 2 with 3, 2 with 4,...,2 with 8. And so forth.
My thought is that I would need to have an insert trigger where it will basically run the connect by query and for every match going up the hierarchy it will insert a record in the lookup table. And to deal with old data I'll just set up foreign keys to the main table with cascading deletes.
Are there better options than this? Am I missing another way that I could determine these distant ancestor/descendant relationships more quickly?
EDIT: This appears to be exactly what I'm thinking about: http://evolt.org/working_with_hierarchical_data_in_sql_using_ancestor_tables
So what you want is to materialize the transitive closures. That is, given this application table ...
ID | PARENT_ID
------+----------
1 |
2 | 1
3 | 2
4 | 2
5 | 4
... the graph table would look like this:
PARENT_ID | CHILD_ID
-----------+----------
1 | 2
1 | 3
1 | 4
1 | 5
2 | 3
2 | 4
2 | 5
4 | 5
It is possible to maintain a table like this in Oracle, although you will need to roll your own framework for it. The question is whether it is worth the overhead. If the source table is volatile then keeping the graph data fresh may cost more cycles than you will save on the queries. Only you know your data's profile.
I don't think you can maintain such a graph table with CONNECT BY queries and cascading foreign keys. Too much indirect activity, too hard to get right. Also a materialized view is out, because we cannot write a SQL query which will zap the 1->5 record when we delete the source record for ID=4.
So what I suggest you read a paper called Maintaining Transitive Closure of Graphs in SQL by Dong, Libkin, Su and Wong. This contains a lot of theory and some gnarly (Oracle) SQL but it will give you the grounding to build the PL/SQL you need to maintain a graph table.
"can you expand on the part about it
being too difficult to maintain with
CONNECT BY/cascading FKs? If I control
access to the table and all
inserts/updates/deletes take place via
stored procedures, what kinds of
scenarios are there where this would
break down?"
Consider the record 1->5 which is a short-circuit of 1->2->4->5. Now what happens if, as I said before, we delete the the source record for ID=4? Cascading foreign keys could delete the entries for 2->4 and 4->5. But that leaves 1->5 (and indeed 2->5) in the graph table although they no longer represent a valid edge in the graph.
What might work (I think, I haven't done it) would be to use an additional synthetic key in the source table, like this.
ID | PARENT_ID | NEW_KEY
------+-----------+---------
1 | | AAA
2 | 1 | BBB
3 | 2 | CCC
4 | 2 | DDD
5 | 4 | EEE
Now the graph table would look like this:
PARENT_ID | CHILD_ID | NEW_KEY
-----------+----------+---------
1 | 2 | BBB
1 | 3 | CCC
1 | 4 | DDD
1 | 5 | DDD
2 | 3 | CCC
2 | 4 | DDD
2 | 5 | DDD
4 | 5 | DDD
So the graph table has a foreign key referencing the relationship in the source table which generated it, rather than linking to the ID. Then deleting the record for ID=4 would cascade deletes of all records in the graph table where NEW_KEY=DDD.
This would work if any given ID can only have zero or one parent IDs. But it won't work if it is permissible for this to happen:
ID | PARENT_ID
------+----------
5 | 2
5 | 4
In other words the edge 1->5 represents both 1->2->4->5 and 1->2->5. So, what might work depends on the complexity of your data.