What's the case for duplications in BST? - binary-search

How to solve the problem with duplication in Binary Search Tree?

I am not really sure what you are asking. But that won't stop me from posting an answer.
Usually, duplicate keys are disallowed in a BST. That tends to make things a lot easier, and it is a condition that is easy to avoid.
If you do want to allow duplicates, then insertions are not a problem. You can just stick it either in the left subtree or the right subtree.
The problem is that you can't count on the duplicates being on a particular side if it is a self-balancing tree like an AVL-tree or a red-black-tree. It seems like this might be a problem for deletions, but I once implemented an AVL-tree that made no special provisions for duplicates, and it had no problems at all.
Deleting a node from an AVL tree involves (1) finding the node, (2) replacing that node with either the greatest key in the left subtree or the smallest key in the right subtree, and then recursively deleting that node. If there is no subtree, then nothing more needs to be done.
In practice, deleting a node with duplicates means that the node with the sought key nearest the root will be replaced with something, either a node with another key, or a node with the same key. Either way, the ordering constraints are not violated, and everything proceeds with no trouble.
I don't know about red-black trees or other sorts of BSTs.

It's up to your comparison check: if equal and smaller are equivalent, duplicates will be placed in the "smaller" node, otherwise they're in the "larger" node. Besides this, there shouldn't be an issue with duplicates, unless you want to avoid them of course, in which case you need an extra equality check.

Related

Best vs average runtime on binary seach trees

It is known that O((log n)) is the average timecomplexity for search, insert and deletion for a binary search tree, my question is if this is also the best case? If not what are the best cases?
The best case, as is the case with other data structures, is O(1).
Two examples:
1.)The node that you're searching for is the root and that's the only element in the BST.
2.) In a left/right skewed tree, the node that you want to delete is at the root.

Using TTreeCache with TChain friends

I'm using three TChains friended together in my analysis code, with addresses set only for some branches on each chain. What is the best way for me to use TTreeCache in this scenario? Do I have to manually specify which branches to cache? I want to cache all the branches, which have their addresses set, but only those.
On a separate note, I'd like to first only get entries from the first tree, and only in case I decide to further analyze the event get entries from the other friended trees.

Risks and benefits of a modified closure table for hierarchical data

I am attempting to store hierarchical data in SQL and have resolved to use
an object table, where all of the main data will be
and a closure table, defining the relationships between the objects (read more on closure tables here [slides 40 to 68]).
After quite a bit of research, a closure table seemed to suit my needs well. One thing that I kept reading, however, is that if you want to query the direct ancestor / descendant of a particular node - then you can use a depth column in your closure table (see slide 68 from the above link). I have a need for this depth column to facilitate this exact type of query. This is all well and good, but one of the main attractions to the closure table in the first place was the ease by which one could both query and modify data contained there in. And adding a depth column seems to complete destroy the ease by which one can modify data (imagine adding a new node and offsetting an entire branch of the tree).
So - I'm considering modifying my closure table to define relations only between a node and its immediate ancestor / descendant. This allows me to still easily traverse the tree. Querying data seems relatively easy. Modifying data is not as easy as the original closure table without the depth field, but significantly easier than the one with the depth field. It seems like a fair compromise (almost between a closure table and an adjacency list).
Am I overlooking something though? Am I loosing one of the key advantages of the closure table by doing it this way? Does anyone see any inherent risks in doing it this way that may come to haunt me later?
I believe the key advantage you are losing is that if you want to know all of the descendants or ancestors of a node, you now have to do a lot more traversals.
For example, if you start with the following simple tree:
A->B->C->D
To get all descendants of A you have to go A->B then B->C then C->D. So, three queries, as opposed to a single query if following the normal pattern.

Is better have a boolean (important flag )like an attribute or in a separate table?

I have three cases and I don't know what is the better solution for each one, but all are about boolean attributes
I have a table of links and each links has attributes to determine if is visited, broken or filtered and the values of each one must updated one time (except for rare cases of reseting all).
The same links above hava a visiting attribute that is updated constantly, but in a table with more than 1 million of rows, in the maximum, 10,000 or 20,000 will be true.
I have a table with pages and one attribute to indicate if each one was processed or not. In the end (after processing), all rows must be true.
I want to know what is the better solution for each one of these cases.
I think that is: attribute in the first case, table in the second, and I don't know for the third.
Another solution (like index, maybe) are welcome.
IMPORTANT: both tables (pages and links) can have more than a million of rows.
I would say columns for the first case, tables for the second, and columns for the third.
Your main concern, depending on the scale of your database, might be to separate the often-updated data from the bulk of the rest. That's why I'd suggest a table for the second case. You could, however, make judicious use of the "HOT" feature of PostgreSQL, which means that updates do not cause table bloat if the columns being updated are not indexed. But it's probably still a good idea to keep the traffic away from large tables, because of potentially large seek times, keeping autovacuum happy, etc. If you're concerned, I would test this out.
There is no "best" way. The only way to know if your approach is adequately performant is to do it and see. One approach where there are constant updates will not perform the same where there are large numbers of reads and few updates.
I'd suggest just putting everything in the table, unless you have a reason not to and giving that a whirl.
But most importantly: what DBMS?

SQL Query for an Organization Chart?

I feel that this is likely a common problem, but from my google searching I can't find a solution quite as specific to my problem.
I have a list of Organizations (table) in my database and I need to be able to run queries based on their hierarchy. For example, if you query the highest Organization, I would want to return the Id's of all the Organizations listed under that Organization. Further, if I query an organization sort of mid-range, I want only the Organization Id's listed under that Organization.
What is the best way to a) set up the database schema and b) query? I want to only have to send the topmost Organization Id and then get the Id's under that Organization.
I think that makes sense, but I can clarify if necessary.
As promised in my comment, I dug up an article on how to store hierarchies in a database that allows constant-time retrieval of arbitrary subtrees. I think it will suit your needs much better than the answer currently marked as accepted, both in ease of use and speed of access. I could swear I saw this same concept on wikipedia originally, but I can't find it now. It's apparently called a "modified preorder tree traversal". The gist of it is you number each node in the tree twice, while doing a depth-first traversal, once on the way down, and once on the way back up (i.e. when you're unrolling the stack, in a recursive implementation). This means that the children of a given node have all their numbers in between the two numbers of that node. Throw an index on those columns and you've got really fast lookups. I'm sure that's a terrible explanation, so read the article, which goes into more depth and includes pictures.
One simple way is to store the organization's parentage in a text field, like:
SALES-EUROPE-NORTH
To search for every sales organization, you can query on SALES-%. For each European sales org, query on SALES-EUROPE-%.
If you rename an organization, take care to update its child organizations as well.
This keeps it simple, without recursion, at the cost of some flexibility.
The easy way is to have a ParentID column, which is a foreign key to the ID column in the same table, NULL for root nodes. But this method has some drawbacks.
Nested sets are an efficient way to store trees in an relational database.
You could have an Organization have an id PK and a parent FK reference to the id. Then for the query, use (if your database backend supports them) recursive queries, aka Common Table Expressions.