efficient breadth first search using sql joins - sql

I'm dealing with a binary tree.
So I have a database table in my database where each node is a parent to up to 2 other nodes. I have a plan to efficiently find the top most node (under a given node) that is a parent to less than 2 other nodes. I'm looking for the top most open position to place a new node in other words. So I have this implemented as a breadth-first search. But the way I'm calling the database for each and every node is inefficient. I'm basically going down the tree, producing a running list of nodes on each level and checking each one if it is a parent to two other nodes.
Here's a diagram:
And here's the code if you'd like to see it:
# breadth-first search
def build_and_return_parent_id(breadth_list) do
[ {node_id} | tail ] = breadth_list
child_list = fetch_children_id(node_id)
bc_list = tail ++ child_list
case length(child_list) do
x when x > 2 ->
# recursion
build_and_return_parent_id(bc_list)
2 ->
# recursion
build_and_return_parent_id(bc_list)
_ -> node_id
end
end
def fetch_children_id(id) do
Repo.all( from n in Node,
where: n.parent_id == ^id,
order_by: [asc: n.inserted_at],
select: {n.id})
end
end
So instead of doing that so inefficiently - one db call per node - I was thinking, how about I produce a list of all the nodes that have less than two parents, then travel down the tree, for each level use one db call to get a list of all the nodes on that level, then simply compare the two lists. if there are matching IDs in both the lists I've found a node that has an available spot under it.
Here's a diagram:
The problem is I know almost nothing about sql queries. my guess is that this can be done with some kind of self join on the table.
node_id | parent_id
----------------------
1 | nil
2 | 1
3 | 1
4 | 2
5 | 2
6 | 3
7 | 4
8 | 5
9 | 6
10 | 3
So anyway I'm sure if this method works someone has done it before but I can't seem to find any information on the kinds of sql queries that would be used to generate the open list or the level list.
Now I suppose the 2nd query is pretty simple. since we have an open list we can just use a where-in-[list] clause. Byt the first one I think is the one I'm struggling with.
If you have anything you can point me to or help you can offer I'd really appreciate it.

You can add columns depth and child_count and create an index:
create index nodes_depth_1child_idx on nodes(depth) where child_count=1;
Then searching should be basically instant with:
select node_id from nodes where child_count=1 order by depth limit 1;
You should also create triggers that would maintain these values. This would slow down insert operations slightly, as the insert would have to read the parent node depth and update the parent node child_count.

Related

SQL Server 2014 equivalent to mysql's find_in_set()

I'm working with a database that has a locations table such as:
locationID | locationHierarchy
1 | 0
2 | 1
3 | 1,2
4 | 1
5 | 1,4
6 | 1,4,5
which makes a tree like this
1
--2
----3
--4
----5
------6
where locationHierarchy is a csv string of the locationIDs of all its ancesters (think of a hierarchy tree). This makes it easy to determine the hierarchy when working toward the top of the tree given a starting locationID.
Now I need to write code to start with an ancestor and recursively find all descendants. MySQL has a function called 'find_in_set' which easily parses a csv string to look for a value. It's nice because I can just say "find in set the value 4" which would give all locations that are descendants of locationID of 4 (including 4 itself).
Unfortunately this is being developed on SQL Server 2014 and it has no such function. The CSV string is a variable length (virtually unlimited levels allowed) and I need a way to find all ancestors of a location.
A lot of what I've found on the internet to mimic the find_in_set function into SQL Server assumes a fixed depth of hierarchy such as 4 levels maximum) which wouldn't work for me.
Does anyone have a stored procedure or anything that I could integrate into a query? I'd really rather not have to pull all records from this table to use code to individually parse the CSV string.
I would imagine searching the locationHierarchy string for locationID% or %,{locationid},% would work but be pretty slow.
I think you want like -- in either database. Something like this:
select l.*
from locations l
where l.locationHierarchy like #LocationHierarchy + ',%';
If you want the original location included, then one method is:
select l.*
from locations l
where l.locationHierarchy + ',' like #LocationHierarchy + ',%';
I should also note that SQL Server has proper support for recursive queries, so it has other options for hierarchies apart from hierarchy trees (which are still a very reasonable solution).
Finally It worked for me..
SELECT * FROM locations WHERE locationHierarchy like CONCAT(#param,',%%') OR
o.unitnumber like CONCAT('%%,',#param,',%%') OR
o.unitnumber like CONCAT('%%,',#param)

Select N unique rows randomly while excluding 1 specific row (in SQLite DB)

I need to select x unique rows randomly from a table with n rows, while excluding 1 specific row. (x is small, 3 for example) This can be done in several queries if needed and I can also compute anything in programming language (Java). The one important thing is that it must be done faster than O(n), consuming O(x) memory and indefinite looping (retrying) is also undesirable.
Probability of selection should be equal for all rows (except the one which is excluded, of course)
Example:
| id | some data |
|————|———————————|
| 1 | … |
| 2 | … |
| 3 | … |
| 4 | … |
| 5 | … |
The algorithm is ran with arguments (x = 3, exclude_id = 4), so:
it should select 3 different random rows from rows with id in 1,2,3,5.
(end of example)
I tried the following approach:
get row count (= n);
get the position of the excluded row by something like select count(*) where id < excluded_id, assuming id is monotonically increasing;
select the numbers from 0..n, obeying all the rules, by using some "clever" algorithms, it's something like O(x), in other words fast enough;
select these x rows one by one by using limit(index, 1) SQL clause.
However, it turned out that it's possible for rows to change positions (I'm not sure why), so the auto-generated ids are not monotonically increasing. And in this case the second step (get the position of the excluded row) produces wrong result and the algorithm fails to do its job correctly.
How can this be improved?
Also, if this is vastly easier with a different SQL-like database system, it would be interesting, too. (the DB is on a server and I can install any software there as long as it's compatible with Ubuntu 14.04 LTS)
(I'm sorry for a bit of confusion;) the algorithm I used is actually correct it the id is monotonically increasing, I just forgot that it was not itself auto-generated, it was taken from another table where it's auto-generated, and it was possible to add these rows in different order.
So I added another id for this table, which is auto-generated, and used it for row selection, and now it works as it should.

Summarizing leaf node values in a tree using SQL?

Given a column containing a set of strings that represent leaf nodes in a tree, along with some statistics:
leafnodes count
--------- -----
/a/b 1
/a/c 3
/d/e/f 2
/d/e/c 5
How can I generate the set of non-leaf nodes with summarized statistics? It would be nice to summarize both the immediate children and also recursively summarize all descendents.
non-leafnodes immediate-counts recursive-counts
--- ---------------- ----------------
/a 4 4
/d 0 7
/d/e 7 7
Generic SQL preferred, but Oracle-specific solutions are fine.
There is no generic SQL solution, except adding precalculated fields into table, for oracle you do use Hierarchical queries but then better to change structure anyway, as you will have to struggle with substrings

Simple Graph Search Algorithm in SQL (PostgreSQL)

I've implemented a graph of nodes in PostgreSQL (not a tree)
the structure of the table is in this format
id | node1 | node2
--------------------
1 | 1 | 2
2 | 1 | 3
3 | 4 | 1
4 | 5 | 1
5 | 1 | 6
This shows the relationships between the node 1 and the nodes it is connected to.
My Problem
...is that i need a function or method to find a particular node path in sql.
I want to call a function like SELECT getGraphPath(startnode,targetnode) and this will display the path in any form(rows, or strings)
e.g. SELECT getGraphPath(1,18) gives:
[1]-->[3]-->[17]-->[18]
[1]-->[4]-->[10]-->[18]
or even rows:
Result |
--------
1
3
17
18
I'd also like to know how to traverse the graph using breadth first search and depth first search.
Something like this:
with recursive graph_cte (node1, node2, start_id)
as
(
select node1, node2, id as start_id
from graphs
where node1 = 1 -- alternatively elect the starting element using where id = xyz
union all
select nxt.node1, nxt.node2, prv.start_id
from graphs nxt
join graph_cte prv on nxt.node1 = prv.node2
)
select start_id, node1, node2
from graph_cte
order by start_id;
(requires PostgreSQL 8.4 or higher)
SQL is not best suited to manipulating graphs and finding paths. You're probably better off loading the graph in to a procedural language and using the Floyd-Warshall algorithm or Johnson's algorithm to find a path between nodes.
However, if you really must use SQL then I suggest you pick up a copy of Joe Celko's SQL for Smarties which has an entire chapter devoted to graphs in SQL.
You can use graph database directly to solve your problem:
https://launchpad.net/oqgraph -> graph as mysql storage engine
This is not exactly memory-optimal, but works with cyclic graphs too:
with recursive graph_cte (node1, node2, path)
as
(
select node1, node2, ARRAY[node1] as path
from graph
where node1 = 4 -- start node
union all
select nxt.node1, nxt.node2, array_append(prv.path, nxt.node1)
from graph nxt, graph_cte prv
where nxt.node1 = prv.node2
and nxt.node1 != ALL(prv.path)
)
select array_append(path, node2)
from graph_cte
where node2 = 6 -- goal node
limit 1;
Its based on a previously accepted answer, but keeps track of the path. It should stop searching immediately once it finds the goal node.

cloning hierarchical data

let's assume i have a self referencing hierarchical table build the classical way like this one:
CREATE TABLE test
(name text,id serial primary key,parent_id integer
references test);
insert into test (name,id,parent_id) values
('root1',1,NULL),('root2',2,NULL),('root1sub1',3,1),('root1sub2',4,1),('root
2sub1',5,2),('root2sub2',6,2);
testdb=# select * from test;
name | id | parent_id
-----------+----+-----------
root1 | 1 |
root2 | 2 |
root1sub1 | 3 | 1
root1sub2 | 4 | 1
root2sub1 | 5 | 2
root2sub2 | 6 | 2
What i need now is a function (preferrably in plain sql) that would take the id of a test record and
clone all attached records (including the given one). The cloned records need to have new ids of course. The desired result
would like this for example:
Select * from cloningfunction(2);
name | id | parent_id
-----------+----+-----------
root2 | 7 |
root2sub1 | 8 | 7
root2sub2 | 9 | 7
Any pointers? Im using PostgreSQL 8.3.
Pulling this result in recursively is tricky (although possible). However, it's typically not very efficient and there is a much better way to solve this problem.
Basically, you augment the table with an extra column which traces the tree to the top - I'll call it the "Upchain". It's just a long string that looks something like this:
name | id | parent_id | upchain
root1 | 1 | NULL | 1:
root2 | 2 | NULL | 2:
root1sub1 | 3 | 1 | 1:3:
root1sub2 | 4 | 1 | 1:4:
root2sub1 | 5 | 2 | 2:5:
root2sub2 | 6 | 2 | 2:6:
root1sub1sub1 | 7 | 3 | 1:3:7:
It's very easy to keep this field updated by using a trigger on the table. (Apologies for terminology but I have always done this with SQL Server). Every time you add or delete a record, or update the parent_id field, you just need to update the upchain field on that part of the tree. That's a trivial job because you just take the upchain of the parent record and append the id of the current record. All child records are easily identified using LIKE to check for records with the starting string in their upchain.
What you're doing effectively is trading a bit of extra write activity for a big saving when you come to read the data.
When you want to select a complete branch in the tree it's trivial. Suppose you want the branch under node 1. Node 1 has an upchain '1:' so you know that any node in the branch of the tree under that node must have an upchain starting '1:...'. So you just do this:
SELECT *
FROM table
WHERE upchain LIKE '1:%'
This is extremely fast (index the upchain field of course). As a bonus it also makes a lot of activities extremely simple, such as finding partial trees, level within the tree, etc.
I've used this in applications that track large employee reporting hierarchies but you can use it for pretty much any tree structure (parts breakdown, etc.)
Notes (for anyone who's interested):
I haven't given a step-by-step of the SQL code but once you get the principle, it's pretty simple to implement. I'm not a great programmer so I'm speaking from experience.
If you already have data in the table you need to do a one time update to get the upchains synchronised initially. Again, this isn't difficult as the code is very similar to the UPDATE code in the triggers.
This technique is also a good way to identify circular references which can otherwise be tricky to spot.
The Joe Celko's method which is similar to the njreed's answer but is more generic can be found here:
Nested-Set Model of Trees (at the middle of the article)
Nested-Set Model of Trees, part 2
Trees in SQL -- Part III
#Maximilian: You are right, we forgot your actual requirement. How about a recursive stored procedure? I am not sure if this is possible in PostgreSQL, but here is a working SQL Server version:
CREATE PROCEDURE CloneNode
#to_clone_id int, #parent_id int
AS
SET NOCOUNT ON
DECLARE #new_node_id int, #child_id int
INSERT INTO test (name, parent_id)
SELECT name, #parent_id FROM test WHERE id = #to_clone_id
SET #new_node_id = ##IDENTITY
DECLARE #children_cursor CURSOR
SET #children_cursor = CURSOR FOR
SELECT id FROM test WHERE parent_id = #to_clone_id
OPEN #children_cursor
FETCH NEXT FROM #children_cursor INTO #child_id
WHILE ##FETCH_STATUS = 0
BEGIN
EXECUTE CloneNode #child_id, #new_node_id
FETCH NEXT FROM #children_cursor INTO #child_id
END
CLOSE #children_cursor
DEALLOCATE #children_cursor
Your example is accomplished by EXECUTE CloneNode 2, null (the second parameter is the new parent node).
This sounds like an exercise from "SQL For Smarties" by Joe Celko...
I don't have my copy handy, but I think it's a book that'll help you quite a bit if this is the kind of problems you need to solve.