Simple Graph Search Algorithm in SQL (PostgreSQL) - sql

I've implemented a graph of nodes in PostgreSQL (not a tree)
the structure of the table is in this format
id | node1 | node2
--------------------
1 | 1 | 2
2 | 1 | 3
3 | 4 | 1
4 | 5 | 1
5 | 1 | 6
This shows the relationships between the node 1 and the nodes it is connected to.
My Problem
...is that i need a function or method to find a particular node path in sql.
I want to call a function like SELECT getGraphPath(startnode,targetnode) and this will display the path in any form(rows, or strings)
e.g. SELECT getGraphPath(1,18) gives:
[1]-->[3]-->[17]-->[18]
[1]-->[4]-->[10]-->[18]
or even rows:
Result |
--------
1
3
17
18
I'd also like to know how to traverse the graph using breadth first search and depth first search.

Something like this:
with recursive graph_cte (node1, node2, start_id)
as
(
select node1, node2, id as start_id
from graphs
where node1 = 1 -- alternatively elect the starting element using where id = xyz
union all
select nxt.node1, nxt.node2, prv.start_id
from graphs nxt
join graph_cte prv on nxt.node1 = prv.node2
)
select start_id, node1, node2
from graph_cte
order by start_id;
(requires PostgreSQL 8.4 or higher)

SQL is not best suited to manipulating graphs and finding paths. You're probably better off loading the graph in to a procedural language and using the Floyd-Warshall algorithm or Johnson's algorithm to find a path between nodes.
However, if you really must use SQL then I suggest you pick up a copy of Joe Celko's SQL for Smarties which has an entire chapter devoted to graphs in SQL.

You can use graph database directly to solve your problem:
https://launchpad.net/oqgraph -> graph as mysql storage engine

This is not exactly memory-optimal, but works with cyclic graphs too:
with recursive graph_cte (node1, node2, path)
as
(
select node1, node2, ARRAY[node1] as path
from graph
where node1 = 4 -- start node
union all
select nxt.node1, nxt.node2, array_append(prv.path, nxt.node1)
from graph nxt, graph_cte prv
where nxt.node1 = prv.node2
and nxt.node1 != ALL(prv.path)
)
select array_append(path, node2)
from graph_cte
where node2 = 6 -- goal node
limit 1;
Its based on a previously accepted answer, but keeps track of the path. It should stop searching immediately once it finds the goal node.

Related

efficient breadth first search using sql joins

I'm dealing with a binary tree.
So I have a database table in my database where each node is a parent to up to 2 other nodes. I have a plan to efficiently find the top most node (under a given node) that is a parent to less than 2 other nodes. I'm looking for the top most open position to place a new node in other words. So I have this implemented as a breadth-first search. But the way I'm calling the database for each and every node is inefficient. I'm basically going down the tree, producing a running list of nodes on each level and checking each one if it is a parent to two other nodes.
Here's a diagram:
And here's the code if you'd like to see it:
# breadth-first search
def build_and_return_parent_id(breadth_list) do
[ {node_id} | tail ] = breadth_list
child_list = fetch_children_id(node_id)
bc_list = tail ++ child_list
case length(child_list) do
x when x > 2 ->
# recursion
build_and_return_parent_id(bc_list)
2 ->
# recursion
build_and_return_parent_id(bc_list)
_ -> node_id
end
end
def fetch_children_id(id) do
Repo.all( from n in Node,
where: n.parent_id == ^id,
order_by: [asc: n.inserted_at],
select: {n.id})
end
end
So instead of doing that so inefficiently - one db call per node - I was thinking, how about I produce a list of all the nodes that have less than two parents, then travel down the tree, for each level use one db call to get a list of all the nodes on that level, then simply compare the two lists. if there are matching IDs in both the lists I've found a node that has an available spot under it.
Here's a diagram:
The problem is I know almost nothing about sql queries. my guess is that this can be done with some kind of self join on the table.
node_id | parent_id
----------------------
1 | nil
2 | 1
3 | 1
4 | 2
5 | 2
6 | 3
7 | 4
8 | 5
9 | 6
10 | 3
So anyway I'm sure if this method works someone has done it before but I can't seem to find any information on the kinds of sql queries that would be used to generate the open list or the level list.
Now I suppose the 2nd query is pretty simple. since we have an open list we can just use a where-in-[list] clause. Byt the first one I think is the one I'm struggling with.
If you have anything you can point me to or help you can offer I'd really appreciate it.
You can add columns depth and child_count and create an index:
create index nodes_depth_1child_idx on nodes(depth) where child_count=1;
Then searching should be basically instant with:
select node_id from nodes where child_count=1 order by depth limit 1;
You should also create triggers that would maintain these values. This would slow down insert operations slightly, as the insert would have to read the parent node depth and update the parent node child_count.

Visiting a directed graph as if it were an undirected one, using a recursive query

I need your help about the visit of a directed graph stored in a database.
Consider the following directed graph
1->2
2->1,3
3->1
A table stores those relations:
create database test;
\c test;
create table ownership (
parent bigint,
child bigint,
primary key (parent, child)
);
insert into ownership (parent, child) values (1, 2);
insert into ownership (parent, child) values (2, 1);
insert into ownership (parent, child) values (2, 3);
insert into ownership (parent, child) values (3, 1);
I'd like to extract all the semi-connected edges (i.e. the connected edges ignoring the direction) of the graph reachable from a node. I.e., if I start from parent=1, I'd like to have the following output
1,2
2,1
2,3
3,1
I'm using postgresql.
I've modified the example on Postgres' manual which explains recursive queries, and I've adapted the join condition to go "up" and "down" (doing so I ignore the directions). My query is the following one:
\c test
WITH RECURSIVE graph(parent, child, path, depth, cycle) AS (
SELECT o.parent, o.child, ARRAY[ROW(o.parent, o.child)], 0, false
from ownership o
where o.parent = 1
UNION ALL
SELECT
o.parent, o.child,
path||ROW(o.parent, o.child),
depth+1,
ROW(o.parent, o.child) = ANY(path)
from
ownership o, graph g
where
(g.parent = o.child or g.child = o.parent)
and not cycle
)
select g.parent, g.child, g.path, g.cycle
from
graph g
its output follows:
parent | child | path | cycle
--------+-------+-----------------------------------+-------
1 | 2 | {"(1,2)"} | f
2 | 1 | {"(1,2)","(2,1)"} | f
2 | 3 | {"(1,2)","(2,3)"} | f
3 | 1 | {"(1,2)","(3,1)"} | f
1 | 2 | {"(1,2)","(2,1)","(1,2)"} | t
1 | 2 | {"(1,2)","(2,3)","(1,2)"} | t
3 | 1 | {"(1,2)","(2,3)","(3,1)"} | f
1 | 2 | {"(1,2)","(3,1)","(1,2)"} | t
2 | 3 | {"(1,2)","(3,1)","(2,3)"} | f
1 | 2 | {"(1,2)","(2,3)","(3,1)","(1,2)"} | t
2 | 3 | {"(1,2)","(2,3)","(3,1)","(2,3)"} | t
1 | 2 | {"(1,2)","(3,1)","(2,3)","(1,2)"} | t
3 | 1 | {"(1,2)","(3,1)","(2,3)","(3,1)"} | t
(13 rows)
I have a problem: the query extracts the same edges many times, as they are reached through different paths, and I'd like to avoid this. If I modify the outer query into
select distinct g.parent, g.child from graph
I have the desired result, but inefficiencies remain in the WITH query, as unneeded joins are done.
So, is there a solution to extract the reachable edges of a graph in a db, starting from a given one, without using distinct?
I also have another problem (this one is solved, look at the bottom): as you can see from the output, cycles stop only when a node is reached for the second time. I.e. I have (1,2) (2,3) (1,2).
I'd like to stop the cycle before cycling over that last node again, i.e. having (1,2) (2,3).
I've tried to modify the where condition as follows
where
(g.parent = o.child or g.child = o.parent)
and (ROW(o.parent, o.child) <> any(path))
and not cycle
to avoid visiting already visited edges, but it doesn't work and I cannot understand why ((ROW(o.parent, o.child) <> any(path)) should avoid cycling before going on the cycled edge again but doesn't work). How can I do to stop cycles one step before the node that closes the cycle?
Edit: as danihp suggested, to solve the second problem I used
where
(g.parent = o.child or g.child = o.parent)
and not (ROW(o.parent, o.child) = any(path))
and not cycle
and now the output contains no cycles. Rows went from 13 to 6, but I still have duplicates, so the main (the first) problem of extracting all the edges without duplicates and without distinct is still alive. Current output with and not ROW
parent | child | path | cycle
--------+-------+---------------------------+-------
1 | 2 | {"(1,2)"} | f
2 | 1 | {"(1,2)","(2,1)"} | f
2 | 3 | {"(1,2)","(2,3)"} | f
3 | 1 | {"(1,2)","(3,1)"} | f
3 | 1 | {"(1,2)","(2,3)","(3,1)"} | f
2 | 3 | {"(1,2)","(3,1)","(2,3)"} | f
(6 rows)
Edit #2:: following what Erwin Brandstetter suggested, I modified my query, but if I'm not wrong, the proposed query gives MORE rows than mine (ROW comparison is still there as it seems more clear to me, even I understood that string comparison will be more efficient).
Using the new query, I obtain 20 rows, while mine gives 6 rows
WITH RECURSIVE graph(parent, child, path, depth) AS (
SELECT o.parent, o.child, ARRAY[ROW(o.parent, o.child)], 0
from ownership o
where 1 in (o.child, o.parent)
UNION ALL
SELECT
o.parent, o.child,
path||ROW(o.parent, o.child),
depth+1
from
ownership o, graph g
where
g.child in (o.parent, o.child)
and ROW(o.parent, o.child) <> ALL(path)
)
select g.parent, g.child from graph g
Edit 3: so, as Erwin Brandstetter pointed out, the last query was still wrong, while the right one can be found in his answer.
When I posted my first query, I hadn't understood that I was missing some joins, as it happens in the following case: if I start with the node 3, the db selects the rows (2,3) and (3,1). Then, the first inductive step of the query would select, joining from these rows, the rows (1,2), (2,3) and (3,1), missing the row (2,1) that should be included in the result as conceptually the algorithm would imply ( (2,1) is "near" (3,1) )
When I tried to adapt the example in Postgresql manual, I was right trying to join ownership's parent and child, but I was wrong not saving the value of graph that had to be joined in each step.
These type of queries seem to generate a different set of rows depending on the starting node (i.e. depending on the set of rows selected in the base step). So, I think it could be useful to select just one row containing the starting node in the base step, as you'll get any other "adjacent" node anyway.
Could work like this:
WITH RECURSIVE graph AS (
SELECT parent
,child
,',' || parent::text || ',' || child::text || ',' AS path
,0 AS depth
FROM ownership
WHERE parent = 1
UNION ALL
SELECT o.parent
,o.child
,g.path || o.child || ','
,g.depth + 1
FROM graph g
JOIN ownership o ON o.parent = g.child
WHERE g.path !~~ ('%,' || o.parent::text || ',' || o.child::text || ',%')
)
SELECT *
FROM graph
You mentioned performance, so I optimized in that direction.
Major points:
Traverse the graph only in the defined direction.
No need for a column cycle, make it an exclusion condition instead. One less step to go. That is also the direct answer to:
How can I do to stop cycles one step before the node that closes the
cycle?
Use a string to record the path. Smaller and faster than an array of rows. Still contains all necessary information. Might change with very big bigint numbers, though.
Check for cycles with the LIKE operator (~~), should be much faster.
If you don't expect more that 2147483647 rows over the course of time, use plain integer columns instead of bigint. Smaller and faster.
Be sure to have an index on parent. Index on child is irrelevant for my query. (Other than in your original where you traverse edges in both directions.)
For huge graphs I would switch to a plpgsql procedure, where you can maintain the path as temp table with one row per step and a matching index. A bit of an overhead, that will pay off with huge graphs, though.
Problems in your original query:
WHERE (g.parent = o.child or g.child = o.parent)
There is only one endpoint of your traversal at any point in the process. As you wlak the directed graph in both directions, the endpoint can be either parent or child - but not both of them. You have to save the endpoint of every step, and then:
WHERE g.child IN (o.parent, o.child)
The violation of the direction also makes your starting condition questionable:
WHERE parent = 1
Would have to be
WHERE 1 IN (parent, child)
And the two rows (1,2) and (2,1) are effectively duplicates this way ...
Additional solution after comment
Ignore direction
Still walk any edge only once per path.
Use ARRAY for path
Save original direction in path, not actual direction.
Note, that this way (2,1) and (1,2) are effective duplicates, but both can be used in the same path.
I introduce the column leaf which saves the actual endpoint of every step.
WITH RECURSIVE graph AS (
SELECT CASE WHEN parent = 1 THEN child ELSE parent END AS leaf
,ARRAY[ROW(parent, child)] AS path
,0 AS depth
FROM ownership
WHERE 1 in (child, parent)
UNION ALL
SELECT CASE WHEN o.parent = g.leaf THEN o.child ELSE o.parent END -- AS leaf
,path || ROW(o.parent, o.child) -- AS path
,depth + 1 -- AS depth
FROM graph g
JOIN ownership o ON g.leaf in (o.parent, o.child)
AND ROW(o.parent, o.child) <> ALL(path)
)
SELECT *
FROM graph

Creating a list tree with SQLite

I'm trying to make a hierarchical list with PHP and an SQLite table setup like this:
| itemid | parentid | name |
-----------------------------------------
| 1 | null | Item1 |
| 2 | null | Item2 |
| 3 | 1 | Item3 |
| 4 | 1 | Item4 |
| 5 | 2 | Item5 |
| 6 | 5 | Item6 |
The lists would be built with unordered lists and allow for this type of tree structure:
Item1
|_Item3
|_Item4
Item2
|_Item5
|_Item6
I've seen this done with directories and flat arrays, but I can't seem to make it work right with this structure and without a depth limit.
You're using a textbook design for storing hierarchical data in an SQL database. This design is called Adjacency List, i.e. each node in the hierarchy has a parentid foreign key to its immediate parent.
With this design, you can't generate a tree like you describe and support arbitrary depth for the tree. You've already figured this out.
Most other SQL databases (PostgreSQL, Microsoft, Oracle, IBM DB2) support recursive queries, which solve this problem.
Update:
MySQL supports recursive CTE queries from version 8.0.1 (2017-04-10). See https://dev.mysql.com/doc/refman/8.0/en/with.html
SQLite supports recursive CTE queries from version 3.34.0 (2020-12-01). See https://www.sqlite.org/lang_with.html
If you use an older version, you need another solution to store the hierarchy. There are several solutions for this. See my presentation Models for Hierarchical Data with PHP and MySQL for descriptions and examples.
I usually prefer a design I call Closure Table, but each design has strength and weaknesses. Which one is best for your project depends on what kinds of queries you need to do efficiently with your data. So you should go study the solutions and choose one for yourself.
I know this was asked log time ago, but with current SQLite version it is trivial to do and no need level depth as #Bill-Karwin says. So the correct answer should be reconsidered :)
My table has columns MCTMPLID and REF_TMPLID and my structure starting node is called ROOT
CREATE TABLE MyStruct (
`TMPLID` text,
`REF_TMPLID` text
);
INSERT INTO MyStruct
(`TMPLID`, `REF_TMPLID`)
VALUES
('Root', NULL),
('Item1', 'Root'),
('Item2', 'Root'),
('Item3', 'Item1'),
('Item4', 'Item1'),
('Item5', 'Item2'),
('Item6', 'Item5');
And here is the main query, that builds tree structure
WITH RECURSIVE
under_root(name,level) AS (
VALUES('Root',0)
UNION ALL
SELECT tmpl.TMPLID, under_root.level+1
FROM MyStruct as tmpl JOIN under_root ON tmpl.REF_TMPLID=under_root.name
ORDER BY 2 DESC
)
SELECT substr('....................',1,level*3) || name as TreeStructure FROM under_root
And here is result
Root
...Item1
......Item3
......Item4
...Item2
......Item5
.........Item6
I'm sure this can be modified to work tik OP's table structure, so let this sample be starting point
Documentation and some samples https://www.sqlite.org/lang_with.html#rcex1

Recursion & MYSQL?

I got a really simple table structure like this:
Table Page Hits
id | title | parent | hits
---------------------------
1 | Root | | 23
2 | Child | 1 | 20
3 | ChildX | 1 | 30
4 | GChild | 2 | 40
As I don't want to have the recursion in my code I would like to do a recurisive SQL.
Is there any SELECT statement to get the sum of Root (23+20+30+40) or Child ( 20 + 40 ) ?
You are organizing your hierarchical data using the adjacency list model. The fact that such recursive operations are difficult is in fact one major drawback of this model.
Some DBMSes, such as SQL Server 2005, Postgres 8.4 and Oracle 11g, support recursive queries using common table expressions with the WITH keyword.
As for MySQL, you may be interested in checking out the following article which describes an alternative model (the nested set model), which makes recursive operations easier (possible):
Mike Hillyer: Managing Hierarchical Data in MySQL
Not in 1 select statment, no.
If you knew the maximum depth of the relationshop (ie parent->child->child or parent->child->child->child) you could write a select statement which would give you a bunch of numbers that you would then have to sum up seperately (1 number per level of depth).
You could, however, do it with a mysql stored procedure which is recursive.

cloning hierarchical data

let's assume i have a self referencing hierarchical table build the classical way like this one:
CREATE TABLE test
(name text,id serial primary key,parent_id integer
references test);
insert into test (name,id,parent_id) values
('root1',1,NULL),('root2',2,NULL),('root1sub1',3,1),('root1sub2',4,1),('root
2sub1',5,2),('root2sub2',6,2);
testdb=# select * from test;
name | id | parent_id
-----------+----+-----------
root1 | 1 |
root2 | 2 |
root1sub1 | 3 | 1
root1sub2 | 4 | 1
root2sub1 | 5 | 2
root2sub2 | 6 | 2
What i need now is a function (preferrably in plain sql) that would take the id of a test record and
clone all attached records (including the given one). The cloned records need to have new ids of course. The desired result
would like this for example:
Select * from cloningfunction(2);
name | id | parent_id
-----------+----+-----------
root2 | 7 |
root2sub1 | 8 | 7
root2sub2 | 9 | 7
Any pointers? Im using PostgreSQL 8.3.
Pulling this result in recursively is tricky (although possible). However, it's typically not very efficient and there is a much better way to solve this problem.
Basically, you augment the table with an extra column which traces the tree to the top - I'll call it the "Upchain". It's just a long string that looks something like this:
name | id | parent_id | upchain
root1 | 1 | NULL | 1:
root2 | 2 | NULL | 2:
root1sub1 | 3 | 1 | 1:3:
root1sub2 | 4 | 1 | 1:4:
root2sub1 | 5 | 2 | 2:5:
root2sub2 | 6 | 2 | 2:6:
root1sub1sub1 | 7 | 3 | 1:3:7:
It's very easy to keep this field updated by using a trigger on the table. (Apologies for terminology but I have always done this with SQL Server). Every time you add or delete a record, or update the parent_id field, you just need to update the upchain field on that part of the tree. That's a trivial job because you just take the upchain of the parent record and append the id of the current record. All child records are easily identified using LIKE to check for records with the starting string in their upchain.
What you're doing effectively is trading a bit of extra write activity for a big saving when you come to read the data.
When you want to select a complete branch in the tree it's trivial. Suppose you want the branch under node 1. Node 1 has an upchain '1:' so you know that any node in the branch of the tree under that node must have an upchain starting '1:...'. So you just do this:
SELECT *
FROM table
WHERE upchain LIKE '1:%'
This is extremely fast (index the upchain field of course). As a bonus it also makes a lot of activities extremely simple, such as finding partial trees, level within the tree, etc.
I've used this in applications that track large employee reporting hierarchies but you can use it for pretty much any tree structure (parts breakdown, etc.)
Notes (for anyone who's interested):
I haven't given a step-by-step of the SQL code but once you get the principle, it's pretty simple to implement. I'm not a great programmer so I'm speaking from experience.
If you already have data in the table you need to do a one time update to get the upchains synchronised initially. Again, this isn't difficult as the code is very similar to the UPDATE code in the triggers.
This technique is also a good way to identify circular references which can otherwise be tricky to spot.
The Joe Celko's method which is similar to the njreed's answer but is more generic can be found here:
Nested-Set Model of Trees (at the middle of the article)
Nested-Set Model of Trees, part 2
Trees in SQL -- Part III
#Maximilian: You are right, we forgot your actual requirement. How about a recursive stored procedure? I am not sure if this is possible in PostgreSQL, but here is a working SQL Server version:
CREATE PROCEDURE CloneNode
#to_clone_id int, #parent_id int
AS
SET NOCOUNT ON
DECLARE #new_node_id int, #child_id int
INSERT INTO test (name, parent_id)
SELECT name, #parent_id FROM test WHERE id = #to_clone_id
SET #new_node_id = ##IDENTITY
DECLARE #children_cursor CURSOR
SET #children_cursor = CURSOR FOR
SELECT id FROM test WHERE parent_id = #to_clone_id
OPEN #children_cursor
FETCH NEXT FROM #children_cursor INTO #child_id
WHILE ##FETCH_STATUS = 0
BEGIN
EXECUTE CloneNode #child_id, #new_node_id
FETCH NEXT FROM #children_cursor INTO #child_id
END
CLOSE #children_cursor
DEALLOCATE #children_cursor
Your example is accomplished by EXECUTE CloneNode 2, null (the second parameter is the new parent node).
This sounds like an exercise from "SQL For Smarties" by Joe Celko...
I don't have my copy handy, but I think it's a book that'll help you quite a bit if this is the kind of problems you need to solve.