SQL - postgres - shortest path in graph - recursion - sql

I have a table which contains the edges from node x to node y in a graph.
n1 | n2
-------
a | a
a | b
a | c
b | b
b | d
b | c
d | e
I would like to create a (materialized) view which denotes the shortest number of nodes/hops a path contains to reach from x to node y:
n1 | n2 | c
-----------
a | a | 0
a | b | 1
a | c | 1
a | d | 2
a | e | 3
b | b | 0
b | d | 1
b | c | 1
b | e | 2
d | e | 1
How should I model my tables and views to facilitate this? I guess I need some kind of recursion, but I believe that is pretty difficult to accomplish in SQL. I would like to avoid that, for example, the clients need to fire 10 queries if the path happens to contain 10 nodes/hops.

This works for me, but it's kinda ugly:
WITH RECURSIVE paths (n1, n2, distance) AS (
SELECT
nodes.n1,
nodes.n2,
1
FROM
nodes
WHERE
nodes.n1 <> nodes.n2
UNION ALL
SELECT
paths.n1,
nodes.n2,
paths.distance + 1
FROM
paths
JOIN nodes
ON
paths.n2 = nodes.n1
WHERE
nodes.n1 <> nodes.n2
)
SELECT
paths.n1,
paths.n2,
min(distance)
FROM
paths
GROUP BY
1, 2
UNION
SELECT
nodes.n1,
nodes.n2,
0
FROM
nodes
WHERE
nodes.n1 = nodes.n2
Also, I am not sure how good it will perform against larger datasets. As suggested by Mark Mann, you may want to use a graph library instead, e.g. pygraph.
EDIT: here's a sample with pygraph
from pygraph.algorithms.minmax import shortest_path
from pygraph.classes.digraph import digraph
g = digraph()
g.add_node('a')
g.add_node('b')
g.add_node('c')
g.add_node('d')
g.add_node('e')
g.add_edge(('a', 'a'))
g.add_edge(('a', 'b'))
g.add_edge(('a', 'c'))
g.add_edge(('b', 'b'))
g.add_edge(('b', 'd'))
g.add_edge(('b', 'c'))
g.add_edge(('d', 'e'))
for source in g.nodes():
tree, distances = shortest_path(g, source)
for target, distance in distances.iteritems():
if distance == 0 and not g.has_edge((source, target)):
continue
print source, target, distance
Excluding the graph building time, this takes 0.3ms while the SQL version takes 0.5ms.

Expanding on Mark's answer, there are some very reasonable approaches to explore a graph in SQL as well. In fact, they'll be faster than the dedicated libraries in perl or python, in that DB indexes will spare you the need to explore the graph.
The most efficient of index (if the graph is not constantly changing) is a nested-tree variation called the GRIPP index. (The linked paper mentions other approaches.)
If your graph is constantly changing, you might want to adapt the nested intervals approach to graphs, in a similar manner that GRIPP extends nested sets, or to simply use floats instead of integers (don't forget to normalize them by casting to numeric and back to float if you do).

Rather than computing these values on the fly, why not create a real table with all interesting pairs along with the shortest path value. Then whenever data is inserted, deleted or updated in your data table, you can recalculate all of the shortest path information. (Perl's Graph module is particularly well-suited to this task, and Perl's DBI interface makes the code straightforward.)
By using an external process, you can also limit the number of recalculations. Using PostgreSQL triggers would cause recalculations to occur on every insert, update and delete, but if you knew you were going to be adding twenty pairs of points, you could wait until your inserts were completed before doing the calculations.

Related

PostgreSQL data transformation - Turn rows into columns

I have a table whose structure looks like the following:
k | i | p | v
Notice that the key (k) is not unique, there are no keys, nothing. Each key can have multiple attributes (i = 0, 1, 2, ...) which can be of different types (p) and have different values (v). One attribute type may also appear multiple times (p(i-1) = p(i)).
What I want to do is pick certain attribute types and their corresponding values and place them in the same row. For example I want to have:
k | attr_name1 | attr_name2
I have managed to make a query that does this and works for all keys (k) for which attr_name1 and attr_name2 appear in the column p of the initial table:
SELECT DISTINCT ON (key) fn.k AS key, fn.v AS attr_name1, a.v AS attr_name2
FROM Table fn
LEFT JOIN Table a ON fn.k = a.k
AND a.p = 'attr_name2'
WHERE fn.p = 'attr_name1'
I would like, however, to take into account the case where a certain key has no attribute named attr_name1 and insert a NULL value into the corresponding column of the new table. I am not sure how to achieve that. I have no issue using multiple queries or intermediate tables etc, but there are quite a lot of rows in the table and I need something that scales to millions of rows.
Any help would be appreciated.
Example:
k i p v
1 0 a 10
1 1 b 12
1 2 c 34
1 3 d 44
1 4 e 09
2 0 a 11
2 1 b 13
2 2 d 22
2 3 f 34
Would turn into (assuming I am only interested in columns a, b, c):
k a b c
1 10 12 34
2 11 13 NULL
I would use conditional aggregation. That is, an aggregate function around a CASE expression.
SELECT
k,
MAX(CASE WHEN p='a' THEN v END) AS a,
MAX(CASE WHEN p='b' THEN v END) AS b,
MAX(CASE WHEN p='c' THEN v END) AS c
FROM
your_table
GROUP BY
k
This presumes that (k, p) is unique. If there are duplicate keys, this will clearly find the one v with the highest value (for each (k,p))
As a general rule this kind of pivoting makes the data harder to process in SQL. This is often done for display purposes because humans find this easier to read. However, from a software engineering perspective, such formatting should not be done in the data layer; be careful that by doing this you don't actually make your future life harder.

Cypher: How to create a recursive cost query with alternates?

I have the following structure:
(:pattern)-[:contains]->(:pattern)
...basically a hierarchy of patterns that use other patterns as content. These constitute trees.
Certain patterns are generated by certain generators:
(:generator)-[:canProduce]->(:pattern)
The canProduce relationship has a cost value associated with it as a property. Multiple generators can create the same pattern.
I would like to figure out, with a query, what patterns I need to generate to produce a particular output - and which generators to choose to have the lowest cost. I started like this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern) RETURN ps
so far so good. The results don't contain the starting pattern, so I made this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as patterns
RETURN patterns
That does not feel elegant, but it also does not provide the hierarchy
I can of course do a path query (MATCH path = MATCH...) but the results don't seem very useful.
Also, now I need to connect the cost from the generator relationship.
I tried this:
MATCH (p:pattern {name: 'awesome'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as rec
CALL {
WITH rec
MATCH (rec)-[r:canGenerate]-(g:generator)
return r.GenCost as GenCost, g.name AS GenName
}
return rec.name, GenCost , GenName
The problem I have now is that if any of the patterns that are part of another pattern can be generated by multiple generators, I just get double entries in the list, but what I want is separate lists for each alternative possibility, so that I can generate the cost.
This is my pattern tree:
Awesome
input1
input2
input 3
Input 3 can be generated by 2 different generators. I now get:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
input4 | 1.4 | TestGen4
What I want is this: Two lists (or n, in the general case, where I might have n possible paths), one
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
and one:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input4 | 1.4 | TestGen4
each set representing one alternative set, so that I can calculate the costs and compare.
I have no idea how to do something like that. Any suggestions?

MS SQL Server: Logic to find out groups of all dependent records from table having source and destinations items

I am using MS SQL Server DB. I have a specific need to find out groups of interdependent items. Visualize below a scenario where we have two items in each row, one is source item and other is destination item. Any item can be source of any item, and same for destination as well. We have two column in the table, 'Source' and 'Destination'. Let's consider 10 values in the table as below:
Source | Destination
A | B
B | C
C | D
E | A
D | E
X | Y
Y | Z
Z | X
P | Q
R | S
My requirement is to get distinct groups of items with Source and destination. Meaning, my query should return below result with 4 rows (grouped items in comma separated form):
RowNum| Result
1| A,B,C,D,E
2| X,Y,Z
3| P,Q
4| R,S
Here, the level of hierarchy can be upto n number. In my example, I kept the first group of 5 items (A to B, B to C, C to D, D to E and E to A - Means 5 different items are involved in this group). But the data may have more items as well, in single group. Also, cyclical records are possible (X to , Y to Z and Z to X)
I can achieve this using nested WHILE Loops. But, as we have thousands of records, the nested WHILE Loop scripts takes too much time to execute. I am iterating one loop on each record of the table and then there is an inner loop which will take outer loop's record and will compare it with all other records.
Can anybody suggest a better way or algorithm to achieve this?
Any help on this would be appreciated.

Hive, pass different row into a function

I have two tables. In first table, each row represents a polygon. In second table, each row represents a point. I want to find if each point within one of the polygon. I try to use ST_Contains, ST_MultiPolygon, and ST_Point in hive. I think there is a way to feed all rows into ST_MultiPolygon, but not sure how to do that. The following is my test data.
pid | shape
1 | [2,0,3,0,3,1,2,1]
2 | [0,0,1,0,1,1,0,1]
This is Polygon table.
pid | x | y
1 | 0.5 | 0.5
2 | 2.1 | 0.5
3 | 1.5 | 0.5
This is Point table
I want to get the result like
pid | is_in
1 | true
2 | true
3 | false
Here is the way I think to solve this problem. what I wanna do is to determine if a point is in one of the Polygon that is stored in hive. Suppose I have 2 polygons [2,0,3,0,3,1,2,1] and [0,0,1,0,1,1,0,1], actually they really are [(2,0), (3,0), (3,1), (2,1)] and [(0,0), (1,0), (1,1), (0,1)]. The reason that I stored in this odd way is because ST_MultiPolygon takes this kind of format as a parameter, like ST_MultiPolygon(array(2,0,3,0,3,1,2,1)). Combined ST_MultiPolygon with ST_Contains and st_point, I can get a boolean result that indicates if a point is in a MultiPolygon. ST_MultiPolygon can even takes multiple arrays, such as ST_MultiPolygon(array(2,0,3,0,3,1,2,1), array(0,0,1,0,1,1,0,1)). By this way, if there is a way for me to feed all Polygons into MultiPolygons, then I know if a point in one of the Polygon.
Any comment will be appreciated.

Explain the Differential Evolution method

Can someone please explain the Differential Evolution method? The Wikipedia definition is extremely technical.
A dumbed-down explanation followed by a simple example would be appreciated :)
Here's a simplified description. DE is an optimisation technique which iteratively modifies a population of candidate solutions to make it converge to an optimum of your function.
You first initialise your candidate solutions randomly. Then at each iteration and for each candidate solution x you do the following:
you produce a trial vector: v = a + ( b - c ) / 2, where a, b, c are three distinct candidate solutions picked randomly among your population.
you randomly swap vector components between x and v to produce v'. At least one component from v must be swapped.
you replace x in your population with v' only if it is a better candidate (i.e. it better optimise your function).
(Note that the above algorithm is very simplified; don't code from it, find proper spec. elsewhere instead)
Unfortunately the Wikipedia article lacks illustrations. It is easier to understand with a graphical representation, you'll find some in these slides: http://www-personal.une.edu.au/~jvanderw/DE_1.pdf .
It is similar to genetic algorithm (GA) except that the candidate solutions are not considered as binary strings (chromosome) but (usually) as real vectors. One key aspect of DE is that the mutation step size (see step 1 for the mutation) is dynamic, that is, it adapts to the configuration of your population and will tend to zero when it converges. This makes DE less vulnerable to genetic drift than GA.
Answering my own question...
Overview
The principal difference between Genetic Algorithms and Differential Evolution (DE) is that Genetic Algorithms rely on crossover while evolutionary strategies use mutation as the primary search mechanism.
DE generates new candidates by adding a weighted difference between two population members to a third member (more on this below).
If the resulting candidate is superior to the candidate with which it was compared, it replaces it; otherwise, the original candidate remains unchanged.
Definitions
The population is made up of NP candidates.
Xi = A parent candidate at index i (indexes range from 0 to NP-1) from the current generation. Also known as the target vector.
Each candidate contains D parameters.
Xi(j) = The jth parameter in candidate Xi.
Xa, Xb, Xc = three random parent candidates.
Difference vector = (Xb - Xa)
F = A weight that determines the rate of the population's evolution.
Ideal values: [0.5, 1.0]
CR = The probability of crossover taking place.
Range: [0, 1]
Xc` = A mutant vector obtained through the differential mutation operation. Also known as the donor vector.
Xt = The child of Xi and Xc`. Also known as the trial vector.
Algorithm
For each candidate in the population
for (int i = 0; i<NP; ++i)
Choose three distinct parents at random (they must differ from each other and i)
do
{
a = random.nextInt(NP);
} while (a == i)
do
{
b = random.nextInt(NP);
} while (b == i || b == a);
do
{
c = random.nextInt(NP);
} while (c == i || c == b || c == a);
(Mutation step) Add a weighted difference vector between two population members to a third member
Xc` = Xc + F * (Xb - Xa)
(Crossover step) For every variable in Xi, apply uniform crossover with probability CR to inherit from Xc`; otherwise, inherit from Xi. At least one variable must be inherited from Xc`
int R = random.nextInt(D);
for (int j=0; j < D; ++j)
{
double probability = random.nextDouble();
if (probability < CR || j == R)
Xt[j] = Xc`[j]
else
Xt[j] = Xi[j]
}
(Selection step) If Xt is superior to Xi then Xt replaces Xi in the next generation. Otherwise, Xi is kept unmodified.
Resources
See this for an overview of the terminology
See Optimization Using Differential Evolution by Vasan Arunachalam for an explanation of the Differential Evolution algorithm
See Evolution: A Survey of the State-of-the-Art by Swagatam Das and Ponnuthurai Nagaratnam Suganthan for different variants of the Differential Evolution algorithm
See Differential Evolution Optimization from Scratch with Python for a detailed description of an implementation of a DE algorithm in python.
The working of DE algorithm is very simple.
Consider you need to optimize(minimize,for eg) ∑Xi^2 (sphere model) within a given range, say [-100,100]. We know that the minimum value is 0. Let's see how DE works.
DE is a population-based algorithm. And for each individual in the population, a fixed number of chromosomes will be there (imagine it as a set of human beings and chromosomes or genes in each of them).
Let me explain DE w.r.t above function
We need to fix the population size and the number of chromosomes or genes(named as parameters). For instance, let's consider a population of size 4 and each of the individual has 3 chromosomes(or genes or parameters). Let's call the individuals R1,R2,R3,R4.
Step 1 : Initialize the population
We need to randomly initialise the population within the range [-100,100]
G1 G2 G3 objective fn value
R1 -> |-90 | 2 | 1 | =>8105
R2 -> | 7 | 9 | -50 | =>2630
R3 -> | 4 | 2 | -9.2| =>104.64
R4 -> | 8.5 | 7 | 9 | =>202.25
objective function value is calculated using the given objective function.In this case, it's ∑Xi^2. So for R1, obj fn value will be -90^2+2^2+2^2 = 8105. Similarly it is found for all.
Step 2 : Mutation
Fix a target vector,say for eg R1 and then randomly select three other vectors(individuals)say for eg.R2,R3,R4 and performs mutation. Mutation is done as follows,
MutantVector = R2 + F(R3-R4)
(vectors can be chosen randomly, need not be in any order).F (scaling factor/mutation constant) within range [0,1] is one among the few control parameters DE is having.In simple words , it describes how different the mutated vector becomes. Let's keep F =0.5.
| 7 | 9 | -50 |
+
0.5 *
| 4 | 2 | -9.2|
+
| 8.5 | 7 | 9 |
Now performing Mutation will give the following Mutant Vector
MV = | 13.25 | 13.5 | -50.1 | =>2867.82
Step 3 : Crossover
Now that we have a target vector(R1) and a mutant vector MV formed from R2,R3 & R4 ,we need to do a crossover. Consider R1 and MV as two parents and we need a child from these two parents. Crossover is done to determine how much information is to be taken from both the parents. It is controlled by Crossover rate(CR). Every gene/chromosome of the child is determined as follows,
a random number between 0 & 1 is generated, if it is greater than CR , then inherit a gene from target(R1) else from mutant(MV).
Let's set CR = 0.9. Since we have 3 chromosomes for individuals, we need to generate 3 random numbers between 0 and 1. Say for eg, those numbers are 0.21,0.97,0.8 respectively. First and last are lesser than CR value, so those positions in the child's vector will be filled by values from MV and second position will be filled by gene taken from target(R1).
Target-> |-90 | 2 | 1 | Mutant-> | 13.25 | 13.5 | -50.1 |
random num - 0.21, => `Child -> |13.25| -- | -- |`
random num - 0.97, => `Child -> |13.25| 2 | -- |`
random num - 0.80, => `Child -> |13.25| 2 | -50.1 |`
Trial vector/child vector -> | 13.25 | 2 | -50.1 | =>2689.57
Step 4 : Selection
Now we have child and target. Compare the obj fn of both, see which is smaller(minimization problem). Select that individual out of the two for next generation
R1 -> |-90 | 2 | 1 | =>8105
Trial vector/child vector -> | 13.25 | 2 | -50.1 | =>2689.57
Clearly, the child is better so replace target(R1) with the child. So the new population will become
G1 G2 G3 objective fn value
R1 -> | 13.25 | 2 | -50.1 | =>2689.57
R2 -> | 7 | 9 | -50 | =>2500
R3 -> | 4 | 2 | -9.2 | =>104.64
R4 -> | -8.5 | 7 | 9 | =>202.25
This procedure will be continued either till the number of generations desired has reached or till we get our desired value. Hope this will give you some help.