Metropolis Hastings Random Walk SQL Implementation - sql

Is it possible and efficient to implement MHRW algorithm in SQL?
I want to sample a direct large graph with +1 million nodes and this seems to be one of the best ways to do it. The purpose of the algorithm is for undirect graphs, but I think it can work for directed ones too
The algorithm:
v <- initial node
while stop criteria not met do
select node w uniformly at random from neighbors of v;
generate uniformly at random 0<= p <= 1
if p <= (degree of v) / (degree of w)
then v <- w
else
stay at v
end if
end while
I take the initial node from table1, which contains all nodes and their properties. In table2 I have two columns that display all connections between nodes (and a way to get a nodes degree). The stop criteria would be the size of the sample, ie, while sample <= ~100.000 nodes.
Best regards.

Related

Can Cypher return a increment chain of numbers?

For example, can I have such a command that generate the increment of number?
MATCH (n)
RETURN n, number_increment
node A 1
node B 2
node C 3
node D 4
I want to assign id to a group of nodes (not the id(n) one) and I need a chain of increasing number. Is this doable in Cypher or I need to use another language?
Looks like you want something like a row number. There isn't a direct way to do it in cypher, but there are a number of different solutions. One way is using the apoc.coll.zip function and manipulating the result into collections,
MATCH (n)
WITH collect(n) as nodes
WITH apoc.coll.zip(nodes, range(0, size(nodes))) as pairs
UNWIND pairs as pair
RETURN pair[0] as n, pair[1] as rowNumber
(Be careful though, the above query selects all nodes in the store, so may take a while if you have a huge number of nodes)
This will work.
MATCH(n)
WITH RANGE(1, COUNT(n)) AS indexes, COLLECT(n) AS nodes
FOREACH(i IN indexes | SET (nodes[i-1]).myID = i)
WITH nodes UNWIND nodes AS node
RETURN node

Ranking Big O Functions By Complexity

I am trying to rank these functions — 2n, n100, (n + 1)2, n·lg(n), 100n, n!, lg(n), and n99 + n98 — so that each function is the big-O of the next function, but I do not know a method of determining if one function is the big-O of another. I'd really appreciate if someone could explain how I would go about doing this.
Assuming you have some programming background. Say you have below code:
void SomeMethod(int x)
{
for(int i = 0; i< x; i++)
{
// Do Some Work
}
}
Notice that the loop runs for x iterations. Generalizing, we say that you will get the solution after N iterations (where N will be the value of x ex: number of items in array/input etc).
so This type of implementation/algorithm is said to have Time Complexity of Order of N written as O(n)
Similarly, a Nested For (2 Loops) is O(n-squared) => O(n^2)
If you have Binary decisions made and you reduce possibilities into halves and pick only one half for solution. Then complexity is O(log n)
Found this link to be interesting.
For: Himanshu
While the Link explains how log(base2)N complexity comes into picture very well, Lets me put the same in my words.
Suppose you have a Pre-Sorted List like:
1,2,3,4,5,6,7,8,9,10
Now, you have been asked to Find whether 10 exists in the list. The first solution that comes to mind is Loop through the list and Find it. Which means O(n). Can it be made better?
Approach 1:
As we know that List of already sorted in ascending order So:
Break list at center (say at 5).
Compare the value of Center (5) with the Search Value (10).
If Center Value == Search Value => Item Found
If Center < Search Value => Do above steps for Right Half of the List
If Center > Search Value => Do above steps for Left Half of the List
For this simple example we will find 10 after doing 3 or 4 breaks (at: 5 then 8 then 9) (depending on how you implement)
That means For N = 10 Items - Search time was 3 (or 4). Putting some mathematics over here;
2^3 + 2 = 10 for simplicity sake lets say
2^3 = 10 (nearly equals --- this is just to do simple Logarithms base 2)
This can be re-written as:
Log-Base-2 10 = 3 (again nearly)
We know 10 was number of items & 3 was the number of breaks/lookup we had to do to find item. It Becomes
log N = K
That is the Complexity of the alogorithm above. O(log N)
Generally when a loop is nested we multiply the values as O(outerloop max value * innerloop max value) n so on. egfor (i to n){ for(j to k){}} here meaning if youll say for i=1 j=1 to k i.e. 1 * k next i=2,j=1 to k so i.e. the O(max(i)*max(j)) implies O(n*k).. Further, if you want to find order you need to recall basic operations with logarithmic usage like O(n+n(addition)) <O(n*n(multiplication)) for log it minimizes the value in it saying O(log n) <O(n) <O(n+n(addition)) <O(n*n(multiplication)) and so on. By this way you can acheive with other functions as well.
Approach should be better first generalised the equation for calculating time complexity. liken! =n*(n-1)*(n-2)*..n-(n-1)so somewhere O(nk) would be generalised formated worst case complexity like this way you can compare if k=2 then O(nk) =O(n*n)

BST Time Complexity

Assume that T is a binary search tree with n nodes and height h. Each node x of T stores a
real number x.Key. Give the worst-case time complexity of the following algorithm Func1(T.root). You
need to justify your answer.
Func 1 (x)
if (x == NIL) return 0;
s1 <- Func1(x.left());
if (s1 < 100) then
s2 <- Func1(x.Right());
end
else
s2 <- 0;
end
s <- s1 + s2 + x.Key();
return (s);
x.left() & x.right() return left and right child of node x
x.key() return the key stored at node x
For the worst case run time, I was thinking that this would be O(height of tree) since this basically act like the minimum() or maximum() binary search tree algorithms. However, it's recursive, so I'm slightly hesitant to actually write O(h) as the worst case run-time.
When I think about it, the worst case would be if the function executed the if(s1 < 100) statement for every x.left, which would mean that every node is visited, so would that make the run time O(n)?
You're correct that the worst-case runtime of this function is Θ(n), which happens if that if statement always executes. In that case, the recursion visits each node, recursively visits the full right subtree, then recursively visits the full left subtree. (It also does O(1) work per node, which is why this sums up to O(n)).

Segment tree - query complexity

I am having problems with understanding segment tree complexity. It is clear that if you have update function which has to change only one node, its complexity will be log(n).
But I have no idea why complexity of query(a,b), where (a,b) is interval that needs to be checked, is log(n).
Can anyone provide me with intuitive / formal proof to understand this?
There are four cases when query the interval (x,y)
FIND(R,x,y) //R is the node
% Case 1
if R.first = x and R.last = y
return {R}
% Case 2
if y <= R.middle
return FIND(R.leftChild, x, y)
% Case 3
if x >= R.middle + 1
return FIND(R.rightChild, x, y)
% Case 4
P = FIND(R.leftChild, x, R.middle)
Q = FIND(R.rightChild, R.middle + 1, y)
return P union Q.
Intuitively, first three cases reduce the level of tree height by 1, since the tree has height log n, if only first three cases happen, the running time is O(log n).
For the last case, FIND() divide the problem into two subproblems. However, we assert that this can only happen at most once. After we called FIND(R.leftChild, x, R.middle), we are querying R.leftChild for the interval [x, R.middle]. R.middle is the same as R.leftChild.last. If x > R.leftChild.middle, then it is Case 1; if x <= R.leftChild, then we will call
FIND ( R.leftChild.leftChild, x, R.leftChild.middle );
FIND ( R.leftChild.rightChild, R.leftChild.middle + 1, , R.leftChild.last );
However, the second FIND() returns R.leftChild.rightChild.sum and therefore takes constant time, and the problem will not be separate into two subproblems (strictly speaking, the problem is separated, though one subproblem takes O(1) time to solve).
Since the same analysis holds on the rightChild of R, we conclude that after case4 happens the first time, the running time T(h) (h is the remaining level of the tree) would be
T(h) <= T(h-1) + c (c is a constant)
T(1) = c
which yields:
T(h) <= c * h = O(h) = O(log n) (since h is the height of the tree)
Hence we end the proof.
This is my first time to contribute, hence if there are any problems, please kindly point them out and I would edit my answer.
A range query using a segment tree basically involves recursing from the root node. You can think of the entire recursion process as a traversal on the segment tree: any time a recursion is needed on a child node, you are visiting that child node in your traversal. So analyzing the complexity of a range query is equivalent to finding the upper bound for the total number of nodes that are visited.
It turns out that at any arbitrary level, there are at most 4 nodes that can be visited. Since the segment tree has a height of log(n) and that at any level there are at most 4 nodes that can be visited, the upper bound is actually 4*log(n). The time complexity is therefore O(log(n)).
Now we can prove this with induction. The base case is at the first level where the root node lies. Since the root node has at most two child nodes, we can only visit at most those two child nodes, which is at most 4 nodes.
Now suppose it is true that at an arbitrary level (say level i) we visit at most 4 nodes. We want to show that we will visit at most 4 nodes at the next level (level i+1) as well. If we had visited only 1 or 2 nodes at level i, it's trivial to show that at level i+1 we will visit at most 4 nodes because each node can have at most 2 child nodes.
So let's focus on the assumption that 3 or 4 nodes were visited at level i, and try to show that at level i+1 we can also have at most 4 visited nodes. Now since the range query is asking for a contiguous range, we know that the 3 or 4 nodes visited at level i can be categorized into 3 partitions of nodes: a leftmost single node whose segment range is only partially covered by the query range, a rightmost single node whose segment range is only partially covered by the query range, and 1 or 2 middle nodes whose segment range is fully covered by the query range. Since the middle nodes have their segment range(s) fully covered by the query range, there would be no recursion at the next level; we just use their precomputed sums. We are left with possible recursions on the leftmost node and the rightmost node at the next level, which is obviously at most 4.
This completes the proof by induction. We have proven that at any level at most 4 nodes are visited. The time complexity for a range query is therefore O(log(n)).
An interval of length n can be represented by k nodes where k <= log(n)
We can prove it based on how the binary system works.

Number of BST's given a linked list of numbers

Suppose I have a linked list of positive numbers, how many BST's can be generated from them, provided all nodes all required to form the tree?
Conversely, how many BST's can be generated, provided any number of the linked list nodes can exist in these trees?
Bonus: how many balanced BST's can be formed? Any help or guidance is greatly appreciated.
You can use dynamic programming to compute that.
Just note that it doesn't matter what the numbers are, just how many. In other words for any n distinct integers there is the same amount of different BSTs. Let's call this number f(n).
Then if you know f(k) for k < n, you can get f(n):
f(n) = Sum ( f(i) + f(n-1-i), i = 0,1,2,...,n-1 )
Each summand represents the number of trees for which the (1+i)-th smallest number is at the root (thus in the left subtree where are i numbers and in the right subtree there are n-1-i).
So DP solves this.
Now the total number of BSTs (with any nodes from the list) is just a sum:
Sum ( Binomial(n,k) * f(k), k=1,2,3,...,n )
This is because you can pick k of them in Binomial(n,k) ways and then you know that there are f(k) BSTs for them.