Segment tree - query complexity - time-complexity

I am having problems with understanding segment tree complexity. It is clear that if you have update function which has to change only one node, its complexity will be log(n).
But I have no idea why complexity of query(a,b), where (a,b) is interval that needs to be checked, is log(n).
Can anyone provide me with intuitive / formal proof to understand this?

There are four cases when query the interval (x,y)
FIND(R,x,y) //R is the node
% Case 1
if R.first = x and R.last = y
return {R}
% Case 2
if y <= R.middle
return FIND(R.leftChild, x, y)
% Case 3
if x >= R.middle + 1
return FIND(R.rightChild, x, y)
% Case 4
P = FIND(R.leftChild, x, R.middle)
Q = FIND(R.rightChild, R.middle + 1, y)
return P union Q.
Intuitively, first three cases reduce the level of tree height by 1, since the tree has height log n, if only first three cases happen, the running time is O(log n).
For the last case, FIND() divide the problem into two subproblems. However, we assert that this can only happen at most once. After we called FIND(R.leftChild, x, R.middle), we are querying R.leftChild for the interval [x, R.middle]. R.middle is the same as R.leftChild.last. If x > R.leftChild.middle, then it is Case 1; if x <= R.leftChild, then we will call
FIND ( R.leftChild.leftChild, x, R.leftChild.middle );
FIND ( R.leftChild.rightChild, R.leftChild.middle + 1, , R.leftChild.last );
However, the second FIND() returns R.leftChild.rightChild.sum and therefore takes constant time, and the problem will not be separate into two subproblems (strictly speaking, the problem is separated, though one subproblem takes O(1) time to solve).
Since the same analysis holds on the rightChild of R, we conclude that after case4 happens the first time, the running time T(h) (h is the remaining level of the tree) would be
T(h) <= T(h-1) + c (c is a constant)
T(1) = c
which yields:
T(h) <= c * h = O(h) = O(log n) (since h is the height of the tree)
Hence we end the proof.
This is my first time to contribute, hence if there are any problems, please kindly point them out and I would edit my answer.

A range query using a segment tree basically involves recursing from the root node. You can think of the entire recursion process as a traversal on the segment tree: any time a recursion is needed on a child node, you are visiting that child node in your traversal. So analyzing the complexity of a range query is equivalent to finding the upper bound for the total number of nodes that are visited.
It turns out that at any arbitrary level, there are at most 4 nodes that can be visited. Since the segment tree has a height of log(n) and that at any level there are at most 4 nodes that can be visited, the upper bound is actually 4*log(n). The time complexity is therefore O(log(n)).
Now we can prove this with induction. The base case is at the first level where the root node lies. Since the root node has at most two child nodes, we can only visit at most those two child nodes, which is at most 4 nodes.
Now suppose it is true that at an arbitrary level (say level i) we visit at most 4 nodes. We want to show that we will visit at most 4 nodes at the next level (level i+1) as well. If we had visited only 1 or 2 nodes at level i, it's trivial to show that at level i+1 we will visit at most 4 nodes because each node can have at most 2 child nodes.
So let's focus on the assumption that 3 or 4 nodes were visited at level i, and try to show that at level i+1 we can also have at most 4 visited nodes. Now since the range query is asking for a contiguous range, we know that the 3 or 4 nodes visited at level i can be categorized into 3 partitions of nodes: a leftmost single node whose segment range is only partially covered by the query range, a rightmost single node whose segment range is only partially covered by the query range, and 1 or 2 middle nodes whose segment range is fully covered by the query range. Since the middle nodes have their segment range(s) fully covered by the query range, there would be no recursion at the next level; we just use their precomputed sums. We are left with possible recursions on the leftmost node and the rightmost node at the next level, which is obviously at most 4.
This completes the proof by induction. We have proven that at any level at most 4 nodes are visited. The time complexity for a range query is therefore O(log(n)).

An interval of length n can be represented by k nodes where k <= log(n)
We can prove it based on how the binary system works.

Related

Confused about time complexity of recursive function

For the following algorithm (this algorithm doesn't really do anything useful besides being an exercise in analyzing time complexity):
const dib = (n) => {
if (n <= 1) return;
dib(n-1);
dib(n-1);
I'm watching a video where they say the time complexity is O(2^n). If I count the nodes I can see they're right (the tree has around 32 nodes) however in my head I thought it would be O(n*2^n) since n is the height of the tree and each level has 2^n nodes. Can anyone point out the flaw in my thinking?
Each tree has 2^i nodes, not 2^n.
So each level has 2^(i-1) nodes: 1 + 2 + 4 + 8 ... 2^n.
The deepest level is the decider in the complexity.
The total number of nodes beneath any level > 1 is 1 + 2*f(i-1) .
This is 2^n - 1.
Derek's answer is great, it gives intuition behind the estimation, if you want formal proof, you can use the Master theorem for Decreasing functions.
The master theorem is a formula for solving recurrences of the form
T(n) = aT(n - b) + f(n), where a ≥ 1 and b > 0 and f(n) is
asymptotically positive. (Asymptotically
positive means that the function is positive for all sufficiently large n.
Recurrent formula for above algorithm is T(n) = 2*T(n-1) + O(1). Do you see why? You can see solution for various cases (a=1, a>1, a<1) here http://cs.uok.edu.in/Files/79755f07-9550-4aeb-bd6f-5d802d56b46d/Custom/Ten%20Master%20Method.pdf
For our case a>1, so T(n) = O(a^(n/b) * f(n)) or O (a^(n/b) * n^k ) and gives O(2^n)

Computational complexity of DFS over the worst case DAG

I got confused by calculating the worst-case DAG complexity when you do a Naive DFS from one node to another node.
For example in the following DAG,
Node A act as a start node, and if we always pick the last node, in this case, Node D as the end node, use DFS to calculate all path from A to D.
In this case, we have DFS going through Paths:
path 1st: A -> B -> C -> D
path 2nd: A -> B -> D
path 3rd: A -> C -> D
path 4th: A -> D
The computational complexity was 8 because it takes 3 iterations in the first path, 2 in the second, 2 in the third, and 1 in the last one.
Now if we expand this graph, add more nodes after.
Assuming we have N nodes, what is O(N) then?
The way I do the calculation for the number of total paths is like, every time we add a new node to an N-node-DAG, to replace Node A as the new beginning node, we add N new edge here. Because we need to add edges to go from the new start node to all nodes that existed.
If we assume P as total paths, we have
P(N) = P(N-1) + P(N-2) + P(N-3) + .... + P(1)
then you have
P(N) = 2 * P(N-1)
then
P(N) = O(2^N)
However, if we consider that not all paths is using the computational complexity of O(1), for example, the longest path that goes through all the nodes, we take O(N) for that single path, the actual cost is higher than O(2^N).
So what could that be then?
Current Algorithm and Time Complexity
As far as I understand, your current algorithm follows the steps below:
Start from a start node specified by us.
In each node, store the adjacent vertices in a stack. Then pop the first element from stack and repeat step 2 until the stack gets empty.
Terminate execution until the stack gets empty.
Our graph is a DAG, therefore there won't be any cycle in the graph and algorithm is guaranteed to terminate eventually.
For further examination, you mentioned about expanding the graph in the same manner. This means that, whenever we add the i(th) node to the graph, we have to create vertices from each node to that node - which means we have to insert i edges to the graph.
Let's say you start from the first node. In this case, time complexity will be:
T(1) = [1+T(2)] + [1+T(3)] + ... + [1+T(n-3)] + [1+T(n-2)] + [1+T(n-1)] + [1+T(n)]
T(1) = [1+T(2)] + [1+T(3)] + ... + [1+(T(n-2)+T(n-1)+T(n))] + [1+(T(n-1)+T(n))] + [1+T(n)] + 0 (assuming T(n) = 0)
T(1) = [1+T(2)] + [1+T(3)] + ... + (2+1+1) + (1+1) + 1 + 0
T(1) = [1+T(2)] + [1+T(3)] + ... + 4 + 2 + 1
with this manner
(observe the pattern, for T(n-1), we get 2^0 - so for T(2), we'll get 2^(n-3))
T(1) = 2^(n-3) + 2^(n) + ... + 2^(0)
T(1) = 2^(n-2) - 1
Eventually it turns out that the time complexity if O(2^N) for this algorithm. It turns out to be pretty bad, it's because this is extraordinarily brute force. Let's come up with an optimized algorithm that stores the information of visited vertices.
Note:
I spotted that there are two edges from A to B and B to C, but not from C to D. I couldn't understand the pattern here, but if it's the case then it means it requires more operations than 2^N. Well, naive DFS is a bad algorithm anyways, I strongly recommend you to implement the one below.
Optimized Algorithm and Time Complexity
To make your algorithm optimized, you can follow the steps below:
0) Create a boolean array to mark each state as visited when you visit them. Assigned each index to false initially.
Start from a start node specified by us.
In each node, store the adjacent vertices (that are not marked as visited) in a stack. Mark them as visited. Then pop the first element from stack and repeat step 2 until the stack gets empty.
Terminate execution until the stack gets empty.
This time, by storing an array to mark visited nodes, we avoid the phenomenon of visiting same node more than once. This algorithm, traditionally, has the time complexity of O(N + E), N being the number of nodes and E being the number of edges. In your case, your graph seems like a complete graph (if it was undirected though), therefore E ~ N^2, meaning that your time complexity will be O(N^2).
I think I found out.
so if we consider that each new node will add a new edge from the new node to each node in DAG, traversing each of those new edges will take complexity as 1.
Then we could have (if we use C as computational complexity):
C(N) = 1 + C(N-1) + 1 + C(N-2) + 1+ C(N-3) + .... + 1 + C(1)
then you have
C(N) = 2 * C(N-1) + N - 1
= 2^2 * C(N-2) + 2 * (N-2) + N - 1
= 2^3 * C(N-3) + 2^2 * (N-3) + 2 * (N -2) + N - 1
= 2^(N-1) * C(1) + 2^(N-2) * 1 + ...... + N - 1
At this moment this becomes a sum of the geometric progression of N, the ratio is 2, so the highest rank item is at 2^N.
Thus O(2^N) is the answer.

If get inorder successor in BST takes O(h), why does iterative inorder traversal take O(n) when calling an O(h) function n times?

In a BST, it takes O(h) time complexity to get the inorder successor of a given node, so given getNext(), which gets the inorder successor of the current node, you would need n calls to getNext() to traverse the tree, giving O(nh) time complexity.
However, iterative inorder traversal of BST's are given in books to take O(n) time. I'm confused why there's a difference.
(Nodes have parent pointers).
Getting the inorder successor is O(h), but not Ө(h). Half of the nodes will have a successor just one link away. Let’s think about the number of pointer dereferences it takes to traverse a given node:
the number of pointer dereferences to traverse the left subtree, plus
the number to ascend back from the left subtree’s rightmost node = the height of the left subtree (at most), plus
one for the node itself, plus
the number to descend to the right subtree’s rightmost node = the height of the right subtree (at most), plus
the number for the right subtree.
So an upper bound is:
f(0) = 1
f(h) = f(h − 1) + (h − 1) + h + (h − 1) + f(h − 1)
    
 = 2f(h − 1) + 3h − 2
f(h) = 5·2h − 3h − 4
and h being ⌈log₂ n⌉, f is O(n).

BST Time Complexity

Assume that T is a binary search tree with n nodes and height h. Each node x of T stores a
real number x.Key. Give the worst-case time complexity of the following algorithm Func1(T.root). You
need to justify your answer.
Func 1 (x)
if (x == NIL) return 0;
s1 <- Func1(x.left());
if (s1 < 100) then
s2 <- Func1(x.Right());
end
else
s2 <- 0;
end
s <- s1 + s2 + x.Key();
return (s);
x.left() & x.right() return left and right child of node x
x.key() return the key stored at node x
For the worst case run time, I was thinking that this would be O(height of tree) since this basically act like the minimum() or maximum() binary search tree algorithms. However, it's recursive, so I'm slightly hesitant to actually write O(h) as the worst case run-time.
When I think about it, the worst case would be if the function executed the if(s1 < 100) statement for every x.left, which would mean that every node is visited, so would that make the run time O(n)?
You're correct that the worst-case runtime of this function is Θ(n), which happens if that if statement always executes. In that case, the recursion visits each node, recursively visits the full right subtree, then recursively visits the full left subtree. (It also does O(1) work per node, which is why this sums up to O(n)).

Metropolis Hastings Random Walk SQL Implementation

Is it possible and efficient to implement MHRW algorithm in SQL?
I want to sample a direct large graph with +1 million nodes and this seems to be one of the best ways to do it. The purpose of the algorithm is for undirect graphs, but I think it can work for directed ones too
The algorithm:
v <- initial node
while stop criteria not met do
select node w uniformly at random from neighbors of v;
generate uniformly at random 0<= p <= 1
if p <= (degree of v) / (degree of w)
then v <- w
else
stay at v
end if
end while
I take the initial node from table1, which contains all nodes and their properties. In table2 I have two columns that display all connections between nodes (and a way to get a nodes degree). The stop criteria would be the size of the sample, ie, while sample <= ~100.000 nodes.
Best regards.