Can Cypher return a increment chain of numbers? - cypher

For example, can I have such a command that generate the increment of number?
MATCH (n)
RETURN n, number_increment
node A 1
node B 2
node C 3
node D 4
I want to assign id to a group of nodes (not the id(n) one) and I need a chain of increasing number. Is this doable in Cypher or I need to use another language?

Looks like you want something like a row number. There isn't a direct way to do it in cypher, but there are a number of different solutions. One way is using the apoc.coll.zip function and manipulating the result into collections,
MATCH (n)
WITH collect(n) as nodes
WITH apoc.coll.zip(nodes, range(0, size(nodes))) as pairs
UNWIND pairs as pair
RETURN pair[0] as n, pair[1] as rowNumber
(Be careful though, the above query selects all nodes in the store, so may take a while if you have a huge number of nodes)

This will work.
MATCH(n)
WITH RANGE(1, COUNT(n)) AS indexes, COLLECT(n) AS nodes
FOREACH(i IN indexes | SET (nodes[i-1]).myID = i)
WITH nodes UNWIND nodes AS node
RETURN node

Related

search any word inside a string in million rows

I have a set of 50k values say X. each value i want to compare with a set of 10k values say Y. if X is present any where in the string Y it matches.
So each value in X i want to check across each value in Y and assign X if it matches.
what would be the best method to complete this task. It is required for a data mining project.
I loaded the data into MS Access database.
then using a vba program
take each X . Update Y if it matches (Like '%X%') but it is a never ending process. The columns are indexed but no effect.
Is there any algorithm or steps to reduce it into step-by-step process and complete the mapping faster?
Please let me know if there is any other options available other than the answers given below. I ll explain the scenario bit more
Table1.Data
sentense1
sentense2
sentense3
sentense4
sentense5
sentense6
-
-
-
Sentense100k
Table2.Phrase (Means multiple words)
Phrase1
Phrase2
Phrase3
Phrase4
Phrase5
-
-
-
Phrase 100k
Want to check Phrase1 has any Match in Sentense1 to Sentense100k Exact Match of Phrase, anywhere Match of Phrase, Maximum Words in Phrase1 Match in Sentense etc.. and create a map based on best Match(ideally exact phrase available anywhere in the sentense)
Table3 Output
Data Best Possible Phrase Second Best Phrase(Optional)
Sentense1 Phrase1000 Phrase50k
Sentense2 Phrase10 Phrase70k
Please let me know any tool,logic to perform this. The logic what i tried in SQL
1.
Select A.Data,B.Phrase from Table1 A left join Table2 B on A.Data Like '%' + B.Phrase + '%'
2.
Check for any word in phrase available in sentense. So replaced all spaces with % like word1%word2%word3. then did query as
A.Data Like '%' + B.Phrase + '%' which is
A.Data Like '%word1%word2%word3%'
But it takes days to complete the task for this much data.
Any readily usable tools, indexing methods,queries would really help. The answers given below seems too technical for me to adapt. Please guide
You can build a suffix tree in linear time (you can look up suffix trees online), out of the concatenation of all strings in X and Y, with special unique symbols that end each string.
Then for each string Xi in X, you look it up in the suffix tree (linear time in length of Xi) and assign Xi to each string in Y that is somewhere in the subtree rooted at the end of Xi.
This is linear time in the number of strings in Y that Xi is assigned to.
Thus you get an optimal O(N + k) time algorithm, where:
N is the total length of all the strings in X and Y,
and k is the total number of matches between query strings in X and target strings in Y.

Segment tree - query complexity

I am having problems with understanding segment tree complexity. It is clear that if you have update function which has to change only one node, its complexity will be log(n).
But I have no idea why complexity of query(a,b), where (a,b) is interval that needs to be checked, is log(n).
Can anyone provide me with intuitive / formal proof to understand this?
There are four cases when query the interval (x,y)
FIND(R,x,y) //R is the node
% Case 1
if R.first = x and R.last = y
return {R}
% Case 2
if y <= R.middle
return FIND(R.leftChild, x, y)
% Case 3
if x >= R.middle + 1
return FIND(R.rightChild, x, y)
% Case 4
P = FIND(R.leftChild, x, R.middle)
Q = FIND(R.rightChild, R.middle + 1, y)
return P union Q.
Intuitively, first three cases reduce the level of tree height by 1, since the tree has height log n, if only first three cases happen, the running time is O(log n).
For the last case, FIND() divide the problem into two subproblems. However, we assert that this can only happen at most once. After we called FIND(R.leftChild, x, R.middle), we are querying R.leftChild for the interval [x, R.middle]. R.middle is the same as R.leftChild.last. If x > R.leftChild.middle, then it is Case 1; if x <= R.leftChild, then we will call
FIND ( R.leftChild.leftChild, x, R.leftChild.middle );
FIND ( R.leftChild.rightChild, R.leftChild.middle + 1, , R.leftChild.last );
However, the second FIND() returns R.leftChild.rightChild.sum and therefore takes constant time, and the problem will not be separate into two subproblems (strictly speaking, the problem is separated, though one subproblem takes O(1) time to solve).
Since the same analysis holds on the rightChild of R, we conclude that after case4 happens the first time, the running time T(h) (h is the remaining level of the tree) would be
T(h) <= T(h-1) + c (c is a constant)
T(1) = c
which yields:
T(h) <= c * h = O(h) = O(log n) (since h is the height of the tree)
Hence we end the proof.
This is my first time to contribute, hence if there are any problems, please kindly point them out and I would edit my answer.
A range query using a segment tree basically involves recursing from the root node. You can think of the entire recursion process as a traversal on the segment tree: any time a recursion is needed on a child node, you are visiting that child node in your traversal. So analyzing the complexity of a range query is equivalent to finding the upper bound for the total number of nodes that are visited.
It turns out that at any arbitrary level, there are at most 4 nodes that can be visited. Since the segment tree has a height of log(n) and that at any level there are at most 4 nodes that can be visited, the upper bound is actually 4*log(n). The time complexity is therefore O(log(n)).
Now we can prove this with induction. The base case is at the first level where the root node lies. Since the root node has at most two child nodes, we can only visit at most those two child nodes, which is at most 4 nodes.
Now suppose it is true that at an arbitrary level (say level i) we visit at most 4 nodes. We want to show that we will visit at most 4 nodes at the next level (level i+1) as well. If we had visited only 1 or 2 nodes at level i, it's trivial to show that at level i+1 we will visit at most 4 nodes because each node can have at most 2 child nodes.
So let's focus on the assumption that 3 or 4 nodes were visited at level i, and try to show that at level i+1 we can also have at most 4 visited nodes. Now since the range query is asking for a contiguous range, we know that the 3 or 4 nodes visited at level i can be categorized into 3 partitions of nodes: a leftmost single node whose segment range is only partially covered by the query range, a rightmost single node whose segment range is only partially covered by the query range, and 1 or 2 middle nodes whose segment range is fully covered by the query range. Since the middle nodes have their segment range(s) fully covered by the query range, there would be no recursion at the next level; we just use their precomputed sums. We are left with possible recursions on the leftmost node and the rightmost node at the next level, which is obviously at most 4.
This completes the proof by induction. We have proven that at any level at most 4 nodes are visited. The time complexity for a range query is therefore O(log(n)).
An interval of length n can be represented by k nodes where k <= log(n)
We can prove it based on how the binary system works.

Metropolis Hastings Random Walk SQL Implementation

Is it possible and efficient to implement MHRW algorithm in SQL?
I want to sample a direct large graph with +1 million nodes and this seems to be one of the best ways to do it. The purpose of the algorithm is for undirect graphs, but I think it can work for directed ones too
The algorithm:
v <- initial node
while stop criteria not met do
select node w uniformly at random from neighbors of v;
generate uniformly at random 0<= p <= 1
if p <= (degree of v) / (degree of w)
then v <- w
else
stay at v
end if
end while
I take the initial node from table1, which contains all nodes and their properties. In table2 I have two columns that display all connections between nodes (and a way to get a nodes degree). The stop criteria would be the size of the sample, ie, while sample <= ~100.000 nodes.
Best regards.

Number of BST's given a linked list of numbers

Suppose I have a linked list of positive numbers, how many BST's can be generated from them, provided all nodes all required to form the tree?
Conversely, how many BST's can be generated, provided any number of the linked list nodes can exist in these trees?
Bonus: how many balanced BST's can be formed? Any help or guidance is greatly appreciated.
You can use dynamic programming to compute that.
Just note that it doesn't matter what the numbers are, just how many. In other words for any n distinct integers there is the same amount of different BSTs. Let's call this number f(n).
Then if you know f(k) for k < n, you can get f(n):
f(n) = Sum ( f(i) + f(n-1-i), i = 0,1,2,...,n-1 )
Each summand represents the number of trees for which the (1+i)-th smallest number is at the root (thus in the left subtree where are i numbers and in the right subtree there are n-1-i).
So DP solves this.
Now the total number of BSTs (with any nodes from the list) is just a sum:
Sum ( Binomial(n,k) * f(k), k=1,2,3,...,n )
This is because you can pick k of them in Binomial(n,k) ways and then you know that there are f(k) BSTs for them.

VB.NET Array Intersection

This could be terribly trivial, but I'm having trouble finding an answer that executes in less than n^2 time. Let's say I have two string arrays and I want to know which strings exist in both arrays. How would I do that, efficiently, in VB.NET or is there a way to do this other than a double loop?
The simple way (assuming no .NET 3.5) is to dump the strings from one array in a hashtable, and then loop through the other array checking against the hashtable. That should be much faster than an n^2 search.
If you sort both arrays, you can then walk through them each once to find all the matching strings.
Pseudo-code:
while(index1 < list1.Length && index2 < list2.Length)
{
if(list1[index1] == list2[index2])
{
// You've found a match
index1++;
index2++;
} else if(list1[index1] < list2[index2]) {
index1++;
} else {
index2++;
}
}
Then you've reduced it to the time it takes to do the sorting.
If one of the arrays is sorted you can do a binary search on it in the inner loop, this will decrease the time to O(n log n)
Sort both lists. Then you can know with certainty that if the next entry in list A is 'cobble' and the next entry in list B is 'definite', then 'cobble' is not in list B. Simply advance the pointer/counter on whichever list has the lower ranked result and ascend the rankings.
For example:
List 1: D,B,M,A,I
List 2: I,A,P,N,D,G
sorted:
List 1: A,B,D,I,M
List 2: A,D,G,I,N,P
A vs A --> match, store A, advance both
B vs D --> B
D vs D --> match, store D, advance both
I vs G --> I>G, advance 2
I vs I --> match, store I, advance both
M vs N --> M
List 1 has no more items, quit.
List of matches is A,D,I
2 list sorts O(n log(n)), plus O(n) comparisons makes this O(n(log(n) + 1)).