Neo4J: efficient neighborhood query - optimization

What's an efficient way to find all nodes within N hops of a given node? My particular graph isn't highly connected, i.e. most nodes have only degree 2, so for example the following query returns only 27 nodes (as expected), but it takes about a minute of runtime and the CPU is pegged:
MATCH (a {id:"36380_A"})-[*1..20]-(b) RETURN a,b;
All the engine's time is spent in traversals, because if I just find that starting node by itself, the result returns instantly.
I really only want the set of unique nodes and relationships (for visualization), so I also tried adding DISTINCT to try to stop it from re-visiting nodes it's seen before, but I see no change in run time.

As you said, matching the start node alone is really fast and faster if your property is indexed.
However what you are trying to do now is matching the whole pattern in the graph.
Keep your idea of your fast starting point:
MATCH (a:Label {id:"1234-a"})
once you got it pass it to the rest of the query with WITH
WITH a
then match the relationships from your fast starting point :
MATCH (a)-[:Rel*1..20]->(b)

Related

Cypher BFS with multiple Relations in Path

I'd like to model autonomous systems and their relationships in Graph Database (memgraph-db)
There are two different kinds of relationships that can exist between nodes:
undirected peer2peer relationships (edges without arrows in image)
directed provider2customer relationships (arrows pointing to provider in image)
The following image shows valid paths that I want to find with some query
They can be described as
(s)-[:provider*0..n]->()-[:peer*0..n]—()<-[:provider*0..n]-(d)
or in other words
0-n c2p edges followed by 0-n p2p edges followed by 0-n p2c edges
I can fix the first and last node and would like to find a (shortest/cheapest) path. As I understand I can do BFS if there is ONE relation on the path.
Is there a way to query for paths of such form in Cypher?
As an alternative I could do individual queries where I specify the length of each of the segments and then do a query for every length of path until a path is found.
i.e.
MATCH (s)<-[]->(d) // All one hop paths
MATCH (s)-[:provider]->()-[:peer]-(d)
MATCH (s)-[:provider]->()<-[:provider]-(d)
...
Since it's viable to have 7 different path sections, I don't see how 3 BFS patterns (... BFS*0..n) would yield a valid solution. It's impossible to have an empty path because the pattern contains some nodes between them (I have to double-check that).
Writing individual patterns is not great.
Some options are:
MATCH path=(s)-[:BFS*0.n]-(d) WHERE {{filter_expression}} -> The expression has to be quite complex in order to yield valid paths.
MATCH path=(s)-[:BFS*0.n]-(d) CALL module.filter_procedure(path) -> The module.procedure(path) could be implemented in Python or C/C++. Please take a look here. I would recommend starting with Python since it's much easier. Python for the PoC should be fine. I would also recommend starting with this option because I'm pretty confident the solution will work, + it's modular. After all, the filter_procedure could be extended easily, while the query will stay the same.
Could you please provide a sample dataset in a format of a Cypher query (a couple of nodes and edges / a small graph)? I'm glad to come up with a solution.

Natural way of indexing elements in Flink

Is there a built-in way to index and access indices of individual elements of DataStream/DataSet collection?
Like in typical Java collections, where you know that e.g. a 3rd element of an ArrayList can be obtained by ArrayList.get(2) and vice versa ArrayList.indexOf(elem) gives us the index of (the first occurence of) the specified element. (I'm not asking about extracting elements out of the stream.)
More specifically, when joining DataStreams/DataSets, is there a "natural"/easy way to join elements that came (were created) first, second, etc.?
I know there is a zipWithIndex transformation that assigns sequential indices to elements. I suspect the indices always start with 0? But I also suspect that they aren't necessarily assigned in the order the elements were created in (i.e. by their Event Time). (It also exists only for DataSets.)
This is what I currently tried:
DataSet<Tuple2<Long, Double>> tempsJoIndexed = DataSetUtils.zipWithIndex(tempsJo);
DataSet<Tuple2<Long, Double>> predsLinJoIndexed = DataSetUtils.zipWithIndex(predsLinJo);
DataSet<Tuple3<Double, Double, Double>> joinedTempsJo = tempsJoIndexed
.join(predsLinJoIndexed).where(0).equalTo(0)...
And it seems to create wrong pairs.
I see some possible approaches, but they're either non-Flink or not very nice:
I could of course assign an index to each element upon the stream's
creation and have e.g. a stream of Tuples.
Work with event-time timestamps. (I suspect there isn't a way to key by timestamps, and even if there was, it wouldn't be useful for
joining multiple streams like this unless the timestamps are
actually assigned as indices.)
We could try "collecting" the stream first but then we wouldn't be using Flink anymore.
The 1. approach seems like the most viable one, but it also seems redundant given that the stream should by definition be a sequential collection and as such, the elements should have a sense of orderliness (e.g. `I'm the 36th element because 35 elements already came before me.`).
I think you're going to have to assign index values to elements, so that you can partition the data sets by this index, and thus ensure that two records which need to be joined are being processed by the same sub-task. Once you've done that, a simple groupBy(index) and reduce() would work.
But assigning increasing ids without gaps isn't trivial, if you want to be reading your source data with parallelism > 1. In that case I'd create a RichMapFunction that uses the runtimeContext sub-task id and number of sub-tasks to calculate non-overlapping and monotonic indexes.

Negamax: what to do with "partial" results after canceling a search?

I'm implementing negamax with alpha/beta transposition table based on the pseudo code here, with roughly this algorithm:
NegaMax():
1. Transposition Table lookup
2. Loop through moves
2a. **Bail if I'm out of time**
2b. Make move, call -NegaMax, undo move
2c. Update bestvalue, alpha/beta but if appropriate
3. Transposition table store/update
4. Return bestvalue
I'm also using iterative deepening, calling NegaMax with progressively higher depths.
My question is: when I determine I've run out of time (2a. in the beginning of move loop) what is the right thing to do? Do I bail immediately (not updating the transposition table) or do I just break the loop (saving whatever partial work I've done)?
Currently, I return null at that point, signifying that the search was canceled before "completing" that node (whether by trying every move or the alpha/beta cut). The null gets propagated up and up the stack, and each node on the way up bails by return, so step 3 never runs.
Essentially, I only store values in the TT if the node "completed". The scenario I keep seeing with the iterative deepening:
I get through depths 1-5 really quick, so the TT has a depth = 5, type = Exact entry.
The depth = 6 search is taking a long time, so I bail.
I ultimately return the best move in the transposition table, which is the move I found during the depth = 5 search. The problem is, if I start a new depth = 6 search, it feels like I'm starting it from scratch. However, if I save whatever partial results I found, I worry that I'll have corrupted my TT, potentially by overwriting the completed depth = 5 entry with an incomplete depth = 6 entry.
If the search wasn't completed, the score is inaccurate and should likely not be added to the TT. If you have a best move from the previous ply and it is still best and the score hasn't dropped significantly, you might play that.
On the other hand, if at depth 6 you discover that the opponent has a mate in 3 (oops!) or could win your queen, you might have to spend even more time to try to resolve that.
That would leave you with less time for the remaining moves (if any...), but it might be better to be slightly short on time than to get mated with plenty of time remaining. :-)

Neo4j Query Optimization - 15 diff labels

So I have 15 different Labels in Neo4j which represent 15 real world business Objects. All these Objects are related to each other. Each Object (Label) has thousands of nodes . What I'm trying to do is to do optional Matches between all these Labels and get the Related data. The query is running extremely slow with all 15 selected and works fine with 3-4 object types.
So Typically this is the query with less object types which works fine.
MATCH (incident:Incidents)
WHERE incident.incident_number IN ["INC000005590903","INC000005590903"]
MATCH (device:Devices)
WHERE device.deviceid_udr in ["RE221869491800Uh_pVAevJpYAhRcJ"]
MATCH (alarm:Alarms)
WHERE alarm.entryid_udr in ["ALM123000000110"]
MATCH incident-[a]-alarm
MATCH device-[b]-alarm
MATCH incident-[c]-device
RETURN incident.incident_number, device.deviceid,alarm.entryid_udr
When I do a query to find data related between 15 different object types it runs extremely slow. Do you have any suggestions how I could approach this problem
When you analyze what is happening in your query, it's not to hard to see why it is slow. Each of your initial matches is unrelated to the others, so the entire domain for each label is searched. If you use relationship matching up front, you can significantly reduce the size of each the time it takes to do the query.
Try this query and see how it goes in comparison with the one in your question.
MATCH (incident:Incidents {incident_number : "INC000005590903"})
WITH incident
MATCH (incident)--(device:Devices {deviceid_udr : "RE221869491800Uh_pVAevJpYAhRcJ"})
WITH incident, device
MATCH (device)--(alarm:Alarms {entryid_udr : "ALM123000000110"})
WITH incident, device, alarm
MATCH (incident)--(alarm)
RETURN incident.incident_number, device.deviceid_udr, alarm.entryid_udr
In this query, once you find the incident, the next match is a search of the relationships on the particular incident to find a matching device. Once that is found, matching the alarm is limited in the same fashion.
In addition, if you have a different relationship type for each of the relationships (incident to device, device to alarm, alarm to incident, etc), using the specifications of the relationship types in the matches will speed things up even further, since it will again reduce the number of items that must be searched. Without relationship typing, relationships to nodes with wrong labels will be tested.
I don't know if it is intentional or not, but you also are matching for a closed ring in your example. That's not a problem if you are careful not to allow match loops to occur, but the query won't succeed if there isn't a closed ring. Just thought I'd point it out.

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)