redisgraph create edge between nodes if not exist already in high traffic - redis

I have the following cypher query.
MATCH (a:ACTOR {id: 'Charlie'})
MATCH (m:MOVIE {id: 'TwoAndAHalfMen'})
OPTIONAL MATCH (a)-[e:ACTED {prop1: val1}]->(m)
WITH COUNT(e) AS c, a, m
WHERE c=0 CREATE (a)-[:ACTED {prop1: val1, prop2: '<cur-time>' }]->(m)
We have an actor node and a movie node. If there is no edge between the actor and the movie nodes, then a new edge ACTED should be created, with some edge properties.
The above query works fine in an idle database with not much traffic, often responding in less than 1 millisecond. However, when there is a lot of traffic, this particular query alone takes a long time, in the order of multiple seconds. While the other queries finish fine.
I suspect that the reason for the slowness is the COUNT function is trying to take all the nodes and somehow gets delayed when there are constant writes. I may be wrong here and just making an assumption.
I do not need to know the total number of edges, and is interested to know only if the edge is there or not. So I tried to do this query in a different way as:
MATCH (a:ACTOR {id: 'Charlie'})
MATCH (m:MOVIE {id: 'TwoAndAHalfMen'})
WHERE NOT EXISTS(
MATCH (a)-[e:ACTED {prop1: val1}]->(m)
)
CREATE (a)-[:ACTED {prop1: val1, prop2: '<cur-time>' }]->(m)
But for this I get an error from the redisgraph as:
RedisGraph does not currently support map projection
Is there any other way to optimally create an edge between two nodes if one does not exist already ? I cannot use a MERGE command because the prop2 that I set will be a different value, when I have to create a new edge.
Please note that this should work in redisgraph and not enough if it works on neo4j or some such.

Please see RedisGraph Merge docs more specifically ON MATCH and ON CREATE
e.g.
MATCH (a:ACTOR {id: 'Charlie'}),
(m:MOVIE {id: 'TwoAndAHalfMen'})
MERGE (a)-[e:ACTED {prop1: val1}]->(m)
ON CREATE SET e.prop2 = 'cur-time'

Related

Neo4J Optimsation For Creating Relationships Between Nodes

I am working on around 50000 tweets as node having similar data as shown below.
{
"date": "2017-05-26T09:50:44.000Z",
"author_name": "djgoodlook",
"share_count": 0,
"mention_name": "firstpost",
"tweet_id": "868041705257402368",
"mention_id": "256495314",
"location": "pune india",
"retweet_id": "868039862774931456",
"type": "Retweet",
"author_id": "103535663",
"hashtag": "KamalHaasan"
}
I have tried to make relationships between tweets having same location by using following command
MATCH (a:TweetData),(b:TweetData)
WHERE a.location = b.location AND NOT a.tweet_id = b.tweet_id
CREATE (a)-[r:SameLocation]->(b)
RETURN r
And using this command I didn't able to make relationship as it is took more than 20 hour and still didn't produced the results. While for hashtag relationship it worked fine with similar command as it took around 5 minutes.
Is their any other method to make relationship or any way to optimise this query.
Yes. First, make sure you have an index on :TweetData(location), that's the most important change, since without that every single node lookup will have to scan all 50k :TweetData nodes for a common location (that's 50k ^2 lookups).
Next, it's better to ensure one node's id is less than the other, otherwise you'll get the same pairs of nodes twice, with just the order reversed, resulting in two relationships for every pair, one in each direction, instead of just the single relationship you want.
Lastly, do you really need to return all relationships? That may kill your browser, maybe return just the count of relationships added.
MATCH (a:TweetData)
MATCH (b:TweetData)
WHERE a.location = b.location AND a.tweet_id < b.tweet_id
CREATE (a)-[r:SameLocation]->(b)
RETURN count(r)
One other thing to (strongly) consider is instead of tracking common locations this way, create a :Location node instead, and link all :TweetData nodes to it.
You will need an index or unique constraint on :Location(name), then:
MATCH (a:TweetData)
MERGE (l:Location {name:a.location})
CREATE (a)-[:LOCATION]->(l)
This approach also more easily lends itself to batching, if 50k nodes at once is too much. You can just use LIMIT and SKIP after your match to a.

Neo4J: efficient neighborhood query

What's an efficient way to find all nodes within N hops of a given node? My particular graph isn't highly connected, i.e. most nodes have only degree 2, so for example the following query returns only 27 nodes (as expected), but it takes about a minute of runtime and the CPU is pegged:
MATCH (a {id:"36380_A"})-[*1..20]-(b) RETURN a,b;
All the engine's time is spent in traversals, because if I just find that starting node by itself, the result returns instantly.
I really only want the set of unique nodes and relationships (for visualization), so I also tried adding DISTINCT to try to stop it from re-visiting nodes it's seen before, but I see no change in run time.
As you said, matching the start node alone is really fast and faster if your property is indexed.
However what you are trying to do now is matching the whole pattern in the graph.
Keep your idea of your fast starting point:
MATCH (a:Label {id:"1234-a"})
once you got it pass it to the rest of the query with WITH
WITH a
then match the relationships from your fast starting point :
MATCH (a)-[:Rel*1..20]->(b)

Neo4j Query Optimization - 15 diff labels

So I have 15 different Labels in Neo4j which represent 15 real world business Objects. All these Objects are related to each other. Each Object (Label) has thousands of nodes . What I'm trying to do is to do optional Matches between all these Labels and get the Related data. The query is running extremely slow with all 15 selected and works fine with 3-4 object types.
So Typically this is the query with less object types which works fine.
MATCH (incident:Incidents)
WHERE incident.incident_number IN ["INC000005590903","INC000005590903"]
MATCH (device:Devices)
WHERE device.deviceid_udr in ["RE221869491800Uh_pVAevJpYAhRcJ"]
MATCH (alarm:Alarms)
WHERE alarm.entryid_udr in ["ALM123000000110"]
MATCH incident-[a]-alarm
MATCH device-[b]-alarm
MATCH incident-[c]-device
RETURN incident.incident_number, device.deviceid,alarm.entryid_udr
When I do a query to find data related between 15 different object types it runs extremely slow. Do you have any suggestions how I could approach this problem
When you analyze what is happening in your query, it's not to hard to see why it is slow. Each of your initial matches is unrelated to the others, so the entire domain for each label is searched. If you use relationship matching up front, you can significantly reduce the size of each the time it takes to do the query.
Try this query and see how it goes in comparison with the one in your question.
MATCH (incident:Incidents {incident_number : "INC000005590903"})
WITH incident
MATCH (incident)--(device:Devices {deviceid_udr : "RE221869491800Uh_pVAevJpYAhRcJ"})
WITH incident, device
MATCH (device)--(alarm:Alarms {entryid_udr : "ALM123000000110"})
WITH incident, device, alarm
MATCH (incident)--(alarm)
RETURN incident.incident_number, device.deviceid_udr, alarm.entryid_udr
In this query, once you find the incident, the next match is a search of the relationships on the particular incident to find a matching device. Once that is found, matching the alarm is limited in the same fashion.
In addition, if you have a different relationship type for each of the relationships (incident to device, device to alarm, alarm to incident, etc), using the specifications of the relationship types in the matches will speed things up even further, since it will again reduce the number of items that must be searched. Without relationship typing, relationships to nodes with wrong labels will be tested.
I don't know if it is intentional or not, but you also are matching for a closed ring in your example. That's not a problem if you are careful not to allow match loops to occur, but the query won't succeed if there isn't a closed ring. Just thought I'd point it out.

Edges records not showing up in OrientDB

I've discovered recently about OrientDB and I've been playing a little with this tool these past few weeks. However, I noticed today that something seemed to be wrong whenever I added an edge between two vertices. The edge record is not present if I make a query such as SELECT FROM E, this just returns an empty set. In spite of this, it is possible to see the relationship as a property in the nodes, and queries like SELECT IN() FROM V do work.
This poses an issue; if I can't access directly the edge record, I can't modify it with more properties, or even if I could, I wouldn't be able to see the changes made. I thought this could be a design decision for some reason but the GratefulDeadConcerts example database doesn't seem to have this problem.
I'll illustrate my question with an example:
Let's create a graph database in OrientDB from scratch and name it "Test". We'll create a couple of vertices:
CREATE VERTEX SET TEST=123
CREATE VERTEX SET TEST=456
Let's assume the #rid of these nodes are #9:0 and #9:1 respectively, as we haven't changed anything from the default settings. Let's create an edge between them:
CREATE EDGE FROM #9:0 TO #9:1
Now, let's take a look at the output of the query SELECT FROM V:
orientdb {Test}> SELECT FROM V
----+----+----+----+----
# |#RID|TEST|out_|in_
----+----+----+----+----
0 |#9:0|123 |#9:1|null
1 |#9:1|456 |null|#9:0
----+----+----+----+----
2 item(s) found. Query executed in 0.005 sec(s).
Everything looks right so far. However, the output of the query SELECT FROM E is simply 0 item(s) found. Query executed in 0.016 sec(s).. If we execute SELECT IN() FROM V we get the following:
orientdb {Test}> SELECT IN() FROM V
----+-----+----
# |#RID |IN
----+-----+----
0 |#-2:1|[0]
1 |#-2:2|[1]
----+-----+----
2 item(s) found. Query executed in 0.005 sec(s).
From this, I assume that the edges are created in cluster number -2, even if the default cluster for the class E is 10, and I haven't added any other clusters. I suspect this has something to do with the problem, but I'm not sure how to fix it. I have tried adding new clusters to the class E and creating the edges in this new cluster, but to no avail, I keep getting the exact same result.
So my question is, how do I make edges records show up in OrientDB?
I'm using OrientDB Community 1.7-RC2 and have tried this in two different machines, one Windows 7 and another one Debian Wheezy.
Extracted from https://github.com/orientechnologies/orientdb/wiki/Troubleshooting#why-i-cant-see-all-the-edges:
OrientDB, by default, manages edges as "lightweight" edges if they have no properties. This means that if an edge has no properties, it's not stored as physical record. But don't worry, your edge is still there but encoded in a separate data structure. For this reason if you execute a select from Eno edges or less edges than expected are returned. It's extremely rare the need to have the list of edges, but if this is your case you can disable this feature by issuing this command once (with a slow down and a bigger database size):
alter database custom useLightweightEdges=false

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)