Neo4j: "ghost" node in label index throws error - indexing

I have a neo4j database with a set of nodes with label :EXAMPLE.
There are two operations. First I delete one node and then I look for another one. They are done separately using neo4j API.
MATCH (n:EXAMPLE {Name: { name1 }}) DELETE n;
and
MATCH (n:EXAMPLE {Name: { name2 }}) RETURN n;
Sometimes, when I execute second query, it throws an error "Node with id 123". Node with id 123 is the same node that was deleted in the first query.
It happens when there is a lot of requests are coming to the database simultaneously.
I guess that it could happen if node was deleted, but EXAMPLE label index wasn't updated yet. There are two facts that prove such theory.
1) The error is unstable.
2) If I change second query like this (remove the label), I won't get the error:
MATCH (n {Name: { name2 }}) RETURN n;
Neo4j version is 2.1.5, Java - OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2~deb7u1) and operation system is Debian. There are no other indexes in the database except the label.
The question is how can I fix this, but still use labels?

What ends up happening is that (simplified) the operations will order like so:
Q1: MATCH (n)
Q2: DELETE (n), COMMIT
Q1: RETURN n # Error, n no longer exists
For implementation reasons, this is much more likely to happen if cypher is going via an index. The database will eventually handle this for you, but for now, you'll need to wrap that read query in a retry block - if it fails with this type of error, you simply run it again.
On that note, there are other errors that are easily recoverable from by retrying, such as deadlock errors, so wrapping your statements and/or transactions in retry-blocks is a useful thing to do in general.

This is a possible workaround:
Mark nodes as deleted instead of deleting. Ignore nodes that are marked as deleted. Delete all such nodes at once with a garbage collector.

Related

Create subgraph query in Gremlin around single node with outgoing and incoming edges

I have a large Janusgraph database and I'd to create a subgraph centered around one node type and including incoming and outgoing nodes of specific types.
In Cypher, the query would look like this:
MATCH (a:Journal)N-[:PublishedIn]-(b:Paper{paperTitle:'My Paper Title'})<-[:AuthorOf]-(c:Author)
RETURN a,b,c
This is what I tried in Gremlin:
sg = g.V().outE('PublishedIn').subgraph('j_p_a').has('Paper','paperTitle', 'My Paper Title')
.inE('AuthorOf').subgraph('j_p_a')
.cap('j_p_a').next()
But I get a syntax error. 'AuthorOf' and 'PublishedIn' are not the only edge types ending at 'Paper' nodes.
Can someone show me how to correctly execute this query in Gremlin?
As written in your query, the outE step yields edges and the has step will check properties on those edges, following that the query processor will expect an inV not another inE. Without your data model it is hard to know exactly what you need, however, looking at the Cypher I think this is what you want.
sg = g.V().outE('PublishedIn').
subgraph('j_p_a').
inV().
has('Paper','paperTitle', 'My Paper Title').
inE('AuthorOf').
subgraph('j_p_a')
cap('j_p_a').
next()
Edited to add:
As I do not have your data I used my air-routes graph. I modeled this query on yours and used some select steps to limit the data size processed. This seems to work in my testing. Hopefully you can see the changes I made and try those in your query.
sg = g.V().outE('route').as('a').
inV().
has('code','AUS').as('b').
select('a').
subgraph('sg').
select('b').
inE('contains').
subgraph('sg').
cap('sg').
next()

Using deepstream List for tens of thousands unique values

I wonder if it's a good/bad idea to use deepstream record.getList for storing a lot of unique values, for example, emails or any other unique identifiers. The main purpose is to be able to answer a question quickly whether we already have, say, a user with such email (email in use) or another record by specific unique field.
I made few experiments today and got two problems:
1) when I tried to populate the list with few thousands values I got
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
and my deepstream server went off. I was able to fix it by adding more memory to the server node process with this flag
--max-old-space-size=5120
it doesn't look fine but allowed me to make a list with more than 5000 items.
2) It wasn't enough for my tests so I precreated the list with 50000 items and put the data directly to rethinkdb table and got another issue on getting the list or modifing it:
RangeError: Maximum call stack size exceeded
I was able to fix it with another flag:
--stack-size=20000
It helps but I believe it's only matter of time when one of those errors appear in production when the list size reaches proper value. I don't know really whether it's nodejs, javascript, deepstream or rethinkdb issue. That's all in general made me think that I try to use deepstream List wrong way. Please, let me know. Thank you in advance!
Whilst you can use lists to store arrays of strings, they are actually intended as collections of recordnames - the actual data would be stored in the record itself, the list would only manage the order of the records.
Having said that, there are two open Github issues to improve performance for very long lists by sending more efficient deltas and by introducing a pagination option
Interesting results in regards to memory though, definitely something that needs to be handled more gracefully. In the meantime you could drastically improve performance by combining updates into one:
var myList = ds.record.getList( 'super-long-list' );
// Sends 10.000 messages
for( var i = 0; i < 10000; i++ ) {
myList.addEntry( 'something-' + i );
}
// Sends 1 message
var entries = [];
for( var i = 0; i < 10000; i++ ) {
entries.push( 'something-' + i );
}
myList.setEntries( entries );

SQL simulator on Prolog

i need to do a SQL simulator on Prolog. i need to simulate some functions like create_table, insert, delete, update_table, select_table, drop_table, show_table, etc. I am learning how to do asserts and retracts but im getting some errors in my first function create_table(N,A) where N is the name of the table and A a list with the atributtes
An example is create_table("student",["name","lastname","age"]). This will create my table named "student" with the atributes ["name","lastname","age"].
I was searching how to work with assert and i see i need to do dynamic before making assert, then my code is.
len([],0).
len([_|T],N) :- len(T,X), N is X+1.
create_table(_, []):-!.
create_table(N, Atributos):- len(Atributos, L), dynamic N/L, assert(N(Atributos)).
But i get this error :7: Syntax error: Operator expected on this line
create_table(N, Atributos):- len(Atributos, L), dynamic N/L, assert(N(Atributos)).
What im doing wrong? excuse me, i dont speak good english.
From the error message, seems you're using SWI-Prolog....
Your code could be (note that len/2 can be replaced by the builtin length/2)
create_table(N, Atributos):-
length(Atributos, L),
dynamic(N/L),
T =.. [N|Atributos], % this missing 'constructor' was the cause of the error message
assert(T). % this is problematic
There is an important 'design' problem: the CREATE TABLE SQL statement works at metadata level.
If you do, for instance,
?- assertz(student('Juan', 'Figueira', 20)).
pretending that the predicate student/3 holds table data, you're overlapping data and metadata
using dynamic/1 and assert is a tricky non-logical aspect of Prolog, and dynamically creating dynamic predicates is really unusual. Fundamentally you cannot have a Prolog query with the predicate name as a variable e.g.
query(N,X) :- N=student, N(X).
My suggestion is you remove a layer of complexity and have one dynamic predicate 'table', and assert your SQL tables as new 'table' clauses i.e.
:- dynamic table/2.
:- assertz(table(student,['Joe','Young',18])).
query(N,X) :- table(N,X).
:- query(student,[Name,Lastname,Age]).

orientdb sql update edge?

I have been messing around with orientdb sql, and I was wondering if there is a way to update an edge of a vertex, together with some data on it.
assuming I have the following data:
Vertex: Person, Room
Edge: Inside (from Person to Room)
something like:
UPDATE Persons SET phone=000000, out_Inside=(
select #rid from Rooms where room_id=5) where person_id=8
obviously, the above does not work. It throws exception:
Error: java.lang.ClassCastException: com.orientechnologies.orient.core.id.ORecordId cannot be cast to com.orientechnologies.orient.core.db.record.ridbag.ORidBag
I tried to look at the sources at github searching for a syntax for bag with 1 item,
but couldn't find any (found %, but that seems to be for serialization no for SQL).
(1) Is there any way to do that then? how do I update a connection? Is there even a way, or am I forced to create a new edge, and delete the old one?
(2) When writing this, it came to my mind that perhaps edges are not the way to go in this case. Perhaps I should use a LINK instead. I have to say i'm not sure when to use which, or what are the implications involved in using any of them. I did found this though:
https://groups.google.com/forum/#!topic/orient-database/xXlNNXHI1UE
comment 3 from the top, of Lvc#, where he says:
"The suggested way is to always create an edge for relationships"
Also, even if I should use a link, please respond to (1). I would be happy to know the answer anyway.
p.s.
In my scenario, a person can only be at one room. This will most likely not change in the future. Obviously, the edge has the advantage that in case I might want to change it (however improbable that may be), it will be very easy.
Solution (partial)
(1) The solution was simply to remove the field selection. Thanks for Lvca for pointing it out!
(2) --Still not sure--
CREATE EDGE and DELETE EDGE commands have this goal: avoid the user to fight with underlying structure.
However if you want to do it (a little "dirty"), try this one:
UPDATE Persons SET phone=000000, out_Inside=(
select from Rooms where room_id=5) where person_id=8
update EDGE Custom_Family_Of_Custom
set survey_status = '%s',
apply_source = '%s'
where #rid in (
select level1_e.#rid from (
MATCH {class: Custom, as: custom, where: (custom_uuid = '%s')}.bothE('Custom_Family_Of_Custom') {as: level1_e} .bothV('Custom') {as: level1_v, where: (custom_uuid = '%s')} return level1_e
)
)
it works well

Neo4j 2.0.1 Cypher performance difference between using start and match with a predicate

Started using Cypher about a week ago (really like it). In the 'browser' interface I'm running two queries:
1) start n=node:Node(name="foo") match (n)-[r*..4]-(m) return n,m
2) match (n{name:"foo"})-[r*..4]-(m) return n,m
The first query returns almost immediately, the second query more than an hour and counting. Naively I would think these would be equivalent, clearly they are not. I ran a 'smaller' (path just up to 1) version of both in the neo-shell so I could profile them.
profile start n=node:Node(name="foo") match (n)-[r*..1]-(m) return n,m;
ColumnFilter(symKeys=["m", "n", " UNNAMED51", "r"], returnItemNames=["n", "m"], _rows=4, _db_hits=0)
TraversalMatcher(start={"expr": "Literal(foo)", "identifiers": ["n"], "key": "Literal(name)",
"idxName": "Node", "producer": "NodeByIndex"}, trail="(n)-[*1..1]-(m)", _rows=4, _db_hits=5)
.
profile match (n{name:"foo"})-[r*..1]-(m) return n,m
ColumnFilter(symKeys=["n", "m", " UNNAMED33", "r"], returnItemNames=["n", "m"], _rows=4, _db_hits=0)
Filter(pred="Property(n,name(0)) == Literal(foo)", _rows=4, _db_hits=196870)
TraversalMatcher(start={"producer": "AllNodes", "identifiers": ["m"]},
trail="(m)-[*1..1]-(n)", _rows=196870, _db_hits=396980)
From other stackoverflow questions I understand db_hits is good to look at, so it looks like the second query has basically done a linear scan (my db is almost 400k nodes). This seems to be confirmed by the "producer" value of "AllNodes" instead of "NodeByIndex".
Obviously I need to specify the match (predicate) differently so that it hits the index. The index is called 'Node' on parameter 'name'. My googling, stacko search is failing me.. how do I specify the conditional in the match so that it hits the index?
Update:
After some poking around it appears I'm using a 'legacy' index? and then trying to hit that with the 'new style (don't use start) query... (kinda extrapolating here). So I can do the following:
create index ON :label(name)
and that would provide an index for a particular label on the name property, but I really want an index (I guess non-legacy index) on ALL the node names. I have use cases where that's important (user may not know the label but does know the name).
Any suggestions or guidance is much appreciated.
Right now there is no global schema index, so you would probably want to create an index on a generic label like Entity or Node and create an index like this:
create index on :Entity(name);
And add that Entity label to all your nodes.
match (n) set n:Entity;