Changing index in Neo4j affects the search in cypher - lucene

I am using Neo4j 2.0 .
I have a label PROD. All nodes having label PROD have a property name. I have created an index on name like this :
CREATE INDEX ON :PROD(name)
After creating an index if I change value of the property name from "old" to "new", the following query works fine in a small dataset used for testing but not in our production data with 700 nodes having label PROD (where totally there are around a million nodes with other labels).
MATCH (n:PROD) WHERE n.name="new" RETURN n;
I have also created legacy index on the same field and after dropping and re-indexing the node on modification, it works perfectly fine on both test and production datasets.
Is there a way to ensure the indices are updated? What am I doing wrong? Why does the above query fail for large dataset?

You can use the :schema command in Neo4j browser or schema in neo4j-shell. The output should indicate which indexes have been created and which of them are already online.
Additionally there's a programmatic way to wait until index population has finished.
Consider upgrading to 2.1.3, indexing has improved since 2.0.

Related

Element Range Index & availability for search

MarkLogic 9.0.8.2
We have around 20M records in our database in XML format.
To work with facets, we have created element-rage-index on the given element.
It is working fine, so no issue there.
Real problem is that, we now want to deploy same code on different environments like System Test(ST), UAT, Production.
Before deploying code, we have to make sure that given index exist. So we execute it in 1/2 days in advance.
We noticed that until full indexing is completed, we can't deploy our code else it will start showing up errors like this.
<error:code>XDMP-ELEMRIDXNOTFOUND</error:code>
<error:name/>
<error:xquery-version>1.0-ml</error:xquery-version>
<error:message>No element range index</error:message>
<error:format-string>XDMP-ELEMRIDXNOTFOUND: cts:element-reference(fn:QName("","tc"), ("type=string", "collation=http://marklogic.com/collation/")) -- No string element range index for tc collation=http://marklogic.com/collation/ </error:format-string>
And once index is finished, same code will run as expected.
Specially in ST/UAT, we are fine if we get partial data with unfinished indexing.
Is there any way we can achieve this? else we are loosing too much time just to wait for index to finish.
This happens every time when we come up with new feature which has dependency on new index
You can only use a range index if it exists and is available. It is not available until all matching records have been indexed.
You should create your indexes earlier and allow enough time to finish reindexing before deploying code that uses them. Maybe make your code deployment depend upon the reindexing status and not allow for it to be deployed until it has completed.
If the new versions of your applications can function without the indexes (value query instead of range-query), or you are fine with queries returning inaccurate results, then you could enable/disable the section of code utilizing them with feature flags, or wrap with try/catch, but you really should just create the indexes earlier in your deployment cycles.
Otherwise, if you are performing tests without a complete and functioning environment, what are you really testing?

Couchbase performance issues when scanning bucket for certain documents - Getting timeout exception

We have a Couchbase server version Community Edition 5.1.1 build 5723
In our Cars bucket we have the Car Make and the Cars it manufactured.
The connection between the two is the Id of the Car Make that we save as another field in the Car document (like a foreign key in a MySQL table).
The bucket has only 330,000 documents.
Queries are taking a lot of time - dozens of seconds for very simple query such as
select * from cars where model="Camry" <-- we expect to have about 50,000 results for that
We perform the queries in 2 ways:
The UI of the Couchbase
A Spring boot app, that constantly get a TimeOutException after 7.5 seconds
We thought the issue is a missing index to the bucket.
So we added an index:
CREATE INDEX cars_idx ON cars(makeName, modelName, makeId, _class) USING GSI;
We can see that index when running
SELECT * FROM system:indexes
What are we missing here? Are these reasonable amount of times for such queries in a NoSQL DB?
Try
CREATE INDEX model_idx ON cars(model);
Your index does not cover model field.
And you should have index for spring data couchbase "_class" property
CREATE INDEX `type_idx` ON `cars`(`_class`)
So, this is how we solved the issue:
Using this link and #paralen answer, we created several indexes that speed up the queries.
We altered our code to use pagination when we know the returned result set will be big, and came up with something like this:
do{
Pageable pageable = PageRequest.of(pageNumber, SLICE_SIZE, Sort.by("id"));
Slice slice carsRepository.findAllByModelName("Camry", pageable);
List cars = slice.getContent();
} while (slice.hasNext());

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

Neo4j schema indexing is very slow

I'm trying to implement indexes in NE4OJ in order to speed up some queries.
One of the queries is to find some people based on a string using where people.key IN [<10 string values here>].
We have around 180k nodes with the label People in the database and the query takes 41s to return the result.
So I created a schema Index on this property, run again the query and nothing changed. By curiosity, I decided to select by ID:
match (people:People)
where ID(people) IN [789806,908117,934851,934857,935125,935174,935177,935183,935581,935586,935587,935588,935634,935636,935637,935638,935639]
return ID(people)
It took 92ms. Perfect! So I tried to create a new property called test, index it and set the same value as the node id. Then I run the following query:
match (people:People)
where people.test IN [789806,908117,934851,934857,935125,935174,935177,935183,935581,935586,935587,935588,935634,935636,935637,935638,935639]
return ID(people)
It took again 41s (41000ms). Am I missing something? I really don't understand... Is it some performance conf?
FYI, we use NEO4J 2.0.0 on Debian.
Thanks guys !
As far as I remember the IN operation was not using an existing index in 2.0.x. So try to upgrade to 2.1.3 and retry.

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)