Neo4j: how do I use an index in a relationship between two nodes? - indexing

I'm debugging the code of an api and I found a cypher instruction that takes 6 minutes to return the data.
I ran the neo4j code in smaller chunks and found that this snippet is causing the problem: MATCH(copart:CopartOperadora) WHERE NOT (copart)-[:FROM_TO]->(:Coexistence)
I'm new to neo4j so I still haven't figured out how I can optimize this instruction.
Thanks to everyone who contributed.

Optimizations of this kind, usually depend on the schema, of your graph database, without that it's very hard to provide any insights. But you can try this:
MATCH (copart:CopartOperadora)-[:FROM_TO]->(:Coexistence)
WITH collect(id(copart)) AS connectedNodesIds
MATCH (copart:CopartOperadora) WHERE id(copart) NOT IN connectedNodesIds
We can't create any index as such, unfortunately. But if the relationship FROM_TO is only present from CopartOperadora to Coexistence nodes. Then you can remove the node label for Coexistence, all together, which will be optimal. Something like this:
MATCH(copart:CopartOperadora) WHERE NOT (copart)-[:FROM_TO]->()

Related

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

How to traverse & query faster TitanGraph with Cassandra as a backend?

I have millions of nodes stored in Titan 1.0.0 with Cassandra 2.2.4. I want to retrieve graph from Cassandra and query or traverse it in fast way.
If I build index in a code,
mgmt.buildIndex("nameSearchIndex", Vertex.class).addKey(namep, Mapping.TEXT.asParameter()).buildMixedIndex("search");
mgmt.buildIndex("addressSearchIndex", Vertex.class).addKey(addressp, Mapping.TEXT.asParameter()).buildMixedIndex("search");
Still the querying seems to be slower.
When I use
g.traversal().V().count()
it still gives warning - please use indexes, when I have already build indexes in code. Is there any specific configuration to forcefully activate indexes? How to query graph with using indexes?
g.traversal().V().has("Name","Jason")
Does this query uses indexes? if not then how do I make use of indexes to query faster?
Can Spark be used for fast traversal? How to use SparkComputerGraph for the same? I am not able to find the configurations for CassnadraInputFormat with Spark.
Thanks.
There are lots of questions bundled up in this question.
The indexes you're making are mixed indexes, which are implemented by an external indexing system, such as Solr or ElasticSearch. These would help if you are looking for a vertex with a certain name, such as your .has("Name", "Jason) example.
To find out if an index is being used, I suggest looking into the profile() step in Gremlin. You can read about it here.
Spark is meant to be used for traversals that need to potentially load a graph that is bigger than one machine can hold. What use case is .V().count() important for?
This answer was cross-posted on the Titan mailing list.
Indexing is useful for doing fast traversals, but ultimately "fast queries" depends on many factors, including your graph model/volume/shape and the types of questions you are trying to answer.
Read Chapter 8 "Indexing for better Performance" in the Titan docs, and digest the differences between the different types: Composite, Mixed, and Vertex-centric.
Based on the example query you posted, and as Daniel noted, it looks to me like an exact match type of query, so I would start with a Composite index. You can cut and paste this to try it out in the Titan Console.
graph = TitanFactory.open('inmemory')
mgmt = graph.openManagement()
name = mgmt.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.SINGLE).make()
nameIndex = mgmt.buildIndex('nameIndex',Vertex.class).addKey(name).buildCompositeIndex()
mgmt.commit()
graph.addVertex('name','jason')
g = graph.traversal()
g.V().has('name','jason') // no warning should appear
If after reading the Composite vs Mixed Index section you decide that a Mixed index (backed by Elasticsearch, Solr, or Lucene) is what you really need, read Chapter 20 "Index Parameters and Full-Text Search", and digest the differences between the mappings TEXT, STRING, and TEXTSTRING.
Here's an example that uses a STRING mixed index
graph = TitanFactory.build().set('storage.backend','inmemory').set('index.search.backend','elasticsearch').open()
mgmt = graph.openManagement()
name = mgmt.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.SINGLE).make()
nameIndex = mgmt.buildIndex('nameIndex',Vertex.class).addKey(name, Mapping.STRING.asParameter()).buildMixedIndex("search")
mgmt.commit()
graph.addVertex('name','jason')
g = graph.traversal()
g.V().has('name','jason') // no warning should appear

Way to optimise a mapping on informatica

I would like to optimise a mapping developped by one of my colleague and where the "loading part" (in a flat file) is really really slow - 12 row per sec
Currently, to get to the point where I start writting in my file, I take about 2 hours, so I would like to know where I should start looking first otherwise, I will need at least 2 hours between each improvment - which is not really efficient.
Ok, so to describe simply what is done :
Oracle table (with big query inside - takes about 2 hours to get a result)
SQ
2 LKup on ref table (should not be heavy)
update strategy
1 transformer
2 Lk up (on big table - that should be one optimum point I guess : change them to joiner)
6 stored procedure (these also seem a bit heavy, what do you think ?)
another tranformer
load in the flat file
Can you confirm that either the LK up or the stored procedur part could be the reason why it is so slow ?
Do you think that I should look somewhere else to optimize ? I was thinking may be only 1 transformer.
First check the logs carefuly. Look at the timestamps. It should give you initial idea what part causes delay.
Lookups to big tables are not recommended. Joiners are a better way, but they still need to cache data. Can you limit the data for cache, perhaps? It'll be very hard to advise without seeing it.
Which leads us to the Stored Procedures: it's simply impossible to tell anything about them just like that.
So: first collect the stats and do log analysis. Next, read some tuning guides on the Net - there's plenty. Here's a more comprehensive one, but well... large - so you might like to try and look for some other ones.
Powercenter Performance Tuning Guide

Alphabetical index with millions of rows in redis

For my application, I need an alphabetical index on a set with millions of rows.
When I use a sorted set, and give all members the same score, the result looks perfect.
Performance is also great, with a test set of 2 million rows, the last third does not perform noticably less than the first third of the set.
However, I need to query those results. For example, get the first (max) 100 items that start with "goo". I played around with zscan and sort, but it does not give me a working and performant result.
Since redis is very fast when inserting a new member to the sorted set, it must be technically possible to immediately (well, very quickly) go to the right memory location. I suppose redis uses some kind of quicksort mechanism to accomplish this.
But.. I don't seem to get the result when I just want to query the data, and not write to it.
We use replicated slaves for read actions, and we prefer the (default) read-only config switch. So creating a dummy key and deleting it afterward (however unelegant) is not really an option.
I'm stuck a bit, and I'm thinking about writing a ZLEX command in redis-server itself. Which I could use like this:
HELP "ZLEX" -> (ZLEX set score startswith)
-- Query the lexicographical index of a sorted set, supplying a 'startswith' string.
127.0.0.1:12345> ZLEX myset 0 goo LIMIT 0 100
1) goo
2) goof
3) goons
4) goozer
What are your thoughts? Am I missing something in the standard redis commands?
We're using Redis 2.8.4 x64 on Debian.
Kind regards, TW
Edits:
Note:
Related issue: indexing-using-redis-sorted-sets -> At least the name I gave to ZLEX seems to conform with Antirez' (Salvatore's) standards. As of 24-1-2014, I'm working on implementing ZLEX. It seems to be the easiest and most straight-forward solution for this use case, and Antirez could merge it into the main branch for everyone's benefit.
I've implemented ZLEX.
Here are the full specs.
You can grab the new functionality from here: github tw-bert
I also posted a pull request to Antirez here.
Kind regards, TW
Have you had a look at this ?
It can be useful depending on the length of the field by which you sort, this method requires b*(a^2) keys, where a is the length of the field , and b is amount of rows for this field.

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)