How can I speed up datastax dse gremlin query 'find shortest path' between groups of nodes by adding timestamp index on edge - datastax

I would like to speed up query by adding timeStamp property index on edge. but According to the link :
http://www.datastax.com/2016/08/inside-dse-graph-what-powers-the-best-enterprise-graph-database
Vertex-centric indexes are spesific, But in my query I look for all edges between two groups of nodes and not specific.
so ,How can I speed the following query using timeStamp index on edge ?
ids1 = [{'~label=person',member_id=54666,community_id=505443455},{...},{...}]
ids2 = [{'~label=person',member_id=52366,community_id=501423455},{...},{...}]
g.V(ids1).repeat(timeLimit(20000).bothE().otherV().dedup().simplePath()).times(5).emit().where(hasId(within(ids2))).path().range(0,5)
Please help

can you provide more context with this? how many id's are you passing ids1? do you have an index built on the person vertex label? have you tried using the .profile() method to see where bottlenecks exist in the traversal?

Related

How to add index to label in DSE graph?

I turned off the auto scan for dse graph and add index for properties of vertices and edges. All my query failed with the following error message,
g.V().hasLabel("PERMISSIONS").valueMap()
yields,
Could not find an index on vertices labelled 'PERMISSIONS' to answer the condition: '((label = PERMISSIONS))'. Current indexes are: byName(Secondary)->name. Alternatively if in development enable graph scan by using graph.allow_scan. Graph scan is NOT suitable for anything other than toy graphs.
How to add index to lable?
The docs have a section on indexing which will show you how to create an index - http://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/graph/using/indexingTOC.html
It looks like though that you are trying to return all data for the Permissions Vertex Label. That would cause a cluster scan. Are you able to filter your query by a property, instead of trying to return all data?

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

DSE Graph Index on Integer Interval

Currently, I have a graph stored through the DSE Graph Engine with 100K nodes. These nodes have label "customer" and a property called "age" which allows integer values. I have indexed this property with the following command:
schema.vertexLabel("customer").index("custByAge").secondary().by("age").add()
I would like to be able to use this index to answer queries that look for customers within a certain age range (e.g. "age" between 10 and 20). However, it doesn't seem like the index I created is actually being used when I query customers by an age interval.
When I submit the following query, a list of vertices is returned in about 40ms, which leads me to believe that the index is being used:
g.V().has('customer','age',15)
But when I submit the following query, the query times out after 30 sec (as I have specified in my configuration):
g.V().has('customer','age',inside(10,20))
Interruption of result iteration
Display stack trace? [yN]
This leads me to believe that the index is not being used for this query. Does that seem right? And if the index is not being used, does anyone have some advice for how I can speed up this query?
EDIT
As suggested by an answer below, I have run .profile on each of the above queries, with the following results (only showing relevant info):
gremlin> g.V().has('customer','age',21).profile()
==>Traversal Metrics
...
index-query 14.333ms
gremlin> g.V().has('customer','age',inside(21,23)).profile()
==>Traversal Metrics
...
index-query 115.055ms
index-query 132.144ms
index-query 132.842ms
>TOTAL 53042.171ms
This leaves me with a few questions:
Does the fact that .profile() returns index-query mean that indexes are being used for my second query?
Why does the second query have 3 index queries, as opposed to 1 for the first query?
All of the index queries combined, for the second query, total to about ~400ms. Why is the whole query taking ~50000ms? The .profile() command shows nothing else that takes time except for these index-queries, so where is the extra 50000ms coming from?
Are you using DataStax Studio? If so, you can use the .profile() feature to understand how the index is being engaged?
example .profile() use:
g.V().in().has('name','Julia Child').count().profile()
You want to use a search index for this case, it will be much much faster.
For example, in KillRVideo:
schema.vertexLabel("movie").index("search").search().by("year").add()
g.V().hasLabel('movie').has('year', gt(2000)).has('year', lte(2017)).profile()
Then from Studio profile() we can see:
SELECT "community_id", "member_id" FROM "killrvideo"."movie_p" WHERE
"solr_query" = '{"q":"*:*", "fq":["year:{2000 TO *}","year:{* TO
2017]"]}' LIMIT ?; with params (java.lang.Integer) 50000
By default, the profiler doesn't show the trace of all operations, so the index-query list you see may be truncated. Modify "max_profile_events" according to this documentation: https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/graph/reference/schema/refSchemaConfig.html

why neo4j indexing can't find the nodes which I know they exist?

I have created and indexed my graph database through localhost:7474 in neo4j(visually).
The Nodes have three properties,name,priority,link.
and I created index on name property of nodes through
add or remove indexes
tab of localhost:7474(as shown in picture)
but when I try to retrieve nodes based on their names,in data browser,console,or my java application the nodes can not be found.
in console or data browser,when I write this query for red(there is a node with the name of red),for example:
start n=node:name(name="red")
return n;
I get returned 0 rows.
and when I type this query:
start n=node:node(name="red")
return n;
or this one:
start n=node:Node(name="red")
return n;
I get Indexnodedoes not exist,IndexNodedoes not exist,in console or data browser.
my database file is in the same path which neo4j default.graphdb file exists(I mean in "C:\Users\fereshteh\Documents\Neo4j" ),and I first created the index,and then the graph database.
I don't know what I am doing wrong,please help me,I will be so thankful.
version of neo4j:1.9.4
I believe your assumption about how to set up the indexing is incorrect. You can read here for more information, but basically there are 3 things that are needed to create/read from an index. The Index name, the entry key, and the entry value.
What you have specified above in the Web Console is the Index name, but in your cypher query, you are specifying the entry key. You either want to use the Node Auto index, or to create a node in cypher and index it there but that isn't an option in 1.9.4.

How to return all newest nodes from neo4j?

Is it possible to query neo4j for the newest nodes? In this case, the indexed property "timestamp" records time in milliseconds on every node.
All of the cypher examples I have lfound concern graph-type queries- "start at node n and follow relationships. What is the general best approach for returning resultsets sorted on one field? Is this even possible in a graph database such as node4j?
In the embedded Java API it is possible to add sorting using Lucene constructs.
http://docs.neo4j.org/chunked/milestone/indexing-lucene-extras.html#indexing-lucene-query-objects
http://blog.richeton.com/2009/05/12/lucene-sort-tips/
In the server mode you can pass an ?order parameter to the lucene lookup query.
http://docs.neo4j.org/chunked/milestone/rest-api-indexes.html#rest-api-find-node-by-query
Depending on how you indexed your data (not numerically as there are issues with the lucene query syntax parser and numeric searches :( ), in cypher you can do:
start n=node:myindes('time: [1 to 1000]') return n order by n.time asc
There are also more graphy ways of doing that, e.g. by linking the events with a NEXT relationship and returning the head and next n elements from this list
http://docs.neo4j.org/chunked/milestone/cypher-cookbook-newsfeed.html
or to create a tree structure for time:
http://docs.neo4j.org/chunked/milestone/cypher-cookbook-path-tree.html
Yes, it is possible, and there are some different ways to do so.
You could either use a timestamp property and a classic index, and sort your result set by that property. Or you could create an in-graph time-based index, like f.e. described in Peter's blog post:
http://blog.neo4j.org/2012/02/modeling-multilevel-index-in-neoj4.html