How to add index to label in DSE graph? - datastax

I turned off the auto scan for dse graph and add index for properties of vertices and edges. All my query failed with the following error message,
g.V().hasLabel("PERMISSIONS").valueMap()
yields,
Could not find an index on vertices labelled 'PERMISSIONS' to answer the condition: '((label = PERMISSIONS))'. Current indexes are: byName(Secondary)->name. Alternatively if in development enable graph scan by using graph.allow_scan. Graph scan is NOT suitable for anything other than toy graphs.
How to add index to lable?

The docs have a section on indexing which will show you how to create an index - http://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/graph/using/indexingTOC.html
It looks like though that you are trying to return all data for the Permissions Vertex Label. That would cause a cluster scan. Are you able to filter your query by a property, instead of trying to return all data?

Related

How to handle supernodes with DSE Graph

I’ve a simple Vertex „url“:
schema.vertexLabel('url').partitionKey('url_fingerprint', 'prop1').properties("url_complete").ifNotExists().create()
And a edgeLabel called „links“ which connects one url to another.
schema.edgeLabel('links').properties("prop1", 'prop2').connection('url', 'url').ifNotExists().create()
It’s possible that one url has millions of incoming links (e.g. front page of ebay.com from all it’s subpages).
But that seems to result in really big partitions / and a crash of dse because of wide partitions (From Opscenter wide partitions report):
graphdbname.url_e (2284 mb)
How can i avoid that situation? How to handle this „Supernodes“? I’ve found a „partition“ command (article about this [1]) for Labels but that is set deprecated and will be removed in DSE 6.0 / the only hint in release notes is to model the data on another way - but i’ve no idea how i can do that in that case.
I’m happy about every hint. Thanks!
[1] https://www.experoinc.com/post/dse-graph-partitioning-part-2-taming-your-supernodes
The current recommendation is to use the concept of "bucketing" that drives data model design in the C* world and apply that to the graph by creating an intermediary Vertex that represents groups of links.
2 Vertex Labels
URL
URL_Group | partition key ((url, group)) … i.e. a composite primary key with 2 partition key components
2 Edges
URL -> URL_Group
URL_Group (replaces existing self reference edge) URL_Group <->URL_Group
Store no more than 100Kish url_fingerprints per group. Create a new group after each 100kish edges exist.
This solution requires bookkeeping to determine when a new group is needed.
This could be done through a simple C* table for fast, easy retrievable.
CREATE TABLE lookup url_fingerprint, group, count counter PRIMARY KEY (url_fingerprint, group)
This should preserve DESC order, may need to add an ORDER BY statement if DESC order is not preserved.
Prior to writing to the Graph, one would need to read the table to find the latest group.
SELECT url_fingerprint, group, count from lookup LIMIT(1)
If the counter is > 100kish, create a new group (increment group +1). During or after writing a new row to Graph, one would need to increment the counter.
Traversing would require something similar to:
g.V().has(some url).out(URL).out(URL_Group).in(URL)
Where conceptually you would traverse the relationships like URL -> URL_Group->URL_Group<-URL
The visual model of this type of traversal would look like the following diagram

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

How can I speed up datastax dse gremlin query 'find shortest path' between groups of nodes by adding timestamp index on edge

I would like to speed up query by adding timeStamp property index on edge. but According to the link :
http://www.datastax.com/2016/08/inside-dse-graph-what-powers-the-best-enterprise-graph-database
Vertex-centric indexes are spesific, But in my query I look for all edges between two groups of nodes and not specific.
so ,How can I speed the following query using timeStamp index on edge ?
ids1 = [{'~label=person',member_id=54666,community_id=505443455},{...},{...}]
ids2 = [{'~label=person',member_id=52366,community_id=501423455},{...},{...}]
g.V(ids1).repeat(timeLimit(20000).bothE().otherV().dedup().simplePath()).times(5).emit().where(hasId(within(ids2))).path().range(0,5)
Please help
can you provide more context with this? how many id's are you passing ids1? do you have an index built on the person vertex label? have you tried using the .profile() method to see where bottlenecks exist in the traversal?

DSE Graph Index on Integer Interval

Currently, I have a graph stored through the DSE Graph Engine with 100K nodes. These nodes have label "customer" and a property called "age" which allows integer values. I have indexed this property with the following command:
schema.vertexLabel("customer").index("custByAge").secondary().by("age").add()
I would like to be able to use this index to answer queries that look for customers within a certain age range (e.g. "age" between 10 and 20). However, it doesn't seem like the index I created is actually being used when I query customers by an age interval.
When I submit the following query, a list of vertices is returned in about 40ms, which leads me to believe that the index is being used:
g.V().has('customer','age',15)
But when I submit the following query, the query times out after 30 sec (as I have specified in my configuration):
g.V().has('customer','age',inside(10,20))
Interruption of result iteration
Display stack trace? [yN]
This leads me to believe that the index is not being used for this query. Does that seem right? And if the index is not being used, does anyone have some advice for how I can speed up this query?
EDIT
As suggested by an answer below, I have run .profile on each of the above queries, with the following results (only showing relevant info):
gremlin> g.V().has('customer','age',21).profile()
==>Traversal Metrics
...
index-query 14.333ms
gremlin> g.V().has('customer','age',inside(21,23)).profile()
==>Traversal Metrics
...
index-query 115.055ms
index-query 132.144ms
index-query 132.842ms
>TOTAL 53042.171ms
This leaves me with a few questions:
Does the fact that .profile() returns index-query mean that indexes are being used for my second query?
Why does the second query have 3 index queries, as opposed to 1 for the first query?
All of the index queries combined, for the second query, total to about ~400ms. Why is the whole query taking ~50000ms? The .profile() command shows nothing else that takes time except for these index-queries, so where is the extra 50000ms coming from?
Are you using DataStax Studio? If so, you can use the .profile() feature to understand how the index is being engaged?
example .profile() use:
g.V().in().has('name','Julia Child').count().profile()
You want to use a search index for this case, it will be much much faster.
For example, in KillRVideo:
schema.vertexLabel("movie").index("search").search().by("year").add()
g.V().hasLabel('movie').has('year', gt(2000)).has('year', lte(2017)).profile()
Then from Studio profile() we can see:
SELECT "community_id", "member_id" FROM "killrvideo"."movie_p" WHERE
"solr_query" = '{"q":"*:*", "fq":["year:{2000 TO *}","year:{* TO
2017]"]}' LIMIT ?; with params (java.lang.Integer) 50000
By default, the profiler doesn't show the trace of all operations, so the index-query list you see may be truncated. Modify "max_profile_events" according to this documentation: https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/graph/reference/schema/refSchemaConfig.html

How to build a simple inverted index?

I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence.
Now, I have several questions:
What kind of data structure is often used for building inverted index? Multidimensional list?
After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper?
You can see a very simple implementation of inverted index and search in TinySearchEngine.
For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this:
val invertedIndex = new collection.mutable.HashMap[String, List[Posting]]
or a Java-esque:
HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>();
The hash maps each term/word/token to a list of Postings. A Posting is just an object that represents an occurrence of a word inside a document:
case class Posting(docId:Int, var termFrequency:Int)
Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. There are other ways of doing this. For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert Postings once with the correct termFrequency instead of updating it every time.
Regarding your second question, there are normally two cases:
(1) you have an (almost) immutable index. You index all your data once and if you have new data you can just reindex. There is no need to real-time or indexing many times in an hour, for example.
(2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible.
For case (1), you can have at least 2 files:
1 - The Inverted Index file. It lists for each term all Postings (docId/termFrequency pairs). Here represented in plain text, but normally stored as binary data.
Term1<docId1,termFreq><docId2,termFreq><docId3,termFreq><docId4,termFreq><docId5,termFreq><docId6,termFreq><docId7,termFreq>
Term2<docId3,termFreq><docId5,termFreq><docId9,termFreq><docId10,termFreq><docId11,termFreq>
Term3<docId1,termFreq><docId3,termFreq><docId10,termFreq>
Term4<docId5,termFreq><docId7,termFreq><docId10,termFreq><docId12,termFreq>
...
TermN<docId5,termFreq><docId7,termFreq>
2- The offset file. Stores for each term the offset to find its inverted list in the inverted index file. Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. This file can be loaded to memory at startup time. When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file.
Term1 -> 0
Term2 -> 126
Term3 -> 222
....
Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm.
For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch) do it.
The file format can be the same as explained above. The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. So every time you have to index something, you do it in a new separated index.
To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user.
Lucene calls this "little" indexes segments.
The obvious concern here is that you'll get a lot of little segments very quick. To avoid this, you'll need a policy for merging segments and creating larger segments. For example, if you have more than N segments you can decide to merge all segments smaller than 10 KBs together.