DSE Graph Index on Integer Interval - datastax

Currently, I have a graph stored through the DSE Graph Engine with 100K nodes. These nodes have label "customer" and a property called "age" which allows integer values. I have indexed this property with the following command:
schema.vertexLabel("customer").index("custByAge").secondary().by("age").add()
I would like to be able to use this index to answer queries that look for customers within a certain age range (e.g. "age" between 10 and 20). However, it doesn't seem like the index I created is actually being used when I query customers by an age interval.
When I submit the following query, a list of vertices is returned in about 40ms, which leads me to believe that the index is being used:
g.V().has('customer','age',15)
But when I submit the following query, the query times out after 30 sec (as I have specified in my configuration):
g.V().has('customer','age',inside(10,20))
Interruption of result iteration
Display stack trace? [yN]
This leads me to believe that the index is not being used for this query. Does that seem right? And if the index is not being used, does anyone have some advice for how I can speed up this query?
EDIT
As suggested by an answer below, I have run .profile on each of the above queries, with the following results (only showing relevant info):
gremlin> g.V().has('customer','age',21).profile()
==>Traversal Metrics
...
index-query 14.333ms
gremlin> g.V().has('customer','age',inside(21,23)).profile()
==>Traversal Metrics
...
index-query 115.055ms
index-query 132.144ms
index-query 132.842ms
>TOTAL 53042.171ms
This leaves me with a few questions:
Does the fact that .profile() returns index-query mean that indexes are being used for my second query?
Why does the second query have 3 index queries, as opposed to 1 for the first query?
All of the index queries combined, for the second query, total to about ~400ms. Why is the whole query taking ~50000ms? The .profile() command shows nothing else that takes time except for these index-queries, so where is the extra 50000ms coming from?

Are you using DataStax Studio? If so, you can use the .profile() feature to understand how the index is being engaged?
example .profile() use:
g.V().in().has('name','Julia Child').count().profile()

You want to use a search index for this case, it will be much much faster.
For example, in KillRVideo:
schema.vertexLabel("movie").index("search").search().by("year").add()
g.V().hasLabel('movie').has('year', gt(2000)).has('year', lte(2017)).profile()
Then from Studio profile() we can see:
SELECT "community_id", "member_id" FROM "killrvideo"."movie_p" WHERE
"solr_query" = '{"q":"*:*", "fq":["year:{2000 TO *}","year:{* TO
2017]"]}' LIMIT ?; with params (java.lang.Integer) 50000
By default, the profiler doesn't show the trace of all operations, so the index-query list you see may be truncated. Modify "max_profile_events" according to this documentation: https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/graph/reference/schema/refSchemaConfig.html

Related

ST_INTERSECTS issue BigQuery Waze

I'm using the following code to query a dataset based on a polygon:
SELECT *
FROM `waze-public-dataset.partner_name.view_jams_clustered`
WHERE ST_INTERSECTS(geo, ST_GEOGFROMTEXT("POLYGON((-99.54913355822276 27.60526592074579,-99.52673174853038 27.60526592074579,-99.52673174853038 27.590813604291416,-99.54913355822276 27.590813604291416,-99.54913355822276 27.60526592074579))")) IS TRUE
The validation message says that "This query will process 1 TB when run".
It seems like there's no problem. However, when I remove the "WHERE INTERSECTS" function, the validation message says exactly the same thing: "This query will process 1 TB when run", the same 1 TB, so I'm guessing that the ST_INTERSECTS function is not working.
When you actually run this query, the amount charged should be usually much less, as expected for spatially clustered table. I've run select count(*) ... query with one partner dataset, and while editor UI announced 9TB before running query, the query reported around 150MB processed after running.
The savings come from clustered table - but the specific clusters that intersect the polygon used in filter depend on actual data in the table and how clusters were created. The clusters and thus the cost of the query can only be determined when the query runs. Editor UI in this case shows maximum possible cost of the query.

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

How can I speed up datastax dse gremlin query 'find shortest path' between groups of nodes by adding timestamp index on edge

I would like to speed up query by adding timeStamp property index on edge. but According to the link :
http://www.datastax.com/2016/08/inside-dse-graph-what-powers-the-best-enterprise-graph-database
Vertex-centric indexes are spesific, But in my query I look for all edges between two groups of nodes and not specific.
so ,How can I speed the following query using timeStamp index on edge ?
ids1 = [{'~label=person',member_id=54666,community_id=505443455},{...},{...}]
ids2 = [{'~label=person',member_id=52366,community_id=501423455},{...},{...}]
g.V(ids1).repeat(timeLimit(20000).bothE().otherV().dedup().simplePath()).times(5).emit().where(hasId(within(ids2))).path().range(0,5)
Please help
can you provide more context with this? how many id's are you passing ids1? do you have an index built on the person vertex label? have you tried using the .profile() method to see where bottlenecks exist in the traversal?

What index would speed up my XQuery in X-Hive / Documentum xDB?

I have approx 2500 documents in my test database and searching the xpath /path/to/#attribute takes approximately 2.4 seconds. Doing distinct-values(/path/to/#attribute) takes 3.0 seconds.
I've been able to speed up queries on /path/to[#attribute='value'] to hundreds or tens of milliseconds by adding a Path value index on /path/to[#attribute<STRING>] but no index I can think of gets picked up for the more general query.
Anybody know what indexes I should be using?
The index you propose is the correct one (/path/to[#attribute]), but unfortunately the xDB optimizer currently doesn't recognize this specific case since the 'target node' stored in the index is always an element and not an attribute. If /path/to/#attribute has few results then you can optimize this by slightly modifying your query to this: distinct-values(/path/to[#attribute]/#attribute). With this query the optimizer recognizes that there is an index it can use to get to the 'to' element, but then it still has the access the target document to retrieve the attribute for the #attribute step. This is precisely why it will only benefit cases where there are few hits: each hit will likely access a different data page.
What you also can do is access the keys in the index directly through the API: XhiveIndexIf.getKeys(). This will be very fast, but clearly this is not very user friendly (and should be done by the optimizer instead).
Clearly the optimizer could handle this. I will add it to the bug tracker.

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.