Difficulties using Janusgraph indexes with Gremlin UNION - indexing

I have an issue using UNION operator: in this first example the query works but gives no result:
g.V().has('name','Barack Obama').union(has('name','Michelle Obama'))
Instead in this second example the gremlin compiler replies that cannot use indexes:
g.V().union(has('name','Barack Obama'), g.V().has('name','Michelle Obama'))
Could not find a suitable index to aswer graph query and graph scans are disabled: [()]:VERTEX
Am I wrongly doing this type of query or has Janugraph some limitations?

Not sure about the error message, probably related to the fact you are trying to start a new traversal inside the union step.
I think this is the query you are trying to run:
g.V().union(has('name','Barack Obama'), has('name','Michelle Obama'))
or even better:
g.V().has('name', within('Barack Obama', 'Michelle Obama'))

Related

Need help converting a Neo4j Cypher script to Gremlin

I can't figure out how to rewrite my Cypher script in Gremlin.
First we used the .Net Neo4j client to connect to our Neo4j database and run Cypher queries on it. Then we decided to add an abstraction layer and connect to a Gremlin server instead (which, for now, hosts the same Neo4j database). So now I need to translate our queries from Cypher to Gremlin and I am finding it rather difficult.
Here's one of them:
MATCH (pc:ProductCategory)-[:HasRootCategory]->(r:RootCategory)
WHERE NOT (:ProductCategory)-[]->(pc)
AND pc.Id = r.RootId
RETURN pc;
One of my failed attempts:
g.V().match(as("pc").out("HasRootCategory").as("r"),as("pc").in().has('label', 'ProductCategory').count().is(0))).select("pc", "r").where("pc.Id", eq("r.RootId")).select("pc")
I found an example on stackoverflow using this 'match(as' construct, but it must be depracated or something, because I'm getting an error. Also, not sure how to compare properties with different names on nodes with different labels (I'm sure the 'where' is wrong...)
Any help would be appreciated.
The following traversal should be equivalent:
g.V().hasLabel("ProductCategory").as("pc").
not(__.in().hasLabel("ProductCategory")).
out("HasRootCategory").as("r").
where("pc", eq("r")).
by("Id").
by("RootId").
select("pc")
Since you don't really need the r label, the query can be tweaked a bit:
g.V().hasLabel("ProductCategory").as("pc").
not(__.in().hasLabel("ProductCategory")).
filter(out("HasRootCategory").
where(eq("pc")).
by("Id").
by("RootId"))
Last thing to mention: If a ProductCategory vertex can be connected to another ProductCategory vertex by only one (or more) specific edge label, that can lead nowhere else, it would be better to do:
g.V().hasLabel("ProductCategory").as("pc").
not(inE("KnownLabelBetweenCategories")).
filter(out("HasRootCategory").
where(eq("pc")).
by("Id").
by("RootId"))
On a different note, match() is not deprecated. I guess you tried to run your traversal in Groovy and it just failed because you didn't use __.as() (as, among others, is a reserved keyword in Groovy).

How to search SQL when words contain errors

I would like to execute a SQL command. However, the keywords may contain errors. For example, the correct command should be
select id from my_table where name = 'Tommy'
It would return 1.
However, if someone execute the following incorrect command:
select id from my_table where name = 'Tomyy'
How to change the command so that it still returns 1?
Thanks a lot.
There are many ways to tackle these, but please keep in mind this isn't the easiest of tasks. What you're looking for is a fuzzy search algorithm.
This should get you started: Fuzzy searches in SQL Server (Redgate)
Code project also has some interesting options here: Implementing phonetic name searches
If you're looking for an easier but more barebones solution you should look into using SOUNDEX or DIFFERENCE (assuming your dbms is MSSQL). I've been playing a bit with DIFFERENCE and it's pretty cool what this can do out of the box.
Try this
select id from my_table where SOUNDEX(name) = SOUNDEX('Tomyy')

Why or() queries are not using index in Datastax DSE 5.0.x Graph?

I have created an index on User and on uuid
if I do:
schema.vertexLabel("User").describe()
I get:
schema.vertexLabel("User").index("byUuid").materialized().by("uuid").add()
When I am running:
g.V().hasLabel("User").has("uuid","oneUuid")
The index is picked up properly..
but when I do the following:
g.V().or(__.hasLabel("User").has("uuid","oneUuid"), __.hasLabel("User").has("uuid","anotherUUID"))
I am getting:
org.apache.tinkerpop.gremlin.driver.exception.ResponseException: Could
not find an index to answer query clause and graph.allow_scan is
disabled:
Thanks!
or() is not easily optimizable as you can do much more complicated things like:
g.V().or(
hasLabel("User").has("uuid","oneUuid"),
hasLabel("User").has("name","bob"))
...where one or more condition could not be answered by an index lookup. It is doable and will likely be done in future versions, but afaik none of the currently available graph databases is trying to optimize the OrStep.
Anyway, your sample query can easily be rewritten so that it actually uses the index:
g.V().hasLabel("User").has("uuid", within("oneUuid", "anotherUUID"))

Using OrientDB 2.0.2, a LUCENE query does not seem to respect the "LIMIT n" operator

Using LUCENE inside of OrientDB seems to work fine, but there are very many LUCENE-specific query parameters that I would ordinarily pass directly to LUCENE (normally through Solr). The first one I need to pass is the result limiter such as SELECT * FROM V WHERE field LUCENE "Value" LIMIT 10.
If I use a value that only returns a few rows, I get the performance I expect, but if it has a lot of values, I need the limiter to get the result to return quickly. Otherwise I get an message in the console stating that The query would return more than 50000 records. Please consider using an index.
How do I pass additional LUCENE query paramters?
There's a known issue with the query parser which is in the process of being fixed, until then the following workaround should help:
SELECT FROM (
SELECT * FROM V WHERE Field LUCENE 'Value'
) LIMIT 10
Alternatively, depending on which client libraries you're using you may be able to set a limit using the out-of-band query settings.

Performance of SQL comparison using substring vs like with wildcard

I am working on a join condition between 2 tables where one of the columns to match on is a concatentation of values. I need to join columnA from tableA to the first 2 characters of columnB from tableB.
I have developed 2 different statements to handle this and I have tried to analyze the performance of each method.
Method 1:
ON tB.columnB like tA.columnA || '%'
Method 2:
ON substr(tB.columnB,1,2) = tA.columnA
The query execution plan has a lot less steps using Method 1 compared to Method 2, however, it looks like Method 2 executes much faster. Also, the execution plan shows a recommended index for Method 2 that could improve its performance.
I am running this on an IBM iSeries, though would be interested in answers in a general sense to learn more about sql query optimization.
Does it make sense that Method 2 would execute faster?
This SO question is similar, but it looks like no one provided any concrete answers to the performance difference of these approaches: T-SQL speed comparison between LEFT() vs. LIKE operator.
PS: The table design that requires this type of join is not something that I can get changed at this time. I realize having the fields separated which hold different types of data would be preferrable.
I ran the following in the SQL Advisor in IBM Data Studio on one of the tables in my DB2 LUW 10.1 database:
SELECT *
FROM PDM.DB30
WHERE DB30_SYSTEM_ID = 'XXX'
AND DB30_VERSION_ID = 'YYY'
AND SUBSTR(DB30_REL_TABLE_NM, 1, 4) = 'ZZZZ'
and
SELECT *
FROM PDM.DB30
WHERE DB30_SYSTEM_ID = 'XXX'
AND DB30_VERSION_ID = 'YYY'
AND DB30_REL_TABLE_NM LIKE 'ZZZZ%'
They both had the exact same access path utilizing the same index, the same estimated IO cost and the same estimated cardinality, the only difference being the estimated total CPU cost for the LIKE was 178,343.75 while the SUBSTR was 197,518.48 (~10% difference).
The cumulative total cost for both were the same though, so this difference is negligible as per the advisor.
Yes, Method 2 would be faster. LIKE is not as efficient a function.
To compare performance of various techniques, try using Visual Explain. You will find it buried in System i Navigator. Under your system connection, expand databases, then click onyour RDB name. In the lower right pane you can then click on the option to Run an SQL Script. Enter in your SELECT statement, and choose the menu option for Visual Explain or Run and Explain. Visual explain will break down the execution plan for your statement and show you the cost for each part as estimated on your tables with the indexes available.
You can actually run with real examples in your database.
LIKE is always better at my run.
select count(*) from u_log where log_text like 'AUT%';
1 row(s) returned : 90ms taken
select count(*) from u_log where substr(log_text,1,3)='AUT';
1 row(s) returned : 493ms taken
I found this reference in an IBM redbook related to SQL performance. It sounds like the SUBSTR scalar function can be handled in an optimized manner by an iSeries.
If you search for the first character and want to use the SQE instead
of the CQE, you can use the scalar function substring on the left sign
of the equal sign. If you have to search for additional characters in
the string, you can additionally use the scalar function POSSTR. By
splitting the LIKE predicate into several scalar function, you can
affect the query optimizer to use the SQE.
http://publib-b.boulder.ibm.com/abstracts/sg246654.html?Open