Neo4j Match Node Property OR Relationship Property - properties

I'm trying to write a query that will return nodes that either match a node property or a relationship property.
For instance, I want all nodes where the name property is George OR if the relationship property status is "good". I have two queries that will get the nodes for each of these:
MATCH (n) where n.name = 'George' return n
MATCH (n)-[r]-() where r.status = 'good' return n
Is there a singe query I could write to get these combined results? I thought I could use this optional query (below), but I seemed to have misunderstood the optional match clause because I'm only getting nodes from the first query.
MATCH (n) where n.name = 'George'
Optional MATCH (n)-[r]-() where r.status = 'good' return distinct n

By the time the optional match happens, the only n nodes that are around to make the optional match from are the ones that already match the first criteria. You can do
MATCH (n)
WHERE n.name = 'George' OR n-[{ status:"good" }]->()
RETURN n
but for larger graphs remember that this will not make efficient use of indices.
Another way would be
MATCH (n {name:"George"})
RETURN n
UNION MATCH (n)-[{status:"good"})->()
RETURN n
This should do better with indices for the first match, assuming you use a label and have the relevant index set up (but the second part would still potentially be very inefficient).
Edit
Re comment, relationship indexing would make that part faster, correct, but to my mind it would be better to say it is slow because the pattern is underdetermined. The second match pattern does something like
bind every node in the graph to (n)
get all outgoing relationships (regardless of type) from (n)
check relationship for status="good"
You could improve performance with relationship indexing, but since a relationship exists only between the two nodes it relates, you can think of it instead as indexed by those nodes. That is, fix the first bullet point by excluding nodes whose relationships are not relevant. The two match clauses could look like
MATCH (n:Person {name:"George"})
// add label to use index
MATCH (n:Person)-[{status:"good"}]->()
// add label to limit (n) -no indexing, but better than unlimited (n)
MATCH (n:Person {name:"Curious"})-[{status:"good"}]->()
// add label to use index -now the relationships are sort-of-indexed
and/or type the relationship
MATCH (n)-[:REL {status:"good"}]->() // add type to speed up relationship retrieval
in fact, with anonymous relationships and rel property, it would probably make sense (bullet point three) to make the property the type, so
MATCH (n)-[:GOOD]->() // absent type, it would make sense to use the property as type instead
Your actual queries may look very different, and your question wasn't really about query performance at all :) oh well.

Related

OrientDB graph query that match specific relationship

I am developing an application using OrientDB as a database. The database is already filled, and now I need to make some queries to obtain specific information.
I have 3 classes and 3 edges to be concerned. What I need to do is query the database to see if some specific relationship exists. The relationship is like this:
ParlamentarVertex --Realiza> TransacaoVertex --FornecidaPor> EmpresaFornecedoraVertex AND ParlamentarVertex --SocioDe> EmpresaFornecedoraVertex
The names with a vertex in it are a vertex of course, and the arrows are the edges between the two vertexes.
I've tried to do this:
SELECT TxNomeParlamentar, SgPartido, SgUF FROM Parlamentar where ...
SELECT EXPAND( out('RealizaTransacao').out('FornecidaPor') ) FROM Parlamentar
But I do not know how to specify the relationships after the where clause.
I've also tried to use match
MATCH {class: Parlamentar, as: p} -Realiza-> {as:realiza}
But I am not sure how to specify the and clause that is really important for my query.
Does anyone have some tip, so I can go in the right direction?
Thanks in advance!
EDIT 1
I've managed to use the query below:
SELECT EXPAND( out('RealizaTransacao').out('FornecidaPor').in('SocioDe') ) FROM Parlamentar
It almost works, but return some relationships incorrectly. It looks like a join that I did not bind the Pk and FK.
The easiest thing here is to use a MATCH as follows:
MATCH
{class:ParlamentarVertex, as:p} -Realiza-> {class:TransacaoVertex, as:t}
-FornecidaPor-> {class:EmpresaFornecedoraVertex, as:e},
{as:p} -SocioDe-> {as:e}
RETURN p, p.TxNomeParlamentar, p.SgPartido, p.SgUF, t, e
(or RETURN whatever you need)
As you can see, the AND is represented as the addition of multiple patterns, separated by a comma

Schema index for words in a string

I have a large amount of nodes which have the property text containing a string.
I want to find all nodes which text contains a given string (exact match). This can be done using the CONTAINS operator.
MATCH (n)
WHERE n.text CONTAINS 'keyword'
RETURN n
Edit: I am looking for all nodes n where n.text contains the substring 'keyword'. E.g. n.text = 'This is a keyword'.
To speed up this I want to create an index for each word. Is this possible using the new Schema Indexes?
(Alternatively this could be done using a legacy index and adding each node to this index but I would prefer using a schema index)
Absolutely. Given that you are looking for an exact match you can use a schema index. Judging from your question you probably know this but to create the index you will need to assign your node a label and then create the index on that label.
CREATE INDEX ON :MyLabel(text)
Then at query time the cypher execution index will automatically use this index with the following query
MATCH (n:MyLabel { text : 'keyword' })
RETURN n
This will use the schema index to look up the node with label MyLabel and property text with value keyword. Note that this is an exact match of the complete value of the property.
To force Neo4j to use a particular index you can use index hints
MATCH (n:MyLabel)
USING INDEX n:MyLabel(text)
WHERE n.text = 'keyword'
RETURN n
EDIT
On re-reading your question I am thinking you are not actually looking for a full exact match but actually wanting an exact match on the keyword parameter within the text field. If so, then...no, you cannot yet use schema indexes. Quoting Use index with STARTS WITH in the Neo4j manual:
The similar operators ENDS WITH and CONTAINS cannot currently be solved using indexes.
If I understand your question correctly, a legacy index would accomplish exactly what you're looking to do. If you don't want to have to maintain the index for each node you create/delete/update, you can use auto indexing (http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/).
If you're looking to only use schema indexing, another approach would be to store each keyword as a separate node. Then you could use a schema index for finding relevant keyword nodes which then map to the node they exist on. Just a thought.

SQL Query Optimisation (Direction of Condition Evaluation)

Let's say I have a dictionary of 26000 words, 1000 words per letter of the alphabet.
If I want to find all the words that have an 'e' in them, I write:
SELECT *
FROM dict
WHERE word LIKE '%e%';
If I wanted to reduce that to only the words beginning with 'a' I could change the like condition or I could do this:
SELECT *
FROM dict
WHERE word LIKE '%e%'
AND id < 1000;
Lots of words have the letter 'e' in them and so would return true only to fail the second requirement if the conditions are evaluated left to right but I would expect faster results if the condition is evaluated from right to left.
My question is, would it be better to have the id < 1000 as the first or second condition or does this depend on the type of database.
The location of the condition is irrelevant, the same number of scans (if applicable) will be required. They are not parsed in order -- the optimizer determines what is applied, and when, based on table statistics and indexes (if any exist). Those statistics change, and can become out of date (which is why maintenance is important).
It would be bad to assume id < 1000 to be the equivalent of
SELECT * FROM dict WHERE word LIKE'a%'.
If you designed your database this way it would violate First Normal form. 1NF, Specifically: There's no top-to-bottom ordering to the rows.
Technically there isn't a way to ensure this ordering is valid, especially if you wanted to add a word starting with 'A' after you setup your initial state.
One of the key design principles of modern relational database management systems is that you, the user, have no true control or say over how the data is actually being stored on the hard drive by the RDBMS. This means that you cannot assume that the data is (a) stored in alphabetical order on the drive, or (b) that when you retrieve the data, it will be retrieved in alphabetical order. The only way to be absolutely 100% sure that you are getting the data you want is to spell out the way you want it, and anything else is an assumption that some day may blow up in your face.
Why does this matter? Because your query assumes that the data you'll be getting will be in alphabetical order, starting with "A" and going up. (And that assumes consistent case--what about "A" vs "a"? Anything with leading spaces or numbers? Different systems handle different data differently...) Fixing this is simple enough, add an ORDER BY clause, such as:
select * from dict where word like ("%e%") and id < 1000 order by word;
Of course, if you have more than 1000 words beginning with "A" and containing "e", you're in trouble... and if you have less than 1000, you end up with a bunch of "B" words. Try something like:
select * from dict where left(word. 1) = "A" and word like ("%e%");
Depending on your RDBMS and any indexing you have on the table, the system could first identify all "A" words, and then run the "contains e" check on only them.
Try switching your where clause conditions around and then compare the execution plans.
This will show you the difference, if any (I would guess they will be identical, in this case)
The bottom line is, most of the time it makes no difference.
However it can change the execution plan.

Django query for large number of relationships

I have Django models setup in the following manner:
model A has a one-to-many relationship to model B
each record in A has between 3,000 to 15,000 records in B
What is the best way to construct a query that will retrieve the newest (greatest pk) record in B that corresponds to a record in A for each record in A? Is this something that I must use SQL for in lieu of the Django ORM?
Create a helper function for safely extracting the 'top' item from any queryset. I use this all over the place in my own Django apps.
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
# Extracts a single element collection w/ top item
result = queryset[0:1]
# Return that element or None if there weren't any matches
return result[0] if result else None
This uses a bit of a trick w/ the slice operator to add a limit clause onto your SQL.
Now use this function anywhere you need to get the 'top' item of a query set. In this case, you want to get the top B item for a given A where the B's are sorted by descending pk, as such:
latest = top_or_none(B.objects.filter(a=my_a).order_by('-pk'))
There's also the recently added 'Max' function in Django Aggregation which could help you get the max pk, but I don't like that solution in this case since it adds complexity.
P.S. I don't really like relying on the 'pk' field for this type of query as some RDBMSs don't guarantee that sequential pks is the same as logical creation order. If I have a table that I know I will need to query in this fashion, I usually have my own 'creation' datetime column that I can use to order by instead of pk.
Edit based on comment:
If you'd rather use queryset[0], you can modify the 'top_or_none' function thusly:
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
try:
return queryset[0]
except IndexError:
return None
I didn't propose this initially because I was under the impression that queryset[0] would pull back the entire result set, then take the 0th item. Apparently Django adds a 'LIMIT 1' in this scenario too, so it's a safe alternative to my slicing version.
Edit 2
Of course you can also take advantage of Django's related manager construct here and build the queryset through your 'A' object, depending on your preference:
latest = top_or_none(my_a.b_set.order_by('-pk'))
I don't think Django ORM can do this (but I've been pleasantly surprised before...). If there's a reasonable number of A record (or if you're paging), I'd just add a method to A model that would return this 'newest' B record. If you want to get a lot of A records, each with it's own newest B, I'd drop to SQL.
remeber that no matter which route you take, you'll need a suitable composite index on B table, maybe adding an order_by=('a_fk','-id') to the Meta subclass

Lucene query permutation

I have a question regarding performing a lucene query involving permutation.
Say I have two fields: "name" and "keyword" and the user searches for "joes pizza restaurant". I want some part of that search to match the full contents of the "name" field and some part to match the full content of the keyword field. It should match all the the supplied terms and should match the entire contents of the fields. For example it could match:
1) name:"joes restaurant" keyword:"pizza"
2) name:"joes pizza" keyword:"restaurant"
3) name:"pizza restaurant" keyword:"joes"
4) name:"pizza" keyword:"joes restaurant"
5) name:"pizza joes" keyword:"restaurant"
but it would not match
6) name:"big joes restaurant" keyword:"pizza" - because it's not a match on the full field
7) name:"joes pizza restaurant" keyword:"nomatch" - because at least one of the terms should match to the keyword field
I've thought about possible ways to implement this by calculating all the permutations of the fields and using boolean queries however this doesn't scale very well as the number of terms increases. Anyone have any clues how to implement this sort of query efficiently?
Lucene docs recommend using separate field which is concatenation of 'name' and 'keyword' fields for queries spanning multiple fields. Do the search on this field.
Let's divide your query into three parts:
Both the 'name' field and the 'keyword' field should contain part of the query.
Both matches should be to the full field.
The union of the matches should cover the query completely.
I would implement it this way:
Create a boolean query composed of the tokens in the original query. Make it a disjunction of 'MUST' terms. e.g. in the example something like:
(name:joes OR name:restaurant OR name:pizza) AND (keyword:joes OR keyword:restaurant OR keyword:pizza)
Any document matching this query has a part of the original query in each field.
(This could be a ConstantScoreQuery to save time).
Take the set of matches from the first query. Extract the field contents as tokens, and store them in String sets. Keep only the matches where the union of the sets equals the string set from your original query, and the sets have an empty intersection. (This handles the covering - item 3 above). For your first example, we will have the sets {"joes", "restaurant"} and {"pizza"} fulfilling both conditions.
Take the set sizes from the matches left, and compare them to the field lengths. For your first example we will have set sizes of 2 and 1 which should correspond to field lengths of 2 and 1 respectively.
Note that my items 2 and 3 are not part of the regular Lucene scoring but rather external Java code.