Neo4j 2.0.1 Cypher performance difference between using start and match with a predicate - indexing

Started using Cypher about a week ago (really like it). In the 'browser' interface I'm running two queries:
1) start n=node:Node(name="foo") match (n)-[r*..4]-(m) return n,m
2) match (n{name:"foo"})-[r*..4]-(m) return n,m
The first query returns almost immediately, the second query more than an hour and counting. Naively I would think these would be equivalent, clearly they are not. I ran a 'smaller' (path just up to 1) version of both in the neo-shell so I could profile them.
profile start n=node:Node(name="foo") match (n)-[r*..1]-(m) return n,m;
ColumnFilter(symKeys=["m", "n", " UNNAMED51", "r"], returnItemNames=["n", "m"], _rows=4, _db_hits=0)
TraversalMatcher(start={"expr": "Literal(foo)", "identifiers": ["n"], "key": "Literal(name)",
"idxName": "Node", "producer": "NodeByIndex"}, trail="(n)-[*1..1]-(m)", _rows=4, _db_hits=5)
.
profile match (n{name:"foo"})-[r*..1]-(m) return n,m
ColumnFilter(symKeys=["n", "m", " UNNAMED33", "r"], returnItemNames=["n", "m"], _rows=4, _db_hits=0)
Filter(pred="Property(n,name(0)) == Literal(foo)", _rows=4, _db_hits=196870)
TraversalMatcher(start={"producer": "AllNodes", "identifiers": ["m"]},
trail="(m)-[*1..1]-(n)", _rows=196870, _db_hits=396980)
From other stackoverflow questions I understand db_hits is good to look at, so it looks like the second query has basically done a linear scan (my db is almost 400k nodes). This seems to be confirmed by the "producer" value of "AllNodes" instead of "NodeByIndex".
Obviously I need to specify the match (predicate) differently so that it hits the index. The index is called 'Node' on parameter 'name'. My googling, stacko search is failing me.. how do I specify the conditional in the match so that it hits the index?
Update:
After some poking around it appears I'm using a 'legacy' index? and then trying to hit that with the 'new style (don't use start) query... (kinda extrapolating here). So I can do the following:
create index ON :label(name)
and that would provide an index for a particular label on the name property, but I really want an index (I guess non-legacy index) on ALL the node names. I have use cases where that's important (user may not know the label but does know the name).
Any suggestions or guidance is much appreciated.

Right now there is no global schema index, so you would probably want to create an index on a generic label like Entity or Node and create an index like this:
create index on :Entity(name);
And add that Entity label to all your nodes.
match (n) set n:Entity;

Related

Amazon CloudSearch returns false results

I have a DB of articles, and i would like to search for all the articles who:
1. contain the word 'RIO' in either the title or the excerpt
2. contain the word 'BRAZIL' in the parent_post_content
3. and in a certain time range
The query I search with (structured) was:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (phrase field=title 'RIO') (phrase field=excerpt 'RIO')))
but for some reason i get results that contain 'RIO' in the title, but do not contain 'BRAZIL' in the parent_post_content.
This is especially weird because i tried to condition only on the title (and not the excerpt) with this query:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (phrase field=name 'RIO'))
and the results seem OK.
I'm fairy new to CloudSearch, so i very likely have syntax errors, but i can't seem to find them. help?
You're using the phrase operator but not actually searching for a phrase; it would be best to use the term operator (or no operator) instead. I can't see why it should matter but using something outside of how it was intended to be used can invite unintended consequences.
Here is how I'd re-write your queries:
Using term (mainly just used if you want to boost fields):
(and (term field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (term field=title 'RIO') (term field=excerpt 'RIO')))
Without an operator (I find this simplest):
(and parent_post_content:'BRAZIL' (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or title:'RIO' excerpt:'RIO'))
If that fails, can you post the complete query? I'd like to check that, for example, you're using the structured query parser since you mentioned you're new to CloudSearch.
Here are some relevant docs from Amazon:
Compound queries for more on the various operators
Searching text for specifics on the phrase operator
Apparently the problem was not with the query, but with the displayed content. I foolishly trusted that the content displaying in the CloudSearch site was complete, and so concluded that it does not contain Brazil. But alas, it is not the full content, and when i check the full content, Brazil was there.
Sorry for the foolery.

Lucene query with filter "without property"

I need to write lucene query/filter to get objects without specific property.
I tried with ... ISNULL:"cm:param_name" but id didn't work.
Edit: I have added new property in aspect but objects that haven't been updated yet don't have it amongst their listed properties (checked with node browser).
With a query like "cm:*", you should only receive documents that have the field "cm" plus content. Note that you have to allow leading wildcard queries by the query parser with setAllowLeadingWildcard(true).
Also check out this post, which deals with a reversed version of your problem:
Find all Lucene documents having a certain field
Can you please be more clear as to what "without property" means ? Do you mean that you do not want to specify the field like so "field:value" and instead set the filter to "value" ?
EDIT
Are you generating these field names dynamically or is this the only field name that can have it's value missing ? If there is only one field that may or may not appear in your document then you could just populate it with a default value when it's missing and then search for that . Otherwise, you could try a negated rangequery like so : NOT foo:[* TO *] . This should match all documents without a value in the foo field. For performance purposes , in the second case the field should be indexed as a string field (not analyzed).
I managed to get this done with .. AND NOT (#namespace\:property:"")
In Java and Lucene 3.6.2 the "FieldValueFilter" with activated negation can be used: (which was not the question)
import org.apache.lucene.search.FieldValueFilter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.TopDocs;
final IndexSearcher indexSearcher = getIndexSearcher() <- whereever that comes from
final TopDocs topdocs = indexSearcher.search(new MatchAllDocsQuery(), new FieldValueFilter("cm", true), Integer.MAX_VALUE);
You can use ISUNSET and/or ISNULL for this scenario.
ISUNSET:"cm:title"
ISNULL:"cm:title"

Apache solr - more like this score

I have a small index with ~1000 documents with only two fields:
- id (string)
- content (text_general)
I noticed that when I do MLT search by id for similar content, the original document(which id is the searched id) have a score 5.241327.
There is 1:1 duplicated document and for the duplicated content it is returning score = 1.5258181. Why? Why it is not 5.241327 when it is 100% duplicate.
Another question is can I in any way to get similarity documents by content by passing some text in the query.
Example:
/mlt/?q=content:Some encoded long text&mlt.fl=content
I am trying to check if there is similar content uploaded and the check must be performed at new content upload time.
It might be worth to try some different parameters. I also use MLT on only one field, I use the following parameters:
'mlt.boost': 'true',
'mlt.fl': 'my_field_name',
'mlt.maxqt': 1000,
'mlt.mindf': '0',
'mlt.mintf': '0',
'qt': 'mlt',
'rows': '10'
See http://wiki.apache.org/solr/MoreLikeThis for an explanation of the parameters. I think with a small index mindf might be important and I see the default mintf (term frequency) is 2, so I assume an ID is only one term, so this is probably ignored!
First, how does Solr More-Like-This works?
A regular Solr query is conducted (e.g. "?q=content:Some encoded long text&.....".
For each document returned by the above query, More-Like-This conduct More like this query...
So, the first result set "response", is just like any Solr query results set.
The More-Like-This appears below and start with something like that (Json format):
"moreLikeThis":{
"57375":{"numFound":18155,"start":0,"docs":["
For an explanation about More Like This algorithm, please read that:
http://blog.brattland.no/node/18
and: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
If you didn't solved the problem yet, please let me know and I will guide you through.

How to quick match an entry with the beginning of a long string?

I have a table articles (:Rails Model but I think the issue is more SQL related) which have a column name permalink. To instance, some of my permalinks :
title-of-article
great-article
great-article-about-obama
obama-stuff-about-him
I want to match a request like great-article-about-obama-random-stuff to great-article-about-obama. Is it possible to do it, avoiding killing performance ?
Thanks to all,
ps : We use Rails 3 and Postgresql (or Sqlite not decided yet for production)
EDIT
We can do something like this, but the main downside is we have to fetched every single permalinks from the table articles :
permalinks = ['title-of-article','great-article','great-article-about-obama','obama-stuff-about-him']
string_to_match = 'great-article-about-obama-random-stuf'
result = permalinks.inject('') do |matched,permalink|
matched = (string_to_match.include? permalink and permalink.size > matched.size) ? permalink : matched
end
result => 'great-article-about-obama'
I'll love to find a way to do it directly in SQL for obvious performance reason.
Unless using a text-search base technology (w/ postgres : http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html + http://tenderlovemaking.com/2009/10/17/full-text-search-on-heroku/ or solr, indexTank) you can do it with :
request = "chien-qui-aboie"
article = nil
while !article do
article = Article.where("permalink like ?", request+"%").select(:id).first
request.gsub!(/-[^-]*$/) unless article
end
This will first look for chien-qui-aboie%, then chien-qui%, then chien%.
This will also match "chien_qui_mange" if there is an article "chien_qui_mange" but no one about "chien qui aboie"
That's not optimal because of the number of requests, but that's not that heavy if it's just a look up, and not the normal way of accessing a record.

nutch field problem

I was using something like:
Field notdirectory = new Field("notdirectory","1", Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "notdirectory:1" can be processed quite well all the time.
But recently I've changed the "Field.Store.NO, Field.Index.UN_TOKENIZED" to index a non-numeric string:
Field stateField = new Field("state","irn_" + state, Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "state:irn_CA" can never fetch any results any more,even though I watch through hadoop logs that "irn_CA" is added to "state" field in fact.
So I doubt for Fields that satisfy "Field.Store.NO, Field.Index.UN_TOKENIZED",only numeric Fields can searchable,but I didn't see any documents about that.
So what's the true reason for this?
I think, you are using StandardAnalyzer for parsing the input query, which will tokenize your input query "irn_CA" into two tokens - "irn" and "CA". Since the index has "irn_CA" as single token, it won't match.
Try using KeywordAnalyzer for while searching. It will generate single token for the query string and match the indexed token correctly.
I think the searcher bean forces everything to lowercase...so make the state is in lower case when adding to the index:
Field stateField = new Field("state","irn_" + state.toLowerCase(), Field.Store.NO, Field.Index.UN_TOKENIZED);
and when you query: 'state:irn_ca' instead of 'state:irn_CA'.
I also note you prefixed with 'irn_' - good call, otherwise the highlighter flags up the the query.