How do I use native Lucene Query Syntax? - lucene

I read that Lucene has an internal query language where one specifies : and you make combinations of these using boolean operators.
I read all about it on their website and it works just fine in LUKE, I can do things like
field1:value1 AND field2:value2
and it will return seemingly correct results.
My problem is how do I pass this who Lucene query into the API? I've seen QueryParser, but I have to specifiy a field. Does this mean I still have to manually parse my input string, fields, values, parenthesis, etc or is there a way to feed the whole thing in and let lucene do it's thing?
I'm using Lucene.NET but since it's a method by method port of the orignal java, any advice is appreciated.

Are you asking whether you need to force your user to enter the field? If so, the query parser has a default field. Here's a little more info. As long as you have a default field that will do the job, they don't need to specify fields.
If you're asking how to get a Query object from the String, you need the parse method. It understands about fields, and the default field, etc. mentioned earlier. You just need to make sure that the query parser and the index builder are both using the same analysis.

Related

SOLR indexed item has extra word which is not available in query parameter - how to identify those cases?

We have a scenario where we are trying to perform accurate name matching of Items using SOLR.
Query Parameter: Apple
SOLR Indexed Word: Apple-D
In our business case, "Apple" and "Apple-D" are totally different items and therefore SOLR shouldn't return the match.
Is there an option to achieve the same?
You need to change the fieldType used for the field. Use the String fieldType for the your field.
This String fieldType will make sure that the words will be stored as it is by solr.
It won't apply any analysis on the word. Or it won't create any tokes of it.
With the String type applied to it . The Apple and Apple-D are stored/indexed different token. As there won't be any tokenizing on the same. This will help you to achieve the exact match.
Once you change the fieldType. Re-index the same.
You can use the solr analysis tool to check how it is indexing and querying .
Note : Make sure whenever you ask question on it, Share your schema.xml

Sparql Query Results without Namespace

I want to get results from sparql query and the results contain no namespace.
ex: there is result in triple format like:
"http://www.xyz.com#Raxit" "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" "http://www.xyz.com#Name"
So i want to get only following:
Raxit type Name
I want to get this results directly from sparql query. I am using virtuoso.
Is it possible to get this from sparql?
Please share your thoughts regarding this.
Thanks in Advance.
If your data is regular, and you know that the sub-string you want always occurs after a # character, then you can use the strafter function from SPARQL 1.1. I do not know whether this is available in Virtuoso's implementation or not.
However this is, in general, a very risky strategy. Not all URI's are formatted with a local name part after a # character. In fact, in general, a URI may not have a legal or useful localname at all. So you should ask yourself: why do you think you need this? Generally speaking, a semantic web application uses the whole URI as an indivisible identifier. If your need is actually for something human-friendly to display in a UI, have your query also look for rdfs:label or skos:label properties. Worst case, try to abbreviate the URI to q-name form (i.e. prefix:name), using the prefixes from the model or a service like prefix.cc
The simplest way to achieve this is to not bother with adapting your query, but to just post-process the result yourself. Depending on which client library you use to communicate with Virtuoso, you will typically find it has API support to parse the result, get back values, and for each value then get only local name (I suggest you look for a URI.getLocalName() method or something similar).

Why are my Lucene Document results empty?

I'm running a simple test--trying to index something and then search for it. I index a simple document, but then when a search for a string in it, I get back what looks to be an empty document (it has no fields). Lucene seems to be doing something, because if I search for a word that's not in the document, it returns 0 results.
Any reason why Lucene would reliably return a document when it finds one that matches the given query, and yet that document has nothing in it?
More details:
I'm actually running Lucandra (Lucene + Cassandra). That certainly may be a relevant detail, but not sure.
The fields are set to Field.Store/YES and Field.Index/ANALYZED
Interestingly, I'm able to get this to work just fine on my local machine, but when we put it on our main server (which is a multi-node cassandra setup), I get the behavior described above. So this seems like probably the relevant detail, but unfortunately, I see no error message to clue me in to what specifically is causing it.
Unsure if this will work with Lucandra, but you have tried opening the index using Luke? Viewing the index contents with Luke might help
It's hard to tell what the problem is since you only provide a very abstract description. However, it sounds a bit like you are not storing the field value in the index. There are different modes for indexing a field. One option determines whether the original value is stored in the index to retrieve it later:
http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/document/Field.Store.html
See also the description of the enclosing class Field
Read: http://anismiles.wordpress.com/2010/05/27/lucandra-an-inside-story/

determine which value produced a hit in SOLR multivalued field type

If I have a multiValued field type of text, and I put values [cat,dog,green,blue] in it. Is there a way to tell when I execute a query against that field for dog, that it was in the 1st element position for that multiValued field?
Assumption: client does not have any pre-knowledge of what the field type of the field being queried is. (i.e. Solr must provide the answer and the client can't post process the return doc to figure it out because it would not know how SOLR matched the query to the result).
Disclosure: I posted to solr-user list and am getting no traction so I post here now.
Currently, there's no out-of-the-box functionality provided in Solr which tells you the position of a value in a multiValue field.
Hopefully I understand your question correctly.
If you want to get field index or value there is an ugly workaround:
You could add the index directly in the value e.g. store "1; car", "2; test" and so on. Then use highlighting. When reading the returned fields simply skip the text before the semicolon.
But if you want to query only one type:
You can avoid the multivalue approach and simply store it as item_i and query via item_1. To query against all items regardless the type you need to use the copyField directive in the schema.xml
The Lucene API allows for this, but I'm not sure if Solr does out of the box. In Lucene you can use the IndexReader.termPositions(Term term) method.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article