Searching "AND" in lucene index - lucene

I have lucene indexes indexed using StandardAnalyzer. The index consist of a value "AND".
When I try to search for the field value AND using MultiFieldQueryParser, the search is resulting in error.
EG: field1:* AND field2:AND
filed1:* AND field:"AND"
I have tried escape but is that is escaping the field value. I have aslo tried in double coutes("AND"). But could not succed in getting correct value.
Any advice in this regard would be helpful.
Thanks in advance.

I suspect that there are probably two issues in play here:
Query syntax, I think you'll get further by putting the "and" in lower case. Boolean terms in the standard query parser must be in upper case. Anyway, given that one of the steps of the standard analyser is to drop case sensitivity, this shouldn't be an issue
The next problem is stop words: I suspect that "and" is excluded from the set of analysed terms by the standard analysers stop word list. You could get around this by using a different stop word list with the standard analyser that doesn't exclude "and" as a term.
Good luck,

Related

Cloudant Search: Why are my wildcards not working?

I have a Cloudant database with a search index. In the search index I index the titles of my documents. For instance, search for 'rijkspersoneel':
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspersoneel
Returns 48 rows.
However, when I replace the 'o' with a ? wildcard:
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspers?neel
I get 0 results. Why is that? The Cloudant docs say that this should match 'rijkspersoneel' as well!
My previous answer was definitely mistaken. Internal wildcads do appear to be supported. Try:
title:rijkspe*on*
title rijksper?on*
Fairly sure what is happening here is an analysis issue. Fairly sure you are using a stemming analyzer. I'm not really all the familiar with cloudant and their implementation of this, but in Lucene, wildcard queries are not subject to the same analysis as term queries. I'm guessing that your analysis of this field includes a stemmer, in which case "rijkspersoneel" is actually indexed as "rijkspersone".
So, when you search for
rijkspersonee*
or
rijkper?oneel
Since the "el" is missing from the end in the indexed form, you find no matches. When just searching for rijkpersoneel it does get analyzed though, and you search for the stemmed form of the word, and will find matches.
Stemming and wildcards just don't get along.

Using regular expressions for syntax highlight

I have a piece of code that assigns attributes to an NSAttributedString depending on whether certain keywords are present in the string or not. In other words, syntax highlight.
To find if a certain string has those keywords I am currently using regular expressions to find the location of those words with "\\bKEYWORD\\b". The problem is, obviously, performance.
I first tried with NSRegularExpression but performance was so slow that scrolling my textview was nearly impossible. I then tried Oniguruma and things improved but it's still noticeably slow. I may try PCRE but I don't think I'll be adding much.
So, my question is: how can I speed up regular expression searches? Maybe caching the compiled expression?
It sounds like you're searching for each word individually. I would create an array of search words, then join them together with a regex alternation | symbol
Given search words like: alpha, bravo, charlie, delta, echo
Resulting complied regex: \b(?:alpha|bravo|charlie|delta|echo)\b
The non capture group construct (?:...) is a bit faster then the capture syntax (...)

Lucene search and underscores

When I use Luke to search my Lucene index using a standard analyzer, I can see the field I am searchng for contains values of the form MY_VALUE.
When I search for field:"MY_VALUE" however, the query is parsed as field:"my value"
Is there a simple way to escape the underscore (_) character so that it will search for it?
EDIT:
4/1/2010 11:08AM PST
I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably there before.
Load up Luke and try to search for "BB_HHH_FFFF5_SSSS", when there is a number, the following tokens are returned:
"bb hhh_ffff5_ssss"
After some testing, I've found that this is because of the number. If I input
"BB_HHH_FFFF_SSSS", I get
"bb hhh ffff ssss"
At this point, I'm leaning towards a tokenizer bug unless the presence of the number is supposed to have this behavior but I fail to see why.
Can anyone confirm this?
It doesn't look like you used the StandardAnalyzer to index that field. In Luke you'll need to select the analyzer that you used to index that field in order to match MY_VALUE correctly.
Incidentally, you might be able to match MY_VALUE by using the KeywordAnalyzer.
I don't think you'll be able to use the standard analyser for this use case.
Judging what I think your requirements are, the keyword analyser should work fine for little effort (the whole field becomes a single term).
I think some of the confusion arises when looking at the field with luke. The stored value is not what's used by queries, what you need are the terms. I suspect that when you look at the terms stored for your field, they'll be "my" and "value".
Hope this helps,

Sql Server 2005 Fulltext case sensitivity problem

I seem to have a weird bug in Microsoft SQL Server 2005 where FREETEXT() searches are somewhat case-sensitive despite the collation being case-insensitive (Latin1_General_CI_AS).
First of, LIKE queries are perfectly case-insensitive, so
WHERE column LIKE '%word%'
and
WHERE column LIKE '%Word%'
return the same results.
Also, FREETEXT are infact case-insensitive to some extent, for instance
WHERE FREETEXT(column, 'Word')
will return results with different cases.
BUT
WHERE FREETEXT(column, 'word')
while still returning case-insensitive matches for word, gives a different resultset.
Or, as I found out after some investigation, searching for word gives all matches for different cases of word but searching for Word gives the same PLUS inflectional results.
Or to use one of the actual cases I found, searching for marketingleader returns all results containing that word, independent of the case, whereas searching for Marketingleader would return those, but also results that just contain leader that don't show up when searching for the lower case.
has anyone got any Idea as to what is causing this and how I could turn on inflectional/fuzzy searching for lower-case words as well?
Any help would be appreciated.
Use the alternative to freetext which is contains and the inflectional results are optional ..
CONTAINS (Transact-SQL)
.. oups just saw that you mention contains in your question, but does it behave the same way as the freetext in the provided examples ?

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article