Lucene mutli-language analyzer/index approach - indexing

I have a working Lucene index supporting a suggestion service. When a user types into a search box it queries the index by the SUGGESTION_FIELD. Each entry in SUGGESTION_FIELD can be one of many supported languages and each is stored using an appropriate language specific analyzer. In order to know what analyzer was used there is second field per entry which stores the LOCALE. So during a query I can say something like the code below to do a language specific query using appropriate analyzer
QueryParser parser = new QueryParser(Version.LUCENE_33, SUGGESTION_FIELD, getLangaugeAnalyzer(locale));
return searcher.search(parser.parse("SUGGESTION_FIELD:" + queryString + " AND LOCALE:"
+ locale), 100);
The works.... But now the client wants to be able to search using multiple languages at once.
My Question: What would be the fastest querying solution bearing in mind that a suggestion service needs to be very fast?...
Sol. #1. The simplest solution would seem to be; do the query multiple times. Once for each locale, thereby applying the corresponding language analyser each time. Finally append the results from each query in some sensible fashion
Sol. #2. Alternatively I could re-index using a column for each locale such that:
SUGGESTION_FIELD_en, SUGGESTION_FIELD_fr, SUGGESTION_FIELD_es etc..
using a different analyzer for each field (using PerFieldAnalyzerWrapper) and then query using a more complex query string such that:
"SUGGESTION_FIELD_en:" + queryString + " AND SUGGESTION_FIELD_fr:" + queryString + " AND SUGGESTION_FIELD_es:" + queryString
Please help if you think you :)

Your query is going to be something like this: (sugField:queryString1 AND locale:loc1) OR (sugField:queryString2 AND locale:loc2) OR .... This is a top-level BooleanQuery with subordinate BooleanQueries added with occurs=SHOULD, where each subordinate query has its terms with occurs=MUST. The queryString1, queryString2, etc. are the outputs from different language analyzers having the same input, the string the user entered.
Each subordinate query involves mandatory terms (from your query string) that are rare in the index and Lucene knows this at the outset (it knows the total doc count for each Term in the index) so it will first constrain the result by the queryString and then additionally intersect that with the locale terms. This will be VERY efficient no matter how large your index.
As for the different analyzers, I suggest you don't use the QueryParser, but create the entire query programmatically. This is a good general advice whenever you don't enter the query by hand and in your case it is the only way to gain control of the analyzing aspect. Run your query string through each of the language-specific analyzers and add their output tokens as TermQueries to the subordinate BooleanQueries.

Related

SQL Full-Text Search (CONTAINSTABLE) - Better String Parser?

I'm currently using a RadSearchBox (Telerik) for AutoComplete purposes to collect the search string from a user and pass it to a stored procedure that queries against a MS SQL full-text index (400k+ rows) fairly fast and returns the results in a RadGrid. The customer wants (as they always do) for the search string parser to be more advanced. I currently have it so if it detects any non-alphanumeric characters ("/","-",etc) it will search for the "exact/phrase", as full-text doesn't do well with " * exact/phrase * ".
But, before I completely re-create the wheel, I was wondering if anyone out there has something more intelligent that will insert 'AND','OR','NEAR', search for "exact/phrase" and " * exact * " OR " * phrase * " and return the results back racked and stacked based on the CONTAINSTABLE ranking. Having FORMSOF(INFLECTION,exact) would be icing on the cake.
I know this is a lot to ask, just putting a "line out" to see if anyone has anything decent they would be willing to share. I would look into using Lucene, but the web app has a Fluent Data Model that isn't easily compatible (that I know of) with Lucene.
I've done my fair share of googling, but can't seem to find anything that looks clean (would prefer not to dive into and out of numerous functions/stored procedures to accomplish) and also something that was written within the last several years.
Thanks in advance...

Basic Lucene Beginners q: Index and Autocomplete

I am using Lucene.NET and have a basic question:
Do I need to make an additional Index for Autocompletion?
I created an index based on two different tables from a database.
Here are two Docs:
stored,indexed,tokenized,termVector<URL:/Service/Zahlungsmethoden/Teilzahlung>
stored,indexed,tokenized,termVector<Website:Body:The Text of the first Page>
stored,indexed,tokenized,termVector<Website:ID:19>
stored,indexed,tokenized,termVector<Website:Title:Teilzahlung>
stored,indexed,tokenized,termVector<URL:/Service/Kundenservice/Kinderbetreeung>
stored,indexed,tokenized,termVector<Website:Body:The text of the second Page>
stored,indexed,tokenized,termVector<Website:ID:13>
stored,indexed,tokenized,termVector<Website:Title:Kinderbetreuung>
I need to create a dropdown for a search with suggestions:
eg: term "Pag" should suggest "Page"
so I assume that for every word (token) in every doc, I need a list like:
p
pa
pag
page
is this correct?
Where do I store these?
In an additional Index?
Or how would I re-arrange the existing structure of my index to hold the autocompletion-suggestions?
Thank you!
1) Like femtoRgon said above, look at the Lucene Suggest API.
2) That being said, one cheap way to perform auto-suggest is to look for all words that start with the string you've typed so far, like 'pa' returning 'pa' + 'pag' + 'page'. A wildcard query would return those results -- in Lucene query syntax, a query like 'pa*'. (You might want to restrict the suggestions to only those strings of length 2+.)
Mark Leighton Fisher has the right approach for a cheap way but performing a wildcard query just returns you the documents but not the words. It's better to look at the imlpementation of the WildcardQuery maybe. You need to use the Terms object retrieved from the IndexReader and iterate through the terms in the index.

Examine lucene.net custom query after analyzer tokenizes

I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?
I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).

Lucene term query

In my application we have deals and each deal has a target user group which may include several fields like gender, age and city. For the gender part a deal's target could be MALE FEMALE or BOTH. I wanted to find deals which are either for males or both.I created the following query but it doesn't work...
TermQuery maleQuery = new TermQuery(new Term("gender","MALE"));
TermQuery bothQuery = new TermQuery(new Term("gender","BOTH"));
BooleanQuery query = new BooleanQuery();
query.add(maleQuery,BooleanClause.Occur.SHOULD);
query.add(bothQuery,BooleanClause.Occur.SHOULD);
Please suggest if I am making some mistake. Somehow it seems to spit out only MALE deals,and not BOTH.
I am using version 4.2.1 and Standard Analyzer as the analyzer.
Several solutions possible are:
Use a QueryParser to construct the query instead of using TermQuery using the same analyser used at indexing time.For eg in my case it would be:
Query query = new QueryParser(version,"gender",new StandardAnalyzer()).parse("MALE BOTH");
Use a different analyzer for indexing that does a case insensitive indexing.
(this one applies for StandardAnalyzer, for other analyzers solutions may be different) LOWERCASE your search terms before searching.
Explaination
A brief explaination to the situation would be:
I used a StandardAnalyzer for indexing which lower cases input tokens so that a case insensitive search could be materialised.
Then I used a QueryParser configured with the same analyzer to construct a query instance for searching the user's query at front end. The search worked because the parser working in accordance with standard analyzer lower cased the user's terms and made a case sensitive search.
Then I needed to filter search results for which I wrote TermQuerys instead of using parsers, in which I used capitalized text which was not indexed that way, so the search failed.
Your query seems perfectly valid, I'd look for something else that might be wrong (e.g. are you not using LowerCaseFilter for MALE/BOTH/FEMALE terms by a chance when indexing?).
You might want to read this article on how to combine various queries into a single BooleanQuery.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article