Lucene search ton of names

Lucene search ton of names - lucene

I was trying to Search ton of names(10000+) against lucene index, the names were loaded from a text file.
This is snippet of my code:
Analyzer analyzer = new StandardAnalyzer();
MultiFieldQueryParser mParser = new MultiFieldQueryParser(arrSearchFields,
analyzer);
Query keyWordsQuery = mParser.parse(names);
- First I get the error: too many boolean clauses
at
org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118)
as search on the internet, I can fix the by
BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
But the search is slow and use a lot of memory.
Any suggestions for this case?
Appreciate it.
James

I know this is old but I hope someone finds this useful.
With Lucene 8.6.1, you can avoid "Too Many Clauses" exception by using the below query
(field1: term1 OR term2 OR .... OR term1024) OR (field1: term1025 OR trem1026 OR .... OR term2000)
instead of
field1: term1 OR term2 OR term3 OR .... OR term2000
Basically, just break up your query to consist no more than 1024 terms for each field at a time.
I'm not sure about memory usage or how fast this going to be but my intention was to get it working without throwing an error.
You may also want to look at this, this and this.

Related

lucene wildcard query with space

I have Lucene index which has city names.
Consider I want to search for 'New Delhi'. I have string 'New Del' which I want to pass to Lucene searcher and I am expecting output as 'New Delhi'.
If I generate query like Name:New Del* It will give me all cities with 'New and Del'in it.
Is there any way by which I can create Lucene query wildcard query with spaces in it?
I referred and tried few solutions given # http://www.gossamer-threads.com/lists/lucene/java-user/5487

It sounds like you have indexed your city names with analysis. That will tend to make this more difficult. With analysis, "new" and "delhi" are separate terms, and must be treated as such. Searching over multiple terms with wildcards like this tends to be a bit more difficult.
The easiest solution would be to index your city names without tokenization (lowercasing might not be a bad idea though). Then you would be able to search with the query parser simply by escaping the space:
QueryParser parser = new QueryParser("defaultField", analyzer);
Query query = parser.parse("cityname:new\\ del*");
Or you could use a simple WildcardQuery:
Query query = new WildcardQuery(new Term("cityname", "new del*"));
With the field analyzed by standard analyzer:
You will need to rely on SpanQueries, something like this:
SpanQuery queryPart1 = new SpanTermQuery(new Term("cityname", "new"));
SpanQuery queryPart2 = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("cityname", "del*")));
Query query = new SpanNearQuery(new SpanQuery[] {query1, query2}, 0, true);
Or, you can use the surround query parser (which provides query syntax intended to provide more robust support of span queries), using a query like W(new, del*):
org.apache.lucene.queryparser.surround.parser.QueryParser surroundparser = new org.apache.lucene.queryparser.surround.parser.QueryParser();
SrndQuery srndquery = surroundparser.parse("W(new, del*)");
query = srndquery.makeLuceneQueryField("cityname", new BasicQueryFactory());

As I learnt from the thread mentioned by you (http://www.gossamer-threads.com/lists/lucene/java-user/5487), you can either do an exact match with space or treat either parts w/ wild card.
So something like this should work - [New* Del*]

Lucene phrase query with wildcards

I come up with solution to programmaticlly create query to search for phrase with wildcards using this code:
public static Query createPhraseQuery(String[] phraseWords, String field) {
SpanQuery[] queryParts = new SpanQuery[phraseWords.length];
for (int i = 0; i < phraseWords.length; i++) {
WildcardQuery wildQuery = new WildcardQuery(new Term(field, phraseWords[i]));
queryParts[i] = new SpanMultiTermQueryWrapper<WildcardQuery>(wildQuery);
}
return new SpanNearQuery(queryParts, //words
0, //max distance
true //exact order
);
}
Example creation and call toString() method will output:
String[] phraseWords = new String[]{"foo*", "b*r"};
Query phraseQuery = createPhraseQuery(phraseWords, "text");
System.out.println(phraseQuery.toString());
outputs:
spanNear([SpanMultiTermQueryWrapper(text:foo*), SpanMultiTermQueryWrapper(text:b*r)], 0, true)
Which works great, and fast enough for most cases. For instance, if I create such query and search with it, It will output desired results, for example:
Sentence with foo bar.
Foolies beer drinkers.
...
And not something like:
Bar fooes.
Foo has bar.
I have mentioned that query work fast enough in most cases. Currently I have an index with size of aprox. 200GB and on average searching time is between 0.1 to 3 seconds. Depending on many factors like: cache, size of subsets of documents matching single word in phrase since lucene will perform set intersections between founded terms.
Example:
Let supose I want to query phrase "an* karenjin*" (which I will split into ["an*", "karenjin*"] and than create query using createPhraseQuery method) and I want that it matches sentences containing: "ana karenjina", "ani karenjinoj", "ane karenjine", ... (different cases due croatian grammar).
This query is very slow that I haven't waited long enough to get results (over 1h) and sometimes causes GC overhead limit exceeded exception.
This behaviour is somewhat expected since "an*" itself matches a huge number of documents. I am aware of that I could query "an? karanjin*" which giver results in 30-40sec (faster but still slow).
This is where I am confused.
If I query just "karenjin*" it gives results in 1 sec. Therefore I have tried to query "an* karenjin*" and using a Filter "karenjin*" using WildcardQuery and QueryWrapperFilter. And it is still unacceptable slow (I killed process before it returned anythong).
Documentation says that Filter reduces search space of Query. So I tried to use filter:
Filter filter = new QueryWrapperFilter(new WildcardQuery(new Term("text", "karanjin*")));
And query:
Query query = createPhraseQuery(new String[]{"an*", "karenjin*"}, "text");
Than search, (after several warm-up queries):
Sort sort = new Sort(new SortField("insertTime", SortField.Type.STRING, true));
TopDocs docs = searcher.search(query, filter, 100, sort);
OK, what is my question?
How come is quering:
Query query = new WildcardQuery(new Term("text", "karanjin*"));
is fast, but using Filter described above is still slow?

Yes, wildcards can be performance hogs, especially if they match a lot of terms, but what you describe does seem surprisingly so. Hard to say for sure why that is occuring, but for an attempt.
I'll assume:
Query query = new WildcardQuery(new Term("text", "an*"));
On it's own, is performing very badly, as described. Since the wildcards you are looking for are both prefix style queries, it's a better idea to use a PrefixQuery instead.
Query query = new PrefixQuery(new Term("text", "an"));
Though I don't think that will make much of a difference if any at all. What might just make a different is changing you rewrite method. You could try limiting the number of Terms the query is rewritten into:
Query query = new PrefixQuery(new Term("text", "an"));
//or
//Query query = new WildcardQuery(new Term("text", "an*"));
query.setRewriteMethod(new MultiTermQuery.RewriteMethod.TopTermsRewrite(10));

Lucene search using StopWords in StandardAnalyzer

I have the following issue using Lucene.NET 3.0.3.
My project analyze Documents using StandardAnalyzer with StopWord-List (combined german and english words).
While searching I create my searchterm by hand and parse it using MultiFieldQueryParser. The Parser is initialized with the same analyzer as indexing documents.
The parsed search query initialized a BooleanQuery. The BooleanQuery and a TopScoreDocCollector search in the Lucene index with IndexSearcher.
My code looks like:
using (StandardAnalyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30, roxConnectionTools.getServiceInstance<ISearchIndexService>().GetStopWordList()))
{
...
MultiFieldQueryParser parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchFields, analyzer);
parser.MultiTermRewriteMethod = MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE;
parser.AllowLeadingWildcard = true;
...
Query searchQuery = parser.Parse(searchStringBuilder.ToString().Trim);
...
BooleanQuery boolQuery = new BooleanQuery();
boolQuery.Add(searchQuery, Occur.MUST);
...
TopScoreDocCollector scoreCollector = TopScoreDocCollector.Create(SearchServiceTools.MAX_SCORE_COLLECTOR_SIZE, true);
...
searcher.Search(boolQuery, scoreCollector);
ScoreDoc[] scoreDocs = scoreCollector.TopDocs().ScoreDocs;
}
If I index a document field with value "Test- und Produktivumgebung" I can´t find this document by searching this term.
I get results if I correct the search term to "Test- Produktivumgebung".
The word "und" is in my StopWord-List.
My search query looks like the following:
Manually generated search query: (+*Test* +*und* +*Produktivumgebung*)
Parsed search query: +(title:*Test*) +(title:*und*) +(title:*Produktivumgebung*)
Why I can´t find the document searching for "Test- und Produktivumgebung"?

Wildcard Queries are not analyzed (See this question, for an example). Since you are (if I understand correctly), interpreting the query "Test- und Produktivumgebung" to (+*Test* +*und* +*Produktivumgebung*), the analyzer is not used for any of those wildcard queries, and stop words will not be eliminated.
If you eliminate the step that performs that translation, the query "Test- und Produktivumgebung" should be parsed to a phrase query and analyzed, and should work just fine. Another reason to eliminate that step, is that applying a leading wildcard to every term will cause your performance to become very poor. That's why leading wildcards must be manually enabled, because it is generally a bad idea to use them.

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.

Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.

Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

Lucene analyzer for substrings

I have a database table with about 40,000 records containing code fields, such as
FLEFSU25B-25M
EMG1090-5S
I need to be able to very quickly select all codes that contain a given substring. For example "109" matches EMG1090-5S.
My current approach is to store the codes in Lucene and have Lucene filter by substring - such as 109
But that is not very efficient if I just store the codes, because than Lucene has to search through all the tokens.
To overcome this, I'm thinking of creating a new analyzer that will split each code into tokens, like this:
EMG1090-5S
MG1090-5S
G1090-5S
1090-5S
...
Then to find all codes with substring 109, I can search on 109* which is much more efficient (I understand Lucene stores tokens alphabetically, just like SQL Server indexes).
Does this make sense?
Does such an analyzer already exist? I'm using .Net/C#.

A token filter to accomplish this does indeed already exist! Take a look at EdgeNGramTokenFilter. An Analyzer using it might look something like:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
KeywordTokenizer source = new KeywordTokenizer(reader);
LowercaseFilter filter = new LowercaseFilter(source);
filter = new EdgeNGramTokenFilter(filter, EdgeNGramTokenFilter.Side.BACK, 2, 50);
return new TokenStreamComponents(source, filter);
}
};
For your consideration, WordDelimiterTokenizer might also prove useful to you. It has a number of configuartion options, and can be used to separate at punctuation and at transitions from letter to number, etc. So with it, you could get the from your input: "EMG1090-5S"
You could get the tokens:
EMG
1090
5
S
Which might work well for your case, but would not be particularly helpful in finding something like: "MG1"

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene search ton of names - lucene

Related

lucene wildcard query with space

Lucene phrase query with wildcards

Lucene search using StopWords in StandardAnalyzer

Find typo with Lucene

Lucene analyzer for substrings

Categories

Resources