Can I boost different fields in MultiFieldQueryParser with different factors?
Also, what is the maximum boost factor value I can assign to a field?
Thanks a ton!
Ed
MultiFieldQueryParser has a [constructor][1] that accepts a map of boosts. You use it with something like this:
String[] fields = new String[] { "title", "keywords", "text" };
HashMap<String,Float> boosts = new HashMap<String,Float>();
boosts.put("title", 10);
boosts.put("keywords", 5);
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(
fields,
new StandardAnalyzer(),
boosts
);
As for the maximum boost, I'm not sure, but you shouldn't think about boost in absolute terms anyway. Just use a ratio of boosts that makes sense. Also see this question.
[1]: https://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/MultiFieldQueryParser.html#MultiFieldQueryParser(org.apache.lucene.util.Version, java.lang.String[], org.apache.lucene.analysis.Analyzer, java.util.Map)
Related
What is the best way to check if String a is part of String b in Lucene. For example: a = "capital" and b = "Berlin is a capital of Germany". In this case b contains a and fits the requirement.
I think your problem can be treated as some field contains certain term or not.
The basic TermQuery should be enough to solve your problem, in most analyzers, "Berlin is a capital of Germany" will be analyzed as "berlin", "capital" "germany"(if you use the basic stop words)
// code in Scala
new TermQuery(new Term("contents", "capital"))
you can also use PhraseQuery to solve your problem(though, your problem is not the most suitable scenario for PhraseQuery).
val query = new PhraseQuery();
query.add(new Term("contents", "capital"))
Lucene In Action 2nd, 3.4 Lucene’s diverse queries introduces all kinds of Query used in Lucene. I suggest you have a read and that might help.
I am using lucene 4.3.0 and want to tokenize the doc with both English and Japanese characters.
An example is like "LEICA S2 カタログ (新品)"
The StandardAnalyzer "[leica] [s2] [カタログ] [新] [品]"
The JapaneseAnalyzer "[leica] [s] [2] [カタログ] [新品]"
In the application of my project, the StandardAnalyzer is better on English characters, e.g. [s2] is better than [s] [2]. JapaneseAnalyzer is better on Japanese, e.g. [新品] to [新] [品]. In addition, JapaneseAnalyzer has a good feature to convert fullwidth character "2" to "2".
If I want the tokens to be [leica] [s2] [カタログ] [新品], it means:
1) English and numbers are tokenized by StandardAnalyzer. [leica] [s2]
2) Japanese are tokenized by JapaneseAnalyzer. [カタログ] [新品]
3) fullwidth character are converted to halfwidth by a filter. [s2]=>[s2]
how to implement this custom analyzer?
First thing I would try is messing with the arguments passed to the JapaneseAnalyzer, particularly the Tokenizer.Mode (I know precisely nothing about the structure of the Japanese language, so no help from me on the intent of those options).
Barring that
You'll need to create your own Analyzer for this. Unless you are willing to write your own Tokenizer, the end result may be a best effort. Creating an analyzer is pretty simple, creating a tokenizer will mean defining your own grammar, which will not be so simple.
Take a look at the code for JapaneseAnalyzer and StandardAnalyzer, particularly the call to createComponents, which is all you need to implement to create a custom analyzer.
Say you come to conclusion the StandardTokenizer is correct for you, but otherwise we're going to use mostly the Japanese filter set, it might look something like:
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
//For your Tokenizer, you might consider StandardTokenizer, JapaneseTokenizer, or CharTokenizer
Tokenizer tokenizer = new StandardTokenizer(version, reader);
TokenStream stream = new StandardFilter(version, tokenizer);
stream = new JapaneseBaseFormFilter(stream);
stream = new LowerCaseFilter(matchVersion, stream); //In JapaneseAnalyzer, a lowercasefilter comes at the end, further proving I don't know Japanese.
stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags);
stream = new CJKWidthFilter(stream); //Note this WidthFilter! I believe this does the char width transform you are looking for.
stream = new StopFilter(matchVersion, stream, stopwords);
stream = new JapaneseKatakanaStemFilter(stream);
stream = new PorterStemFilter(stream); //Nothing stopping you using a second stemmer, really.
return new TokenStreamComponents(tokenizer, stream);
}
That's a completely random implementation, from someone who doesn't understand the concerns, but hopefully it points the way toward implementing a more meaningful Analyzer. The order in which you apply filters in that filter chain are important, so be careful there (ie. In english, LowerCaseFilter is usually applied early, so that things like Stemmers don't have to worry about case).
i am currently trying to get all Documents from a Lucene Index (v. 4) in a RamDirectory.
on index creation the following addDocument function is used:
public void addDocument(int id, String[] values, String[] fields) throws IOException{
Document doc = new Document();
doc.add(new IntField("getAll", 1, IntField.TYPE_STORED));
doc.add(new IntField("ID", id, IntField.TYPE_STORED));
for(int i = 0; i < fields.length; i++){
doc.add(new TextField(fields[i], values[i], Field.Store.NO));
}
writer.addDocument(doc);
}
after calling this for all documents the writer is closed.
as you can see from the first field added to the document, i added an additional field "getAll" to make it easy to retrieve all documents. If I understood it right, the Query "getAll:1" should return all documents in the index. But thats not the case.
I am using the following function for that:
public List<Integer> getDocIds(int noOfDocs) throws IOException, ParseException{
List<Integer> result = new ArrayList<Integer>(noOfDocs);
Query query = parser.parse("getAll:1");
ScoreDoc[] docs = searcher.search(query, noOfDocs).scoreDocs;
for(ScoreDoc doc : docs){
result.add(doc.doc);
}
return result;
}
noOfDocs is the number of Documents that were indexed. Of course i used the same RamDirectory when creating the IndexSearcher.
Substitution of the parsed Query to a manually created TermQuery didn't help either.
The query returns no results.
Hope someone can help to find my error.
Thanks
I believe you are having trouble searching because you are using an IntField, rather than a StringField or TextField, for instance. IntField, and other numeric fields, are designed for numeric range querying, and are not indexed in their raw form. You may use a NumericRangeQuery to search for them.
Really, though, IntField should only be used, to my mind, for numeric values, and not for a string of digits, which is what you appear to have. IDs should be keyword or text fields, generally.
As far as pulling all records, you don't need to add a field to do that. Simply use a MatchAllDocsQuery.
I think first you should run Luke to verify the contents of the index.
Also, if you allow * as the first character of a query with queryParser.setAllowLeadingWildcard(true); , then a query like ID:* would retrieve all documents without having to include the getAll field.
With Lucene's query parser, it's possible to boost terms (causing them to weight higher in the search results) by appending "^n", e.g. "apple^5 pie" would assign five times more importance to the term "apple". Is it possible to do this when constructing a query using the API? Note that I'm not wanting to boost fields or documents, but individual terms within a field.
You simply need to use the setBoost(float) method of the Query class.
For example
TermQuery tq1 = new TermQuery(new Term("text", "term1"));
tq1.setBoost(5f);
TermQuery tq2 = new TermQuery(new Term("text", "term2"));
tq2.setBoost(0.8f);
BooleanQuery query = new BooleanQuery();
query.add(tq1, Occur.SHOULD);
query.add(tq2, Occur.SHOULD);
This is equivalent to parsing the query text:term1^5 text:term2^0.8.
I use the Lucene java QueryParser with KeywordAnalyzer. A query like topic:(hello world) is broken up in to multiple parts by the KeywordTokenizer so the resulting Query object looks like this topic:(hello) topic:(world) i.e. Instead of one, I now have two key-value pairs. I would like the QueryParser to interpret hello world as one value, without using double quotes. What is the best way to do so?
Parsing topic:("hello world") results in a single key value combination but, using double quotes is not an option.
I am not using the Lucene search engine. I am using Lucene's QueryParser just for parsing the query, not for searching. The text Hello World is entered at runtime, by the user so that can change. I would like KeywordTokenizer to treat Hello World as one Token instead of parsing splitting it in to two Tokens.
You'll need to use a BooleanQuery. Here's a code snippet using the .NET port of Lucene. This should work with both the KeywordAnalyzer and the StandardAnalyzer.
var luceneAnalyzer = new KeywordAnalyzer();
var query1 = new QueryParser("Topic", luceneAnalyzer).Parse("hello");
var query2 = new QueryParser("Topic", luceneAnalyzer).Parse("world");
BooleanQuery filterQuery = new BooleanQuery();
filterQuery.Add(query1, BooleanClause.Occur.MUST);
filterQuery.Add(query1, BooleanClause.Occur.MUST);
TopDocs results = searcher.Search(filterQuery);
You can construct the query directly as follows. This preserves the space.
Query query = new TermQuery(new Term("field", "value has space"));
If you print query as System.out.println(query); you will see the following result.
field:value has space