Lucene 'minimumNumberShouldMatch' fails parsing - lucene

I create simple boolean query with org.apache.lucene.search.BooleanQuery.Builder.
I also want to use minimumNumberShouldMatch there, to specify a minimum number of the optional BooleanClauses which must be satisfied:
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.add(new TermQuery(new Term("field", "value1")), BooleanClause.Occur.SHOULD);
builder.add(new TermQuery(new Term("field", "value2")), BooleanClause.Occur.SHOULD);
builder.add(new TermQuery(new Term("field", "value3")), BooleanClause.Occur.SHOULD);
builder.setMinimumNumberShouldMatch(2);
String queryString = builder.build().toString();
System.out.println(queryString);
As a result, I get this query string:
(field:value1 field:value2 field:value3)~2
I want this query to return documents if at least two clauses are satisfying.
But I face a problem during parsing this query:
new QueryParser(Version.LUCENE_7_7_1.toString(), new ClassicAnalyzer()).parse(queryString);
throws following exception:
Exception in thread "main" org.apache.lucene.queryparser.classic.ParseException: Cannot parse '(field:value1 field:value2 field:value3)~2': Encountered " <FUZZY_SLOP> "~2 "" at line 1, column 40.
Was expecting one of:
<EOF>
<AND> ...
<OR> ...
<NOT> ...
"+" ...
"-" ...
<BAREOPER> ...
"(" ...
"*" ...
"^" ...
<QUOTED> ...
<TERM> ...
<PREFIXTERM> ...
<WILDTERM> ...
<REGEXPTERM> ...
"[" ...
"{" ...
<NUMBER> ...
at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:114)
at ....lucene.common.BaseLuceneConnection.main(BaseLuceneConnection.java:101)
Caused by: org.apache.lucene.queryparser.classic.ParseException: Encountered " <FUZZY_SLOP> "~2 "" at line 1, column 40.
Was expecting one of:
<EOF>
<AND> ...
<OR> ...
<NOT> ...
"+" ...
"-" ...
<BAREOPER> ...
"(" ...
"*" ...
"^" ...
<QUOTED> ...
<TERM> ...
<PREFIXTERM> ...
<WILDTERM> ...
<REGEXPTERM> ...
"[" ...
"{" ...
<NUMBER> ...
at org.apache.lucene.queryparser.classic.QueryParser.generateParseException(QueryParser.java:931)
at org.apache.lucene.queryparser.classic.QueryParser.jj_consume_token(QueryParser.java:813)
at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:216)
at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:109)
... 1 more
I also tried to run this query with Luke, but getting the same error there.
Please, tell me, why this query can't be parsed, although it was built using appropriate tool.

Query.toString() is not serialization, there is no guarantee that the query passed back will be parseable by the QueryParser. It is intended to pass back something reasonably human-readable, for debugging purposes.
I'm not sure what you are trying to accomplish here, since you have already built perfectly acceptable BooleanQuery, but you should never do something like this: QueryParser.parse(query.toString())
QueryParser does not support minimumNumberShouldMatch. Search with your BooleanQuery.

Related

Java Parser comment statement

I am trying to comment a particulate statement.
My first approach is to return a comment in case statement is an 'Expression Statement' and expression is a particular 'Method Call Expression'.
new ModifierVisitor<Object>() {
public Visitable visit(ExpressionStmt expStmt, Object arg) {
Expression exp = expStmt.getExpression();
if (exp.isMethodCallExpr()) {
// My other logic goes here
return new LineComment(expStmt.toString());
}
}
}
But it failed while dumping the unit back to string.
java.lang.ClassCastException: com.github.javaparser.ast.comments.LineComment cannot be cast to com.github.javaparser.ast.stmt.Statement
at com.github.javaparser.printer.DefaultPrettyPrinterVisitor.visit(DefaultPrettyPrinterVisitor.java:1329)
at com.github.javaparser.printer.DefaultPrettyPrinterVisitor.visit(DefaultPrettyPrinterVisitor.java:163)
at com.github.javaparser.ast.stmt.BlockStmt.accept(BlockStmt.java:76)
at com.github.javaparser.printer.DefaultPrettyPrinterVisitor.visit(DefaultPrettyPrinterVisitor.java:1220)
at com.github.javaparser.printer.DefaultPrettyPrinterVisitor.visit(DefaultPrettyPrinterVisitor.java:163)
at com.github.javaparser.ast.body.MethodDeclaration.accept(MethodDeclaration.java:104)
at com.github.javaparser.printer.DefaultPrettyPrinterVisitor.printMembers(DefaultPrettyPrinterVisitor.java:190)
at com.github.javaparser.printer.DefaultPrettyPrinterVisitor.visit(DefaultPrettyPrinterVisitor.java:419)
at com.github.javaparser.printer.DefaultPrettyPrinterVisitor.visit(DefaultPrettyPrinterVisitor.java:163)
at com.github.javaparser.ast.body.ClassOrInterfaceDeclaration.accept(ClassOrInterfaceDeclaration.java:98)
at com.github.javaparser.printer.DefaultPrettyPrinterVisitor.visit(DefaultPrettyPrinterVisitor.java:325)
at com.github.javaparser.printer.DefaultPrettyPrinterVisitor.visit(DefaultPrettyPrinterVisitor.java:163)
at com.github.javaparser.ast.CompilationUnit.accept(CompilationUnit.java:133)
at com.github.javaparser.printer.DefaultPrettyPrinter.print(DefaultPrettyPrinter.java:104)
at com.github.javaparser.ast.Node.toString(Node.java:320)
As it suggests that you can replace a 'Statement' with another statement so instead I tried another approach to replace the statement and with an 'Empty Statement'. It kind of worked for me but the output does not look good as it leaves extra ';' after the commented line.
At third I tried to go deeper and instead of replacing the statement I tried to replace the expression with an comment. That too failed as mentioned in SO - Javaparser comment expression.
Any idea how to fix this ?
I tried a workaround which does not feel like a good solution, but gives me the expected result for now:
BlockStmt blockStmt = (BlockStmt) expStmt.getParentNode().get();
blockStmt.getStatement(blockStmt.getStatements().indexOf(expStmt) + 1).setLineComment(expStmt.toString());
return null;

Select FROM clause in Jena returning no results

We are having trouble reliably issuing sparql queries across multiple graphs using the sparql FROM clause within a Jena dataset.
Here is an example of the issue:
final String subject = "http://example.com/ont/breakfast#espresso";
final String graph1 = "http://example.com/ont/breakfast/graph#espresso_definition";
final String graph2 = "http://example.com/ont/breakfast/graph#espresso_decoration";
// Add some triples to graphs within the dataset
Dataset dataset = DatasetFactory.create();
Model modelG1 = dataset.getNamedModel(graph1);
Resource espressoTypeG1 = modelG1.createResource(subject)
.addProperty(RDF.type, OWL.Class);
Resource espressoLabelG1 = modelG1.createResource(subject)
.addProperty(RDFS.label, "Espresso");
Model modelG2 = dataset.getNamedModel(graph2);
Resource espressoLabelG2 = modelG2.createResource(subject)
.addProperty(RDFS.label, "Black Gold");
// The query to execute - returns no results
String queryString = "select * FROM <" + graph1 + "> FROM <" + graph2 + "> " +
"{ <" + subject + "> ?p ?o }";
// This, however, works:
// String queryString = "select * { graph ?g { <" + subject + "> ?p ?o } }";
// Run the query
Query query = QueryFactory.create(queryString);
try (QueryExecution qe = QueryExecutionFactory.create(query, dataset)) {
ResultSet results = qe.execSelect();
while (results.hasNext()) {
QuerySolution result = results.next();
System.out.println(result);
}
}
A combination of a values clause and the graph keyword has helped us through most of the scenarios where we need to process multiple graphs in the same query. There are some queries where this gets quite unwieldy or downright inefficient.
What can we do to correctly issue a query across a union of models within a single dataset?
Note that the queries are not known at compile time, so we cannot rely on manually creating unions of models in Java code. Furthermore the data is generally added using a combination of loading from files, sparql update and calls to dataset.asDatasetGraph().add(...).
Handling of FROM and FROM NAMED depends on whether the Dataset implementation used supports it, the default in-memory implementations don't support it by default.
To enforce dataset usage you can use the DynamicDatasets and DatasetDescription helper classes to resolve the query specified dataset e.g.
Dataset resolvedDataset =
DynamicDatasets.dynamicDataset(DatasetDescription.create(query), dataset, false);
try (QueryExecution qe = QueryExecutionFactory.create(query, resolvedDataset)) {
// Normal result processing logic goes here...
}

colon in colum name causes java.sql.SQLException: Invalid column name

I've been getting a "java.sql.SQLException: Invalid column name" error when I try to select each row with an attribute name that contains a colon (for example "item:one".
In my code the following line is called:
if (!(matGroup == null || matGroup.equals(''))) {
sql.eachRow("select distinct da.item_name from defined_attribute da where da.attribute2=$matGroup") {
selectItems << it.item_name
}
}
If I run the same select command directly to the database (select distinct da.item_name from defined_attribute da where da.attribute2='item:one'), I receive no errors. I have also tried the code with item names containing no colon and those seem are retrieved correctly.
I've found here: http://groovy.329449.n5.nabble.com/Sql-is-it-possible-to-escape-the-colon-td5155389.html that adding "sql.enableNamedParameters = false" might solve the problem, however, I'm constrained to use groovy 2.1.6, and that is not supported.
I've also tried other things to work around the colon, but neither have been successful. Things I've tried include:
Defining the parameter in a collection
def params = [matGroup]
sql.eachRow("select distinct da.item_name from defined_attribute da where da.attribute2=?", params)
Escaping the colon before passing it to eachRow()
def params = [matGroup.replaceAll(":", "\\:")]
sql.eachRow("select distinct da.item_name from defined_attribute da where da.attribute2=?", params)
Defining the command as a string before passing it to eachRow()
def command = "select distinct da.item_name from defined_attribute da where da.attribute2='" + matGroup + "\'"
sql.eachRow(command.toString())
Is there a way to make this work for groovy 2.1.6?

Lucene MultiFieldQuery with WildcardQuery

Currently I have an issue with the Lucene search (version 2.9).
I have a search term and I need to use it on several fields. Therefore, I have to use MultiFieldQueryParser. On the other hand, I have to use the WhildcardQuery(), because our customer wants to search for a term in a phrase (e.g. "CMH" should match "KRC250/CMH/830/T/H").
I have tried to replace the slashes ('/') with stars ('*') and use a BooleanQuery with enclosed stars for the term.
Unfortunately whichout any success.
Does anyone have any Idea?
Yes, if the field shown is a single token, setting setAllowLeadingWildcard to be true would be necessary, like:
parser.setAllowLeadingWildcard(true);
Query query = parser.parse("*CMH*");
However:
You don't mention how the field is analyzed. By default, the StandardAnalyzer is used, which will split it into tokens at slashes (or asterisks, when indexing data). If you are using this sort of analysis, you could simply create a TermQuery searching for "cmh" (StandardAnalyzer includes a LowercaseFilter), or simply:
String[] fields = {"this", "that", "another"};
QueryParser parser = MultiFieldQueryParser(Version.LUCENE_29, fields, analyzer) //Assuming StandardAnalyzer
Query simpleQuery = parser.parse("CMH");
//Or even...
Query slightlyMoreComplexQuery = parser.parse("\"CMH/830/T\"");
I don't understand what you mean by a BooleanQuery with enclosed stars, if you can include code to elucidate that, it might help.
Sorry, maybe I have described it a little bit wrong.
I took something like this:
BooleanQuery bq = new BooleanQuery();
foreach (string field in fields)
{
foreach (string tok in tokArr)
{
bq.Add(new WildcardQuery(new Term(field, " *" + tok + "* ")), BooleanClause.Occur.SHOULD);
}
}
return bq;
but unfortunately it did not work.
I have modified it like this
string newterm = string.Empty;
string[] tok = term.Split(new[] { ' ', '/' }, StringSplitOptions.RemoveEmptyEntries);
tok.ForEach(x => newterm += x.EnsureStartsWith(" *").EnsureEndsWith("* "));
var version = Lucene.Net.Util.Version.LUCENE_29;
var analyzer = new StandardAnalyzer(version);
var parser = new MultiFieldQueryParser(version, fields, analyzer);
parser.SetDefaultOperator(QueryParser.Operator.AND);
parser.SetAllowLeadingWildcard(true);
return parser.Parse(newterm);
and my customer love it :-)

Using ShingleFilter to build costomized analyzer in PyLucene

I am pretty new to Lucene and Pylucene. This is a problem when I am using pylucene to write a customized analyzer, to tokenize text in to bigrams.
The code for analyzer class is:
class BiGramShingleAnalyzer(PythonAnalyzer):
def __init__(self, outputUnigrams=False):
PythonAnalyzer.__init__(self)
self.outputUnigrams = outputUnigrams
def tokenStream(self, field, reader):
result = ShingleFilter(LowerCaseTokenizer(Version.LUCENE_35,reader))
result.setOutputUnigrams(self.outputUnigrams)
#print 'result is', result
return result
I used ShingleFilter on the TokenStream produced by LowerCaseTokeinizer. When I call the tokenStream function directly, it works just tine:
str = ‘divide this sentence'
bi = BiGramShingleAnalyzer(False)
sf = bi.tokenStream('f', StringReader(str))
while sf.incrementToken():
print sf
(divide this,startOffset=0,endOffset=11,positionIncrement=1,type=shingle)
(this sentence,startOffset=7,endOffset=20,positionIncrement=1,type=shingle)
But when I tried to build a query parser using this analyzer, problem occurred:
parser = QueryParser(Version.LUCENE_35, 'f', bi)
query = parser.parse(str)
In query there is nothing.
After I add print statement in the tokenStream function, I found when I call parser.parse(str), the print statement in tokenStream actually get called 3 times (3 words in my str variable). It seems to me the parser pre-processed the str I passed to it, and call the tokenStream function on the result of the pre-processing.
Any thoughts on how should I make the analyzer work, so that when I pass it to query parser, the parser could parse a string into bigrams?
Thanks in advance!