I am calling Lucene using the following code (PyLucene, to be precise):
analyzer = StandardAnalyzer(Version.LUCENE_30)
queryparser = QueryParser(Version.LUCENE_30, "text", analyzer)
query = queryparser.parse(queryparser.escape(querytext))
But consider if this is the content of querytext:
querytext = "THE FOOD WAS HONESTLY NOT WORTH THE PRICE. MUCH TOO PRICY WOULD NOT GO BACK AND OR RECOMMEND IT"
In that case, the "AND OR" trips up the queryparser, even though I am use queryparser.escape. How do I avoid the following error message?
Java stacktrace:
org.apache.lucene.queryParser.ParseException: Cannot parse 'THE FOOD WAS HONESTLY NOT WORTH THE PRICE. MUCH TOO PRICY WOULD NOT GO BACK AND OR RECOMMEND IT': Encountered " <OR> "OR "" at line 1, column 80.
Was expecting one of:
<NOT> ...
"+" ...
"-" ...
"(" ...
"*" ...
<QUOTED> ...
<TERM> ...
<PREFIXTERM> ...
<WILDTERM> ...
"[" ...
"{" ...
<NUMBER> ...
<TERM> ...
"*" ...
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:187)
....
at org.apache.lucene.queryParser.QueryParser.generateParseException(QueryParser.java:1759)
at org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.java:1641)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1268)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1207)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1167)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:182)
It's not just OR, it's AND OR.
I use the following workaround:
query = queryparser.parse(queryparser.escape(querytext.replace("AND OR", "AND or")))
queryparser.parse only escapes special characters (as shown in this page) and leaves "AND OR" unchanged, so it would not work in your case. Since presumably you also used StandardAnalyzer to analyze your text, the terms in your index are already in lowercase. So you can change the whole query string to lowercase before giving it to the queryparser. Lowercase "and" and "or" are not considered operators, so "and or" would not trip the queryparser.
I realise I'm rather late to the party here, but putting quotes round the search string is a better option:
querytext = "\"THE FOOD WAS ... \""
Related
I've gobbled together a basic Powershell script to query W10's Windows Desktop Search (WDS) index. Here is the relevant bits,
$query = "
SELECT System.DateModified, System.ItemPathDisplay
FROM SystemIndex
WHERE CONTAINS(System.Search.Contents, '$($text)')
"
$objConnection = New-Object -ComObject adodb.connection
$objrecordset = New-Object -ComObject adodb.recordset
$objrecordset.CursorLocation = 3
$objconnection.open("Provider=Search.CollatorDSO;Extended Properties='Application=Windows';")
$objrecordset.open($query, $objConnection, $adOpenStatic)
Until now my tests have been using single words and everything works. But when I started using two words, it falls apart with the following error,
Searching for 'and then'...
SELECT System.DateModified, System.ItemPathDisplay
FROM SystemIndex
WHERE CONTAINS(System.Search.Contents, 'and then')
Exception from HRESULT: 0x80040E14
At D:\searchSystemIndex.ps1:72 char:1
+ $objrecordset.open($query, $objConnection, $adOpenStatic)
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OperationStopped: (:) [], COMException
+ FullyQualifiedErrorId : System.Runtime.InteropServices.COMException
Using Explorer to query the index using content:"and then" works fine.
Any ideas?
According to the documentation for Windows Search SQL Syntax and the examples in the CONTAINS predicate, if you want to search for a literal phrase with "multiple words or included spaces" you need to quote the phrase inside the query:
Type: Phrase
Description: Multiple words or included spaces.
Examples
...WHERE CONTAINS('"computer software"')
So in your example you probably want:
$text = "and then"
$query = "
SELECT System.DateModified, System.ItemPathDisplay
FROM SystemIndex
WHERE CONTAINS(System.Search.Contents, '`"$($text)`"')
"
# ^^ ^^
# quoted search phrase
(note the quotes are prefixed with a backtick as the quote would otherwise terminate your entire query string.)
If you're not looking for the exact phrase "and then", and you just want results that contain "and" and "then" it looks like you need to to do something like this:
Type: Boolean
Description: Words, phrases, and wildcard strings combined by using the Boolean operators AND, OR, or NOT. Enclose the Boolean terms in double quotation marks.
Example:
...WHERE CONTAINS('"computer monitor" AND "software program" AND "install component"')
...WHERE CONTAINS(' "computer" AND "software" AND "install" ' )
$query = "
SELECT System.DateModified, System.ItemPathDisplay
FROM SystemIndex
WHERE CONTAINS(System.Search.Contents, '`"and`" AND `"then`"')
# ^^^^^^^^^^^^^^^^^^^^^^
# multiple independent words
"
I am trying to understand this answer: https://stackoverflow.com/a/44180583/481061 and particularly this part:
if the first line of the statement is a valid statement, it won't work:
val text = "This " + "is " + "a "
+ "long " + "long " + "line" // syntax error
This does not seem to be the case for the dot operator:
val text = obj
.getString()
How does this work? I'm looking at the grammar (https://kotlinlang.org/docs/reference/grammar.html) but am not sure what to look for to understand the difference. Is it built into the language outside of the grammar rules, or is it a grammar rule?
It is a grammar rule, but I was looking at an incomplete grammar.
In the full grammar https://github.com/Kotlin/kotlin-spec/blob/release/grammar/src/main/antlr/KotlinParser.g4 it's made clear in the rules for memberAccessOperator and identifier.
The DOT can always be preceded by NL* while the other operators cannot, except in parenthesized contexts which are defined separately.
Say I have a string: "Test me".
how do I convert it to: "Test me"?
I've tried using:
string?.replace("\\s+", " ")
but it appears that \\s is an illegal escape in Kotlin.
replace function in Kotlin has overloads for either raw string and regex patterns.
"Test me".replace("\\s+", " ")
This replaces raw string \s+, which is the problem.
"Test me".replace("\\s+".toRegex(), " ")
This line replaces multiple whitespaces with a single space.
Note the explicit toRegex() call, which makes a Regex from a String, thus specifying the overload with Regex as pattern.
There's also an overload which allows you to produce the replacement from the matches. For example, to replace them with the first whitespace encountered, use this:
"Test\n\n me".replace("\\s+".toRegex()) { it.value[0].toString() }
By the way, if the operation is repeated, consider moving the pattern construction out of the repeated code for better efficiency:
val pattern = "\\s+".toRegex()
for (s in strings)
result.add(s.replace(pattern, " "))
I got the below error in my project:
org.apache.lucene.queryParser.ParseException: Cannot parse 'AMERICAN EXP PROPTY CASLTY INS AND': Encountered "" at line 1, column 34.
Was expecting one of:
...
"+" ...
"-" ...
"(" ...
"" ...
...
...
...
...
"[" ...
"{" ...
...
...
"" ...
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:211)
at org.elasticsearch.index.query.xcontent.QueryStringQueryParser.parse(QueryStringQueryParser.java:196)
... 15 more
Please help on how to resolve...when i add an AND at the end of any string
it gives me the above error.
Thanks
When you are using QueryString query or specifying your query as a q parameter, elasticsearch is using Lucene to parse your query. As a result, it expects your query to follow Lucene query syntax and returns errors when your query contains syntax errors (dangling AND at the end, in your case). If you want your query string to be interpreted as text and not parsed as a query, consider using Text Query instead.
That's funny.
Lucene is waiting for a new term as in Lucene you can build queries like : "termA AND termB" or "+termA +termB"
Can you try to lowercase your query and see if it works?
use correct package name and classpath parser P is small letter
org.apache.lucene.queryparser.classic.ParseException
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>4.3.0</version>
</dependency>
I have a string which contains words with parentheses. I need to remove the whole word from the string.
For example: for the input, "car wheels_(four) klaxon" the result should be, "car klaxon".
Can someone give me an example that would accomplish this?
You can do this with regular expressions. The regular expression you need is:
"\s?\S+[()]\S+\s?"
This removes any word containing either ( or ) or both, and removes both the word and collapses the surrounding whitespace. The match should be replaced with a single space.
In C# the regular expression could be used like this:
string s = "car wheels_(four) klaxon";
s = Regex.Replace(s, #"\s?\S*[()]\S*\s?", " ");
I'm not entirely sure of the VB translation for this, but hopefully you can figure it out.
Slightly different:
sed "s/\s\+\S*(.\+)\S*\s\+/ /g" yourfile
It works like this:
yourfile:
car wheels_(four) klaxon
ciao (wheel) hey
foo bar (baz) qux
stack overflow_(rulez)_the world
transformed in:
car klaxon
ciao hey
foo bar qux
stack world
If speed isn't an issue and you want to avoid overcomplicated regular expressions, you can use String.Split on " " to create an array of "words", iterate through each word, replace any that String.Contains "(" with an empty string, then use String.Join with a separator of "" to get your results.
Sorry can't send the codez, don't have a VB.NET compiler on hand.