Boolean query with multiple OR:s in a proximity search using Lucene syntax - lucene

I'm trying to create a search engine that uses Lucene syntax boolean queries to do searches against different thirdparties api. It all seems to work fine, the user enters the query in a textfield and I parse it using https://www.npmjs.com/package/lucene-query-parser to see that it's acctually a valid Lucene query.
However, now I have gotten a request to have a query with multiple OR:s in a proximity search. So basically they want to search on A, B or C, any of those should be close to Z, X OR Y.
The normal Lucene proximity search looks like this:
"a b"~20
Which gives me all hits where a is 20 words or closer to b.
I have been looking all over the place for the syntax of doing this multiple OR proximity search with Lucene for a couple of days now but haven't found anything. Is it even possible?
I have tried stuff like this:
"a OR b, x OR z"~20
"a b OR x z"~20
"a, b OR x, z"~20
But none of them work.
Thanks in advance!

Related

Getting search results from wikidata website, but not API

I'm trying out the wikidata API but have some trouble with the search query "Jas 39 C Gripen". It returns results on the wikidata website, but not if I use the API.
On The wikidata website I get two search results for the query
https://www.wikidata.org/w/index.php?search=Jas+39+C+Gripen&title=Special:Search&fulltext=1
The same query using the API, does not return a result
https://www.wikidata.org/w/api.php?action=wbsearchentities&format=json&language=en&type=item&continue=0&search=Jas%2039%20C%20Gripen
Am I missing some parameters or using the wrong parameters? For many other queries I get results from the API.
It seems that the native Wikidata search applies some "fuzzy logic" when interpreting the user's entries. In your case, it shows two results, although the character C is missing in the first one.
Coming back to the API and the action you have chosen, you could use Jas 39 Gripen as search term (which will show one result) as well as Jas 39C Gripen (which will also show one result). But it seems that you can't use Jas 39 C Gripen (note the space character between 9 and C).
In other words,
https://www.wikidata.org/w/api.php?action=wbsearchentities&format=json&language=en&type=item&continue=0&search=Jas%2039%20Gripen
https://www.wikidata.org/w/api.php?action=wbsearchentities&format=json&language=en&type=item&continue=0&search=Jas%2039C%20Gripen
both work, but
https://www.wikidata.org/w/api.php?action=wbsearchentities&format=json&language=en&type=item&continue=0&search=Jas%2039%20C%20Gripen
does not.
I have investigated this issue further and finally found the solution. Try this:
https://www.wikidata.org/w/api.php?action=query&list=search&format=json&srsearch=Jas+39+C+Gripen
The action query allows some "fuzziness" in the search term. Please refer to the API documentation for further details. In short, this action performs a full text search (which you obviously want) and allows for a nearmatch search type.
The reason seems to be that the English label is JAS 39C Gripen. By removing one space from your query, you will get the result you are looking for:
https://www.wikidata.org/w/api.php?action=wbsearchentities&format=json&language=en&type=item&continue=0&search=Jas%2039C%20Gripen

Search multiple keywords in DocumentDB collection

I have a Azure DocumentDB collection with a 100 documents. I have tokenized an array of search terms in each document for performing a search based on keywords.
I was able to search on just one keyword using below SQL query for DocumentDB:
SELECT VALUE c FROM root c JOIN word IN c.tags WHERE
CONTAINS(LOWER(word), LOWER('keyword'))
However, this only allows search based on single keyword. I want to be able to search given multiple keywords. For this, I tried below query:
SELECT * FROM c WHERE ARRAY_CONTAINS(c.tags, "Food") OR
ARRAY_CONTAINS(c.tags, "Dessert") OR ARRAY_CONTAINS(c.tags, "Spicy")
This works, but is case-sensitive. How do I make this case-insensitive? I tried using scalar function LOWER like this
LOWER(c.tags), LOWER("Dessert")
but this doesn't seem to work with ARRAY_CONTAINS.
Any idea how I can perform a case-insensitive search on multiple keywords using SQL query for DocumentDB?
Thanks,
AB
The best way to deal with the case sensitivity is to store them in the tags array with all lower case (or upper case) and then just do LOWER(<user-input-tag>) at query time.
As for your desire to search on multiple user input tags, your approach of building a series of OR clauses is probably the best approach.

SQL Server full text search and spaces

I have a column with a product names. Some names look like ‘ab-cd’ ‘ab cd’
Is it possible to use full text search to get these names when user types ‘abc’ (without spaces) ? The like operator is working for me, but I’d like to know if it’s possible to use full text search.
If you want to use FTS to find terms that are adjacent to each other, like words separated by a space you should use a proximity term.
You can define a proximity term by using the NEAR keyword or the ~ operator in the search expression, as documented here.
So if you want to find ab followed immediately by cd you could use the expression,
'NEAR((ab,cd), 0)'
searching for the word ab followed by the word cd with 0 terms in-between.
No, unfortunately you cannot make such search via full-text. You can only use LIKE in that case LIKE ('ab%c%')
EDIT1:
You can create a view (WITH SCHEMABINDING!) with some id and column name in which you want to search:
CREATE VIEW dbo.ftview WITH SCHEMABINDING
AS
SELECT id,
REPLACE(columnname,' ','') as search_string
FROM YourTable
Then create index
CREATE UNIQUE CLUSTERED INDEX UCI_ftview ON dbo.ftview (id ASC)
Then create full-text search index on search_string field.
After that you can run CONTAINS query with "abc*" search and it will find what you need.
EDIT2:
But it wont help if search_string does not start with your search term.
For example:
ab c d -> abcd and you search cd
No. Full Text Search is based on WORDS and Phrases. It does not store the original text. In fact, depending on configuration it will not even store all words - there are so called stop words that never go into the index. Example: in english the word "in" is not selective enough to be considered worth storing.
Some names look like ‘ab-cd’ ‘ab cd’
Those likely do not get stored at all. At least the 2nd example is actually 2 extremely short words - quite likely they get totally ignored.
So, no - full text search is not suitable for this.

How to use MultiFieldQueryParser along with boost factor?

I have indexed documents in Lucene based on three fields: title, address, city. Now I want to build my query say, C A B so that I can retrieve the documents as follows:
C must be present in the title field of the documents and either A or B must be present in either of address and city fields of the matched documents. The documents that have A present in either of those fields should get higher score or higher boost. Here A, B, C may be single terms or phrases.
I am new to Lucene. I do not have any experience of framing such complex queries. In this context I have read the post Boost factor in MultiFieldQueryParser
But this post does not answer my question. So if anyone please help me to solve this I will be really grateful.
title:C AND (address:A^2 OR city:A^2 OR address:B OR city:B)
Don't get caught up on reading about MultiFieldQueryParser, that isn't really what you need for this. Standard QueryParser syntax will serve your purposes you fine.
See the Lucene QueryParser syntax documentation
A query like:
+title:C +((address:A city:A)^2 address:B city:B)
Should do nicely.
To explain a bit:
+title:C - require a match on title:C. No results will be returned that don't match this condition.
+(....) - require a match on the subquery contained inside. As long as a match is found on any one of the optional queries contained within the parentheses is matched, this will be satisfied.
(address:A city:A)^2 - You prefer a match on A, these two queries are boosted more heavily with ^2.

How to write Lucene queries for website search engine

I plan to implement my website's search engine using Apache Solr. I have a search index built, and one of its documents is:
Virtua Fighter 2
Performing a search of:
Virtua*
returns all records starting with "Virtua", as expected.
A search of "Virtua Fighter 2" returns an exact match.
I would like a search of "Virtua Fighter" to return Virtua Fighter 2 in its result set. But a phrase search of Virtua Fighter omits Virtua Fighter 2 from its result sets. And I'm unable to use a wildcard in a phrase search-- "Virtua Fighter*" does not return any results.
What type of query needs to be written to support this? Or what types of Lucene queries are used for simple website search engines?
I'm guessing you're using a Keyword analyzer for the titles? (Or another analyzer that doesn't split on tokens.)
You should just use a Standard Analyzer, then phrase queries will work fine.