Lucene query permutation - lucene

I have a question regarding performing a lucene query involving permutation.
Say I have two fields: "name" and "keyword" and the user searches for "joes pizza restaurant". I want some part of that search to match the full contents of the "name" field and some part to match the full content of the keyword field. It should match all the the supplied terms and should match the entire contents of the fields. For example it could match:
1) name:"joes restaurant" keyword:"pizza"
2) name:"joes pizza" keyword:"restaurant"
3) name:"pizza restaurant" keyword:"joes"
4) name:"pizza" keyword:"joes restaurant"
5) name:"pizza joes" keyword:"restaurant"
but it would not match
6) name:"big joes restaurant" keyword:"pizza" - because it's not a match on the full field
7) name:"joes pizza restaurant" keyword:"nomatch" - because at least one of the terms should match to the keyword field
I've thought about possible ways to implement this by calculating all the permutations of the fields and using boolean queries however this doesn't scale very well as the number of terms increases. Anyone have any clues how to implement this sort of query efficiently?

Lucene docs recommend using separate field which is concatenation of 'name' and 'keyword' fields for queries spanning multiple fields. Do the search on this field.

Let's divide your query into three parts:
Both the 'name' field and the 'keyword' field should contain part of the query.
Both matches should be to the full field.
The union of the matches should cover the query completely.
I would implement it this way:
Create a boolean query composed of the tokens in the original query. Make it a disjunction of 'MUST' terms. e.g. in the example something like:
(name:joes OR name:restaurant OR name:pizza) AND (keyword:joes OR keyword:restaurant OR keyword:pizza)
Any document matching this query has a part of the original query in each field.
(This could be a ConstantScoreQuery to save time).
Take the set of matches from the first query. Extract the field contents as tokens, and store them in String sets. Keep only the matches where the union of the sets equals the string set from your original query, and the sets have an empty intersection. (This handles the covering - item 3 above). For your first example, we will have the sets {"joes", "restaurant"} and {"pizza"} fulfilling both conditions.
Take the set sizes from the matches left, and compare them to the field lengths. For your first example we will have set sizes of 2 and 1 which should correspond to field lengths of 2 and 1 respectively.
Note that my items 2 and 3 are not part of the regular Lucene scoring but rather external Java code.

Related

Why manipulate filtered column will affect index efficiency?

I'm reading "Tsql Fundamental" by Ben Itzik.
The author briefly mentioned that we shouldn't manipulate the filtered column if we want to use index efficiently. But he didn't really go into detail as to why this is the case.
So could someone please kindly explain the reason behind it?
The author briefly mentioned that we shouldn't manipulate the filtered column if we want to use index efficiently
What author mentions is called SARGABILITY.
Assume this statement
select * from t1 where name='abc'
Assume,you have an index on above filtered column
,then the query is Sargable
But not below one
select * from t1 where len(name)=3
When SQL is presented with above query,the only way ,it can filter out the data is to scan the table and then apply predicate to each row
Think of an index as being like a telephone directory (hopefully that's still a familiar enough concept) where everyone is listed by their surnames followed by their addresses.
This index is useful if you want to locate someone's phone number and you know their surname (and maybe their address).
But what if you want to locate everyone who (to steal TheGameiswar's example) has a 3 letter surname - is the index useful to you? It may be slightly more useful than having to go and visit every house in town1, but it's not nearly so efficient as being able to just jump to the appropriate surnames. You have to search the entire book.
Similarly, if you want to locate everyone who lives on a particular street, the index isn't so useful2 - you have to search through the entire book to make sure you've found everyone. Or to locate everyone whose surname ends with Son, etc.
1This is the analogy for when a database may choose to perform an index scan to satisfy a query simply because the index is smaller and so is easier than a full table scan.
2This is the analogy for a query that isn't attempting to filter on the left-most column in the index.
WHERE clause in a SQL query use predicates to filter the rows. Predicate is an expression, that determines whether an argument applied on a database object is true or false. Example : "Salary > 5000".
Relational models use predicates as a core element in filtering the data. These predicates should be written in certain form known as "Search Arguments" in order for the query optimizer to use the indexes effectively on the attributes used in the WHERE clause to filter data.
A predicate in the form - "column - operator - value" or "value - operator - column" is considered an appropriate search argument. Example - Salary = 1000 or Salary > 5000. As you can see, the column name should appear ALONE on one side of the expression and the constant or calculated value should be on the other side to form a valid search argument. The moment a built-in function like MAX , MIN, DATEADD or DATEDIFF etc was used on the column name, the expression is no longer treated as a search argument and the query optimizer won't use the indexes on those column names.
I hope this is clear.

SQL: LIKE and Contains — Different results

I am using MS SQL Express SQL function Contains to select data. However when I selected data with LIKE operator, I realised that Contains function is missing a few rows.
Rebuilt indexes but it didn't help.
Sql: brs.SearchText like '%aprilis%' and CONTAINS(brs.SearchText, '*aprilis*')
The contains function missed rows like:
22-28.aprīlis
[1.aprīlis]
Sīraprīlis
PS. If I search directly CONTAINS(brs.SearchText, '*22-28.aprīlis*'), then it finds them
contains is functionality based on the full text index. It supports words, phrases, and prefixed matches on words, but not suffixed matches. So you can match words that start with 'aprilis' but not words that end with it or that contain it arbitrarily in the middle. You might be able to take advantage of a thesaurus for these terms.
This is explained in more detail in the documentation.

Lucene.NET - do an AND search multiple words on multiple fields

I define a Document object for my product entity which has several fields: Title, Brand, Category, Size, Color, Material.
Now I want to support user to do an AND search on multiple fields. Any document that have one, two or more fields contain all the search words will be responded.
For example, when user enter "gucci shirt red" I want to return all documents that have fields matched with all 3 tokens "gucci", "shirt" AND "red". So all documents below will be responded:
1.Documents with title contains all the 3 words, for example Title = "Gucci Modern Shirt Red" or "Gucci blue shirt"...
2.Documents with Title = "Gucci classical shirt" AND Color = "red"
3.Documents with Category = "mens shirt" AND Brand = "gucci" AND Color = "red"
4.etc..
I know that Lucene support operator + that do a MUST for search query. For example I can translate the above keyword to query "+gucci +shirt +red" then I'm sure documents of example (1) above will definitely be responded. But does it work for cases (2) and (3) above ?
When doing these types of queries I like to: create a master BooleanQuery and add several sub-queries that work together to give the best result:
TermQuery: (exact match), someone types in the exact match of the title
PhraseQuery: (use slop), so if you have "Gucci Modern Shirt Red" and someone types in "Gucci Shirt" (notice one word gap) it would match
FuzzyQuery: (slow on large(> 50 million records)/non-memory indexes) to account for potential misspellings
Boolean SubQuery: with all of the terms seperated and OR'ed. Queries matching 1 our of 4 words will have low score however 3/4 words will have a higher score.
Query Parse (as mentioned above with potential field boosts)
Other: i.e. Synonym search on phrases etc.
I would OR all of these types and then filter them out using a Collector minimum score.
The reason I like the master BooleanQuery approach is that you can have a setting where a user chooses "the type" of query. Maybe as simple -> advanced and it is easy to add/remove query types rather quickly on the fly and the query can be built pretty easily giving predicitve results. Boosting records/similarity you are working within the internal Lucene algorithm and results are not sometimes clear.
Performance: I have done queries like this using Lucene 3.0.x on indexes with > 100M records NOT IN MEMORY and it works pretty quickly giving sub-second responses. Fuzzy Query does slow things down, but as stated before that can be made into an advanced search option (or "Search again with...")
No, when not given a a field to search explicitly in the query, it will go to the default field, which it would appear is the "title" in your case. You would need a query more like:
+shirt +color:red +brand:gucci
for instance.
Or, one common usage is to set up a catch all field, in which all (or a large subset) of searchable data is mashed together, allowing you to search everything in a very loose fashion, on that field, in which case you would just use something like:
all:(+shirt +gucci +red)
Or, if you made that field your default field instead:
+shirt +gucci +red
As you indicated.
You could use MultiFieldQueryParser. Add Title, color, brand etc to this.
If you search for "gucci shirt red" then using above Parser would return query like
+((Title:gucci Color:gucci Brand:gucci) (Title:shirt Color:shirt Brand:shirt) (Title:red Color:red Brand:red)
This should solve the problem.
Also, if you want that lets say, for above query you want to show brand with gucci products to be shown 1st then you could apply boost to this field.

MySQL MATCH...AGAINST sometimes finds answer, sometimes doesn't

The following two queries return the same (expected) result when I query my database:
SELECT * FROM articles
WHERE content LIKE '%Euskaldunak%'
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"Euskaldunak"' IN BOOLEAN MODE)
The text in the content field that it's searching looks like this: "...These Euskaldunak, or newcomers..."
However, the following query on the same table returns the expected single result:
SELECT * FROM articles
WHERE content LIKE '%PCC%'
And the following query returns an empty result:
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"PCC"' IN BOOLEAN MODE)
The text in the content field that matches this result looks like this: "...Portland Community College (PCC) is the largest..."
I can't figure out why searching for "Euskaldunak" works with that MATCH...AGAINST syntax but "PCC" doesn't. Does anyone see something that I'm not seeing?
(Also: "PCC" is not a common phrase in this field - no other rows contain the word, so the natural language search shouldn't be excluding it.)
Your fulltext minimum word length is probably set too high. I think the default is 4, which would explain what you are seeing. Set it to 1 if you want all words indexed regardless of length.
Run this query:
show variables like 'ft_min_word_len';
If the values is greater than 3 and you want to get hits on words shorter than that, edit your /etc/my.cnf and add or update this line in the [mysqld] section using a value appropriate for your application:
ft_min_word_len = 1
Then restart MySQL and rebuild your fulltext indexes and you should be all set.
There are two things I can think of right away. The first is your ft_min_word_len value is set to more than 3 characters. Any "word" less than the ft_min_word_len length will not get indexed.
The second is that more then 50% of your records contain the 'PCC' string. A full text search that matches more than 50% of the records is considered irrelevant and returns nothing.
Full text indexes have different rules than regular string indexes. For example, there is a stop words list so certain common words like to, the, and, don't get indexed.

Make lucene treat all terms in a field as a single term

In my Lucene documents I have a field "company" where the company name is tokenized.
I need the tokenization for a certain part of my application.
But for this query, I need to be able to create a PrefixQuery over the whole company field.
Example:
My Brand
my
brand
brahmin farm
brahmin
farm
Regularly querying for "bra" would return both documents because they both have a term starting with bra.
The result I want though, would only return the last entry because the first term starts with bra.
Any suggestions?
Create another indexed field, where the company name is not tokenized. When necessary, search on that field rather than the tokenized company name field.
If you want fast searches, you need to have index entries that point directly at the records of interest. There might be something that you can to with the proximity data to filter records, but it will be slow. I see the problem as: how can a "contains" query over a complete field be performed efficiently?
You might be able to minimize the increase in index size by creating (for each current field) a "first term" field and "remaining terms" field. This would eliminate duplication of the first term in two fields. For "normal" queries, you look for query terms in either of these fields. For "startswith" queries, you search only the "first term" field. But this seems like more trouble than it's worth.
Use a SpanQuery to only search the first term position. A PrefixQuery wrapped by SpanMultiTermQueryWrapper wrapped by SpanPositionRangeQuery:
<SpanPositionRangeQuery: spanPosRange(SpanMultiTermQueryWrapper(company:bra*), 0, 1)>