How to treat the whole sentence as one token in Azure Search - indexing

We are facing a problem of performing exact matching and ignore case-sensitivity by using Azure Search.
For example, we have a field called Description and it can be a small sentence or long sentence (For example: Welcome to Azure Search). We are trying to treat the whole sentence as one token such that when user search "Welcome to" it won't return the result back because we have to search "Welcome to Azure Search" to do a exactly matching. Another requirement is that we want the capability to search case-insensitive such that searching "welcome TO Azure SEARCH" will return the result.
I have used Keyword Analyzer to treat the whole field as a single token but this will prevent search case-insensitive from working.
I am also trying to define custom analyzer with keyword_v2 tokenizer and lowercase token filter. Looks like this will solve my problem however there is a 300 maximum token length limitation. In some of the cases, the Description field will be a long sentence more than 300 characters.
I also thought about duplicating an index field to be lowercase and using OData syntax $filter=Description eq 'welcome to azure search'. For example, there will be two fields: "Description" and "DescriptionLowerCase", when do the searching, search on "DescriptionLowerCase" and when returning result, returning "Description". But this will double the size of index storage.
Is there a better way to solve my problem?

you have pretty much covered all the options available options. At the moment there is no workaround the size limitation as without that the search will suffer performance. Now exactly why would you need exact match on the whole string more than 300 characters is beyond me. Have you tried using quotes around your search?

Related

Pentaho Data Integration (Spoon) Value Mapper Wildcard

Is there a wildcard character for the Value Mapper transformation in Pentaho Spoon? I've done some digging and only found wildcard solutions for uploading files and documents. I need to be able to map any and all potential values that contain a specific word yet I don't have a way of identifying all possible variations of the phrase that contains that word.
Example: Map website values to a category.
Value -> Mapped Category
facebook.com -> Facebook
m.facebook.com -> Facebook
google.com -> Google
google.ca -> Google
I'd prefer to use a wildcard character (let's call it % for example) so that one mapping captures all cases for a given category (e.g. %facebook% -> Facebook) in my Value Mapper. Another benefit is that the wildcard would correctly map any future site traffic value that comes along. (e.g. A hypothetical l.facebook.com would be correctly mapped if it ever entered my data)
I've tried various characters as wildcards and none have worked. + \ * %
Please and thank you!
You can use the step Replace in String with regular expressions to do this.
If you still need the original field, create a copy first using the Calculator step. Then you can put a number of mappings into the Replace step. They will run in sequence and if the regex matches, replace the contents of the field with your chosen mapping.
The performance may not be great, but it gives you the full flexibility of regexes. Do keep in mind this way gives you the first match. See my example for what can go wrong.

Issue with Solr Indexing, Solr Indexing Chain is not complete

In my solr, i get this result after running analysis for Indexing. I have a number of documents containing the word Machine Learning but seems like something broke and indexing chain didn't complete. Can i find a work-around for this?
Field type is for the value being searched is: <field name="Skills" type="text_general" indexed="true" stored="true"/>
EDIT 1:
Analysis with Query:
I'm guessing that the "SF" is a Stemming filter - the filter will remove common endings to allow 'machine' to match 'machines', storing 'machin' as the common term in the index. As long as stemming is performed both when indexing and when querying, you should get the result you're looking for.
The EdgeNGramFilter stores a token for each extra letter in the token, so you get a token (that will match a query token) for each additional letter (where your filter seems to be configured for 3 as the minimum ngram size).
If you're not performing stemming when searching as well, the query machine will not find any terms matching, since the token after indexing has been stored as machin.
Use both the "query" and "index" section on the analysis page to see how each part is parsed and processed, and see why they don't end up with the same terms on both sides (the end tokens on both sides are compared, and if they're the same, there's a match - this is shown with a slightly darked background in the interface IIRC).
I am not sure what's your first image stands for, but your two image shows different token filter order.
As a side note of the Stem filter, The kstem token filter is a high performance filter for english. All terms must already be lowercased (use lowercase filter) for this filter to work correctly.
Your first image shows you have LCF (LowercaseFilter) as the first token filter. But your second image shows you have stem filter run first, then do the LCF (LowercaseFilter), it is not going to work

Lucene - Which field contains search term?

I have developed a search application with Lucene. I have created the basic search. Basically, my app works as follows:
My index has many fields. (Around 40)
User can enter query to multiple fields i.e: +NAME:John +SURNAME:Doe
Queries can contain wildcards such as ? and * i.e: +NAME:J?hn +SURNAME:Do*
Queries can also contain fuzzy i.e: +NAME:Jahn~0.5
Now, I want to find, which field(s) contains my search term(s). As I am using wildcard and fuzzy, I cannot just make string comparison. How can I do it?
If you need it for debugging purposes, you could use IndexSearcher.explain.
Otherwise, this problem looks like highlighting, so you should be able to find out the fields that matched by:
re-analyzing your document,
or using its term vectors.

Lucene: Accessing payloads of results of a query

When I'm searching for a query in Lucene, I receive a list of documents as result. But how can I get the hits within those documents? I want to access the payload of those word, which where found by the query.
If your query contains only one term you can simply use TermPositions to access the payload of this term. But if you have a more complex query with Phrase Search, Proximity Search, ... you can't just search for the single terms in TermPositions.
I would like to receive a List<Token>, TokenStream or something similiar, which contains all the Tokens that were found by the query. Then I can iterate over the list and access the payload of each Token.
I solved my problem by using SpanQueries. Nearly every Query can be expressed as SpanQuery. A SpanQuery gives access to the spans where the hit within a document is. Because the normal QueryParser doesn`t produce SpanQueries, I had to write my own parser which only creates SpanQueries. Another option would be the SurroundParser from Lucene-Contrib, which also creates SpanQueries.
I think you'll want to start by looking at the Lucene Highlighter, as it highlights the matching terms in the document.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article