Issue with Solr Indexing, Solr Indexing Chain is not complete - indexing

In my solr, i get this result after running analysis for Indexing. I have a number of documents containing the word Machine Learning but seems like something broke and indexing chain didn't complete. Can i find a work-around for this?
Field type is for the value being searched is: <field name="Skills" type="text_general" indexed="true" stored="true"/>
EDIT 1:
Analysis with Query:

I'm guessing that the "SF" is a Stemming filter - the filter will remove common endings to allow 'machine' to match 'machines', storing 'machin' as the common term in the index. As long as stemming is performed both when indexing and when querying, you should get the result you're looking for.
The EdgeNGramFilter stores a token for each extra letter in the token, so you get a token (that will match a query token) for each additional letter (where your filter seems to be configured for 3 as the minimum ngram size).
If you're not performing stemming when searching as well, the query machine will not find any terms matching, since the token after indexing has been stored as machin.
Use both the "query" and "index" section on the analysis page to see how each part is parsed and processed, and see why they don't end up with the same terms on both sides (the end tokens on both sides are compared, and if they're the same, there's a match - this is shown with a slightly darked background in the interface IIRC).

I am not sure what's your first image stands for, but your two image shows different token filter order.
As a side note of the Stem filter, The kstem token filter is a high performance filter for english. All terms must already be lowercased (use lowercase filter) for this filter to work correctly.
Your first image shows you have LCF (LowercaseFilter) as the first token filter. But your second image shows you have stem filter run first, then do the LCF (LowercaseFilter), it is not going to work

Related

Boosting CloudSearch results which match facet

I'm using AWS CloudSearch for a search index, and the user can currently search over it for records which match the name field and a few others. However, we have users in different languages and I would like to give a boost to results which match their local language. Every record has a locale field, which could be a facet. However, I don't want to simply exclude results which don't match, and nor do I want to simply sort it so that everything in their language always comes first regardless of 'relevance' - I simply want to give a 'boost' to any result where locale=<my locale>.
In other words, I would like highly relevant matches in a different locale to still beat barely relevant matches in my own language, but relevant matches in my own language should definitely rank higher than matches in a different language.
Is there a way to do this when I query CloudSearch or should I just do the reordering client side once I have fetched all of the results?
So this was painfully convoluted but I did manage to get it to work in the end. If you are doing a structured search (q.parser=structured) then you can perform a query that receives a boost if the locale field matches a given value.
Sadly, where it gets a bit cumbersome, is that it doesn't really seem to be designed for boosting things that match something other than your main query, so by default it filters out anything that doesn't match the locale and excludes them entirely from the results. So you have to combine two versions of the query with an (or) - one with the boost and one without.
So my basic query (which in my case was already (or 'un' (prefix 'un'))) now becomes (or (and <OLD_QUERY> (term field=locale boost=2.5 'en')) <OLD_QUERY>)
In other words: EITHER a match for the original query where locale='en', boosted by 2.5, OR a match for that same query in any locale without a boost.
Painful, but it works!

How to treat the whole sentence as one token in Azure Search

We are facing a problem of performing exact matching and ignore case-sensitivity by using Azure Search.
For example, we have a field called Description and it can be a small sentence or long sentence (For example: Welcome to Azure Search). We are trying to treat the whole sentence as one token such that when user search "Welcome to" it won't return the result back because we have to search "Welcome to Azure Search" to do a exactly matching. Another requirement is that we want the capability to search case-insensitive such that searching "welcome TO Azure SEARCH" will return the result.
I have used Keyword Analyzer to treat the whole field as a single token but this will prevent search case-insensitive from working.
I am also trying to define custom analyzer with keyword_v2 tokenizer and lowercase token filter. Looks like this will solve my problem however there is a 300 maximum token length limitation. In some of the cases, the Description field will be a long sentence more than 300 characters.
I also thought about duplicating an index field to be lowercase and using OData syntax $filter=Description eq 'welcome to azure search'. For example, there will be two fields: "Description" and "DescriptionLowerCase", when do the searching, search on "DescriptionLowerCase" and when returning result, returning "Description". But this will double the size of index storage.
Is there a better way to solve my problem?
you have pretty much covered all the options available options. At the moment there is no workaround the size limitation as without that the search will suffer performance. Now exactly why would you need exact match on the whole string more than 300 characters is beyond me. Have you tried using quotes around your search?

Is there a way to properly experiment with Solr field-types?

I'm working with Solr for a basic search engine, and I've created a couple different fieldTypes that include various filters and tokenizers in their analyzer chains.
However, I'm finding it very difficult to assess how these components of the chain interact and when I query in the Solr Admin, I consistently get different results than I expect-- with no clue as to why.
Is there a way to see what a phrase like education:"x university" is being transformed into when I type it in the q section of the Admin?
Also, when the phrase goes through the chain can it be transformed into multiple things that are all searched or is it just a single modified phrase?
Thanks for any help!
Use Analysis in Solr Admin to check how each field and its type process the tokens both while querying and indexing.
Analyse Fieldname / FieldType:
from the drop down option select field/type that you want to analyse and clieck on Analyse values.
ex: what tokenizer used, which all filter classes applied to token and how token is transformed after passing each filter class.
if
Verbose Output is checked, it shows more details about each filter class used for the selected field/type.

solr unable to search with exact value

I am using Solr 4.1.0 and I'm facing a strange issue. If I give a value to search for a field, even be it exact or involving a wildcard, it gives me 0 search results. On the other hand if I just give the field name and a * in place of value, I get all the results.
Also, if I search in the text field, i.e where I have copied values of all my fields, it gives me correct output. text is by default, my catch-all for all fields. feature is a field which has value Butter.
So now, what is happening here is that if I try to find in the actual field with the exact value or even with starting alphabet and a *, it doesn't give me a value while if I search in the text field, which is a catch-all field, I'm able to retrieve the value. Although if I try to find in the feature field using *, it gives me complete result list correctly.
You can view the logs for text:Butter here, logs for feature:Butter here, logs for feature:B* here and logs for feature:* here
I'm facing this issue with this particular field only. Any pointers to what could be the reason behind this strange problem?
If you search without the field name, Solr is going to search in the default search field.
So make sure you are marking the fields you want to search on as default.
If you are using dismax query handler, you can add them to the qf parameter.
Also, for Wildcard Queries check [Analyzers][1]
On wildcard and fuzzy searches, no text analysis is performed on the search word.
As no analysis is done at query time for wilcard searches and hence the lower casing, stemming would not be applied during query time but just the index time.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article