Is there a way to properly experiment with Solr field-types? - apache

I'm working with Solr for a basic search engine, and I've created a couple different fieldTypes that include various filters and tokenizers in their analyzer chains.
However, I'm finding it very difficult to assess how these components of the chain interact and when I query in the Solr Admin, I consistently get different results than I expect-- with no clue as to why.
Is there a way to see what a phrase like education:"x university" is being transformed into when I type it in the q section of the Admin?
Also, when the phrase goes through the chain can it be transformed into multiple things that are all searched or is it just a single modified phrase?
Thanks for any help!

Use Analysis in Solr Admin to check how each field and its type process the tokens both while querying and indexing.
Analyse Fieldname / FieldType:
from the drop down option select field/type that you want to analyse and clieck on Analyse values.
ex: what tokenizer used, which all filter classes applied to token and how token is transformed after passing each filter class.
if
Verbose Output is checked, it shows more details about each filter class used for the selected field/type.

Related

Issue with Solr Indexing, Solr Indexing Chain is not complete

In my solr, i get this result after running analysis for Indexing. I have a number of documents containing the word Machine Learning but seems like something broke and indexing chain didn't complete. Can i find a work-around for this?
Field type is for the value being searched is: <field name="Skills" type="text_general" indexed="true" stored="true"/>
EDIT 1:
Analysis with Query:
I'm guessing that the "SF" is a Stemming filter - the filter will remove common endings to allow 'machine' to match 'machines', storing 'machin' as the common term in the index. As long as stemming is performed both when indexing and when querying, you should get the result you're looking for.
The EdgeNGramFilter stores a token for each extra letter in the token, so you get a token (that will match a query token) for each additional letter (where your filter seems to be configured for 3 as the minimum ngram size).
If you're not performing stemming when searching as well, the query machine will not find any terms matching, since the token after indexing has been stored as machin.
Use both the "query" and "index" section on the analysis page to see how each part is parsed and processed, and see why they don't end up with the same terms on both sides (the end tokens on both sides are compared, and if they're the same, there's a match - this is shown with a slightly darked background in the interface IIRC).
I am not sure what's your first image stands for, but your two image shows different token filter order.
As a side note of the Stem filter, The kstem token filter is a high performance filter for english. All terms must already be lowercased (use lowercase filter) for this filter to work correctly.
Your first image shows you have LCF (LowercaseFilter) as the first token filter. But your second image shows you have stem filter run first, then do the LCF (LowercaseFilter), it is not going to work

Endeca search query on multiple fields

How to create an Endeca query on combination of multiple fields [just like where clause in sql query]. Suppose we have three fields indexed are -
empId
empName
empGender
Now, I need a query like "where empName like 's%' AND empGender=male"
Thanks.
Firstly,
Checkout Record Filters in the Advanced Development Guide.
If you are trying to use a Record Filter on a property, you will need to enable it explicitly in Developer Studio for that property, while your Dimensions will automatically have the ability to apply a Record Filter. This will help when you have explicit values to filter on, for example empGender.
Your Record Filter can then look as follow:
Nr=AND(empGender:male)
You can further use the Ntk parameter to specify fields to search on so assuming your empName field is enabled for wildcard searching (configure this in Developer Studio) searching this field will look as follow:
Ntk=empName&Ntt=s*
So assuming your properties have been configured correctly, your example above will probably end up looking as follow:
Nr=AND(empGender:male)&Ntk=empName&Ntt=s*
To take this one step further, you can specify Search Filters (ie. Ntk + Ntt parameters) together. I haven't tried this for wildcards so you'll need to confirm that yourself but to combine Search Filters you delimit them with |
Ntk=empName|empId&Ntt=s*|1234*
I suggest you manually build up queries in the Reference Application to confirm you get your expected results and then start to code this up in your application.
radimbe, the problem with record filters for this use case is that they need to be precise. This means you don't get pelling correction, thesaurus expansion, case insensitivity or stemming. It's very unlikely that a user will input precise information like this.
Saraubh, you can do a boolean search to do OR text search queries. You can also use the Endeca Query Language to specify a complex set of boolean logic that goes beyond boolean search and which would incorporate spelling correction, stemming, etc.
In general though, I think for an application like this, you should move away from searching specific individual fields simultaneously and make use of the faceting capabilities of dimensions to guide the user. Additionally, a search box that searches many fields in combination simultaneously in order of importance is really the way to go for a simplified user interface for this sort of application.

RESTful API Design OR Predicates

I'm designing a RESTful API and I'm trying to work out how I could represent a predicate with OR an operator when querying for a resource.
For example if I had a resource Foo with a property Name, how would you search for all Foo resources with a name matching "Name1" OR "Name2"?
This is straight forward when it's an AND operator as I could do the following:
http://www.website.com/Foo?Name=Name1&Age=19
The other approach I've seen is to post the search in the body.
You will need to pick your own approach, but I can name few that seem to be pretty logical (although not without disadvantages):
Option 1.: Using | operator:
http://www.website.com/Foo?Name=Name1|Name2
Option 2.: Using modified query param to allow selection by one of the values from the set (list of possible comma-separated values):
http://www.website.com/Foo?Name_in=Name1,Name2
Option 3.: Using PHP-like notation to provide list instead of single string:
http://www.website.com/Foo?Name[]=Name1&Name[]=Name2
All of the above mentioned options have one huge advantage: they do not interfere with other query params.
But as I mentioned, pick your own approach and be consistent about it across your API.
Well one quick way to fixing that is to add an additional parameter that is identifying the relationship between your parameters wether they're an and or an or for example:
http://www.website.com/Foo?Name=Name1&Age=19&or=true
Or for much more complex queries just keep a single parameter and in it include your whole query by making up your own little query language and on the server side you would parse the whole string and extract the information and the statement.

Lucene: Accessing payloads of results of a query

When I'm searching for a query in Lucene, I receive a list of documents as result. But how can I get the hits within those documents? I want to access the payload of those word, which where found by the query.
If your query contains only one term you can simply use TermPositions to access the payload of this term. But if you have a more complex query with Phrase Search, Proximity Search, ... you can't just search for the single terms in TermPositions.
I would like to receive a List<Token>, TokenStream or something similiar, which contains all the Tokens that were found by the query. Then I can iterate over the list and access the payload of each Token.
I solved my problem by using SpanQueries. Nearly every Query can be expressed as SpanQuery. A SpanQuery gives access to the spans where the hit within a document is. Because the normal QueryParser doesn`t produce SpanQueries, I had to write my own parser which only creates SpanQueries. Another option would be the SurroundParser from Lucene-Contrib, which also creates SpanQueries.
I think you'll want to start by looking at the Lucene Highlighter, as it highlights the matching terms in the document.

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.
for the most frequent terms look into the terms- and statistical component
besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393
For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.