Solr queryparser for lucene indices? - lucene

I've created an index (using Lucene 2.9) which stores text messages. (the documents also contain some other meta data which are not indexed, just stored) I use the StandardAnalyzer for parsing these messages. I'm trying to run some tests on this index using Solr (I replaced the example app index with my index) to see what kind of results I get from various queries.
When I tried the following query , I got 0 results
"text:happiness"
However, changing that to "text:happiness*" gives me some results. All of them contain terms like "happiness,", "happiness." etc. So I thought that it was a tokenization issue during index creation, however, when I used Luke (a lucene index debugging tool) to run the same query (text:happiness), I got the exact same results that I get for happiness* from Solr, which led me to believe that the problem is not while indexing, but in the way that I'm specifying my Solr query. I looked at the solrconfig.xml, and noticed that it has the following line (commented), I tried uncommenting it, and then modified my query to use "defType=lucene" in addition to the original query, but got the same results.
<queryParser name="lucene" class="org.apache.solr.search.LuceneQParserPlugin"/>
I have very little experience with Solr, so any help is greatly appreciated :)

The field that I was querying on was defined as type "text" in the solr schema.xml (not solrconfig.xml as I incorrectly mentioned in my earlier comment). Here's a relevant snippet from the schema.xml
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
I replaced it with the following,
<fieldType name = "text" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
</fieldType>
Which gives me the required behavior.

Related

Solr field name rules?

Sorry for the newby question, I'm new to Solr. In managed-schema, I see that there are many fields with identical types but different names. How does Solr identify which field to store the tokens given that the types are all the same but only names are different? For instance,
<field name="content_type" type="text_general">
<field name="content_type_hint" type="text_general">
<field name="blitz" type="text_general">
They all the have the same type (same analyzer). How does Solr store different content into all these text_general fields? Do they check the names of tags with the actual content? and if not identical, it moves on to dynamic fields? I searched on the web and it seems no one has mentioned in detail if name helps in the process of indexing.
So names and type are two different things.
<field name="content_type" type="text_general">
In the above case the name of the field is "content_type" which will be used to search it.
e.g if you want to search a docuemnt with content_type="xml", you will query something like this
q=content_type:xml
however type defines the analysis that will occur on a field when documents are indexed or queries are sent to the index.
so somewhere in your schema you will have defined the field type text_general something like this.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
You can read more about it here
https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html
Solr wont store everything into the type field. type field just tells it what analysis to run on the field while indexing or querying.
Every field will have its own index.
Edit:
i think you are confused how data is indexed. Lets take an example of.
lets say i have a document like this
{
"content_type" : "text/html",
"content_type_hint" : "some_hint",
"blitz" : "some_text"
}
So when you index the document you will tell solr which field you are putting what value.
so in this case you are saying the field content_type has value "text/html" and blitz has value "some_text".
Then solr will do some analysis based on what the type of that field is and then put it to respective index.

Solr query a particular field ONLY using qf: parameter

I'm trying to search for a particular set of keywords keyword1,keyword2 or keyword3 in a particular field. I'm doing it by using the query,
http://localhost:8983/solr/gettingstarted_shard2_replica2/browse?q=keyword1
keyword 2 keyword 3&qf=field1
However, when I run this it finds keyword2 in another field field2 and returns that row as well! As far as I understand, the qf:field1 parameter limits the search for all the keywords in just field1 right?
Where am I incorrect? Is it because of the schema that I have defined?
My schema config is:
<field name="field1" type="text_general" indexed="true"/>
<field name="field2" type="strings" indexed="false"/>
Disclaimer: I'm the author of Solr Query Debugger Google Chrome plugin.
I suggest to use this debugger in order to see what's is execute and explain why your query have such strange behaviour.
Just execute the solr query in your browser and then start the Solr Query Debugger plugin.
In the plugin page you'll see Debug and Echo tabs where explain what's executed by Solr. In the Explain tab you'll see the score explanations structured as a tree.
Are you using standard (default) Query parser or an eDisMax one? If standard (most likely), then you need to use df parameter.
qf parameter is used with eDisMax, but then you need to also have defType=edismax
Enabling debug flag will show you against which fields the search is actually issued.

How do I get Solr to not index common words in a query?

I'm new working with Solr and I have an instance running properly in my server
My problem is:
When I query Solr with some terms it doesn't return results, but there are items with that term indexed. I talked with a developer who was working with this Solr instance and he remember something about a "blacklist", or "empty list", or something related, that act as a filter for queries, it's like a common words list that return poor quality results to a query, words like:
"a", "the", "for", ...
I want to know how to manage that list to remove a term from it(or add one, edit, etc)
It sounds like you're talking about the stopwords filter. If you have stopword filtering active, you should see something similar to this in your field analysis within schema.xml
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
This references the file stopwords.txt, which is a standard name for this file, but a different file name may be used, so it could be different on your server. This file will contain a list of words that should be disregarded during search. You should find this file in the conf directory for your index (the same place as the schema.xml and solrconfig.xml). You can edit this file, though for best results, you should re-index your records after you do so.
Alternately, if you would prefer not to filter common words from your search, you can remove the reference to the StopFilterFactory from your field analysis entirely. Again, you should plan to reindex your records after doing so.

Solr's SnowballPorterFilterFactory and Wildcard parameters

I'm having an issue querying Solr using the following field type:
<fieldType name="text_ci" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
</fieldType>
As you can see it applies the "SnowballPorterFilterFactory" when indexing and querying. If I Index something like
Mouse stuff and fun
It get's indexed as:
As you can see the word "Mouse" is turned into "Mous" by the "SnowballPorterFilterFactory". Which is what we want. However when we search for
Mouse*
It doesn't seem to apply the "SnowballPorterFilterFactory" in the same way. I guess due to the * at the end.
My question is.. Is there a way to make the "SnowballPorterFilterFactory" know about wildcards? So that when i Query for
Mouse*
I don't get 0 results.
Interestingly if i query for
mous*
The record does come back.
Or can someone offer a better way to query/index this type of field?
Thanks Dave
From the FAQ:
Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming and lowercasing. The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query. These queries are case-insensitive anyway because QueryParser makes them lowercase. This behavior can be changed using the setLowercaseExpandedTerms(boolean) method
If you're fine with changing your Solr source, SOLR-757 has a patch attached to it which you might find useful. I don't know of a way to change this other than diving into the source though.
What might be a simpler idea: just have a copy field which is not stemmed. The user can search both of these fields, and then mouse* will match in the non-stemmed field.
(EDIT: actually, looking at that patch, I'm not sure it will do what you want. But basically you just need to change your query handler to stem first.)
Last time I check, when you use wildcards, the query analyzer is not used. So since you are using a LowerCaseFilterFactory, your terms are indexed in lower case and searching for Mous* won't return anything.
I think the only thing to do when you are using wildcards is to make sure to adapt your query to the way your terms are indexed (in a way similar to what your query analyzer would do).

Field having multiple distinct values

Am building a "Book search" API using Lucene.
I need to index Book Name,Author, and Book category fields in Lucene index.
A single book can fall under multiple distinct book categories...for example:
BookName1 --fiction,humour,philosophy.
BookName1 --fiction,science.
BookName1 --humour,business.
BookName4-humour
and so on.....
User should be able to search all the books under a particular category say "homour".
Given this situation, how do i index above fields and build the query in lucene?
You can have a field for a Lucene document occur multiple times. Create the document, add the values for the the name and author, then do the same for each category
create new lucene document
add name field and value
add author field and value
for each category:
add category field and value
add document to index
When you search the index for a category, it will return all documents that have a category field with the value you're after. The category should be a 'Keyword' field.
I've written it in english because the specific code is slightly different per lucene version.
You can create a simple "category" field, where you list all categrories for a book seperated by spaces.
Then you can search something like:
stock market AND category:(+"business")
Or if you want to search in more than one category
stock market AND category:(+"business" +"philosophy")
I would use Solr instead - it's built on Lucene and managed by the ASF, but is much, much easier to use than Lucene, especially for newcomers.
If offers pretty much all the mainline features of Lucene (certainly everything you'll need for the project you describe), plus extra things like snapshotting, replication, schemas, ...
In Solr, you would simply define the fields you want to index something like this in schema.xml:
<field name="book_id" type="string" indexed="true" stored="true" required="true" multiValued='false'/>
<field name="book_name" type="text" indexed="true" stored="true" required="true" multiValued='false' />
<field name="book_authors" type="text" indexed="true" stored="true" required="true" multiValued='true' />
<field name="book_categories" type="textTight" indexed="true" stored="true" required="true" multiValued='true' />
Note that the multiValued='true' attribute lets you effective pass an array or list to this field, which gets split and indexed nicely by Solr.
Once you have this, start up Solr and you can ask queries like "book_authors:Hemingway" or "book_categories:Romance book_categories:Mills".
There are several query handlers pre-written and configured for you to do things like parse complex queries (fuzzy matches, boolean operations, scoring boosts, ...), and as Solr's API is exposed over HTTP, all this is wrapped by a number of client libraries, so you don't need to handle the low-level details of crafting queries yourself.
There is lots of great documentation on their website to get you started.