Am building a "Book search" API using Lucene.
I need to index Book Name,Author, and Book category fields in Lucene index.
A single book can fall under multiple distinct book categories...for example:
BookName1 --fiction,humour,philosophy.
BookName1 --fiction,science.
BookName1 --humour,business.
BookName4-humour
and so on.....
User should be able to search all the books under a particular category say "homour".
Given this situation, how do i index above fields and build the query in lucene?
You can have a field for a Lucene document occur multiple times. Create the document, add the values for the the name and author, then do the same for each category
create new lucene document
add name field and value
add author field and value
for each category:
add category field and value
add document to index
When you search the index for a category, it will return all documents that have a category field with the value you're after. The category should be a 'Keyword' field.
I've written it in english because the specific code is slightly different per lucene version.
You can create a simple "category" field, where you list all categrories for a book seperated by spaces.
Then you can search something like:
stock market AND category:(+"business")
Or if you want to search in more than one category
stock market AND category:(+"business" +"philosophy")
I would use Solr instead - it's built on Lucene and managed by the ASF, but is much, much easier to use than Lucene, especially for newcomers.
If offers pretty much all the mainline features of Lucene (certainly everything you'll need for the project you describe), plus extra things like snapshotting, replication, schemas, ...
In Solr, you would simply define the fields you want to index something like this in schema.xml:
<field name="book_id" type="string" indexed="true" stored="true" required="true" multiValued='false'/>
<field name="book_name" type="text" indexed="true" stored="true" required="true" multiValued='false' />
<field name="book_authors" type="text" indexed="true" stored="true" required="true" multiValued='true' />
<field name="book_categories" type="textTight" indexed="true" stored="true" required="true" multiValued='true' />
Note that the multiValued='true' attribute lets you effective pass an array or list to this field, which gets split and indexed nicely by Solr.
Once you have this, start up Solr and you can ask queries like "book_authors:Hemingway" or "book_categories:Romance book_categories:Mills".
There are several query handlers pre-written and configured for you to do things like parse complex queries (fuzzy matches, boolean operations, scoring boosts, ...), and as Solr's API is exposed over HTTP, all this is wrapped by a number of client libraries, so you don't need to handle the low-level details of crafting queries yourself.
There is lots of great documentation on their website to get you started.
Related
Sorry for the newby question, I'm new to Solr. In managed-schema, I see that there are many fields with identical types but different names. How does Solr identify which field to store the tokens given that the types are all the same but only names are different? For instance,
<field name="content_type" type="text_general">
<field name="content_type_hint" type="text_general">
<field name="blitz" type="text_general">
They all the have the same type (same analyzer). How does Solr store different content into all these text_general fields? Do they check the names of tags with the actual content? and if not identical, it moves on to dynamic fields? I searched on the web and it seems no one has mentioned in detail if name helps in the process of indexing.
So names and type are two different things.
<field name="content_type" type="text_general">
In the above case the name of the field is "content_type" which will be used to search it.
e.g if you want to search a docuemnt with content_type="xml", you will query something like this
q=content_type:xml
however type defines the analysis that will occur on a field when documents are indexed or queries are sent to the index.
so somewhere in your schema you will have defined the field type text_general something like this.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
You can read more about it here
https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html
Solr wont store everything into the type field. type field just tells it what analysis to run on the field while indexing or querying.
Every field will have its own index.
Edit:
i think you are confused how data is indexed. Lets take an example of.
lets say i have a document like this
{
"content_type" : "text/html",
"content_type_hint" : "some_hint",
"blitz" : "some_text"
}
So when you index the document you will tell solr which field you are putting what value.
so in this case you are saying the field content_type has value "text/html" and blitz has value "some_text".
Then solr will do some analysis based on what the type of that field is and then put it to respective index.
I'm trying to search for a particular set of keywords keyword1,keyword2 or keyword3 in a particular field. I'm doing it by using the query,
http://localhost:8983/solr/gettingstarted_shard2_replica2/browse?q=keyword1
keyword 2 keyword 3&qf=field1
However, when I run this it finds keyword2 in another field field2 and returns that row as well! As far as I understand, the qf:field1 parameter limits the search for all the keywords in just field1 right?
Where am I incorrect? Is it because of the schema that I have defined?
My schema config is:
<field name="field1" type="text_general" indexed="true"/>
<field name="field2" type="strings" indexed="false"/>
Disclaimer: I'm the author of Solr Query Debugger Google Chrome plugin.
I suggest to use this debugger in order to see what's is execute and explain why your query have such strange behaviour.
Just execute the solr query in your browser and then start the Solr Query Debugger plugin.
In the plugin page you'll see Debug and Echo tabs where explain what's executed by Solr. In the Explain tab you'll see the score explanations structured as a tree.
Are you using standard (default) Query parser or an eDisMax one? If standard (most likely), then you need to use df parameter.
qf parameter is used with eDisMax, but then you need to also have defType=edismax
Enabling debug flag will show you against which fields the search is actually issued.
I'm new working with Solr and I have an instance running properly in my server
My problem is:
When I query Solr with some terms it doesn't return results, but there are items with that term indexed. I talked with a developer who was working with this Solr instance and he remember something about a "blacklist", or "empty list", or something related, that act as a filter for queries, it's like a common words list that return poor quality results to a query, words like:
"a", "the", "for", ...
I want to know how to manage that list to remove a term from it(or add one, edit, etc)
It sounds like you're talking about the stopwords filter. If you have stopword filtering active, you should see something similar to this in your field analysis within schema.xml
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
This references the file stopwords.txt, which is a standard name for this file, but a different file name may be used, so it could be different on your server. This file will contain a list of words that should be disregarded during search. You should find this file in the conf directory for your index (the same place as the schema.xml and solrconfig.xml). You can edit this file, though for best results, you should re-index your records after you do so.
Alternately, if you would prefer not to filter common words from your search, you can remove the reference to the StopFilterFactory from your field analysis entirely. Again, you should plan to reindex your records after doing so.
I am trying to index a legacy mySQL database. (as in terrible normalization) It contains a releases table with a subject column that contains a comma delimited list of subject ids and a lookup table matching those ids to actual subject names. What I'm trying to do is retrieve both the id and the name for each subject into SOLR but it does not seem to be working.
Here is the relevant portion of my data-config.xml:
<entity name="release"
query="SELECT
distinct(rkey), subject
FROM releases_info"
transformer="RegexTransformer">
<field column="rkey" name="id" />
<field name="s" column="subject" splitBy="," />
<entity name="subject_names" query="SELECT name FROM subjects WHERE id = '${release.subject}'">
<field name="subjects" column="name" />
</entity>
</entity>
While the splitBy is working fine and replacing the values of release.subject with the appropriate split array, I've tried various permutations for the second select query but it either fails or retrieves nothing. Any idea where I may have gone wrong?
I am implementing the solr search in my project .
I have one question regarding how do i search a dynamic fields that is created in a solr index
Eg:- this is the tag that is formed in the index
And I am trying to search from solr using this query Employee_* = 172
Please help me in this if the way of searching is incorrect.
In your queries, you need to define exactly what concrete fields you want to search, e.g. Employee_a, Employee_b (or whatever dynamic fields you've used). You can't search in all dynamic fields by using wildcards in a field name in queries.
Here's a work-around :
create a (static) copyField
copy the dynamic field into the (static) copyField
query the copyField
Your schema.xml could look like this:
<dynamicField name="Employee_*" type="string" indexed="true" stored="true"/>
<field name="emp_static" type="string" indexed="true" stored="true" multiValued="true"/>
<copyField source="Employee_*" dest="emp_static"/>
Now you can query solr via :
select?q=emp_static:"172"
You can even tweak it and not store/index the dynamic field (as you might not query on it ... )