How do I implement a solr spell checker?

How do I implement a solr spell checker? - lucene

I want to implement a spellchecker component in my search application using solr. What configuration is required to change for it?

Add the following section to your solrconfig.xml
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<!--
Optional, it is required when more than one spellchecker is configured.
Select non-default name with spellcheck.dictionary in request handler.
-->
<str name="name">default</str>
<!-- The classname is optional, defaults to IndexBasedSpellChecker -->
<str name="classname">solr.IndexBasedSpellChecker</str>
<!--
Load tokens from the following field for spell checking,
analyzer for the field's type as defined in schema.xml are used
-->
<str name="field">spell</str>
<!-- Optional, by default use in-memory index (RAMDirectory) -->
<str name="spellcheckIndexDir">./spellchecker</str>
<!-- Set the accuracy (float) to be used for the suggestions. Default is 0.5 -->
<str name="accuracy">0.7</str>
<!-- Require terms to occur in 1/100th of 1% of documents in order to be included in the dictionary -->
<float name="thresholdTokenFrequency">.0001</float>
</lst>
<!-- Example of using different distance measure -->
<lst name="spellchecker">
<str name="name">jarowinkler</str>
<str name="field">lowerfilt</str>
<!-- Use a different Distance Measure -->
<str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
<!-- This field type's analyzer is used by the QueryConverter to tokenize the value for "q" parameter -->
<str name="queryAnalyzerFieldType">textSpell</str>
</searchComponent>
<!--
The SpellingQueryConverter to convert raw (CommonParams.Q) queries into tokens. Uses a simple regular expression
to strip off field markup, boosts, ranges, etc. but it is not guaranteed to match an exact parse from the query parser.
Optional, defaults to solr.SpellingQueryConverter
-->
<queryConverter name="queryConverter" class="solr.SpellingQueryConverter"/>
<!-- Add to a RequestHandler
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
NOTE: YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT. THIS IS DONE HERE SOLELY FOR
THE SIMPLICITY OF THE EXAMPLE. YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
<requestHandler name="/spellCheckCompRH" class="solr.SearchHandler">
<lst name="defaults">
<!-- Optional, must match spell checker's name as defined above, defaults to "default" -->
<str name="spellcheck.dictionary">default</str>
<!-- omp = Only More Popular -->
<str name="spellcheck.onlyMorePopular">false</str>
<!-- exr = Extended Results -->
<str name="spellcheck.extendedResults">false</str>
<!-- The number of suggestions to return -->
<str name="spellcheck.count">1</str>
</lst>
<!-- Add to a RequestHandler
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
REPEAT NOTE: YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT. THIS IS DONE HERE SOLELY FOR
THE SIMPLICITY OF THE EXAMPLE. YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
This config sample from Solr Wiki ,
After adding this you can request to build spellchecker index
http://localhost:8983/solr/spell?q=some query&spellcheck=true&spellcheck.collate=true&spellcheck.build=true
Note to not include the last part of the query in each request because this woill build the spelling index all time you request so
the previous becomes after the first request
http://localhost:8983/solr/spell?q=some query&spellcheck=true&spellcheck.collate=true
In the previous XML sextion son't forget to replace the field spell by the field on which you want to build your spellchecker against
And now you can feel the power of spellchecking

Related

No "content" field created when indexing PDF with solr

I have succesfully indexed PDF's using the POST command as described in the following link: http://makble.com/how-to-extract-text-from-pdf-and-post-into-solr
Terms stored within an indexed PDF file can be queried and can be found using general queries or the text field.
However, I do not see the "content" field as generated as I can with the other PDF related fields. I tried editing the managed-schema file to add the fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
<copyField source="content" dest="text"/>
I get the following error when I attemp to reload the core:
<str name="msg">Error handling 'reload' action</str>
<str name="trace">
org.apache.solr.common.SolrException: Error handling 'reload' action at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$2(CoreAdminOperation.java:110) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:370) at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:174)
My solrconfig.xml has this:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
I would like to have the "content" field available to perform search only for the text located within the indexed pdf files.

1) Do not manually edit the schema file. Instead use the Schema API.
2) fmap.content maps the content field to the _text_ field in your case.
If you have a content field already defined, then just removing this particular parameter from the ExtractingRequestHandler definition should do the job.

Solr Language Detection

I have a field "text", which I need to copy to text_en or text_es based on the language of "text".
Below is my managed_schema.xml:
<updateRequestProcessorChain name="langid">
<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<bool name="langid">true</bool>
<str name="langid.fl">text</str>
<str name="langid.langField">tweet_lang</str>
<str name="langid.whitelist">es,en</str>
<bool name="langid.map">true</bool>
<!--bool name="langid.map.individual">true</bool-->
<str name="langid.map.individual.fl">text</str>
<bool name="langid.map.keepOrig">true</bool>
<str name="langid.fallback">ko</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
I created a copy field text_en and text_es.When I post the data in spanish, data is copied from text to text_en and text_es as well!
How do I solve this?
Thanks!

By creating copyFields from text to text_en and text_es you get incoming data into both fields regardless of the langage detection, that is what copyField is supposed to do.
The updateRequestProcessor will actually make a copy (rather than a move) because you set <bool name="langid.map.keepOrig">true</bool>.
Other than that, the processor's config looks fine, just remove these copyFields and ensure the mapped fields text_en and text_es are well defined in your schema.

Thanks for the headsup!
The issue is solved by removing the copy fields and created dynamic fields
*_es and
*_en in schema.xml

solr hierarchical clustering

I'm trying to enable hirearchical clustering (sub clusters generation) in Apache SOLR. For this I'm using the SOLR Clustering Component, setting the "outputSubclusters" parameter to true.
However, when I show the output in JSON, the object I receive from the clustering process does not show any subclusters, which makes me wonder...what am I missing here?
Here is my clustering component in solrconfig.xml:
<searchComponent name="clustering"
enable="${solr.clustering.enabled:false}"
class="solr.clustering.ClusteringComponent" >
<lst name="engine">
<str name="name">lingo</str>
<!-- Class name of a clustering algorithm compatible with the Carrot2 framework.
Currently available open source algorithms are:
* org.carrot2.clustering.lingo.LingoClusteringAlgorithm
* org.carrot2.clustering.stc.STCClusteringAlgorithm
* org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm
See http://project.carrot2.org/algorithms.html for more information.
A commercial algorithm Lingo3G (needs to be installed separately) is defined as:
* com.carrotsearch.lingo3g.Lingo3GClusteringAlgorithm
-->
<str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
<!-- Override location of the clustering algorithm's resources
(attribute definitions and lexical resources).
A directory from which to load algorithm-specific stop words,
stop labels and attribute definition XMLs.
For an overview of Carrot2 lexical resources, see:
http://download.carrot2.org/head/manual/#chapter.lexical-resources
For an overview of Lingo3G lexical resources, see:
http://download.carrotsearch.com/lingo3g/manual/#chapter.lexical-resources
-->
<str name="carrot.resourcesDir">clustering/carrot2</str>
</lst>
<!-- An example definition for the STC clustering algorithm. -->
<lst name="engine">
<str name="name">stc</str>
<str name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str>
</lst>
<!-- An example definition for the bisecting kmeans clustering algorithm. -->
<lst name="engine">
<str name="name">kmeans</str>
<str name="carrot.algorithm">org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm</str>
</lst>
</searchComponent>
And the request handler:
<requestHandler name="/clustering_en" startup="lazy" enable="${solr.clustering.enabled:true}" class="solr.SearchHandler">
<lst name="defaults">
<bool name="clustering">true</bool>
<bool name="clustering.results">true</bool>
<!-- Field name with the logical "title" of a each document (optional) -->
<str name="carrot.title">id</str>
<!-- Field name with the logical "URL" of a each document (optional)
<str name="carrot.url">id</str>-->
<!-- Field name with the logical "content" of a each document (optional) -->
<str name="carrot.snippet">answer_en</str>
<!-- Apply highlighter to the title/ content and use this for clustering. -->
<bool name="carrot.produceSummary">true</bool>
<!-- the maximum number of labels per cluster -->
<!--<int name="carrot.numDescriptions">5</int>-->
<!-- produce sub clusters -->
<bool name="carrot.outputSubClusters">true</bool>
<!-- Configure the remaining request handler parameters. -->
<str name="defType">edismax</str>
<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
</str>
<str name="q.alt">*:*</str>
<str name="rows">100</str>
<str name="fl">*,score</str>
</lst>
<arr name="last-components">
<str>clustering</str>
</arr>
</requestHandler>
I'm really clueless, and I thank you in advance for your support.

The open source algorithm available in Carrot2 (which ships as part of Solr) can only generate flat clusterings. A commercially available clustering algorithm can be plugged in to

How do I set a default field list in my solr schema?

I have a schema that contains a fairly large text field.
I've gzipped it and enabled lazy loading, but it will still be fetched unless every client using solr explicitly sets the field list (fl) parameter.
How can I configure solr to omit the large gzipped text field from the results when querying without a field list parameter?

The simplest way to do this is to add the field list to the requestHandler. Assuming that you are using the default /select request handler, you will need to modify your solrconfig.xml adding the fl option to the list of defaults for the /select requestHandler. See below for an example.
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="fl">field1,field2,field3</str>
</lst>
....
</requestHandler>
So in this example, I am setting the fl parameter so the query will return field1, field2 and field3 by default. These will be the fields returned when querying unless the request specifies the fl parameter and then whatever fields are sent will be returned.
These defaults can be set per requestHandler, so if you are using an different requestHandler, then just modify your configuration as needed.
Hope this helps.

Solr/Lucene spellcheck suggestions based on multiple fields

I have a database with Vendor's information: name and address (address, city, zip and country fields). I need to search this database and return some vendors. On the search box, the user could type anything: name of the vendor, part of the address, city, zip,... And, if I can't find any results, I need to implement a google like "Did you mean" feature to give a suggestion to the user.
I thought about using Solr/Lucene to do it. I've installed Solr, exported the information I need using CSV file and created the indexes based on this file. Now I am able to get suggestions from a Solr field using solr.SpellCheckComponent. The thing is my suggestion is based in a single field and need it to get information from address, city, zip, country and name fields.
On solr config file I have something like this:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textSpell</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">name</str>
<str name="spellcheckIndexDir">spellchecker</str>
</lst>
</searchComponent>
<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="spellcheck.onlyMorePopular">false</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count>1</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
I can run queries like:
http://localhost:8983/solr/spell?q=some_company_name&spellcheck=true&spellcheck.collate=true&spellcheck.build=true
Does anyone know how to change my config file in order to have suggestions from multiple fields?
Thanks!!!

In order to configure Solr spellcheck to use words from several fields you should:
Declare a new field. The New field declaration should use the properties type="textSpell" and multiValued="true". For example: <field name="didYouMean" type="textSpell" indexed="true" multiValued="true"/>.
Copy all the fields, of which their words should be part of the spellcheck index, into the new field. For example: <copyField source="field1" dest="didYouMean"/>
<copyField source="field2" dest="didYouMean"/>.
Configure Solr to use the new field. Do it by set the field name to use your spellcheck field name. For example: <str name="field">didYouMean</str>.
For more and detailed information visit Solr spellcheck compound from several fields

You use copyfield for this in schema.xml.<copyField source="*" dest="contentSpell"/> will copy all the fields to contentSpell.
Then change <str name="field">name</str> to <str name="field">contentSpell</str> en you will get suggestions from all fields.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How do I implement a solr spell checker? - lucene

I want to implement a spellchecker component in my search application using solr. What configuration is required to change for it?

Related

No "content" field created when indexing PDF with solr

Solr Language Detection

solr hierarchical clustering

How do I set a default field list in my solr schema?

Solr/Lucene spellcheck suggestions based on multiple fields

Categories

Resources