Apache Solr topTerms (LukeRequestHandler) not giving correct token count - apache

I am using the Solr 4 trunk build, a couple days old.
According to the Wiki page for the LukeRequestHandler (first example output), we're supposed to get a count of the tokens for each or any specified field. I want to use this to make a count of the number of times each word in all my documents appears. For example, if the word 'is' appears in two MS Word documents, twice in the first and three times in the second, I would get an output like this:
<lst name="text">
<str name="type">text</str>
<str name="schema">IT-M---------</str>
<str name="index">(unstored field)</str>
<int name="docs">2</int>
<int name="distinct">42</int>
<lst name="topTerms">
<int name="is">5</int>
That's because the word "is" occurs a total of five times across the two documents. However what I actually get is <int name="is">2</int>. I presume this is because it occurs distinctly (by document) a total of two times.
But again, according to the Wiki, we're supposed to get a total count, summed across all the documents, which is what I actually want.
How can I get a total number of times each and every word in all indexed documents appears?
Reference:
http://wiki.apache.org/solr/LukeRequestHandler

Doc frequencies returned by TermsComponent are the number of unique documents that match the term, including any documents that have been marked for deletion but not yet removed from the index.
TermVectorComponent provides the information about documents that is stored when setting the termVector attribute on a field.
TVC can return the term vector, the term frequency, inverse document frequency, and position and offset information.
tv.tf - Return document term frequency info per term in the document.
<lst name="termVectors">
<lst name="doc-5">
<str name="uniqueKey">MA147LL/A</str>
<lst name="includes">
<lst name="cable">
<int name="tf">1</int>
</lst>
<lst name="earbud">
<int name="tf">5</int>
</lst>
<lst name="headphones">
<int name="tf">1</int>
</lst>
<lst name="usb">
<int name="tf">1</int>
</lst>
</lst>
</lst>
...............
</lst>

Related

solr hierarchical clustering

I'm trying to enable hirearchical clustering (sub clusters generation) in Apache SOLR. For this I'm using the SOLR Clustering Component, setting the "outputSubclusters" parameter to true.
However, when I show the output in JSON, the object I receive from the clustering process does not show any subclusters, which makes me wonder...what am I missing here?
Here is my clustering component in solrconfig.xml:
<searchComponent name="clustering"
enable="${solr.clustering.enabled:false}"
class="solr.clustering.ClusteringComponent" >
<lst name="engine">
<str name="name">lingo</str>
<!-- Class name of a clustering algorithm compatible with the Carrot2 framework.
Currently available open source algorithms are:
* org.carrot2.clustering.lingo.LingoClusteringAlgorithm
* org.carrot2.clustering.stc.STCClusteringAlgorithm
* org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm
See http://project.carrot2.org/algorithms.html for more information.
A commercial algorithm Lingo3G (needs to be installed separately) is defined as:
* com.carrotsearch.lingo3g.Lingo3GClusteringAlgorithm
-->
<str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
<!-- Override location of the clustering algorithm's resources
(attribute definitions and lexical resources).
A directory from which to load algorithm-specific stop words,
stop labels and attribute definition XMLs.
For an overview of Carrot2 lexical resources, see:
http://download.carrot2.org/head/manual/#chapter.lexical-resources
For an overview of Lingo3G lexical resources, see:
http://download.carrotsearch.com/lingo3g/manual/#chapter.lexical-resources
-->
<str name="carrot.resourcesDir">clustering/carrot2</str>
</lst>
<!-- An example definition for the STC clustering algorithm. -->
<lst name="engine">
<str name="name">stc</str>
<str name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str>
</lst>
<!-- An example definition for the bisecting kmeans clustering algorithm. -->
<lst name="engine">
<str name="name">kmeans</str>
<str name="carrot.algorithm">org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm</str>
</lst>
</searchComponent>
And the request handler:
<requestHandler name="/clustering_en" startup="lazy" enable="${solr.clustering.enabled:true}" class="solr.SearchHandler">
<lst name="defaults">
<bool name="clustering">true</bool>
<bool name="clustering.results">true</bool>
<!-- Field name with the logical "title" of a each document (optional) -->
<str name="carrot.title">id</str>
<!-- Field name with the logical "URL" of a each document (optional)
<str name="carrot.url">id</str>-->
<!-- Field name with the logical "content" of a each document (optional) -->
<str name="carrot.snippet">answer_en</str>
<!-- Apply highlighter to the title/ content and use this for clustering. -->
<bool name="carrot.produceSummary">true</bool>
<!-- the maximum number of labels per cluster -->
<!--<int name="carrot.numDescriptions">5</int>-->
<!-- produce sub clusters -->
<bool name="carrot.outputSubClusters">true</bool>
<!-- Configure the remaining request handler parameters. -->
<str name="defType">edismax</str>
<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
</str>
<str name="q.alt">*:*</str>
<str name="rows">100</str>
<str name="fl">*,score</str>
</lst>
<arr name="last-components">
<str>clustering</str>
</arr>
</requestHandler>
I'm really clueless, and I thank you in advance for your support.
The open source algorithm available in Carrot2 (which ships as part of Solr) can only generate flat clusterings. A commercially available clustering algorithm can be plugged in to

Apache solr search is not working when i give the criteria q=value to search

Apache solr search is not working when i give the criteria q='value to search'. This is working fine when i gave q=':' and it fetches all the result.
I am using the Apache solr version 4.7.0
The question needs more information.
yet .. the reason for not returning data could be the following potential reasons
Did you use the default query field df>text or did you edit in the solrconfig.xml?
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
</lst>
If the default is text field, did you populate the data into the field name "text" in schema.xml ?
If the default field is something else, dod you populate that field?
With the above clues you should be able to solve out.

Solr/Lucene spellcheck suggestions based on multiple fields

I have a database with Vendor's information: name and address (address, city, zip and country fields). I need to search this database and return some vendors. On the search box, the user could type anything: name of the vendor, part of the address, city, zip,... And, if I can't find any results, I need to implement a google like "Did you mean" feature to give a suggestion to the user.
I thought about using Solr/Lucene to do it. I've installed Solr, exported the information I need using CSV file and created the indexes based on this file. Now I am able to get suggestions from a Solr field using solr.SpellCheckComponent. The thing is my suggestion is based in a single field and need it to get information from address, city, zip, country and name fields.
On solr config file I have something like this:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textSpell</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">name</str>
<str name="spellcheckIndexDir">spellchecker</str>
</lst>
</searchComponent>
<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="spellcheck.onlyMorePopular">false</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count>1</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
I can run queries like:
http://localhost:8983/solr/spell?q=some_company_name&spellcheck=true&spellcheck.collate=true&spellcheck.build=true
Does anyone know how to change my config file in order to have suggestions from multiple fields?
Thanks!!!
In order to configure Solr spellcheck to use words from several fields you should:
Declare a new field. The New field declaration should use the properties type="textSpell" and multiValued="true". For example: <field name="didYouMean" type="textSpell" indexed="true" multiValued="true"/>.
Copy all the fields, of which their words should be part of the spellcheck index, into the new field. For example: <copyField source="field1" dest="didYouMean"/>
<copyField source="field2" dest="didYouMean"/>.
Configure Solr to use the new field. Do it by set the field name to use your spellcheck field name. For example: <str name="field">didYouMean</str>.
For more and detailed information visit Solr spellcheck compound from several fields
You use copyfield for this in schema.xml.<copyField source="*" dest="contentSpell"/> will copy all the fields to contentSpell.
Then change <str name="field">name</str> to <str name="field">contentSpell</str> en you will get suggestions from all fields.

How do I implement a solr spell checker?

I want to implement a spellchecker component in my search application using solr. What configuration is required to change for it?
Add the following section to your solrconfig.xml
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<!--
Optional, it is required when more than one spellchecker is configured.
Select non-default name with spellcheck.dictionary in request handler.
-->
<str name="name">default</str>
<!-- The classname is optional, defaults to IndexBasedSpellChecker -->
<str name="classname">solr.IndexBasedSpellChecker</str>
<!--
Load tokens from the following field for spell checking,
analyzer for the field's type as defined in schema.xml are used
-->
<str name="field">spell</str>
<!-- Optional, by default use in-memory index (RAMDirectory) -->
<str name="spellcheckIndexDir">./spellchecker</str>
<!-- Set the accuracy (float) to be used for the suggestions. Default is 0.5 -->
<str name="accuracy">0.7</str>
<!-- Require terms to occur in 1/100th of 1% of documents in order to be included in the dictionary -->
<float name="thresholdTokenFrequency">.0001</float>
</lst>
<!-- Example of using different distance measure -->
<lst name="spellchecker">
<str name="name">jarowinkler</str>
<str name="field">lowerfilt</str>
<!-- Use a different Distance Measure -->
<str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
<!-- This field type's analyzer is used by the QueryConverter to tokenize the value for "q" parameter -->
<str name="queryAnalyzerFieldType">textSpell</str>
</searchComponent>
<!--
The SpellingQueryConverter to convert raw (CommonParams.Q) queries into tokens. Uses a simple regular expression
to strip off field markup, boosts, ranges, etc. but it is not guaranteed to match an exact parse from the query parser.
Optional, defaults to solr.SpellingQueryConverter
-->
<queryConverter name="queryConverter" class="solr.SpellingQueryConverter"/>
<!-- Add to a RequestHandler
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
NOTE: YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT. THIS IS DONE HERE SOLELY FOR
THE SIMPLICITY OF THE EXAMPLE. YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
<requestHandler name="/spellCheckCompRH" class="solr.SearchHandler">
<lst name="defaults">
<!-- Optional, must match spell checker's name as defined above, defaults to "default" -->
<str name="spellcheck.dictionary">default</str>
<!-- omp = Only More Popular -->
<str name="spellcheck.onlyMorePopular">false</str>
<!-- exr = Extended Results -->
<str name="spellcheck.extendedResults">false</str>
<!-- The number of suggestions to return -->
<str name="spellcheck.count">1</str>
</lst>
<!-- Add to a RequestHandler
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
REPEAT NOTE: YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT. THIS IS DONE HERE SOLELY FOR
THE SIMPLICITY OF THE EXAMPLE. YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
This config sample from Solr Wiki ,
After adding this you can request to build spellchecker index
http://localhost:8983/solr/spell?q=some query&spellcheck=true&spellcheck.collate=true&spellcheck.build=true
Note to not include the last part of the query in each request because this woill build the spelling index all time you request so
the previous becomes after the first request
http://localhost:8983/solr/spell?q=some query&spellcheck=true&spellcheck.collate=true
In the previous XML sextion son't forget to replace the field spell by the field on which you want to build your spellchecker against
And now you can feel the power of spellchecking

Faceting with Solr using "string" fields, "text" fields and "copy" fields

I have a problem with Solr and Faceting and wondering if anyone knows of the fix. I have a work around for it at the minute, however i really want to work out why my query isn't working.
Here is my Schema, simplified to make it easier to follow:
<fields>
<field name="uniqueid" type="string" indexed="true" required="true"/>
<!-- Indexed and Stored Field --->
<field name="recordtype" type="text" indexed="true" stored="true"/>
<!-- Facet Version of fields -->
<field name="frecordtype" type="string" indexed="true" stored="false"/>
</fields>
<!-- Copy fields for facet searches -->
<copyField source="recordtype" dest="frecordtype"/>
As you can see I have a case insensitive field called recordtype and it's copied to a case sensitive field frecordtype which does not tokenize the text. This is because solr returns the indexed value rather than the stored value in the faceting results.
When i try the following query:
http://localhost:8080
/solr
/select
?version=2.2
&facet.field=%7b!ex%3dfrecordtype%7dfrecordtype
&facet=on
&fq=%7b!tag%3dfrecordtype%7dfrecordtype%3aLarge%20Record
&f1=*%2cscore
&rows=20
&start=0
&qt=standard
&q=text%3a%25
I don't get any results, however the facteting still shows there is 1 record.
<result name="response" numFound="0" start="0" />
<lst name="facet_counts">
<lst name="facet_queries" />
<lst name="facet_fields">
<lst name="frecordtype">
<int name="Large Record">1</int>
<int name="Small Record">12</int>
<int name="Other">1</int>
</lst>
</lst>
<lst name="facet_dates" />
</lst>
However if i change the fitler query (line 7 only) to be on the "recordtype" insted of frecordtype:
http://localhost:8080
/solr
/select
?version=2.2
&facet.field=%7b!ex%3dfrecordtype%7dfrecordtype
&facet=on
&fq=%7b!tag%3dfrecordtype%7drecordtype%3aLarge%20Record
&f1=*%2cscore
&rows=20
&start=0
&qt=standard
&q=text%3a%25
I get the 1 result back that i want.
<result name="response" numFound="1" start="0" />
<lst name="facet_counts">
<lst name="facet_queries" />
<lst name="facet_fields">
<lst name="frecordtype">
<int name="Large Record">1</int>
<int name="Small Record">12</int>
<int name="Other">1</int>
</lst>
</lst>
<lst name="facet_dates" />
</lst>
So my question is, is there something i need to do in order to get the first version of the query to return the results i want? Perhaps it's something to do with URL Encoding or something? Any hints from some solr guru's or otherwise would be very grateful.
NOTE: This isn't necessary a faceting question as the faceting is actually working. It's more a query question in that I can't perform a query on a "string" field, even though the case and spacing is exactly the same as the indexed version.
EDIT: For more information on faceting you can check out these blog post's on it:
http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html
http://wiki.apache.org/solr/SimpleFacetParameters#facet.limit
Thanks
Dave
You need quotes around the values
E.g.
frecordtype:"Large Record"
works
frecordtype:Large Record
This will search for Large in the frecordtype, which will bring back nothing.. then Record across the default field in solr.