No "content" field created when indexing PDF with solr - pdf

I have succesfully indexed PDF's using the POST command as described in the following link: http://makble.com/how-to-extract-text-from-pdf-and-post-into-solr
Terms stored within an indexed PDF file can be queried and can be found using general queries or the text field.
However, I do not see the "content" field as generated as I can with the other PDF related fields. I tried editing the managed-schema file to add the fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
<copyField source="content" dest="text"/>
I get the following error when I attemp to reload the core:
<str name="msg">Error handling 'reload' action</str>
<str name="trace">
org.apache.solr.common.SolrException: Error handling 'reload' action at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$2(CoreAdminOperation.java:110) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:370) at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:174)
My solrconfig.xml has this:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
I would like to have the "content" field available to perform search only for the text located within the indexed pdf files.

1) Do not manually edit the schema file. Instead use the Schema API.
2) fmap.content maps the content field to the _text_ field in your case.
If you have a content field already defined, then just removing this particular parameter from the ExtractingRequestHandler definition should do the job.

Related

Solr Language Detection

I have a field "text", which I need to copy to text_en or text_es based on the language of "text".
Below is my managed_schema.xml:
<updateRequestProcessorChain name="langid">
<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<bool name="langid">true</bool>
<str name="langid.fl">text</str>
<str name="langid.langField">tweet_lang</str>
<str name="langid.whitelist">es,en</str>
<bool name="langid.map">true</bool>
<!--bool name="langid.map.individual">true</bool-->
<str name="langid.map.individual.fl">text</str>
<bool name="langid.map.keepOrig">true</bool>
<str name="langid.fallback">ko</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
I created a copy field text_en and text_es.When I post the data in spanish, data is copied from text to text_en and text_es as well!
How do I solve this?
Thanks!
By creating copyFields from text to text_en and text_es you get incoming data into both fields regardless of the langage detection, that is what copyField is supposed to do.
The updateRequestProcessor will actually make a copy (rather than a move) because you set <bool name="langid.map.keepOrig">true</bool>.
Other than that, the processor's config looks fine, just remove these copyFields and ensure the mapped fields text_en and text_es are well defined in your schema.
Thanks for the headsup!
The issue is solved by removing the copy fields and created dynamic fields
*_es and
*_en in schema.xml

Facet query will give wrong output on dynamicfield in solr

I have dynamicField as 'pa_mydynamicfieldname' in Solr 4.0
I have store value in this field as :
I have indexed my data by Encoding using System.Web.HttpUtility.UrlEncode(pa_mydynamicfieldname)
such as : 2.2+GHz+Intel+Pentium+Dual-Core+E2200
When i apply facet query to get result then output is as :
<lst name="facet_fields">
<lst name="pa_mydynamicfieldname">
<int name="2.2">1</int>
<int name="2.5">1</int>
<int name="core">1</int>
<int name="dual">1</int>
<int name="e2200">1</int>
<int name="ghz">1</int>
<int name="intel">1</int>
<int name="pentium">1</int>
</lst>
Instead of this I want output as :
<lst name="facet_fields">
<lst name="pa_mydynamicfieldname">
<int name="2.2+GHz+Intel+Pentium+Dual-Core+E2200">1</int>
</lst>
how can do this in Solr while applying facet query ?
Updated on 15-May-13
From Schema, text field is defined as:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
And dynamic field is defined as:
<dynamicField name="pa_*" type="text" indexed="true" stored="true" multiValued="true" required="false" />
We need it as multi-valued field, because a document may have multiple value defined for each product.
Please Help me.
Thanks
In order to accomplish the behavior that you are desiring, you will need to change the fieldType for the dynamic field in your schema.xml. Currently, your pa_mydyanmicfieldname is probably defined as a type="text_general" and with multivalued="true". So your field value is being split into tokens and these tokens are then being stored as multiple values. This is producing the behavior you show with multiple words/tokens being returned as facet values.
Since you want to store the original value as you submit it, please change your fieldType to just a plain old string and not multivalued:
<dynamicField name="*_mydynamicfeldname" type="string"
indexed="true" stored="true"/>
Or you can alternately take advantage of the predefined string based dynamic field defined in the example schema.xml:
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
You will need to reindex your data after making this change to your schema.xml for new field types to be stored properly and reflected in the search results.

Solr/Lucene spellcheck suggestions based on multiple fields

I have a database with Vendor's information: name and address (address, city, zip and country fields). I need to search this database and return some vendors. On the search box, the user could type anything: name of the vendor, part of the address, city, zip,... And, if I can't find any results, I need to implement a google like "Did you mean" feature to give a suggestion to the user.
I thought about using Solr/Lucene to do it. I've installed Solr, exported the information I need using CSV file and created the indexes based on this file. Now I am able to get suggestions from a Solr field using solr.SpellCheckComponent. The thing is my suggestion is based in a single field and need it to get information from address, city, zip, country and name fields.
On solr config file I have something like this:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textSpell</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">name</str>
<str name="spellcheckIndexDir">spellchecker</str>
</lst>
</searchComponent>
<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="spellcheck.onlyMorePopular">false</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count>1</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
I can run queries like:
http://localhost:8983/solr/spell?q=some_company_name&spellcheck=true&spellcheck.collate=true&spellcheck.build=true
Does anyone know how to change my config file in order to have suggestions from multiple fields?
Thanks!!!
In order to configure Solr spellcheck to use words from several fields you should:
Declare a new field. The New field declaration should use the properties type="textSpell" and multiValued="true". For example: <field name="didYouMean" type="textSpell" indexed="true" multiValued="true"/>.
Copy all the fields, of which their words should be part of the spellcheck index, into the new field. For example: <copyField source="field1" dest="didYouMean"/>
<copyField source="field2" dest="didYouMean"/>.
Configure Solr to use the new field. Do it by set the field name to use your spellcheck field name. For example: <str name="field">didYouMean</str>.
For more and detailed information visit Solr spellcheck compound from several fields
You use copyfield for this in schema.xml.<copyField source="*" dest="contentSpell"/> will copy all the fields to contentSpell.
Then change <str name="field">name</str> to <str name="field">contentSpell</str> en you will get suggestions from all fields.

How do I implement a solr spell checker?

I want to implement a spellchecker component in my search application using solr. What configuration is required to change for it?
Add the following section to your solrconfig.xml
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<!--
Optional, it is required when more than one spellchecker is configured.
Select non-default name with spellcheck.dictionary in request handler.
-->
<str name="name">default</str>
<!-- The classname is optional, defaults to IndexBasedSpellChecker -->
<str name="classname">solr.IndexBasedSpellChecker</str>
<!--
Load tokens from the following field for spell checking,
analyzer for the field's type as defined in schema.xml are used
-->
<str name="field">spell</str>
<!-- Optional, by default use in-memory index (RAMDirectory) -->
<str name="spellcheckIndexDir">./spellchecker</str>
<!-- Set the accuracy (float) to be used for the suggestions. Default is 0.5 -->
<str name="accuracy">0.7</str>
<!-- Require terms to occur in 1/100th of 1% of documents in order to be included in the dictionary -->
<float name="thresholdTokenFrequency">.0001</float>
</lst>
<!-- Example of using different distance measure -->
<lst name="spellchecker">
<str name="name">jarowinkler</str>
<str name="field">lowerfilt</str>
<!-- Use a different Distance Measure -->
<str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
<!-- This field type's analyzer is used by the QueryConverter to tokenize the value for "q" parameter -->
<str name="queryAnalyzerFieldType">textSpell</str>
</searchComponent>
<!--
The SpellingQueryConverter to convert raw (CommonParams.Q) queries into tokens. Uses a simple regular expression
to strip off field markup, boosts, ranges, etc. but it is not guaranteed to match an exact parse from the query parser.
Optional, defaults to solr.SpellingQueryConverter
-->
<queryConverter name="queryConverter" class="solr.SpellingQueryConverter"/>
<!-- Add to a RequestHandler
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
NOTE: YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT. THIS IS DONE HERE SOLELY FOR
THE SIMPLICITY OF THE EXAMPLE. YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
<requestHandler name="/spellCheckCompRH" class="solr.SearchHandler">
<lst name="defaults">
<!-- Optional, must match spell checker's name as defined above, defaults to "default" -->
<str name="spellcheck.dictionary">default</str>
<!-- omp = Only More Popular -->
<str name="spellcheck.onlyMorePopular">false</str>
<!-- exr = Extended Results -->
<str name="spellcheck.extendedResults">false</str>
<!-- The number of suggestions to return -->
<str name="spellcheck.count">1</str>
</lst>
<!-- Add to a RequestHandler
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
REPEAT NOTE: YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT. THIS IS DONE HERE SOLELY FOR
THE SIMPLICITY OF THE EXAMPLE. YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
This config sample from Solr Wiki ,
After adding this you can request to build spellchecker index
http://localhost:8983/solr/spell?q=some query&spellcheck=true&spellcheck.collate=true&spellcheck.build=true
Note to not include the last part of the query in each request because this woill build the spelling index all time you request so
the previous becomes after the first request
http://localhost:8983/solr/spell?q=some query&spellcheck=true&spellcheck.collate=true
In the previous XML sextion son't forget to replace the field spell by the field on which you want to build your spellchecker against
And now you can feel the power of spellchecking

Faceting with Solr using "string" fields, "text" fields and "copy" fields

I have a problem with Solr and Faceting and wondering if anyone knows of the fix. I have a work around for it at the minute, however i really want to work out why my query isn't working.
Here is my Schema, simplified to make it easier to follow:
<fields>
<field name="uniqueid" type="string" indexed="true" required="true"/>
<!-- Indexed and Stored Field --->
<field name="recordtype" type="text" indexed="true" stored="true"/>
<!-- Facet Version of fields -->
<field name="frecordtype" type="string" indexed="true" stored="false"/>
</fields>
<!-- Copy fields for facet searches -->
<copyField source="recordtype" dest="frecordtype"/>
As you can see I have a case insensitive field called recordtype and it's copied to a case sensitive field frecordtype which does not tokenize the text. This is because solr returns the indexed value rather than the stored value in the faceting results.
When i try the following query:
http://localhost:8080
/solr
/select
?version=2.2
&facet.field=%7b!ex%3dfrecordtype%7dfrecordtype
&facet=on
&fq=%7b!tag%3dfrecordtype%7dfrecordtype%3aLarge%20Record
&f1=*%2cscore
&rows=20
&start=0
&qt=standard
&q=text%3a%25
I don't get any results, however the facteting still shows there is 1 record.
<result name="response" numFound="0" start="0" />
<lst name="facet_counts">
<lst name="facet_queries" />
<lst name="facet_fields">
<lst name="frecordtype">
<int name="Large Record">1</int>
<int name="Small Record">12</int>
<int name="Other">1</int>
</lst>
</lst>
<lst name="facet_dates" />
</lst>
However if i change the fitler query (line 7 only) to be on the "recordtype" insted of frecordtype:
http://localhost:8080
/solr
/select
?version=2.2
&facet.field=%7b!ex%3dfrecordtype%7dfrecordtype
&facet=on
&fq=%7b!tag%3dfrecordtype%7drecordtype%3aLarge%20Record
&f1=*%2cscore
&rows=20
&start=0
&qt=standard
&q=text%3a%25
I get the 1 result back that i want.
<result name="response" numFound="1" start="0" />
<lst name="facet_counts">
<lst name="facet_queries" />
<lst name="facet_fields">
<lst name="frecordtype">
<int name="Large Record">1</int>
<int name="Small Record">12</int>
<int name="Other">1</int>
</lst>
</lst>
<lst name="facet_dates" />
</lst>
So my question is, is there something i need to do in order to get the first version of the query to return the results i want? Perhaps it's something to do with URL Encoding or something? Any hints from some solr guru's or otherwise would be very grateful.
NOTE: This isn't necessary a faceting question as the faceting is actually working. It's more a query question in that I can't perform a query on a "string" field, even though the case and spacing is exactly the same as the indexed version.
EDIT: For more information on faceting you can check out these blog post's on it:
http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html
http://wiki.apache.org/solr/SimpleFacetParameters#facet.limit
Thanks
Dave
You need quotes around the values
E.g.
frecordtype:"Large Record"
works
frecordtype:Large Record
This will search for Large in the frecordtype, which will bring back nothing.. then Record across the default field in solr.