Solr indexed file not but zero result while querying - indexing

While trying to index the pdf, Solr returns this...
D:\Solr\solr-8.11.2\bin>java -DDauto -Dc=profiles_Index -Drecursive -jar D:/Solr/solr-8.11.2/example/exampledocs/post.jar "D:\LCMS\portalDocs\RESUME\Naukri\Automobile-ElectricVehicle(EV)\Research&Development\0\Naukri_LokeshSampath[10y_0m].pdf
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/profiles_Index/update using content-type application/xml...
Entering recursive mode, max depth=999, delay=0s
POSTing file Naukri_LokeshSampath[10y_0m].pdf to [base]
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8983/solr/profiles_Index/update
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">2</int>
</lst>
<lst name="error">
<lst name="metadata">
<str name="error-class">org.apache.solr.common.SolrException</str>
<str name="root-error-class">java.io.CharConversionException</str>
</lst>
<str name="msg">Invalid UTF-8 start byte 0xb5 (at char #12, byte #-1)</str>
<int name="code">400</int>
</lst>
</response>
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/profiles_Index/update
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/profiles_Index/update...
Time spent: 0:00:00.066
I am trying to index a pdf file can any one help me I am new to indexing
.Thanks in advance

Related

In The Solr, How can i index a plain text file that contained a special characters

In The Solr, How can I index a plain text file that contained special characters
In the upper case, tried in The Windows environment.
And in The Linux environment, tried for document of example.
But I got failure too.
Thanks MatsLindh.
I succeeded in indexing to pdf, txt files in The Linux.
But I failed it in Windows.
My configurations for Extracting Request Handler was the same in both environments.
This is my solrconfig.xml file
<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />
.
.
.
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
And the failed my command in windows.
E:\work\private\JAVA\solr8>java -Dc=test -Dparams="literal.id=doc1" -jar ./bin/post.jar "./example/exampledocs/solr-word.pdf"
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/test/update?literal.id=doc1 using content-type application/xml...
POSTing file solr-word.pdf to [base]
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8983/solr/test/update?literal.id=doc1
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">0</int>
</lst>
<lst name="error">
<lst name="metadata">
<str name="error-class">org.apache.solr.common.SolrException</str>
<str name="root-error-class">java.io.CharConversionException</str>
</lst>
<str name="msg">Invalid UTF-8 middle byte 0xe5 (at char #10, byte #-1)</str>
<int name="code">400</int>
</lst>
</response>
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/test/update?literal.id=doc1
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/test/update?literal.id=doc1...
Time spent: 0:00:00.064
Why did not run this in Windows?

When indexing documents in Solr 7, I am recieving an response that I don't understand

I am indexing a series of documents and am occasionally receiving the following error. I have searched around and am not able to understand what the following error means:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">0</int>
</lst>
<lst name="error">
<lst name="metadata">
<str name="error-class">org.apache.solr.common.SolrException</str>
<str name="root-error-class">com.ctc.wstx.exc.WstxEOFException</str>
</lst>
<str name="msg">Unexpected EOF in CDATA section
at [row,col {unknown-source}]: [2,8191]</str>
<int name="code">400</int>
</lst>
</response>

write.lock issue in apache solr using AnalyzingInfixLookupFactory

I am using AnalyzingInfixLookupFactory for Auto Suggest feature in our application.But when I try to use auto suggest feature and search for terms in the text box after some time it throws a write.lock error.
Below is my configuration in solr-config.xml file for the suggestor / suggest component and suggest request handler :
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">text</str>
<str name="weightField">price</str>
<str name="payloadField">prod_id</str>
<str name="contextField">ancestors</str>
<str name="suggestAnalyzerFieldType">text_general</str>
<str name="buildOnStartup">false</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
Any idea or solution how I can circumvent this ?
Thanks.
I have had the same issue with AnalyzingInfixLookupFactory, switching to AnalyzingLookupFactory fixed it for me.

Apache UIMA + Apache Solr Integration for Noun Phrase annotator

I am working on Apache UIMA + Apache Solr integration. First I have integrated Apache UIMA with eclipse. I have implemented NOUN phrase annotator in eclipse and ran few examples of it.
It worked fine and giving accurate result by finding nouns in sentence.
Now I am trying to implement UIMA with Solr. I followed following link for the same:
https://wiki.apache.org/solr/SolrUIMA
I have exported working JAR file of eclipse project in apache solr lib directory and included other necessary jar files.
Here is my solrconfig xml changes :
<lib dir="../../../contrib/uima/lib" />
<lib dir="../../../contrib/uima/lucene-libs" />
<lib dir="../../../dist/" regex="solr-uima-\d.*\.jar" />
<lib dir="C:\apache-uima\lib" />
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">uima</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="uima" default="true">
<processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
<lst name="uimaConfig">
<lst name="runtimeParameters">
</lst>
<str name="analysisEngine">/desc/NounPhraseAnnotator.xml</str>
<bool name="ignoreErrors">false</bool>
<str name="logField">id</str>
<lst name="analyzeFields">
<bool name="merge">false</bool>
<arr name="fields">
<str>text</str>
</arr>
</lst>
<lst name="fieldMappings">
<lst name="type">
<str name="name">org.apache.uima.tutorial.NounPhraseAnnotation</str>
<lst name="mapping">
<str name="feature">nounText</str>
<str name="field">uimanounphrase</str>
</lst>
</lst>
</lst>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Schema.xml changes:
<field name="uimanounphrase" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
Then I have indexed some documents and ran solr instance. But when I execute query, nouns are not coming in uimanounphrase field. Null values are showing up in that field.
You have to generate the PEAR file first and install it. Once you do that, you can add an AE.xml to your solr config to make it work.
Step1: Generate PEAR file from your annotator implementation. You can use Eclipse to do that if you have UIMA plugin for Eclipse.
Step2: Install the PEAR file. You can use scripts provided in the apache-uima package(runPearInstaller.bat). You can also test if your pear file is working by running cvd.bat.
Step3: Create an annotator engine xml file (ex: OpenNLP_AE.xml) which you can integrate with solrconfig.xml
References: https://uima.apache.org/doc-uima-pears.html . This link has the pointers on how you can perform the above.
Hope this helps.

Issue parsing PDF with Apache Nutch - extractor plugin

I am trying to index web pages AND pdf documents from a website. I am using Nutch 1.9.
I downloade the nutch-custom-search plugin from https://github.com/BayanGroup/nutch-custom-search. The plugin is awsome and indeed let me match selected divs to solr fieds.
The problem I am having is that, my site also contains numerous pdf files. I can see that they are fetched but never parsed. There is no pdf when I query solr. Just web pages. I am trying to use tika to parse .PDFs (I hope that I have the right idea)
If on cygwin, I run parsechecker see below, it seems to parse OK:
$ bin/nutch parsechecker -dumptext -forceAs application/pdf http://www.immunisationscotland.org.uk/uploads/documents/18304-Tuberculosis.pdf
I am not too sure what to do next (see below for my config)
extractor.xml
<config xmlns="http://bayan.ir" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://bayan.ir http://raw.github.com/BayanGroup/nutch-custom-search/master/zal.extractor/src/main/resources/extractors.xsd" omitNonMatching="true">
<fields>
<field name="pageTitleChris" />
<field name="contentChris" />
</fields>
<documents>
<document url="^.*\.(?!pdf$)[^.]+$" engine="css">
<extract-to field="pageTitleChris">
<text>
<expr value="head > title" />
</text>
</extract-to>
<extract-to field="contentChris">
<text>
<expr value="#primary-content" />
</text>
</extract-to>
</document>
</documents>
Inside my parse-plugins.xml i added
<mimeType name="application/pdf">
<plugin id="parse-tika" />
</mimeType>
nutch-site.xml
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|extractor|index-(basic|anchor)|query-(basic|site|url)|indexer-solr|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<property>
<name>http.content.limit</name>
<value>65536666</value>
<description></description>
</property>
<property>
<name>extractor.file</name>
<value>extractor.xml</value>
</property>
Help would be much appreciated,
Thanks
Chris
I think the problem relates to omitNonMatching="true" in your extractor.xml file.
omitNonMatching="true" means "don't index those pages that don't match in any extracto-to rules of extractor.xml". The default value is false.