Indexing PDF documents using Cloudera Search

Indexing PDF documents using Cloudera Search - indexing

I've been trying to index pdf documents using Cloudera Search aka Apache Solr. First I was able to index twitter tweets. Later I tried to index PDF files. I've created the corresponding collection using solrctl with default schema. The morphline file that I used is (I've masked the IP address of zkHost here)...
solrLocator : {
# Name of solr collection
#collection : collection1
collection : pdfs
# ZooKeeper ensemble
#zkHost : "127.0.0.1:2181/solr"
zkHost : "xxx.xxx.xxx.xxx:2181,xxx.xxx.xxx.xxx:2181/solr"
# The maximum number of documents to send to Solr per network batch (throughput knob)
# batchSize : 100
}
morphlines : [
{
id : morphlinepdfs
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{ detectMimeType { includeDefaultMimeTypes : true } }
{
solrCell {
solrLocator : ${solrLocator}
captureAttr : true
lowernames : true
capture : [id, title, author, content, content_type, subject, description, keywords, category, resourcename, url, last_modified, links]
parsers : [ { parser : org.apache.tika.parser.pdf.PDFParser } ]
}
}
{ generateUUID { field : id } }
{ sanitizeUnknownSolrFields { solrLocator : ${solrLocator} } }
{ loadSolr: { solrLocator : ${solrLocator} } }
]
}
]
The PDF metadata fields are present in schema.xml file such as...
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true"/>
<field name="resourcename" type="text_general" indexed="true" stored="true"/>
<field name="url" type="text_general" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
But in the solr /select query output, I'm getting only content and content-type fields. How can I get all the metadata in solr frontend query? Do I need to modify the schema.xml or the corresponding morphline file? Also can I index the fields inside the PDF content?
The command I used to index pdf file is:
hadoop --config /etc/hadoop/conf.cloudera.yarn jar /usr/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.8.2-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search-1.0.0+cdh5.8.2+0/examples/solr-nrt/log4j.properties --morphline-file /usr/share/doc/search-1.0.0+cdh5.8.2+0/examples/solr-nrt/test-morphlines/solrPDF.conf --output-dir hdfs://xxxxxx:8020/user/root/outdir --verbose --go-live --zk-host xxxxx:2181/solr --collection pdfs hdfs://xxxxxx:8020/user/root/indir
Thanks in advance.

I've found the problem. In fact, the PDF file that I was using don't have any metadata. I've tried with other PDF files and got the results. Hope it helps others.

Related

Parse image as well as text from pdf using tika in solr(Request Handler)

I am trying to index pdf file using solr 6 and want to extract and save image(if have) to some location. I am using below configuration but not able to extract the image. I have successfully index the pdf text contents.
schema.xml
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="Breast-Cancer_PDFSchema" version="1.6">
<uniqueKey>id</uniqueKey>
<field name="id" type="strings" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="date" type="tdates" indexed="true" stored="true"/>
<field name="pdf_pdfversion" type="strings" indexed="true" stored="true"/>
<field name="stream_content_type" type="strings" indexed="true" stored="true"/>
<field name="access_permission_modify_annotations" type="strings" indexed="true" stored="true"/>
<field name="access_permission_can_print_degraded" type="strings" indexed="true" stored="true"/>
<field name="dcterms_created" type="strings" indexed="true" stored="true"/>
<field name="last_modified" type="strings" indexed="true" stored="true"/>
<field name="dcterms_modified" type="strings" indexed="true" stored="true"/>
<field name="dc_format" type="strings" indexed="true" stored="true"/>
<field name="last_save_date" type="strings" indexed="true" stored="true"/>
<field name="access_permission_fill_in_form" type="strings" indexed="true" stored="true"/>
<field name="pdf_docinfo_modified" type="strings" indexed="true" stored="true"/>
<field name="stream_name" type="strings" indexed="true" stored="true"/>
<field name="meta_save_date" type="strings" indexed="true" stored="true"/>
<field name="pdf_encrypted" type="strings" indexed="true" stored="true"/>
<field name="modified" type="strings" indexed="true" stored="true"/>
<field name="content_type" type="strings" indexed="true" stored="true"/>
<field name="stream_size" type="strings" indexed="true" stored="true"/>
<field name="x_parsed_by" type="strings" indexed="true" stored="true"/>
<field name="meta_creation_date" type="strings" indexed="true" stored="true"/>
<field name="stream_source_info" type="strings" indexed="true" stored="true"/>
<field name="created" type="strings" indexed="true" stored="true"/>
<field name="access_permission_extract_for_accessibility" type="strings" indexed="true" stored="true"/>
<field name="access_permission_assemble_document" type="strings" indexed="true" stored="true"/>
<field name="xmptpg_npages" type="strings" indexed="true" stored="true"/>
<field name="creation_date" type="strings" indexed="true" stored="true"/>
<field name="access_permission_extract_content" type="strings" indexed="true" stored="true"/>
<field name="access_permission_can_print" type="strings" indexed="true" stored="true"/>
<field name="producer" type="strings" indexed="true" stored="true"/>
<field name="subject" type="strings" indexed="true" stored="true"/>
<field name="dc_creator" type="strings" indexed="true" stored="true"/>
<field name="aapl_keywords" type="strings" indexed="true" stored="true"/>
<field name="pdf_docinfo_producer" type="strings" indexed="true" stored="true"/>
<field name="resourcename" type="strings" indexed="true" stored="true"/>
<field name="access_permission_can_modify" type="strings" indexed="true" stored="true"/>
<field name="pdf_docinfo_created" type="strings" indexed="true" stored="true"/>
<field name="_text_" type="strings" indexed="true" stored="true"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" />
<fieldType name="strings" class="solr.TextField" sortMissingLast="true" multiValued="true" />
<fieldType name="long" class="solr.TrieLongField" positionIncrementGap="0" precisionStep="0"/>
<fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>
<fieldType name="tdates" class="solr.TrieDateField" positionIncrementGap="0" multiValued="true" precisionStep="6"/>
<fieldType name="tlongs" class="solr.TrieLongField" positionIncrementGap="0" multiValued="true" precisionStep="8"/>
<fieldType name="tdoubles" class="solr.TrieDoubleField" positionIncrementGap="0" multiValued="true" precisionStep="8"/>
solr-config.xml
<requestHandler name="/update/extract" startup="lazy" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" >
<entries>
<entry class="org.apache.tika.parser.pdf.AutoDetectParser"> </entry>
</entries>
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
<str name="fmap.id">id</str>
</lst>
</requestHandler>
I have followed the apache solr offical documentation but made changes to solr-config.xml according to them still have the same issue.

After reading your question and if I am not wrong, you are posting PDF file (which contains some text and images) to solr. The solr will only index that document it will not extract the images and store them to other location.
Solr internally uses Tika libraries for parsing the documents, but it cannot be used in your requirement
To achieve your requirement,
parse your pdf and extract all the images and other contents and store them
index all the extracted contents and pdf in solr.

How to reduce Qtime (query time) in Solr

I need to reduce qtime in Solr from 24ms to 10ms. I have tried with the following parameters:
responseHeader: {
status: 0,
QTime: 24,
params: {
json.wrf: "jQuery111205813984868582338_1454571708911",
sort: "id desc",
indent: "on",
hl.simple.pre: "<b>",
hl.fl: "title keywords model_name",
wt: "json",
hl: "true",
rows: "100",
hl.highlightMultiTerm: "true",
hl.snippets: "1",
start: "0",
q: "{!q.op=AND df=title}pl~1",
_: "1454571708913",
hl.simple.post: "</b>",
qt: "spellchecker",
hl.usePhraseHighlighter: "true"
}
}
Solr URL:
http://www.example.com:8983/solr/mycoll/select?q={!q.op=AND%20df=title}pl~1&indent=on&sort=id%20desc&qt=spellchecker&wt=json&hl=true&hl.fl=title+keywords+model_name&hl.simple.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&hl.usePhraseHighlighter=true&hl.highlightMultiTerm=true&hl.snippets=1&start=0&rows=100&json.wrf=jQuery111205813984868582338_1454571708911&_=1454571708913
schema.xml
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true" />
<field name="subject" type="text_general" indexed="true" stored="true" />
<field name="description" type="text_general" indexed="true" stored="true" />
<field name="comments" type="text_general" indexed="true" stored="true" />
<field name="author" type="text_general" indexed="true" stored="true" />
<field name="keywords" type="text_general" indexed="true" stored="true" />
<field name="category" type="text_general" indexed="true" stored="true" />
<field name="resourcename" type="text_general" indexed="true" stored="true" />
<field name="url" type="text_general" indexed="true" stored="true" />
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true" />
<field name="last_modified" type="date" indexed="true" stored="true" />
<field name="links" type="string" indexed="true" stored="true" multiValued="true" />
Please see these reference links:
http://wiki.apache.org/solr/SolrPerformanceFactors
https://issues.apache.org/jira/browse/SOLR-2218

Solr 4 Lucene - Error loading class 'Solr.UUIDField'

Trying to create a UUID field in my schema.xml, I just get this error when starting Solr:
Plugin init failure for [schema.xml] fieldType "uuid": Error loading class 'Solr.UUIDField'
My schema looks like:
<fields>
<field name="uuid" type="uuid" indexed="true" stored="true" />
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">uuid</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="address" type="text_general" indexed="true" stored="true"/>
<field name="city" type="text_general" indexed="true" stored="true" />
<field name="county" type="string" indexed="true" stored="true" />
<field name="lat" type="text_general" indexed="true" stored="true" />
<field name="lng" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="price" type="float" indexed="true" stored="true"/>
<field name="bedrooms" type="float" indexed="true" stored="true" />
<field name="image" type="string" indexed="true" stored="true"/>
<field name="region" type="location_rpt" indexed="true" stored="true" />
<defaultSearchField>address</defaultSearchField>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
And then in
<fieldType name="uuid" class="Solr.UUIDField" indexed="true" />
From the docs
I'm confused as the the location on the <updateRequestProcessorChain/> section. I feel it shouldn't go in the field declaration part.

The field class is case sensitive probably, try will lower case solr solr.UUIDField :-
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />

SolrException: Document [null] missing required field: id

I have changed the schema.xml file and added few fields in that like this
<field name="url" type="string" indexed="true" stored="true" />
<field name="content_type" type="text" indexed="true" stored="true" />
<field name="title" type="text" indexed="true" stored="true" />
<field name="keywords" type="text" indexed="true" stored="true" multiValued="true" />
<field name="text" type="text" indexed="true" stored="true" />
<field name="timestamp" type="text" indexed="true" stored="true" />
<field name="public" type="text" indexed="true" stored="true" multiValued="true" />
<field name="groups" type="text" indexed="true" stored="true" multiValued="true" />
<field name="sitename" type="text" indexed="true" stored="true" />
<field name="context" type="text" indexed="true" stored="true" />
<field name="modified_date" type="text" indexed="true" stored="true" />
so corresponding to these fields I have created one xml file and added some dummy data into that like this.
<add><doc>
<field name="url">http://www.host.com/</field>
<field name="content_type">text/html</field>
<field name="title">Testing Data</field>
<field name="keywords">software</field>
<field name="keywords">software_cycle</field>
<field name="text">search</field>
<field name="timestamp">2006-02-13T15:26:37Z</field>
<field name="public">Optimized</field>
<field name="public">Optimized_data</field>
<field name="groups">Standards</field>
<field name="groups">Standards_data</field>
<field name="sitename">GoInfo</field>
<field name="context">Scalability</field>
<field name="modified_date">2010-11-13T15:26:37Z</field>
</doc></add>
And when I tried to reindex the data into solr like this:-
C:\apache-solr-3.2.0\example\exampledocs>java -Durl=http://localhost:7788/solr/u
pdate -jar post.jar *.xml
SimplePostTool: version 1.3
SimplePostTool: POSTing files to http://localhost:7788/solr/update..
SimplePostTool: POSTing file 30-example.xml
SimplePostTool: POSTing file hd.xml
SimplePostTool: POSTing file other.xml
SimplePostTool: FATAL: Solr returned an error #400 Bad Request
I always get an error after text.xml file and If I remove this text.xml file then I don't get any error.. This is the below error I am getting if I include the text.xml file. Any help will be appreciated.
SEVERE: org.apache.solr.common.SolrException: Document [null] missing required field: id
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:336)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:864)
at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1665)
at java.lang.Thread.run(Thread.java:662)

You say you added a few fields (supposedly to the sample schema), but you don't mention what happened to the fields that were already there. I'm guessing you left the preexistent fields there, which means that id is still a required field (see here in the sample schema), therefore the error you see.

Make a
"Primary Key" id. It is really required.

Indexing PDF documents in Solr with no UniqueKey

I want to index PDF (and other rich) documents. I am using the DataImportHandler.
Here is how my schema.xml looks:
.........
.........
<field name="title" type="text" indexed="true" stored="true" multiValued="false"/>
<field name="description" type="text" indexed="true" stored="true" multiValued="false"/>
<field name="date_published" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="link" type="string" indexed="true" stored="true" multiValued="false" required="false"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="false"/>
........
........
<uniqueKey>link</uniqueKey>
As you can see I have set link as the unique key so that when the indexing happens documents are not duplicated again. Now I have the file paths stored in a database and I have set the DataImportHandler to get a list of all the file paths and index each document. To test it I used the tutorial.pdf file that comes with example docs in Solr. The problem is of course this pdf document won't have a field 'link'. I am thinking of way how I can manually set the file path as link when indexing these documents. I tried the data-config settings as below,
<entity name="fileItems" rootEntity="false" dataSource="dbSource" query="select path from file_paths">
<entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
<field column="title" name="title" meta="true"/>
<field column="Creation-Date" name="date_published" meta="true"/>
<entity name="filePath" dataSource="dbSource" query="SELECT path FROM file_paths as link where path = '${fileItems.path}'">
<field column="link" name="link"/>
</entity>
</entity>
</entity>
where I create a sub-entity which queries for the path name and makes it return the results in a column titled 'link'. But I still see this error:
WARNING: Error creating document : SolrInputDocument[{date_published=date_published(1.0)={2011-06-23T12:47:45Z}, title=title(1.0)={Solr tutorial}}]
org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: link
Is there anyway for me to create a field called link for the pdf documents?
This was already asked here before but the solution provided uses ExtractRequestHandler but I want to use it through the DataImportHandler.

Try this:
<entity name="fileItems" rootEntity="false" dataSource="dbSource" query="select path from file_paths">
<field column="path" name="link"/>
<entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
<field column="title" name="title" meta="true"/>
<field column="Creation-Date" name="date_published" meta="true"/>
</entity>
</entity>

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Indexing PDF documents using Cloudera Search - indexing

I've found the problem. In fact, the PDF file that I was using don't have any metadata. I've tried with other PDF files and got the results. Hope it helps others.

Related

Parse image as well as text from pdf using tika in solr(Request Handler)

How to reduce Qtime (query time) in Solr

Solr 4 Lucene - Error loading class 'Solr.UUIDField'

SolrException: Document [null] missing required field: id

Indexing PDF documents in Solr with no UniqueKey

Categories

Resources