Indexing data from pdf

Indexing data from pdf - pdf

I am trying to index data from pdf now, and I am getting the following response from Solr:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Time Elapsed">0:0:1.236</str>
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">1</str>
<str name="Total Documents Processed">0</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-05-11 15:45:01</str>
<str name="">Indexing failed. Rolled back all changes.</str>
<str name="Rolledback">2012-05-11 15:45:01</str></lst><str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>
The log files showing this:
org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoClassDefFoundError: org/apache/tika/parser/AutoDetectParser
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoClassDefFoundError: org/apache/tika/parser/AutoDetectParser
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoClassDefFoundError: org/apache/tika/parser/AutoDetectParser
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:759)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
... 5 more
Caused by: java.lang.NoClassDefFoundError: org/apache/tika/parser/AutoDetectParser
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:388)
at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:1100)
at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:912)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:635)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.AutoDetectParser
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 13 more
The configuration file looks like:
data-config.xml:
<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
<dataSource type="BinFileDataSource" name="binary" />
<document>
<entity name="f" dataSource="binary" rootEntity="false" processor="FileListEntityProcessor" baseDir="C:\solr\solr\docu" fileName=".*pdf" recursive="true">
<entity name="tika" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text">
<field column="id" name="id" meta="true" />
<field column="fake_id" name="fake_id" />
<field column="model" name="model" meta="true" />
<field column="text" name="biog" />
</entity>
</entity>
</document>
</dataConfig>
schema.xml:
<fields>
<field name="id" type="string" indexed="true" stored="true" />
<field name="fake_id" type="string" indexed="true" stored="true" />
<field name="model" type="text_en" indexed="true" stored="true" />
<field name="firstname" type="text_en" indexed="true" stored="true"/>
<field name="lastname" type="text_en" indexed="true" stored="true"/>
<field name="title" type="text_en" indexed="true" stored="true"/>
<field name="biog" type="text_en" indexed="true" stored="true"/>
</fields>
<uniqueKey>fake_id</uniqueKey>
<defaultSearchField>biog</defaultSearchField>
Finally the “Tika” jars that I have are:
tika-core-1.0.jar and tika-parsers-1.0.jar
What is going wrong?
Thanks

The problem could be in your data-config.xml, you specified the bin file dataSource on entity named "f" (with FileListEntityProcessor processor) insttead of entity with TikaEntityProcessor processor.
I think you could try with this code:
<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
<dataSource type="BinFileDataSource" name="binary" />
<document>
<entity name="f" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" baseDir="C:\solr\solr\docu" fileName=".*pdf" recursive="true">
<entity name="tika" dataSource="binary" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text">
<field column="id" name="id" meta="true" />
<field column="fake_id" name="fake_id" />
<field column="model" name="model" meta="true" />
<field column="text" name="biog" />
</entity>
</entity>
</document>
</dataConfig>

Related

DataImportHandler setup with PhpMyAdmin-MySQL

I am trying to index a mysql database on phpmyadmin into solr.
SOLVED BY #MatsLindh
I have tried to find information necessary but no tutorials I have found deal with this setup.
MY DATABASE:
My mysql db is locally hosted and accessed through phpmyadmin. Here is the admin page.
As you can see I have a db titled solrtest with table solr having fields id, date, Problem, and Solution.
Now to link my db, the tutorials online were a bit inconsistent. The most consistent parts told me I would need to use solrs DataImportHandler and the mysql-connector-java. Another also mentioned a jdbc plug in. I have installed and put the .jar files here in my solr/dist directory.
In some tutorials they have these also in the contrib folder but I have left in /dist.
MY FILES:
I have created a core titled solrhelp and made the following changes in the solhelp/conf files.
solrconfig.xml
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="solr-dataimporthandler-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="solr-dataimporthandler-extras-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="mysql-connector-java-8.0.13.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="sqljdbc41.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="sqljdbc42.jar" />
<requestHandler name=" /dataimport" class=" org.apache.solr.handler.dataimport.DataImportHandler">
<lst name=" defaults">
<str name=" config">data-config.xml</str>
</lst>
</requestHandler>
<requestHandler name " /dataimport" class=org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="name">solrhelp</str>
<str name="driver">jdbc:mysql.jdbc.Driver</str>
<str name="url">jdbc:mysql://localhost:8983/solrtest</str>
<str name="user">root</str>
<str name="password"></str>
</lst>
</requestHandler>
the created data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:8983/solrtest"
user="root"
password=""/>
<document>
<entity name="solr"
pk="id"
query="select id, date, Problem, Solution from solr"
>
<field column="id" name="id"/>
<field column="date" name="date"/>
<field column="Problem" name="Problem"/>
<field column="Solution" name="Solution"/>
</entity>
</document>
</dataConfig>
and the managed-schema.xml
<field name="id" type="string" indexed="true" stored="true" multiValued="false" />
<field name="pdate" type="date" indexed="true" stored="true" multiValued="false" />
<field name="Problem" type="text_general" indexed="true" stored="true" />
<field name="Solution" type="text_general" indexed="true" stored="true" />
My question to the community is rather broad and I apologize. I want to know what all I am missing before I attempt to post this db. I dont think I have edited my files correctly and I dont really know of a way to test them before I attempt to post.
It should be noted that in the dist folder I have two verions of the jdbc and have both in my solrconfig.xml file.
Any direction to better tutorials or documentation would be appreciated.
UPDATED FILES
solrconfig
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\"
regex="solr-dataimporthandler-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\"
regex="solr-dataimporthandler-extras-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="mysql-connector-java-8.0.13.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="sqljdbc41.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="sqljdbc42.jar" />
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
data-config
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:8983/solrtest/solr"
user="root"
password=""/>
<document>
<entity name="solr"
pk="id"
query="select * from solr"
>
<field column="id" name="id"/>
<field column="date" name="date"/>
<field column="Problem" name="Problem"/>
<field column="Solution" name="Solution"/>
</entity>
</document>
</dataConfig>

You don't import "through phpmyadmin". You use the connection information to your MySQL server. There is no http involved. jdbc:mysql://localhost:3306/dbname would be the string in your case, assuming that MySQL runs on the same computer as you're running Solr on.
Pay close attention to the port number (3306) in the connection string and the dbname. It should refer to the values of your MySQL server and not your Solr server.

THIS PROBLEM WAS SOLVED BY #MatsLindh
My issue was configuration syntax. Below are the corrected data-config.xml and solrconfig.xml
data-config
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/solrtest"
user="root"
password=""/>
<document>
<entity name="solr"
pk="id"
query="select * from solr"
>
<field column="id" name="id"/>
<field column="date" name="date"/>
<field column="Problem" name="Problem"/>
<field column="Solution" name="Solution"/>
</entity>
</document>
</dataConfig>
solrconfig
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\"
regex="solr-dataimporthandler-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\"
regex="solr-dataimporthandler-extras-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="mysql-connector-java-8.0.13.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="sqljdbc41.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="sqljdbc42.jar" />
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>

Solr - How to make several searches on a single query?

I'm trying to make a query that:
First try to match strings, just lowercase processing, and generate a score, boost x
Second, try to use the 'text_en' processing and generate a score, boost y
Trird, Use 'text_en' with synonyms, and generate a score, boost z
Is this possible?
I made the following configuration but I don't think it's doing what I described... It works but I don't trust on the results.
solrconfig.xml:
<requestHandler name="/querystems" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="qf">
lowercase^2 title_without_synonym description_without_synonym title_stems^1.5 description_stems^1
</str>
<str name="q.alt">*:*</str>
<str name="rows">100</str>
<str name="fl">id,title,description,reuseCount,score,title_stems,description_stems</str>
<str name="defType">edismax</str>
</lst>
</requestHandler>
schema.xml:
<field name="title" type="text_en_test" indexed="true" stored="true"/>
<field name="title_without_synonym" type="text_en_without_synonym" indexed="true" stored="true"/>
<field name="title_stems" type="text_en_stems" indexed="true" stored="true"/>
<copyField source="title" dest="title_without_synonym" maxChars="30000" />
<copyField source="title" dest="title_stems" maxChars="30000" />
<field name="description" type="text_en_test" indexed="true" stored="true"/>
<field name="description_without_synonym" type="text_en_without_synonym" indexed="true" stored="true"/>
<field name="description_stems" type="text_en_stems" indexed="true" stored="true"/>
<copyField source="description" dest="description_without_synonym" maxChars="30000" />
<copyField source="description" dest="description_stems" maxChars="30000" />
I don't even know why that 'lowercase' field works at the query configuration, as it's a 'fieldType', not 'field'... Can someone help me undestanding and better configurating this scenario?

index all files inside a folder in solr

I am having troubles indexing a folder in solr
example-data-config.xml:
<dataConfig>
<dataSource type="BinFileDataSource" />
<document>
<entity name="files"
dataSource="null"
rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:\Temp\" fileName=".*"
recursive="true"
onError="skip">
<field column="fileAbsolutePath" name="id" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<entity
name="documentImport"
processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}"
format="text">
<field column="file" name="fileName"/>
<field column="Author" name="author" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
then I create the schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="fileName" type="string" indexed="true" stored="true" />
<field name="author" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true" />
<field name="size" type="plong" indexed="true" stored="true" />
<field name="lastModified" type="pdate" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
finally I modify the file solrConfig.xml adding the requesthandler and the dataImportHandler and dataImportHandler-extra jars:
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">example-data-config.xml</str>
</lst>
</requestHandler>
I run it and the result is:
Inside that folder there are like 20.000 files in diferent formats (.py,.java,.wsdl, etc)
Any suggestion will be appreciated. Thanks :)

Check your Solr logs . Answer for what is the Root Cause will definitely be there . I also faced same situation once and found through solr logs that my DataImportHandler was throwing exceptions because of encrypted documents present in the folder . Your reasons may be different, but first analyze your solr logs, execute your entity again in DataImport section, and then check the immediate logs for errors by going on the logging section on admin page . If you are getting errors other than I what I mentioned , post them here , so they can be understood and deciphered .

Solr 4 - missing required field: uuid

I'm having issues generating a UUID using the dataImportHandler in Solr4. Im trying to import from an existing MySQL database.
My schema.xml contains:
<fields>
<field name="uuid" type="uuid" indexed="true" stored="true" required="true" />
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="address" type="text_general" indexed="true" stored="true"/>
<field name="city" type="text_general" indexed="true" stored="true" />
<field name="county" type="string" indexed="true" stored="true" />
<field name="lat" type="text_general" indexed="true" stored="true" />
<field name="lng" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="price" type="float" indexed="true" stored="true"/>
<field name="bedrooms" type="float" indexed="true" stored="true" />
<field name="image" type="string" indexed="true" stored="true"/>
<field name="region" type="location_rpt" indexed="true" stored="true" />
<defaultSearchField>address</defaultSearchField>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
</fields>
<uniqueKey>uuid</uniqueKey>
and then in <types>
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
My Solrconfig.xml contains:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">uuid</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
Whenever I run the update, some docs are inserted ok , buy many return with:
org.apache.solr.common.SolrException: [doc=204] missing required field: uuid

Going by the example at link it should be
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
.........
<lst name="defaults">
<str name="config">data-config.xml</str>
<str name="update.chain">uuid</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">uuid</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Indexing binary files from database issue (no errors)

I am trying to index binary files stored in a database (mysql) and I have no success. I have a solr configured as below:
Solr file structure
+solr
+bookledger(core0)
-conf
+lib(all necessary libraries)
+contrib
+dist
+data
+bookledger
-index
-spellchecker
+ktimatologio
-index
-spellchecker
+ktimatologio(core1)
-conf
+lib(all necessary libraries)
+contrib
+dist
As you can see the configuration concerns a multicore solr setup. Now, on the bookledger(core0) I have indexed binary files successfully (stored in a database). In the second core when I conduct full-import I see no errors! Then, when I try to query the binary content the output is like: [B#660b1b14. What am I missing here?
Thank you in advance,
Tom
The solr.xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="false">
<cores adminPath="/admin/cores">
<core name="ktimatologio" instanceDir="ktimatologio" dataDir="../data/ktimatologio"/>
<core name="bookledger" instanceDir="bookledger" dataDir="../data/bookledger"/>
</cores>
</solr>
The solrconfig.xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<abortOnConfigurationError>${solr.abortOnConfigurationError:true}</abortOnConfigurationError>
<luceneMatchVersion>LUCENE_36</luceneMatchVersion>
<lib dir="lib/dist/" regex="apache-solr-cell-\d.*\.jar" />
<lib dir="lib/dist/" regex="apache-solr-clustering-\d.*\.jar" />
<lib dir="lib/dist/" regex="apache-solr-dataimporthandler-\d.*\.jar" />
<lib dir="lib/dist/" regex="apache-solr-langid-\d.*\.jar" />
<lib dir="lib/dist/" regex="apache-solr-velocity-\d.*\.jar" />
<lib dir="lib/dist/" regex="apache-solr-dataimporthandler-extras-\d.*\.jar" />
<lib dir="lib/contrib/extraction/lib/" regex=".*\.jar" />
<lib dir="lib/contrib/clustering/lib/" regex=".*\.jar" />
<lib dir="lib/contrib/dataimporthandler/lib/" regex=".*\.jar" />
<lib dir="lib/contrib/langid/lib/" regex=".*\.jar" />
<lib dir="lib/contrib/velocity/lib/" regex=".*\.jar" />
<lib dir="lib/contrib/extraction/lib/" regex="tika-core-\d.*\.jar" />
<lib dir="lib/contrib/extraction/lib/" regex="tika-parsers-\d.*\.jar" />
<dataDir>${solr.data.dir:}</dataDir>
<directoryFactory name="DirectoryFactory"
class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
<indexConfig>
</indexConfig>
<jmx />
<!-- The default high-performance update handler -->
<updateHandler class="solr.DirectUpdateHandler2">
</updateHandler>
<query>
<maxBooleanClauses>1024</maxBooleanClauses>
<filterCache class="solr.FastLRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<queryResultCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<documentCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<enableLazyFieldLoading>true</enableLazyFieldLoading>
<queryResultWindowSize>20</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
<listener event="newSearcher" class="solr.QuerySenderListener">
<arr name="queries">
</arr>
</listener>
<listener event="firstSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst>
<str name="q">static firstSearcher warming in solrconfig.xml</str>
</lst>
</arr>
</listener>
<useColdSearcher>false</useColdSearcher>
<maxWarmingSearchers>2</maxWarmingSearchers>
</query>
<requestDispatcher>
<requestParsers enableRemoteStreaming="true"
multipartUploadLimitInKB="2048000" />
</requestDispatcher>
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">100</int>
</lst>
</requestHandler>
<requestHandler name="/browse" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<!-- VelocityResponseWriter settings -->
<str name="wt">velocity</str>
<str name="v.template">browse</str>
<str name="v.layout">layout</str>
<str name="title">Solritas</str>
<str name="df">text</str>
<str name="defType">edismax</str>
<str name="q.alt">*:*</str>
<str name="rows">10</str>
<str name="fl">*,score</str>
<str name="mlt.qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
</str>
<str name="mlt.fl">text,features,name,sku,id,manu,cat</str>
<int name="mlt.count">3</int>
<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
</str>
<str name="facet">on</str>
<str name="facet.field">cat</str>
<str name="facet.field">manu_exact</str>
<str name="facet.query">ipod</str>
<str name="facet.query">GB</str>
<str name="facet.mincount">1</str>
<str name="facet.pivot">cat,inStock</str>
<str name="facet.range.other">after</str>
<str name="facet.range">price</str>
<int name="f.price.facet.range.start">0</int>
<int name="f.price.facet.range.end">600</int>
<int name="f.price.facet.range.gap">50</int>
<str name="facet.range">popularity</str>
<int name="f.popularity.facet.range.start">0</int>
<int name="f.popularity.facet.range.end">10</int>
<int name="f.popularity.facet.range.gap">3</int>
<str name="facet.range">manufacturedate_dt</str>
<str name="f.manufacturedate_dt.facet.range.start">NOW/YEAR-10YEARS</str>
<str name="f.manufacturedate_dt.facet.range.end">NOW</str>
<str name="f.manufacturedate_dt.facet.range.gap">+1YEAR</str>
<str name="f.manufacturedate_dt.facet.range.other">before</str>
<str name="f.manufacturedate_dt.facet.range.other">after</str>
<!-- Highlighting defaults -->
<str name="hl">on</str>
<str name="hl.fl">text features name</str>
<str name="f.name.hl.fragsize">0</str>
<str name="f.name.hl.alternateField">name</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
<requestHandler name="/update"
class="solr.XmlUpdateRequestHandler">
</requestHandler>
<requestHandler name="/update/javabin"
class="solr.BinaryUpdateRequestHandler" />
<requestHandler name="/update/csv"
class="solr.CSVRequestHandler"
startup="lazy" />
<requestHandler name="/update/json"
class="solr.JsonUpdateRequestHandler"
startup="lazy" />
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
<requestHandler name="/update/xslt"
startup="lazy"
class="solr.XsltUpdateRequestHandler"/>
<requestHandler name="/analysis/field"
startup="lazy"
class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/analysis/document"
class="solr.DocumentAnalysisRequestHandler"
startup="lazy" />
<!-- Admin Handlers
Admin Handlers - This will register all the standard admin
RequestHandlers.
-->
<requestHandler name="/admin/"
class="solr.admin.AdminHandlers" />
<!-- ping/healthcheck -->
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">solrpingquery</str>
</lst>
<lst name="defaults">
<str name="echoParams">all</str>
</lst>
</requestHandler>
<!-- Echo the request contents back to the client -->
<requestHandler name="/debug/dump" class="solr.DumpRequestHandler" >
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="echoHandler">true</str>
</lst>
</requestHandler>
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textSpell</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">name</str>
<str name="spellcheckIndexDir">spellchecker</str>
</lst>
</searchComponent>
<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="df">text</str>
<str name="spellcheck.onlyMorePopular">false</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">1</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
<searchComponent name="tvComponent" class="solr.TermVectorComponent"/>
<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="df">text</str>
<bool name="tv">true</bool>
</lst>
<arr name="last-components">
<str>tvComponent</str>
</arr>
</requestHandler>
<searchComponent name="clustering"
enable="${solr.clustering.enabled:false}"
class="solr.clustering.ClusteringComponent" >
<!-- Declare an engine -->
<lst name="engine">
<!-- The name, only one can be named "default" -->
<str name="name">default</str>
<str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
<str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
<str name="carrot.lexicalResourcesDir">clustering/carrot2</str>
<str name="MultilingualClustering.defaultLanguage">ENGLISH</str>
</lst>
<lst name="engine">
<str name="name">stc</str>
<str name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str>
</lst>
</searchComponent>
<requestHandler name="/clustering"
startup="lazy"
enable="${solr.clustering.enabled:false}"
class="solr.SearchHandler">
<lst name="defaults">
<bool name="clustering">true</bool>
<str name="clustering.engine">default</str>
<bool name="clustering.results">true</bool>
<!-- The title field -->
<str name="carrot.title">name</str>
<str name="carrot.url">id</str>
<!-- The field to cluster on -->
<str name="carrot.snippet">features</str>
<!-- produce summaries -->
<bool name="carrot.produceSummary">true</bool>
<!-- the maximum number of labels per cluster -->
<!--<int name="carrot.numDescriptions">5</int>-->
<!-- produce sub clusters -->
<bool name="carrot.outputSubClusters">false</bool>
<str name="df">text</str>
<str name="defType">edismax</str>
<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
</str>
<str name="q.alt">*:*</str>
<str name="rows">10</str>
<str name="fl">*,score</str>
</lst>
<arr name="last-components">
<str>clustering</str>
</arr>
</requestHandler>
<searchComponent name="terms" class="solr.TermsComponent"/>
<!-- A request handler for demonstrating the terms component -->
<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<bool name="terms">true</bool>
</lst>
<arr name="components">
<str>terms</str>
</arr>
</requestHandler>
<searchComponent name="elevator" class="solr.QueryElevationComponent" >
<!-- pick a fieldType to analyze queries -->
<str name="queryFieldType">string</str>
<str name="config-file">elevate.xml</str>
</searchComponent>
<!-- A request handler for demonstrating the elevator component -->
<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="df">text</str>
</lst>
<arr name="last-components">
<str>elevator</str>
</arr>
</requestHandler>
<!-- Highlighting Component
http://wiki.apache.org/solr/HighlightingParameters
-->
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
<!-- Configure the standard fragmenter -->
<!-- This could most likely be commented out in the "default" case -->
<fragmenter name="gap"
default="true"
class="solr.highlight.GapFragmenter">
<lst name="defaults">
<int name="hl.fragsize">100</int>
</lst>
</fragmenter>
<!-- A regular-expression-based fragmenter
(for sentence extraction)
-->
<fragmenter name="regex"
class="solr.highlight.RegexFragmenter">
<lst name="defaults">
<!-- slightly smaller fragsizes work better because of slop -->
<int name="hl.fragsize">70</int>
<!-- allow 50% slop on fragment sizes -->
<float name="hl.regex.slop">0.5</float>
<!-- a basic sentence pattern -->
<str name="hl.regex.pattern">[-\w ,/\n\"&apos;]{20,200}</str>
</lst>
</fragmenter>
<!-- Configure the standard formatter -->
<formatter name="html"
default="true"
class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
<!-- Configure the standard encoder -->
<encoder name="html"
class="solr.highlight.HtmlEncoder" />
<!-- Configure the standard fragListBuilder -->
<fragListBuilder name="simple"
default="true"
class="solr.highlight.SimpleFragListBuilder"/>
<!-- Configure the single fragListBuilder -->
<fragListBuilder name="single"
class="solr.highlight.SingleFragListBuilder"/>
<!-- default tag FragmentsBuilder -->
<fragmentsBuilder name="default"
default="true"
class="solr.highlight.ScoreOrderFragmentsBuilder">
<!--
<lst name="defaults">
<str name="hl.multiValuedSeparatorChar">/</str>
</lst>
-->
</fragmentsBuilder>
<!-- multi-colored tag FragmentsBuilder -->
<fragmentsBuilder name="colored"
class="solr.highlight.ScoreOrderFragmentsBuilder">
<lst name="defaults">
<str name="hl.tag.pre"><![CDATA[
<b style="background:yellow">,<b style="background:lawgreen">,
<b style="background:aquamarine">,<b style="background:magenta">,
<b style="background:palegreen">,<b style="background:coral">,
<b style="background:wheat">,<b style="background:khaki">,
<b style="background:lime">,<b style="background:deepskyblue">]]></str>
<str name="hl.tag.post"><![CDATA[</b>]]></str>
</lst>
</fragmentsBuilder>
<boundaryScanner name="default"
default="true"
class="solr.highlight.SimpleBoundaryScanner">
<lst name="defaults">
<str name="hl.bs.maxScan">10</str>
<str name="hl.bs.chars">.,!?
</str>
</lst>
</boundaryScanner>
<boundaryScanner name="breakIterator"
class="solr.highlight.BreakIteratorBoundaryScanner">
<lst name="defaults">
<str name="hl.bs.type">WORD</str>
<str name="hl.bs.language">en</str>
<str name="hl.bs.country">US</str>
</lst>
</boundaryScanner>
</highlighting>
</searchComponent>
<queryResponseWriter name="json" class="solr.JSONResponseWriter">
<str name="content-type">text/plain; charset=UTF-8</str>
</queryResponseWriter>
<queryResponseWriter name="velocity" class="solr.VelocityResponseWriter" startup="lazy"/>
-->
<queryResponseWriter name="xslt" class="solr.XSLTResponseWriter">
<int name="xsltCacheLifetimeSeconds">5</int>
</queryResponseWriter>
<admin>
<defaultQuery>*:*</defaultQuery>
</admin>
</config>
The schema.xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="ktimatologio" version="1.5">
<types>
<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
<fieldtype name="binary" class="solr.BinaryField"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" positionIncrementGap="0"/>
<fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" positionIncrementGap="0"/>
<fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0"/>
<fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<!-- A Trie based date field for faster date range queries and date faceting. -->
<fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>
<fieldType name="pint" class="solr.IntField"/>
<fieldType name="plong" class="solr.LongField"/>
<fieldType name="pfloat" class="solr.FloatField"/>
<fieldType name="pdouble" class="solr.DoubleField"/>
<fieldType name="pdate" class="solr.DateField" sortMissingLast="true"/>
<fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="random" class="solr.RandomSortField" indexed="true" />
<!-- Greek -->
<fieldType name="text_el" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- greek specific lowercase for sigma -->
<filter class="solr.GreekLowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="false" words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
<filter class="solr.GreekStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ktimatologio" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
<fieldType name="point" class="solr.PointType" dimension="2" subFieldSuffix="_d"/>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<fieldtype name="geohash" class="solr.GeoHashField"/>
<fieldType name="currency" class="solr.CurrencyField" precisionStep="8" defaultCurrency="USD" currencyConfig="currency.xml" />
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="solr_id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="title" type="text_ktimatologio" indexed="true" stored="true"/>
<field name="model" type="text_ktimatologio" indexed="true" stored="true" multiValued="false"/>
<field name="type" type="text_ktimatologio" indexed="true" stored="true"/>
<field name="url" type="text_ktimatologio" indexed="true" stored="true"/>
<field name="content" type="text_ktimatologio" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="string" indexed="true" stored="true"/>
</fields>
<uniqueKey>solr_id</uniqueKey>
<defaultSearchField>content</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<copyField source="title" dest="content" />
</schema>
The data-config.xml file:
<dataConfig>
<dataSource type="JdbcDataSource"
autoCommit="true" batchSize="-1"
convertType="false"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://127.0.0.1:3306/ktimatologio"
user="root"
password="1a2b3c4d"/>
<dataSource name="fieldReader" type="FieldStreamDataSource" />
<document>
<entity name="aitiologikes_ektheseis"
dataSource="db"
transformer="HTMLStripTransformer"
query="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, body AS content from aitiologikes_ektheseis where type = 'text'"
deltaImportQuery="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, body AS content from aitiologikes_ektheseis where type = 'text' and id='${dataimporter.delta.id}'"
deltaQuery="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, body AS content from aitiologikes_ektheseis where type = 'text' and last_modified > '${dataimporter.last_index_time}'">
<field column="id" name="id" />
<field column="solr_id" name="solr_id" />
<field column="title" name="title" stripHTML="true" />
<field column="model" name="model" stripHTML="true" />
<field column="type" name="type" stripHTML="true" />
<field column="url" name="url" stripHTML="true" />
<field column="last_modified" name="last_modified" stripHTML="true" />
<field column="content" name="content" stripHTML="true" />
</entity>
<entity name="aitiologikes_ektheseis_bin"
query="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, bin_con AS content from aitiologikes_ektheseis where type = 'bin'"
deltaImportQuery="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, bin_con AS content from aitiologikes_ektheseis where type = 'bin' and id='${dataimporter.delta.id}'"
deltaQuery="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, bin_con AS content from aitiologikes_ektheseis where type = 'bin' and last_modified > '${dataimporter.last_index_time}'"
transformer="TemplateTransformer"
dataSource="db">
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="aitiologikes_ektheseis_bin.content" format="text">
<field column="id" name="id" />
<field column="solr_id" name="solr_id" />
<field column="title" name="title" stripHTML="true" />
<field column="model" name="model" stripHTML="true" />
<field column="type" name="type" stripHTML="true" />
<field column="url" name="url" stripHTML="true" />
<field column="last_modified" name="last_modified" stripHTML="true" />
<field column="content" name="content" stripHTML="true" />
</entity>
</entity>
</document>
</dataConfig>

Finally i have found the solution. Notice the entity queries and the column definition in the data-config.xml:
....
<entity name="aitiologikes_ektheseis_bin"
query="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, bin_con AS content from aitiologikes_ektheseis where type = 'bin'"
deltaImportQuery="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, bin_con AS content from aitiologikes_ektheseis where type = 'bin' and id='${dataimporter.delta.id}'"
deltaQuery="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, bin_con AS content from aitiologikes_ektheseis where type = 'bin' and last_modified > '${dataimporter.last_index_time}'"
transformer="TemplateTransformer"
dataSource="db">
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="aitiologikes_ektheseis_bin.content" format="text">
<field column="id" name="id" />
<field column="solr_id" name="solr_id" />
<field column="title" name="title" stripHTML="true" />
<field column="model" name="model" stripHTML="true" />
<field column="type" name="type" stripHTML="true" />
<field column="url" name="url" stripHTML="true" />
<field column="last_modified" name="last_modified" stripHTML="true" />
<field column="content" name="content" stripHTML="true" />
</entity>
</entity>
</document>
</dataConfig>
In order "Tika" to "see" the content and extract it, i have to change the "content" to "text".
One thing more. The correct syntax is:
<entity name="aitiologikes_ektheseis_bin"
query="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, bin_con AS text from aitiologikes_ektheseis where type = 'bin'"
deltaImportQuery="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, bin_con AS text from aitiologikes_ektheseis where type = 'bin' and id='${dataimporter.delta.id}'"
deltaQuery="select id, title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, bin_con AS text from aitiologikes_ektheseis where type = 'bin' and last_modified > '${dataimporter.last_index_time}'"
transformer="TemplateTransformer"
dataSource="db">
<field column="id" name="id" />
<field column="solr_id" name="solr_id" />
<field column="title" name="title" />
<field column="model" name="model" />
<field column="type" name="type" />
<field column="url" name="url" />
<field column="last_modified" name="last_modified" />
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="aitiologikes_ektheseis_bin.text" format="text">
<field column="text" name="content" />
</entity>
</entity>
</document>
</dataConfig>
I hope this helps someone.
Be well,
Tom

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Indexing data from pdf - pdf

Related

DataImportHandler setup with PhpMyAdmin-MySQL

Solr - How to make several searches on a single query?

index all files inside a folder in solr

Solr 4 - missing required field: uuid

Indexing binary files from database issue (no errors)

Categories

Resources