DataImportHandler setup with PhpMyAdmin-MySQL - apache

I am trying to index a mysql database on phpmyadmin into solr.
SOLVED BY #MatsLindh
I have tried to find information necessary but no tutorials I have found deal with this setup.
MY DATABASE:
My mysql db is locally hosted and accessed through phpmyadmin. Here is the admin page.
As you can see I have a db titled solrtest with table solr having fields id, date, Problem, and Solution.
Now to link my db, the tutorials online were a bit inconsistent. The most consistent parts told me I would need to use solrs DataImportHandler and the mysql-connector-java. Another also mentioned a jdbc plug in. I have installed and put the .jar files here in my solr/dist directory.
In some tutorials they have these also in the contrib folder but I have left in /dist.
MY FILES:
I have created a core titled solrhelp and made the following changes in the solhelp/conf files.
solrconfig.xml
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="solr-dataimporthandler-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="solr-dataimporthandler-extras-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="mysql-connector-java-8.0.13.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="sqljdbc41.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\" regex="sqljdbc42.jar" />
<requestHandler name=" /dataimport" class=" org.apache.solr.handler.dataimport.DataImportHandler">
<lst name=" defaults">
<str name=" config">data-config.xml</str>
</lst>
</requestHandler>
<requestHandler name " /dataimport" class=org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="name">solrhelp</str>
<str name="driver">jdbc:mysql.jdbc.Driver</str>
<str name="url">jdbc:mysql://localhost:8983/solrtest</str>
<str name="user">root</str>
<str name="password"></str>
</lst>
</requestHandler>
the created data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:8983/solrtest"
user="root"
password=""/>
<document>
<entity name="solr"
pk="id"
query="select id, date, Problem, Solution from solr"
>
<field column="id" name="id"/>
<field column="date" name="date"/>
<field column="Problem" name="Problem"/>
<field column="Solution" name="Solution"/>
</entity>
</document>
</dataConfig>
and the managed-schema.xml
<field name="id" type="string" indexed="true" stored="true" multiValued="false" />
<field name="pdate" type="date" indexed="true" stored="true" multiValued="false" />
<field name="Problem" type="text_general" indexed="true" stored="true" />
<field name="Solution" type="text_general" indexed="true" stored="true" />
My question to the community is rather broad and I apologize. I want to know what all I am missing before I attempt to post this db. I dont think I have edited my files correctly and I dont really know of a way to test them before I attempt to post.
It should be noted that in the dist folder I have two verions of the jdbc and have both in my solrconfig.xml file.
Any direction to better tutorials or documentation would be appreciated.
UPDATED FILES
solrconfig
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\"
regex="solr-dataimporthandler-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\"
regex="solr-dataimporthandler-extras-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="mysql-connector-java-8.0.13.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="sqljdbc41.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="sqljdbc42.jar" />
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
data-config
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:8983/solrtest/solr"
user="root"
password=""/>
<document>
<entity name="solr"
pk="id"
query="select * from solr"
>
<field column="id" name="id"/>
<field column="date" name="date"/>
<field column="Problem" name="Problem"/>
<field column="Solution" name="Solution"/>
</entity>
</document>
</dataConfig>

You don't import "through phpmyadmin". You use the connection information to your MySQL server. There is no http involved. jdbc:mysql://localhost:3306/dbname would be the string in your case, assuming that MySQL runs on the same computer as you're running Solr on.
Pay close attention to the port number (3306) in the connection string and the dbname. It should refer to the values of your MySQL server and not your Solr server.

THIS PROBLEM WAS SOLVED BY #MatsLindh
My issue was configuration syntax. Below are the corrected data-config.xml and solrconfig.xml
data-config
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/solrtest"
user="root"
password=""/>
<document>
<entity name="solr"
pk="id"
query="select * from solr"
>
<field column="id" name="id"/>
<field column="date" name="date"/>
<field column="Problem" name="Problem"/>
<field column="Solution" name="Solution"/>
</entity>
</document>
</dataConfig>
solrconfig
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\"
regex="solr-dataimporthandler-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\dist\"
regex="solr-dataimporthandler-extras-7.5.0.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="mysql-connector-java-8.0.13.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="sqljdbc41.jar" />
<lib dir="C:\Program Files\Solr\solr-7.5.0\contrib\dataimporthandler\lib"
regex="sqljdbc42.jar" />
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>

Related

solr 6.2.1 uniqueField not work

i install solr 6.2.1 and in schema define a uniqueField:
<?xml version="1.0" encoding="UTF-8"?>
<!-- Solr managed schema - automatically generated - DO NOT EDIT -->
<schema name="ps_product" version="1.5">
<fieldType name="int" class="solr.TrieIntField" positionIncrementGap="0" precisionStep="0"/>
<fieldType name="long" class="solr.TrieLongField" positionIncrementGap="0" precisionStep="0"/>
<fieldType name="string" class="solr.TextField" omitNorms="true" sortMissingLast="true"/>
<fieldType name="uuid" class="solr.UUIDField" indexed="true"/>
<field name="_version_" type="long" multiValued="false" indexed="true" stored="true"/>
<field name="id_product" type="uuid" default="NEW" indexed="true" stored="true"/>
<uniqueKey>id_product</uniqueKey>
<field name="name" type="string" indexed="true" stored="true"/>
<field name="title" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
</schema>
and my data-config like bellow:
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/pressdb-local" user="sa" password="" />
<document>
<entity name="item" query="select * from ps_product as p inner join ps_product_lang as pl on pl.id_product=p.id_product where pl.id_lang=2"
deltaQuery="select id from ps_product where date_upd > '${dataimporter.last_index_time}'">
<field name="name" column="name"/>
<field name="id_product" column="id_product"/>
<entity name="comment"
query="select title from ps_product_comment where id_product='${item.id_product}'"
deltaQuery="select id_product_comment from ps_product_comment where date_add > '${dataimporter.last_index_time}'"
parentDeltaQuery="select id_product from ps_product where id_product=${comment.id_product}">
<field name="title" column="title" />
</entity>
</entity>
</document>
</dataConfig>
but when i want to define a core in solr, give me error:
Error CREATEing SolrCore 'product': Unable to create core [product] Caused by: QueryElevationComponent requires the schema to have a uniqueKeyField.
please help me to solve this problem.
Since Solr 4 and to support SolrCloud the uniqueKey field can no longer be populated using default=... you should remove it from the feld definition in schema.xml :
<field name="id_product" type="uuid" indexed="true" stored="true"/>
Update: As pointed out by MatsLindh, it seems you are using Solr in schemaless mode. Schema updates in this mode must be done via the Schema API, you should not edit the managed schema (<!-- Solr managed schema - automatically generated - DO NOT EDIT -->). To define id_product and uniqueKey field, use the API or revert to the classic schema mode.
To generate a uniqueKey to any document being added that does not already have a value in the specified field you can use UUIDUpdateProcessorFactory (cf. Update Request Processor). You will need to define an update processor chain in solrconfig.xml :
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id_product</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Then specify the use of the processor chain via the request param update.chain in your request handler definition.

index all files inside a folder in solr

I am having troubles indexing a folder in solr
example-data-config.xml:
<dataConfig>
<dataSource type="BinFileDataSource" />
<document>
<entity name="files"
dataSource="null"
rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:\Temp\" fileName=".*"
recursive="true"
onError="skip">
<field column="fileAbsolutePath" name="id" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<entity
name="documentImport"
processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}"
format="text">
<field column="file" name="fileName"/>
<field column="Author" name="author" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
then I create the schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="fileName" type="string" indexed="true" stored="true" />
<field name="author" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true" />
<field name="size" type="plong" indexed="true" stored="true" />
<field name="lastModified" type="pdate" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
finally I modify the file solrConfig.xml adding the requesthandler and the dataImportHandler and dataImportHandler-extra jars:
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">example-data-config.xml</str>
</lst>
</requestHandler>
I run it and the result is:
Inside that folder there are like 20.000 files in diferent formats (.py,.java,.wsdl, etc)
Any suggestion will be appreciated. Thanks :)
Check your Solr logs . Answer for what is the Root Cause will definitely be there . I also faced same situation once and found through solr logs that my DataImportHandler was throwing exceptions because of encrypted documents present in the folder . Your reasons may be different, but first analyze your solr logs, execute your entity again in DataImport section, and then check the immediate logs for errors by going on the logging section on admin page . If you are getting errors other than I what I mentioned , post them here , so they can be understood and deciphered .

DeltaImport not happening by default

I'm having issues with deltaquery where it's doesn't work automatically. Below is the data-config I have
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://WTL-sql-1.com;databaseName=eng_metrics"
user="metrics"
password="metrics"/>
<document name="content">
<entity name="id"
query="select defect_id,headline,description,modify_date,issue_type,category,product,state FROM defects WHERE state not like 'Duplicate'"
deltaImportQuery="select defect_id,headline,description,modify_date,issue_type,category,product,state FROM defects WHERE defect_id = '${dataimporter.delta.defect_id}' and state not like 'Duplicate'"
deltaQuery="select defect_id FROM defects WHERE modify_date > '${dataimporter.last_index_time}'">
<field column="defect_id" name="defect_id" />
<field column="headline" name="headline" />
<field column="description" name="description" />
<field column="modify_date" name="modify_date" />
<field column="issue_type" name="issue_type" />
<field column="category" name="category" />
<field column="product" name="product" />
<field column="state" name="state" />
</entity>
</document>
</dataConfig>
But what I see that no matter the modify_date changes in the DB, I don't see any update happening unless I try doing a delta import explicitly.
Can someone provide me some thoughts on whether I need to change some config or some query to make that happen automatically?
Actually, DataImportHandler will not do it automatically. You have to trigger it by call delta import 's url.
You may want something like this:
http://wiki.apache.org/solr/DataImportHandler#Scheduling
or you can implement similar one by youself.
But I've this data-config which works fine in some cases
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://127.0.0.1\SQLEXPRESS;databaseName=sustaining_trends"
user="sa"
password="metrics"/>
<document name="content">
<entity name="id"
query="select id,createtime,lastmodified,modifiedby,title,keywords,general,symptom,diagnosis,resolution FROM trends"
deltaImportQuery="select id,createtime,lastmodified,modifiedby,title,keywords,general,symptom,diagnosis,resolution FROM trends WHERE id = ${dataimporter.delta.id}"
deltaQuery="select id FROM trends WHERE lastmodified > '${dataimporter.last_index_time}' or createtime > '${dataimporter.last_index_time}'">
<field column="id" name="trendid" />
<field column="lastmodified" name="lastmodified" />
<field column="modifiedby" name="modifiedby" />
<field column="title" name="title" />
<field column="keywords" name="keywords" />
<field column="general" name="general" />
<field column="symptom" name="symptom" />
<field column="diagnosis" name="diagnosis" />
<field column="resolution" name="resolution" />
</entity>
</document>
</dataConfig>
Here if the item is modified immediately that gets updated without any interference but if a new data is created that doesn't get updated until either I do a manual delta import or else some entry gets modified.
How does this work automatically incase of modification and not work automatically for creation?

Solr 4 - missing required field: uuid

I'm having issues generating a UUID using the dataImportHandler in Solr4. Im trying to import from an existing MySQL database.
My schema.xml contains:
<fields>
<field name="uuid" type="uuid" indexed="true" stored="true" required="true" />
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="address" type="text_general" indexed="true" stored="true"/>
<field name="city" type="text_general" indexed="true" stored="true" />
<field name="county" type="string" indexed="true" stored="true" />
<field name="lat" type="text_general" indexed="true" stored="true" />
<field name="lng" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="price" type="float" indexed="true" stored="true"/>
<field name="bedrooms" type="float" indexed="true" stored="true" />
<field name="image" type="string" indexed="true" stored="true"/>
<field name="region" type="location_rpt" indexed="true" stored="true" />
<defaultSearchField>address</defaultSearchField>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
</fields>
<uniqueKey>uuid</uniqueKey>
and then in <types>
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
My Solrconfig.xml contains:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">uuid</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
Whenever I run the update, some docs are inserted ok , buy many return with:
org.apache.solr.common.SolrException: [doc=204] missing required field: uuid
Going by the example at link it should be
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
.........
<lst name="defaults">
<str name="config">data-config.xml</str>
<str name="update.chain">uuid</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">uuid</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Indexing data from pdf

I am trying to index data from pdf now, and I am getting the following response from Solr:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Time Elapsed">0:0:1.236</str>
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">1</str>
<str name="Total Documents Processed">0</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-05-11 15:45:01</str>
<str name="">Indexing failed. Rolled back all changes.</str>
<str name="Rolledback">2012-05-11 15:45:01</str></lst><str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>
The log files showing this:
org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoClassDefFoundError: org/apache/tika/parser/AutoDetectParser
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoClassDefFoundError: org/apache/tika/parser/AutoDetectParser
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoClassDefFoundError: org/apache/tika/parser/AutoDetectParser
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:759)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
... 5 more
Caused by: java.lang.NoClassDefFoundError: org/apache/tika/parser/AutoDetectParser
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:388)
at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:1100)
at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:912)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:635)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.AutoDetectParser
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 13 more
The configuration file looks like:
data-config.xml:
<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
<dataSource type="BinFileDataSource" name="binary" />
<document>
<entity name="f" dataSource="binary" rootEntity="false" processor="FileListEntityProcessor" baseDir="C:\solr\solr\docu" fileName=".*pdf" recursive="true">
<entity name="tika" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text">
<field column="id" name="id" meta="true" />
<field column="fake_id" name="fake_id" />
<field column="model" name="model" meta="true" />
<field column="text" name="biog" />
</entity>
</entity>
</document>
</dataConfig>
schema.xml:
<fields>
<field name="id" type="string" indexed="true" stored="true" />
<field name="fake_id" type="string" indexed="true" stored="true" />
<field name="model" type="text_en" indexed="true" stored="true" />
<field name="firstname" type="text_en" indexed="true" stored="true"/>
<field name="lastname" type="text_en" indexed="true" stored="true"/>
<field name="title" type="text_en" indexed="true" stored="true"/>
<field name="biog" type="text_en" indexed="true" stored="true"/>
</fields>
<uniqueKey>fake_id</uniqueKey>
<defaultSearchField>biog</defaultSearchField>
Finally the “Tika” jars that I have are:
tika-core-1.0.jar and tika-parsers-1.0.jar
What is going wrong?
Thanks
The problem could be in your data-config.xml, you specified the bin file dataSource on entity named "f" (with FileListEntityProcessor processor) insttead of entity with TikaEntityProcessor processor.
I think you could try with this code:
<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
<dataSource type="BinFileDataSource" name="binary" />
<document>
<entity name="f" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" baseDir="C:\solr\solr\docu" fileName=".*pdf" recursive="true">
<entity name="tika" dataSource="binary" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text">
<field column="id" name="id" meta="true" />
<field column="fake_id" name="fake_id" />
<field column="model" name="model" meta="true" />
<field column="text" name="biog" />
</entity>
</entity>
</document>
</dataConfig>