SOLR LineEntityProcessor - fetched x records, but processed/indexed zero records - dataimporthandler

I am trying fetch all the hyperlinks from an html page and add them as documents to SOLR.
Here is my DIH config xml
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource type="FileDataSource" name="fds" />
<dataSource type="FieldReaderDataSource" name="frds" />
<document>
<entity name="lines" processor="LineEntityProcessor"
acceptLineRegex="<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1"
url="/Users/naveen/AppsAndData/data/test-data/testdata.html"
dataSource="fds" transformer="RegexTransformer">
<field column="line" />
</entity>
</document>
</dataConfig>
mergedschema xml file contents
<schema name="example-data-driven-schema" version="1.6">
<uniqueKey>id</uniqueKey>
<!-
---
-->
<field name="id" type="string" indexed="true" required="true" stored="true"/>
<field name="line" type="text_general" indexed="true" stored="true"/>
</schema>
When I run full-import, the status says
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents. (Duration: 01s)
Requests: 0 , Fetched: 4 4/s, Skipped: 0 , Processed: 0
Am I missing something, please help me out here.
Thanks,
Naveen

The id field is defined as required=true, additionally it is defined as uniqueKey. That could be the issue. Can you switch it off and try again?

Related

DateFormatTransformer not working with FileListEntityProcessor in Data Import Handler

While indexing data from a local folder on my system, i am using given below configuration.However the lastmodified attribute is getting indexed in the format "Wed 23 May 09:48:08 UTC" , which is not the standard format used by solr for filter queries .
So, I am trying to format the lastmodified attribute in the data-config.xml as given below .
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="D://FileBank"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip"
recursive="true" transformer="DateFormatTransformer">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" dateTimeFormat="YYYY-MM-DDTHH:MM:SS.000Z" locale="en"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<!--<field column="text" />-->
</entity>
</entity>
</document>
</dataConfig>
But there is no effect of transformer, and same format is indexed again . Has anyone got success with this ? Is the above configuration right , or am i missing something ?
Your dateTimeFormat provided does not seem to be correct. The upper and lower case characters have different meaning. Also the format you showed does not match the date text you are trying to parse. So, it probably keeps it as unmatched.
Also, if you have several different date formats, you could parse them after DIH runs by creating a custom UpdateRequestProcessor chain. You can see schemaless example where there is several date formats as part of auto-mapping, but you could also do the same thing for a specific field explicitly.

Solr: Indexing nested Documents via DIH

I want to index my document from MySql to Solr via DIH. I have a table structure like this
Table User
id
1
2
3
name
Jay
Chakra
Rabbit
Address
id
1
2
3
number
1111111111
2222222222
3333333333
email
test#email.com
test123#test.co
unique#email.com
and other associations.
I want to index this in a nested document structure but unable to find any resource via which it can be done using DIH.
Resources refered:
http://yonik.com/solr-nested-objects/
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers
Please suggest a way to index it through DIH
This feature has been implemented by SOLR-5147 and should be available for Solr 5.1+
Here is a sample configuration taken from the original Jira ticket.
<dataConfig>
<dataSource type="JdbcDataSource" />
<document>
<entity name="PARENT" query="select * from PARENT">
<field column="id" />
<field column="desc" />
<field column="type_s" />
<entity child="true" name="CHILD" query="select * from CHILD where parent_id='${PARENT.id}'">
<field column="id" />
<field column="desc" />
<field column="type_s" />
</entity>
</entity>
</document>
</dataConfig>
note the child="true" is required for child entities.

Solr + schema.xml creating custom FieldType Object

This is only an example, but it will Help me to get further
I have an Object "person" with Fields [Age,Name]
My schema.xml
<field name="age" type="string" indexed="true" stored="false"/>
<field name="name" type="string" indexed="true" stored="false"/>
everything is ok, but I want to add +1 more Field "relation" (or parents,children etc.)
Person[age,name,relation] -> Relation has also [age,name]
how can i, insert a FieldType Relation to my schema.xml ?
<field name="age" type="string" indexed="true" stored="false"/>
<field name="name" type="string" indexed="true" stored="false"/>
<field name="relation" type="???" indexed="true" stored="false"/>
I want to add an Field, which takes all existing Fields like above
<field name="field1" type="string">
<field name="field2" type="string">
<field name="field3" type="string">
<field name="field4" type="field1,field2,field3">
Solr doesn't really support what you want, so you'd probably either index it with a multivalued field that contains ids that point to the other documents, such as (any reason why the age field is a string and not an int?):
<field name="id" type="int" indexed="true" stored="false"/>
<field name="age" type="string" indexed="true" stored="false"/>
<field name="name" type="string" indexed="true" stored="false"/>
<field name="relation" type="int" multiValued="true" indexed="true" stored="false" />
.. and then query all documents with a given relation when displaying a document (making two queries to Solr).
You can also use nested child documents, but it requires a bit more handling (since everything is contained in one document, you'll have to update everything together).
Solr prefers everything to be in denormalized way. Multi value is in that direction. But as #MatsLindh said, it involves 2 queries, because most of the times the child entities would be more than just a single field(arrays of strings v/s array of entities).
(Parent and child in your case is Person and "relation")
Nested child documents, is more of object relation, just like how we have in other frameworks. You have parent documents, you have child documents, and solr maintains the relationship and we should have a field which identies parent from child. The good part about this is
You can get parent document with child field matching
All child documents for a parent field matching
Finally only one query
Nested stuff, is recent addition. We are using lucid works to interact with solr. They suggested not to use nested documents, so we ended up with multivalue. But if allowed at your infrastructure, and solr framework itself having the feature, i think there is no wrong in using it.

Solr, multiple indexes

I want to index 2 different entities (2 tables in SQL in this case) into my Lucene index. One table containing products, another containing news items.
To be able to use the same search method (query) to search for both products and news items, I understand they must be in the same index, so a several core setup of Solr wouldn't work - right?
In data-config.xml I have defined 2 document types with corresponding entities.
In schema.xml I have defined fields for products and news items as well.
In my databse design (tables) my product table's unique key is called "ProductID", where as my news item's unique key is called "Id" (this is made by the CMS I'm using).
In data-config.xml should I just map both my unique id's to the same "name". Would that be all to make this work?
Am I following the right approach here?
Example of what I'm thinking;
data-config.xml
<!-- Products -->
<document name="products">
<entity name="product" dataSource="sqlServer" pk="ProductID" query="SELECT
ProductID,
ProductNumber,
ProductName,
FROM EcomProducts">
<field column="ProductID" name="**Id**"/>
<field column="ProductNumber" name="ProductNumber"/>
<field column="ProductName" name="ProductName"/>
</entity>
</document>
<!-- News items --->
<document name="newsitems">
<entity name="newsitems" dataSource="sqlServer" pk="id" query="SELECT
Id,
NewsItemTitle,
NewsItemContent,
FROM ItemType_NewsItems">
<field column="Id" name="**Id**"/>
<field column="NewsItemTitle" name="NewsItemTitle"/>
<field column="NewsItemContent" name="NewsItemContent"/>
</entity>
</document>
schema.xml
<!-- Products --->
<field name="**Id**" type="text_general" indexed="true" stored="true" required="true" />
<field name="ProductNumber" type="text_general" indexed="true" stored="true" required="false" />
<field name="ProductName" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
<!-- Tips og fif --->
<field name="**Id**" type="text_general" indexed="true" stored="true" required="true" />
<field name="NewsItemTitle" type="text_general" indexed="true" stored="true" required="false" />
<field name="NewsItemContent" type="text_general" indexed="true" stored="true" required="false" />
Id
If you want to search them together, I would map the metadata to a common schema. Perhaps both ProductName and NewItemTitle would go in a "title" field. Some metadata will be unique to each type. Or you can index the info twice, once as ProductName and once as title.
Unless you can be sure the IDs will always, always be unique across the two sources, you may want to prefix them. It is also very handy to have a field that marks the type of each document. That allows searching only one type and, in DIH, you can use it to delete only one type.
In your SQL, you can add columns like this:
concat('product-',cast(ProductId as char)) as id,
'product' as type,
That is MySQL syntax, it might need tweaking for SQLServer.

Solr return whether member is in multivalued field

Is there any way to return in the fields list whether a value exists as one of the values of a multivalued field?
E.g., if your schema is
<schema>
...
<field name="user_name" type="text" indexed="true" stored="true" required="true" />
<field name="follower" type="integer" indexed="true" stored="true" multiValued="true" />
...
</schema>
A sample document might look like:
<doc>
<field name="user_name">tester blah</field>
<field name="follower">1</field>
<field name="follower">62</field>
<field name="follower">63</field>
<field name="follower">64</field>
</doc>
I would like to be able to query for, say, "tester" and follower:62 and have it match "tester blah" and have some indication of whether 62 is a follower or not in the results.
If you query for something AND follower:62, you can be sure 62 will be a follower of any result you get :)
Now if follower:62 comes as an optional clause in a OR for instance, I guess you can use the highlighting facility to achieve your requirement.
hl.field=...,follower,..
hl.requireFieldMatch= true
You'll get something in the highlighting part of the response for your document if it matches your follower:62.