Solr: Indexing nested Documents via DIH - indexing

I want to index my document from MySql to Solr via DIH. I have a table structure like this
Table User
id
1
2
3
name
Jay
Chakra
Rabbit
Address
id
1
2
3
number
1111111111
2222222222
3333333333
email
test#email.com
test123#test.co
unique#email.com
and other associations.
I want to index this in a nested document structure but unable to find any resource via which it can be done using DIH.
Resources refered:
http://yonik.com/solr-nested-objects/
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers
Please suggest a way to index it through DIH

This feature has been implemented by SOLR-5147 and should be available for Solr 5.1+
Here is a sample configuration taken from the original Jira ticket.
<dataConfig>
<dataSource type="JdbcDataSource" />
<document>
<entity name="PARENT" query="select * from PARENT">
<field column="id" />
<field column="desc" />
<field column="type_s" />
<entity child="true" name="CHILD" query="select * from CHILD where parent_id='${PARENT.id}'">
<field column="id" />
<field column="desc" />
<field column="type_s" />
</entity>
</entity>
</document>
</dataConfig>
note the child="true" is required for child entities.

Related

SOLR LineEntityProcessor - fetched x records, but processed/indexed zero records

I am trying fetch all the hyperlinks from an html page and add them as documents to SOLR.
Here is my DIH config xml
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource type="FileDataSource" name="fds" />
<dataSource type="FieldReaderDataSource" name="frds" />
<document>
<entity name="lines" processor="LineEntityProcessor"
acceptLineRegex="<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1"
url="/Users/naveen/AppsAndData/data/test-data/testdata.html"
dataSource="fds" transformer="RegexTransformer">
<field column="line" />
</entity>
</document>
</dataConfig>
mergedschema xml file contents
<schema name="example-data-driven-schema" version="1.6">
<uniqueKey>id</uniqueKey>
<!-
---
-->
<field name="id" type="string" indexed="true" required="true" stored="true"/>
<field name="line" type="text_general" indexed="true" stored="true"/>
</schema>
When I run full-import, the status says
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents. (Duration: 01s)
Requests: 0 , Fetched: 4 4/s, Skipped: 0 , Processed: 0
Am I missing something, please help me out here.
Thanks,
Naveen
The id field is defined as required=true, additionally it is defined as uniqueKey. That could be the issue. Can you switch it off and try again?

DateFormatTransformer not working with FileListEntityProcessor in Data Import Handler

While indexing data from a local folder on my system, i am using given below configuration.However the lastmodified attribute is getting indexed in the format "Wed 23 May 09:48:08 UTC" , which is not the standard format used by solr for filter queries .
So, I am trying to format the lastmodified attribute in the data-config.xml as given below .
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="D://FileBank"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip"
recursive="true" transformer="DateFormatTransformer">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" dateTimeFormat="YYYY-MM-DDTHH:MM:SS.000Z" locale="en"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<!--<field column="text" />-->
</entity>
</entity>
</document>
</dataConfig>
But there is no effect of transformer, and same format is indexed again . Has anyone got success with this ? Is the above configuration right , or am i missing something ?
Your dateTimeFormat provided does not seem to be correct. The upper and lower case characters have different meaning. Also the format you showed does not match the date text you are trying to parse. So, it probably keeps it as unmatched.
Also, if you have several different date formats, you could parse them after DIH runs by creating a custom UpdateRequestProcessor chain. You can see schemaless example where there is several date formats as part of auto-mapping, but you could also do the same thing for a specific field explicitly.

Solr + schema.xml creating custom FieldType Object

This is only an example, but it will Help me to get further
I have an Object "person" with Fields [Age,Name]
My schema.xml
<field name="age" type="string" indexed="true" stored="false"/>
<field name="name" type="string" indexed="true" stored="false"/>
everything is ok, but I want to add +1 more Field "relation" (or parents,children etc.)
Person[age,name,relation] -> Relation has also [age,name]
how can i, insert a FieldType Relation to my schema.xml ?
<field name="age" type="string" indexed="true" stored="false"/>
<field name="name" type="string" indexed="true" stored="false"/>
<field name="relation" type="???" indexed="true" stored="false"/>
I want to add an Field, which takes all existing Fields like above
<field name="field1" type="string">
<field name="field2" type="string">
<field name="field3" type="string">
<field name="field4" type="field1,field2,field3">
Solr doesn't really support what you want, so you'd probably either index it with a multivalued field that contains ids that point to the other documents, such as (any reason why the age field is a string and not an int?):
<field name="id" type="int" indexed="true" stored="false"/>
<field name="age" type="string" indexed="true" stored="false"/>
<field name="name" type="string" indexed="true" stored="false"/>
<field name="relation" type="int" multiValued="true" indexed="true" stored="false" />
.. and then query all documents with a given relation when displaying a document (making two queries to Solr).
You can also use nested child documents, but it requires a bit more handling (since everything is contained in one document, you'll have to update everything together).
Solr prefers everything to be in denormalized way. Multi value is in that direction. But as #MatsLindh said, it involves 2 queries, because most of the times the child entities would be more than just a single field(arrays of strings v/s array of entities).
(Parent and child in your case is Person and "relation")
Nested child documents, is more of object relation, just like how we have in other frameworks. You have parent documents, you have child documents, and solr maintains the relationship and we should have a field which identies parent from child. The good part about this is
You can get parent document with child field matching
All child documents for a parent field matching
Finally only one query
Nested stuff, is recent addition. We are using lucid works to interact with solr. They suggested not to use nested documents, so we ended up with multivalue. But if allowed at your infrastructure, and solr framework itself having the feature, i think there is no wrong in using it.

Solr, multiple indexes

I want to index 2 different entities (2 tables in SQL in this case) into my Lucene index. One table containing products, another containing news items.
To be able to use the same search method (query) to search for both products and news items, I understand they must be in the same index, so a several core setup of Solr wouldn't work - right?
In data-config.xml I have defined 2 document types with corresponding entities.
In schema.xml I have defined fields for products and news items as well.
In my databse design (tables) my product table's unique key is called "ProductID", where as my news item's unique key is called "Id" (this is made by the CMS I'm using).
In data-config.xml should I just map both my unique id's to the same "name". Would that be all to make this work?
Am I following the right approach here?
Example of what I'm thinking;
data-config.xml
<!-- Products -->
<document name="products">
<entity name="product" dataSource="sqlServer" pk="ProductID" query="SELECT
ProductID,
ProductNumber,
ProductName,
FROM EcomProducts">
<field column="ProductID" name="**Id**"/>
<field column="ProductNumber" name="ProductNumber"/>
<field column="ProductName" name="ProductName"/>
</entity>
</document>
<!-- News items --->
<document name="newsitems">
<entity name="newsitems" dataSource="sqlServer" pk="id" query="SELECT
Id,
NewsItemTitle,
NewsItemContent,
FROM ItemType_NewsItems">
<field column="Id" name="**Id**"/>
<field column="NewsItemTitle" name="NewsItemTitle"/>
<field column="NewsItemContent" name="NewsItemContent"/>
</entity>
</document>
schema.xml
<!-- Products --->
<field name="**Id**" type="text_general" indexed="true" stored="true" required="true" />
<field name="ProductNumber" type="text_general" indexed="true" stored="true" required="false" />
<field name="ProductName" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
<!-- Tips og fif --->
<field name="**Id**" type="text_general" indexed="true" stored="true" required="true" />
<field name="NewsItemTitle" type="text_general" indexed="true" stored="true" required="false" />
<field name="NewsItemContent" type="text_general" indexed="true" stored="true" required="false" />
Id
If you want to search them together, I would map the metadata to a common schema. Perhaps both ProductName and NewItemTitle would go in a "title" field. Some metadata will be unique to each type. Or you can index the info twice, once as ProductName and once as title.
Unless you can be sure the IDs will always, always be unique across the two sources, you may want to prefix them. It is also very handy to have a field that marks the type of each document. That allows searching only one type and, in DIH, you can use it to delete only one type.
In your SQL, you can add columns like this:
concat('product-',cast(ProductId as char)) as id,
'product' as type,
That is MySQL syntax, it might need tweaking for SQLServer.

Mapping one-to-many entities in SOLR

I'm trying to map a few entities from an existing database to SOLR.
The tables are:
Hotel:
hotel_id
hotel_name
HotelToCategory:
hotel_id
category_id
rate
Category:
category_id
name
value
How can I use DataImportHandler to produce documents like this:
{
hotel_name: 'name',
hotel_id: 1,
categories: [
{ category_name: 'cname',
value: 'val',
rate: 3,
}
]
}
Any help will be greatly appreciated!
Relationships are indexed using stacked entities in DIH. Have a look in the DIH page in the Solr wiki.
There's also a few basic examples of this included in Solr distributions, have a look in examples/example-DIH.
There is a limitation here though, solr does not (currently) support relationships between index documents, so you will have to find a workaround for indexing this. For instance by just storing display data in a non-indexed field (which might require very frequent reindexing):
<document>
<entity name="hotel" query="select * from hotel">
<field column="id" name="hotel_id" />
<field column="hotel_name" name="hotel_name" />
<entity name="hotel_category_display"
query="SELECT STATEMENT THAT RETURNS JSON REPRESENTATION">
<field column="category" name="category" />
</entity>
</document>
Or by storing just category ID and do lookups (either against the database, or index categories separately and lookup against Solr) at search time:
<entity name="hotel_category_display"
query="SELECT STATEMENT THAT RETURNS JSON REPRESENTATION">
<field column="category" name="category" />
</entity>