Mapping one-to-many entities in SOLR - lucene

I'm trying to map a few entities from an existing database to SOLR.
The tables are:
Hotel:
hotel_id
hotel_name
HotelToCategory:
hotel_id
category_id
rate
Category:
category_id
name
value
How can I use DataImportHandler to produce documents like this:
{
hotel_name: 'name',
hotel_id: 1,
categories: [
{ category_name: 'cname',
value: 'val',
rate: 3,
}
]
}
Any help will be greatly appreciated!

Relationships are indexed using stacked entities in DIH. Have a look in the DIH page in the Solr wiki.
There's also a few basic examples of this included in Solr distributions, have a look in examples/example-DIH.
There is a limitation here though, solr does not (currently) support relationships between index documents, so you will have to find a workaround for indexing this. For instance by just storing display data in a non-indexed field (which might require very frequent reindexing):
<document>
<entity name="hotel" query="select * from hotel">
<field column="id" name="hotel_id" />
<field column="hotel_name" name="hotel_name" />
<entity name="hotel_category_display"
query="SELECT STATEMENT THAT RETURNS JSON REPRESENTATION">
<field column="category" name="category" />
</entity>
</document>
Or by storing just category ID and do lookups (either against the database, or index categories separately and lookup against Solr) at search time:
<entity name="hotel_category_display"
query="SELECT STATEMENT THAT RETURNS JSON REPRESENTATION">
<field column="category" name="category" />
</entity>

Related

DateFormatTransformer not working with FileListEntityProcessor in Data Import Handler

While indexing data from a local folder on my system, i am using given below configuration.However the lastmodified attribute is getting indexed in the format "Wed 23 May 09:48:08 UTC" , which is not the standard format used by solr for filter queries .
So, I am trying to format the lastmodified attribute in the data-config.xml as given below .
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="D://FileBank"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip"
recursive="true" transformer="DateFormatTransformer">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" dateTimeFormat="YYYY-MM-DDTHH:MM:SS.000Z" locale="en"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<!--<field column="text" />-->
</entity>
</entity>
</document>
</dataConfig>
But there is no effect of transformer, and same format is indexed again . Has anyone got success with this ? Is the above configuration right , or am i missing something ?
Your dateTimeFormat provided does not seem to be correct. The upper and lower case characters have different meaning. Also the format you showed does not match the date text you are trying to parse. So, it probably keeps it as unmatched.
Also, if you have several different date formats, you could parse them after DIH runs by creating a custom UpdateRequestProcessor chain. You can see schemaless example where there is several date formats as part of auto-mapping, but you could also do the same thing for a specific field explicitly.

Solr: Indexing nested Documents via DIH

I want to index my document from MySql to Solr via DIH. I have a table structure like this
Table User
id
1
2
3
name
Jay
Chakra
Rabbit
Address
id
1
2
3
number
1111111111
2222222222
3333333333
email
test#email.com
test123#test.co
unique#email.com
and other associations.
I want to index this in a nested document structure but unable to find any resource via which it can be done using DIH.
Resources refered:
http://yonik.com/solr-nested-objects/
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers
Please suggest a way to index it through DIH
This feature has been implemented by SOLR-5147 and should be available for Solr 5.1+
Here is a sample configuration taken from the original Jira ticket.
<dataConfig>
<dataSource type="JdbcDataSource" />
<document>
<entity name="PARENT" query="select * from PARENT">
<field column="id" />
<field column="desc" />
<field column="type_s" />
<entity child="true" name="CHILD" query="select * from CHILD where parent_id='${PARENT.id}'">
<field column="id" />
<field column="desc" />
<field column="type_s" />
</entity>
</entity>
</document>
</dataConfig>
note the child="true" is required for child entities.

Solr query search for multiple instances for single keyword

I'm stuck on this one issue. What i want to do is to query on a Multivalued and see if a value comes up at least try. For example the field must be "FREE","FREE" and not just "FREE" or "FREE","IN_USE".
Field
<field name="point_statusses" type="string" indexed="true" stored="true" multiValued="true" />
Type
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
SQL
GROUP_CONCAT(cp.status) as point_statusses
Clarification:
I have an object that has multiple plugs and those all have a status of FREE, IN_USE or ERROR. What i want to do is filter on ones that have two plugs with status FREE and I can't change the structure of the schema.xml. How do i query to for this?
Unfortunately, it cannot be done without applying any changes to schema, because solr.StrField does not preserve term frequency information.
Quote from schema.xml:
...
1.2: omitTermFreqAndPositions attribute introduced, true by default
except for text fields.
...
However, if you can apply some changes, then the following will work (tested on the Solr 4.5.1):
1) Make one of the following changes to schema:
Change field to text_general (or any solr.TextField field);
<field name="point_statusses" type="text_general" indexed="true" stored="true" multiValued="true" />
OR add omitTermFreqAndPositions="false" to point_statusses definition:
<field name="point_statusses" type="string" indexed="true" stored="true" multiValued="true" omitTermFreqAndPositions="false"/>
2) Filter by term frequency.
Examples:
Search documents having exactly 2 'FREE' point_statusses:
{!frange l=2 u=2}termfreq(point_statusses,'FREE')
Or from 2 to 3 'FREE' point_statusses:
{!frange l=2 u=3}termfreq(point_statusses,'FREE')
The final solr query may look like this:
http://localhost:8983/solr/stack20746538/select?q=*:*&fq={!frange l=2 u=3}termfreq(point_statusses,'FREE')

Solr, multiple indexes

I want to index 2 different entities (2 tables in SQL in this case) into my Lucene index. One table containing products, another containing news items.
To be able to use the same search method (query) to search for both products and news items, I understand they must be in the same index, so a several core setup of Solr wouldn't work - right?
In data-config.xml I have defined 2 document types with corresponding entities.
In schema.xml I have defined fields for products and news items as well.
In my databse design (tables) my product table's unique key is called "ProductID", where as my news item's unique key is called "Id" (this is made by the CMS I'm using).
In data-config.xml should I just map both my unique id's to the same "name". Would that be all to make this work?
Am I following the right approach here?
Example of what I'm thinking;
data-config.xml
<!-- Products -->
<document name="products">
<entity name="product" dataSource="sqlServer" pk="ProductID" query="SELECT
ProductID,
ProductNumber,
ProductName,
FROM EcomProducts">
<field column="ProductID" name="**Id**"/>
<field column="ProductNumber" name="ProductNumber"/>
<field column="ProductName" name="ProductName"/>
</entity>
</document>
<!-- News items --->
<document name="newsitems">
<entity name="newsitems" dataSource="sqlServer" pk="id" query="SELECT
Id,
NewsItemTitle,
NewsItemContent,
FROM ItemType_NewsItems">
<field column="Id" name="**Id**"/>
<field column="NewsItemTitle" name="NewsItemTitle"/>
<field column="NewsItemContent" name="NewsItemContent"/>
</entity>
</document>
schema.xml
<!-- Products --->
<field name="**Id**" type="text_general" indexed="true" stored="true" required="true" />
<field name="ProductNumber" type="text_general" indexed="true" stored="true" required="false" />
<field name="ProductName" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
<!-- Tips og fif --->
<field name="**Id**" type="text_general" indexed="true" stored="true" required="true" />
<field name="NewsItemTitle" type="text_general" indexed="true" stored="true" required="false" />
<field name="NewsItemContent" type="text_general" indexed="true" stored="true" required="false" />
Id
If you want to search them together, I would map the metadata to a common schema. Perhaps both ProductName and NewItemTitle would go in a "title" field. Some metadata will be unique to each type. Or you can index the info twice, once as ProductName and once as title.
Unless you can be sure the IDs will always, always be unique across the two sources, you may want to prefix them. It is also very handy to have a field that marks the type of each document. That allows searching only one type and, in DIH, you can use it to delete only one type.
In your SQL, you can add columns like this:
concat('product-',cast(ProductId as char)) as id,
'product' as type,
That is MySQL syntax, it might need tweaking for SQLServer.

What is the use of "multiValued" field type in Solr?

I'm new to Apache Solr. Even after reading the documentation part, I'm finding it difficult to clearly understand the functionality and use of the multiValued field type property.
What internally Solr does/treats/handles a field that is marked as multiValued?
What is the difference in indexing in Solr between a field that is multiValued and those that are not?
Can somebody explain with some good example?
Doc says:
multiValued=true|false
True if this
field may contain multiple values per
document, i.e. if it can appear
multiple times in a document
A multivalued field is useful when there are more than one value present for the field. An easy example would be tags, there can be multiple tags that need to be indexed. so if we have tags field as multivalued then solr response will return a list instead of a string value. One point to note is that you need to submit multiple lines for each value of the tags like:
<field name="tags">tag1</tags>
<field name="tags">tag2</tags>
...
<field name="tags">tagn</tags>
Once you have all the values index you can search or filter results by any value, e,g. you can find all documents with tag1 using query like
q=tags:tag1
or use the tags to filter out results like
q=query&fq=tags:tag1
multiValued defined in the schema whether the field is allowed to have more than one value.
For instance:
if I have a fieldType called ID which is multiValued=false indexing a document such as this:
doc {
id : [ 1, 2]
...
}
would cause an exception to be thrown in the indexing thread and the document will not be indexed (schema validation will fail).
On the other hand if I do have multiple values for a field I would want to set multiValued=true in order to guarantee that indexing is done correctly, for example:
doc {
id : 1
keywords: [ hello, world ]
...
}
In this case you would define "keywords" as a multiValued field.
I use multiple value fields only with copyfields, so think this way, say all fields will be single valued unless it's a copyfield, for example I have following fields:
<field name="id" type="string" indexed="true" stored="true"/>
<field name="name" type="string" indexed="true" stored="true"/>
<field name="subject" type="string" indexed="true" stored="true"/>
<field name="location" type="string" indexed="true" stored="true"/>
I want to query one field only and possibly to search all 4 fields above, then we need to use copyfield. first to create a new field call 'all', then copy everything into 'all'
<field name="all" type="text" indexed="true" stored="true" multiValued="true"/>
<copyField source="*" dest="all"/>
Now field 'all' need to be multi-valued.