DateFormatTransformer not working with FileListEntityProcessor in Data Import Handler - indexing

While indexing data from a local folder on my system, i am using given below configuration.However the lastmodified attribute is getting indexed in the format "Wed 23 May 09:48:08 UTC" , which is not the standard format used by solr for filter queries .
So, I am trying to format the lastmodified attribute in the data-config.xml as given below .
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="D://FileBank"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip"
recursive="true" transformer="DateFormatTransformer">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" dateTimeFormat="YYYY-MM-DDTHH:MM:SS.000Z" locale="en"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<!--<field column="text" />-->
</entity>
</entity>
</document>
</dataConfig>
But there is no effect of transformer, and same format is indexed again . Has anyone got success with this ? Is the above configuration right , or am i missing something ?

Your dateTimeFormat provided does not seem to be correct. The upper and lower case characters have different meaning. Also the format you showed does not match the date text you are trying to parse. So, it probably keeps it as unmatched.
Also, if you have several different date formats, you could parse them after DIH runs by creating a custom UpdateRequestProcessor chain. You can see schemaless example where there is several date formats as part of auto-mapping, but you could also do the same thing for a specific field explicitly.

Related

Speedata and XPath

Unfortunately I can't handle the XPath syntax in the Laout.xml of Speedata .
I've been programming XSL for years and maybe I'm a bit preburdened.
The XML I'm trying to evaluate has the following structure:
<export>
<object>
<fields>
<field key="demo1:DISPLAY_NAME" lang="de_DE" origin="default" ftype="string">Anwendungsbild</field>
<field key="demo1:DISPLAY_NAME" lang="en_UK" origin="inherit" ftype="string">application picture</field>
<field key="demo1:DISPLAY_NAME" lang="es_ES" origin="self" ftype="string">imagen de aplicaciĆ³n</field>
</fields>
</object>
</export>
The attempt to output the element node with the following XPath fails.
export/object/fields/field[#key='demo1:DISPLAY_NAME' and #lang='de_DE' and #origin='default']
How do I formulate the query in Speedata Publisher, please?
Thnk you for our Help.
The speedata software only supports a small subset of XPath. You have two options
preprocess the data with the provided Saxon XSLT processor
iterate through the data yourself:
<Layout xmlns="urn:speedata.de:2009/publisher/en"
xmlns:sd="urn:speedata:2009/publisher/functions/en">
<Record element="export">
<ForAll select="object/fields/field">
<Switch>
<Case test="#key='demo1:DISPLAY_NAME' and #lang='de_DE' and #origin='default'">
<SetVariable variable="whatever" select="."/>
</Case>
</Switch>
</ForAll>
<Message select="$whatever"></Message>
</Record>
</Layout>
(given your input file as data.xml)

SOLR index on pdate field included in search results

I am migrating from SOLR 4.10.2 to 8.1.1. For some reason, in the 8.1.1 core, a pdate index named IDX_ExpirationDate is appearing as a field in the search results documents.
I have several other indexes that are defined and (correctly) do not appear in the results. But the index I am having trouble with is the only one based on a pdate.
Here is a sample 8.1.1 response that demonstrates the issue:
"response":{"numFound":58871,"start":0,"docs":[
{
"id":"11111",
"ExpirationDate":"2018-01-26T00:00:00Z",
"_version_":1641033044033798170,
"IDX_ExpirationDate":["2018-01-26T00:00:00Z"]},
{
"id":"22222",
"ExpirationDate":"2018-02-20T00:00:00Z",
"_version_":1641032965380112384,
"IDX_ExpirationDate":["2018-02-20T00:00:00Z"]},
ExpirationDate is supposed to be there, but IDX_ExpirationDate should not. I know that I can probably keep using date, but it is deprecated, and part of the reason for upgrading to 8.1.1 is to use the latest non-deprecated stuff ;-)
I have an index named IDX_ExpirationDate based on a field called ExpirationDate that was a date field in 4.10.2:
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<field name="IDX_ExpirationDate" type="date" indexed="true" stored="false" multiValued="true" />
<field name="ExpirationDate" type = "date" indexed = "true" stored = "true" />
<copyField source="ExpirationDate" dest="IDX_ExpirationDate"/>
In the 8.1.1 core, I have this configured as a pdate:
<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
<field name="IDX_ExpirationDate" type="pdate" indexed="true" stored="false" multiValued="true" />
<field name="ExpirationDate" type = "pdate" indexed = "true" stored = "true" />
<copyField source="ExpirationDate" dest="IDX_ExpirationDate"/>
Fixed.
According to Shawn Heisey on the solruser mailing list, the pdate type defaults to docValues=true and useDocValuesAsStored="true", which makes it appear in results.
So I changed the IDX_ExpirationDate by adding useDocValuesAsStored="false", reloaded the index, and it no longer appears in the results:
<field name="IDX_ExpirationDate" type="pdate" indexed="true" stored="false" multiValued="true" useDocValuesAsStored="false"/>

SOLR LineEntityProcessor - fetched x records, but processed/indexed zero records

I am trying fetch all the hyperlinks from an html page and add them as documents to SOLR.
Here is my DIH config xml
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource type="FileDataSource" name="fds" />
<dataSource type="FieldReaderDataSource" name="frds" />
<document>
<entity name="lines" processor="LineEntityProcessor"
acceptLineRegex="<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1"
url="/Users/naveen/AppsAndData/data/test-data/testdata.html"
dataSource="fds" transformer="RegexTransformer">
<field column="line" />
</entity>
</document>
</dataConfig>
mergedschema xml file contents
<schema name="example-data-driven-schema" version="1.6">
<uniqueKey>id</uniqueKey>
<!-
---
-->
<field name="id" type="string" indexed="true" required="true" stored="true"/>
<field name="line" type="text_general" indexed="true" stored="true"/>
</schema>
When I run full-import, the status says
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents. (Duration: 01s)
Requests: 0 , Fetched: 4 4/s, Skipped: 0 , Processed: 0
Am I missing something, please help me out here.
Thanks,
Naveen
The id field is defined as required=true, additionally it is defined as uniqueKey. That could be the issue. Can you switch it off and try again?

Solr: Indexing nested Documents via DIH

I want to index my document from MySql to Solr via DIH. I have a table structure like this
Table User
id
1
2
3
name
Jay
Chakra
Rabbit
Address
id
1
2
3
number
1111111111
2222222222
3333333333
email
test#email.com
test123#test.co
unique#email.com
and other associations.
I want to index this in a nested document structure but unable to find any resource via which it can be done using DIH.
Resources refered:
http://yonik.com/solr-nested-objects/
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers
Please suggest a way to index it through DIH
This feature has been implemented by SOLR-5147 and should be available for Solr 5.1+
Here is a sample configuration taken from the original Jira ticket.
<dataConfig>
<dataSource type="JdbcDataSource" />
<document>
<entity name="PARENT" query="select * from PARENT">
<field column="id" />
<field column="desc" />
<field column="type_s" />
<entity child="true" name="CHILD" query="select * from CHILD where parent_id='${PARENT.id}'">
<field column="id" />
<field column="desc" />
<field column="type_s" />
</entity>
</entity>
</document>
</dataConfig>
note the child="true" is required for child entities.

Avoiding Boxing/Unboxing on unknown input

I am creating an application that parses an XML and retrieves some data. Each xml node specifies the data (const), a recordset's column-name to get the data from (var), a subset of possible data values depending on some condition (enum) and others. It may also specify, alongside the data, the format in which the data must be shown to the user.
The thing is that for each node type I need to process the values differently and perform some actinons so, for each node, I need to store the return value in a temp variable in order to later format it... I know I could format it right there and return it but that would mean to repeat myself and I hate doing so.
So, the question: How can I store the value to return, in a temp variable, while avoiding boxing/unboxing when the type is unknown and I can't use generics?
P.S.: I'm designing the parser, the XML Schema and the view that will fill the recordset so changes to all are plausible.
Update
I cannot post the code nor the XML values but this is the XML structure and actual tags.
<?xml version='1.0' encoding='utf-8'?>
<root>
<entity>
<header>
<field type="const">C1</field>
<field type="const">C2</field>
<field type="count" />
<field type="sum" precision="2">some_recordset_field</field>
<field type="const">C3</field>
<field type="const">C4</field>
<field type="const">C5</field>
</header>
<detail>
<field type="enum" fieldName="some_recordset_field">
<match value="0">M1</match>
<match value="1">M2</match>
</field>
<field type="const">C6</field>
<field type="const">C7</field>
<field type="const">C8</field>
<field type="var" format="0000000000">some_recordset_field</field>
<field type="var" format="MMddyyyy">some_recordset_field</field>
<field type="var" format="0000000000" precision="2">some_recordset_field</field>
<field type="var" format="0000000000">some_recordset_field</field>
<field type="enum" fieldName ="some_recordset_field">
<match value="0">M3</match>
<match value="1">M4</match>
</field>
<field type="const">C9</field>
</detail>
</entity>
</root>
Have you tried using the var type? That way you don't need to know the type of each node. Also, some small sample of your scenario would be useful.