MarkLogic - Error with loading pipeline for content processing - marklogic-9

MarkLogic version - 9.0-6.2 (on windows)
I am following the guide (https://docs.marklogic.com/guide/cpf/quickStart) to perform the sample exercise provided. After installing CPF on data-hub-FINAL (with data-hub-TRIGGERS as the triggers db), I created a pipeline XML document (as given in example) in my C drive at directory C:\copyright. Then on the admin console, I navigated to databases -->data-hub-FINAL--> Content Processing--> Pipelines --> Load, and provided below values.
directory : C:\copyright
filter : *.xml
source : (file system)
However, when I click 'Ok', I am getting error message 'Invalid input: No readable XML files found:'
I verified that the pipeline xml is present and valid in the directory C:\copyright.
Any inputs appreciated!

Marklogic could not read the xml document because of non UTF-8 content in the document, as shown below.
<state-transition>
<annotation>
When a document containing ‘book' as a root element is created,
add a ‘copyright' statement.
</annotation>
For now, I removed the annotation from the xml document and successfully loaded the pipeline.

Related

When passing a path as flowfile attribute XMLValidator doesn't work, but when passing the exact same path in the schema directly it does

i'm fairly new with working with NiFi. We're trying to validate an xmlfile, except we need to use a different xsd depending on some value passed in the file. Extracting and routing on the name wasn't an issue, and we stored the desired filepath in an attribute (xsdFile).
However, when trying to use that attribute in the XMLValidation processor, it changes the path and gives an error. When I copy the path from the attributes and copy it to the schema, it works, so the path itself isn't wrong.
Attribute passed in flowfile:
xsdFile:
C:\Users\MYNAME\Documents\NiFi\FLOW_RESOURCES\input\validatexml\camt.053.001.02_CvW_2.xsd
XMLValidation processor properties:
Schema File: ${xsdFile}
Error:
Failed to properly initialize Processor. If still scheduled to run, NiFi will attempt to initialize and run the Processor again after the 'Administrative Yield Duration' has elapsed. Failure is due to java.io.FileNotFoundException:
Schema file not found at specified location: C:\Users\MYNAME\DOCUME~1\NiFi\NIFI-1~1.0: java.io.FileNotFoundException:
Schema file not found at specified location: C:\Users\MYNAME\DOCUME~1\NiFi\NIFI-1~1.0
java.io.FileNotFoundException: Schema file not found at specified location: C:\Users\MYNAME\DOCUME~1\NiFi\NIFI-1~1.0
Why does this not work? Is there another way to do this, or do we need to route to different XMLValidators?
Check documentation for this processor:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ValidateXml/index.html
Schema File:
The path to the Schema file that is to be used for validation
Supports Expression Language: true
(will be evaluated using variable registry only)
So, flow file attribute can't be used for this parameter

Deltek Vision 7.6 - Column: does not exist when UpdateProject

I'm currently working in an integration with Deltek Vision 7.6, I'm using the SOAP API, it exposes all actions and I'm creating and updating records currently.
The problem is, adding a mew field in the database table and in Deltek Vision, executing the same call it returns an error like this:
<?xml version="1.0" encoding="UTF-8"?>
<DLTKVisionMessage>
<ReturnCode>ErrSave</ReturnCode>
<ReturnDesc>An unexpected error has occured while saving</ReturnDesc>
<ChangesNotProcessed>
<InsertErrors>
<Error rowNum="1">
<ErrorCode>InsertError</ErrorCode>
<Message>Column: does not exist.</Message>
<Table>Projects_MilestoneCompletionLog</Table>
<ROW new="1" mod="1" del="0">
<WBS1>100434</WBS1>
<WBS2>1014</WBS2>
<WBS3>SD</WBS3>
<Seq>a0D0m000000cf9NEAQ</Seq>
<CustMilestoneNumber>MS01</CustMilestoneNumber>
<CustMilestoneName>DM91 - Data Maintenance SAQ</CustMilestoneName>
<CustAmount>1150.0</CustAmount>
<CustSiteTrackerDate>2018-07-06T10:01:50</CustSiteTrackerDate>
</ROW>
</Error>
</InsertErrors>
</ChangesNotProcessed>
<Detail>Column: does not exist.</Detail>
<CallStack>UpdateProject.SendDataToDeltekVision</CallStack>
</DLTKVisionMessage>
The problematic field is: CustSiteTrackerDate if I remove this from Vision and Database the update call happens correctly.
Does anyone knows if after create a new custom field in Deltek is anything special we need to do to allow the update calls throw the API?
Thanks
I have been working with the Deltek Soap API as well and found this in some of the documentation:
XML Schema for Vision Web Services/APIs The data that you are adding
or updating in the Vision database must be sent in XML format. The
format of the XML data must comply with the schema. The order of the
fields in your XML file must match the order of the fields that is
defined by the schema. If your XML file does not match the required
schema and the order of the fields, you will receive an error when you
use web services to update the Vision database. Each applicable Info
Center in Vision has an XML schema defined. Examples of the schema for
each Info Center are included in schema files that are located on the
Vision Web/app server in \Vision\Web\Xsd directory
( is the directory where Deltek Vision is installed). The
names of the schema files start with the generic Info-Center-name
followed by ‘_Schema.xsd.’ For example, the name of the XML schema
file used for Employee Info Center would be ‘Employee_Schema.Xsd.’
It may be that you need to add the new field to the Info Center XML, go to the server hosting your Vision/Web/App and find the infocenter XML that this new field should exist in and make sure it is there.

Solr pdf index bad request

I'd like to have a simple setup of solr where I can index and search large folders of pdf/docx files. I mostly need just full text search, no need to have fields separated and the original documents do not seem to have well defined structure anyway. I follow https://lucene.apache.org/solr/quickstart.html which is straightforward, however, when I try to index my own folder with some pdf files, some files return error like:
POSTing file G1504225.pdf (application/pdf) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for
url: http://localhost:8983/solr/gettingstarted/update/extract?
resource.name=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf&literal.id=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">400</int><int
name="QTime">263</int></lst><lst name="error"><lst name="metadata"><str
name="error-class">org.apache.solr.common.SolrException</str><str
name="root-error-class">java.lang.NumberFormatException</str><str
name="error-class">org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException</str><str name="root-error-class">org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException</str></lst><str name="msg">Async exception during distributed update: Error from server at http://127.0.1.1:8983/solr/gettingstarted_shard2_replica1: Bad Request
request:
http://127.0.1.1:8983/solr/gettingstarted_shard2_replica1/update?update.chain=add-unknown-fields-to-the-schema&update.distrib=TOLEADER&distrib.from=http%3A%2F%2F127.0.1.1%3A8983%2Fsolr%2Fgettingstarted_shard1_replica1%2F&wt=javabin&version=2
Remote error message: ERROR: [doc=/home/solr/solr-6.5.1/../train_data/G1504225.pdf] Error adding field 'title'='United Nations' msg=For input string: "United Nations"</str><int name="code">400</int></lst>
</response>
SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 400 for URL:
http://localhost:8983/solr/gettingstarted/update/extract?
resource.name=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf&literal.id=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf
Most of the files are fine and I can search them. Any ideas?
Solr uses Tika to extract the text from those files. Some types of files, pdf specially, are hard to parse, as it is a proprietary format and Tika is always trying to catch up edge cases etc. So it is normal that some files will throw errors. You have to expect that.
See how many instances of NumberFormatException/pdfbox are found...(pdfbox is the library Tika uses for pdf files).
If you really want to get all the text from all pdf, even the ones erroring, you can put them in a special folder, and process them again extracting the text yourself with another library, different libraries will have different results of the same pdf, so you can use the superset of the text several libraries produce. But you will have to write some glue code for this, unless Tika allows you to plug specific libraries for specific file types (not sure if it does now, it didn't do that before).

How to handle file inputs with changing schemas in Talend

Questions: How do I continue to process files that differ substantially from a base schema and that trigger tSchemaComplianceCheck errors?
Background
Suppose I have a folder with Customer xls files called file1,file2,....file1000. Assume I have imported the file schema into Talend repository and called it 6Columns and I have the talend job configured to iterate through each of the files and process them
1-tFileInput ->2-tSchemaCompliance-6Columns -> 3-tMap ->4-FurtherProcessing
Read each excel file
Compare it to the schema 6Columns
Format the output (rename columns)
Take the collection of Customer data and process it more
While processing I notice that the schema compliance is generating errors (errorCode 16) which points to a number of files (200) with a different schema 13Columns but there isn't a way to identify the files in advance to filter then into a subjob
How do I amend my processing to correctly integrate the files with 13Columnsschema into the process (whats the recommended way of handling) and designing incase other schema changes occur
1-tFileInput ->2-tSchemaCompliance-6Columns -> 3-tMap ->4-FurtherProcessing
|
|Reject Flow (ErrorCode 16)
|Schema-13Columns
|
|-> ??
Current Thinking When ErrorCode 16 detected
Option 1 Parallel. Take the file path for the current file and process it against 13Columns using a new FileInput before merging the 2 flows back into 1
Option 2 Serial. Collect the list of files that triggered the error and process them after I've finished with the compliance files?
You could try something like below :
tFileList - Read your input repository
tFileInput "schema6" - tSchemaComplianceCheck : read files as 6-columns schema
tMap_1 : further processing
In the reject part :
tMap after reject link : add a new column containing the filepath that has been rejected
tFlowToIterate : used to get an iterate link, acceptable input for tFileInputDelimited that follows.
tFileInput : read data as 13-columns schema. Following components are the same as in part 1.
After that, you can push your data to tHashOutput, in order to read them further in another subjob.

How to fix source is empty error in XML source while using Foreach loop container in SSIS 2012?

I have an issue with a very simple task in SSIS 2012.
I have a for-each container that runs in FOR-EACH-FILE Enumerator mode. I want to read a target folder with XML files. The path to the folder is correctly configured. The files field is set to *.xml
The variable mapping is defined with the follwing Variable: User::FileVar , Index 0.
Now I add a simple data flow task inside the container. The dataflow task only has a XML-Data Source task, that's it. For the XML Data source task, the XSD location is set. When I click choose columns, I can see the columns from the XSD schema.
BUT: When I save the XML task , I always get the error message: The Property XMLDataVariable is empty. I tried both data Access modes, XML file from variable and XML data from variable. The error message remains, I cannot run the package.
I don't use any expressions, neither at the foreach loop container nor at the data flow task.
I dont know what's wrong here, I did the steps exactly as shown in some tutorials for older versions of SSIS.
Do you have any ideas?
The issue is that the XML Source is trying to validate the existence of the given file during the design time. However, you will know the file name only during runtime when the Foreach loop container executes and loops through every XML file available in a given folder.
I recreated an SSIS 2012 package using my answer to one of other SO questions.
SSIS reading multiple xml files from folder
I was able to reproduce the error The property "XMLDataVariable" on the XML Source was empty
On the XML source, I set the property ValidateExternalMetadata to False. Setting this to false will force the package not to verify the existence of the xml file path during design time.
I was successfully able to execute the package.
Hope that helps.