"Get XML Data" step of pentaho is not able to read same xml file sometimes - pentaho

I am using pentaho kettle tool for ETL job. In the job, one of the step(Get XML Data) is not able to read/parse xml file sometime. Sometime same XML file didn't throw any exception and sometime it threw. The list of errors are as given below -
1) Error on line 1 of document
file:///D:/softwares/pdi-ce- : The
element type "Confidence" must be terminated by the matching end-tag
2) org.dom4j.DocumentException: Error on line -1 of document :
Premature end of file. Nested exception: Premature end of file.
However, i don't find any issue in xml file. Could anyone help on this topic?

I didn't find the root cause but got the solution. The xml file which was being parsed by the step, was inside the zip file. Before parsing the xml file, a java step was unzipping the zip file. Instead of unzipping the zip file, i directly parsed the xml file inside the zip. That resolves the issue and no any error is reported again.


What is the wildcard for the File connector file path field in Anypoint Studio and Mule

I am using Anypoint Studio 7 and Mule 4.1.
A product file in csv format with a filename that will include the current timestamp will be added to a directory on a daily basis and needs to be processed. To do this we are creating a mule workflow using the file connector and want to configure the file path field to only read csv file formats regardless of name.
At the moment, the only way I can get it to work is by specifying the filename in the file path field which looks like this:
when I would like to specify some kind of wildcard in the file path similar to this:
but the above does not work.
What is the correct wildcard syntax and also is there a way to specify the relative file path instead of the absolute one as when I try to specify a relative file path I get an error too?
Error message in logs:
Message : Illegal char <*> at index 108: C:/Workspace/product-files-v1/src/main/resources/input/products-*.csv.
Element : product-files-v1/processors/1 # product-files-v1:product-files-v1.xml:16 (Read File)
Element XML : <file:read doc:name="Read File" doc:id="fdbbf477-e831-4e7c-827c-71efd1d2e538" config-ref="File_Config" path="C:/Workspace/product-files-v1/src/main/resources/input/products-*.csv" outputMimeType="application/csv" outputEncoding="UTF-8"></file:read>
Error type : MULE:UNKNOWN
Root Exception stack trace:
java.nio.file.InvalidPathException: Illegal char <*> at index 108: C:/Workspace/product-files-v1/src/main/resources/input/products-*.csv
Thanks for any help
i am assuming you need to user a <file:matcher> when you want to filter or read certain type of files from a directory.
an example would be

ADLA AUs assigned for JSON files

I have a custom Extractor with AtomicFileProcessing set to false. It extracts a large no of JSON files (each line in the file is a JSON document) and output two files with successful and failed requests, both of them contains the json rows (AUs allocated more than 1 to extract the files). Problem is when I use the same extractor to extract the outputted files in first step with more than one AU, it fails with the error, Unexpected character encountered while parsing value: e. Path '', line 0, position 0.
If I assign 1 AU on Azure or run this locally with AU set to more than 1, it successfully processes the data. Is this behavior because of more AU provided to process a single JSON file and since the file is in non-splittable format, it can't be parallelized?
you can solve this problem converting your json file to Jsonlines.
Then you need to read the file using text extractor and use JsonFunctions available on Microsoft.Analytics.Samples.Formats
to read the json.
That transformation will make your file splittable and you can parallelized it!

SAP DS: Read input xml file result in an error

I am using SAP DATA Services v. 4.2.
I am trying to acquire an XML file in input.
I created a new XML Schema starting from a .xsd file
When i launch the job i have this error:
2076818752FIL-0522267/25/2017 2:56:35 PM|Data flow DF_FE_XXXX
2076818752FIL-0522267/25/2017 2:56:35 PM<XML file reader->READ MESSAGE XX_INPUT_FILE OUTPUT(XX_INPUT_FILE)> cannot find file location object <%1> in repository.
24736 20092 RUN-050304 7/26/2017 9:18:39 AM Function call <raise_exception ( Error 52226 gestito in Error_handling ) > failed, due to error <50316>
What am i doing wrong?
problem in the way how you identify file location in Data File(s) section of your format, BODS thinks that you provide some File Location and it don't find such
for more information about "File Locations"

In kettle use text file input read csv file from a tar.gz file but it didn't worked. Where it might be wrong?

I have a csv file that is tared and zipped. So I have test.tar.gz.
I would like, through text file input, read csv file.
I try this tar:gz:file://C:/test/test.tar.gz!/test.tar! use wildcard like ".*\.csv".
But it sometime can't read success.
It throws Exception
Could not list the contents of
because it is not a folder.
I use windows8.1, pdi 5.2
Where it might be wrong?
For a compressed file csv reading, "Text File Input" step in Pentaho Kettle only supports the first files inside the compressed folder(either in Zip/GZip file). Check the Pentaho Wiki in the compression section.
Now for your issue, try removing the wildcard entry since only the first file inside the zip/gzip file will be read. (as explained above)
I have placed a sample code containing both reading zip and gzip files. Check it here.
Hope it helps :)

IS it possible to manage NO FILE error in Pig?

I'm trying to load simple file:
log = load 'file_1.gz' using TextLoader AS (line:chararray);
dump log
And I get an error:
2014-04-08 11:46:19,471 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1054)
Is is possible to manage such situation before error appears?
Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
The error is the input file doesn't exist in the given hdfs path.
log = load 'file_1.gz' using TextLoader AS (line:chararray);
as you haven’t mentioned the absolute path of file_1.gz , it will taken the home hdfs dir of the user with which you are running your pig-script
Unfortunately in the current version of Pig (0.15.0) it is impossible to manage these errors without using UDF's.
I suggest creating a Java or Python script using try and catch to take care of this.
Here's a good website that might be of some use to you: https://wiki.apache.org/pig/PigErrorHandlingInScripts
Good luck learning Pig!
I'm facing this issue as well. My load command is:
DATA = LOAD '${qurwf_folder_input}/data/*/' AS (...);
I want to load all files from the data subfolders, but the data folder is empty and I got the same error as you. What I did, in my particular case, was to create an empty folder in the data directory. So the LOAD returns an empty dataset and the script did not fail.
By the way, I'm using Oozie workflow to run the scripts, and in the prepare, I create the empty folders.