ADLA AUs assigned for JSON files - azure-data-lake

I have a custom Extractor with AtomicFileProcessing set to false. It extracts a large no of JSON files (each line in the file is a JSON document) and output two files with successful and failed requests, both of them contains the json rows (AUs allocated more than 1 to extract the files). Problem is when I use the same extractor to extract the outputted files in first step with more than one AU, it fails with the error, Unexpected character encountered while parsing value: e. Path '', line 0, position 0.
If I assign 1 AU on Azure or run this locally with AU set to more than 1, it successfully processes the data. Is this behavior because of more AU provided to process a single JSON file and since the file is in non-splittable format, it can't be parallelized?

you can solve this problem converting your json file to Jsonlines.
http://jsonlines.org/examples/
Then you need to read the file using text extractor and use JsonFunctions available on Microsoft.Analytics.Samples.Formats
to read the json.
That transformation will make your file splittable and you can parallelized it!

Related

Aws s3 batch operation error: Task target couldn't be URL decoded

I need to restore a lot of object from aws s3 glacier deep archive. So i try to use a s3 batch jobs. For that i use a python code to create a manifest as a csv with to columns Bucket,Key.
But my first issue : some Key contain a comma so the job failed.
To solve (partialy) this issue i just cut the csv file to keep only the first two columns hoping that there are not many files involved.
But now i have another issue:
ErrorMessage: Task target couldn't be URL decoded
Any Idea ?
As mentioned on https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html#specify-batchjob-manifest, the manifest CSV file must be URL encoded. The , character in a key name gets converted to %2C with URL encoding so the resulting file will be valid CSV even with commas in the key name

Load compressed data from Amazon S3 to Postgres using datastage

I am trying to load data which is stored in .gz format in S3 to PostgreSQL server using Datastage. I am using the ODBC connector on the target (database) side. I am able to load uncompressed data from S3 to PostgreSQL but no luck with compressed data so far. I have tried the Expand Stage but it's not helping or I am not doing the right thing. Without the "Expand" the data is coming but it is trying to read the compressed data, while doing so it fails and throws an error:
Amazon_S3_0,1: com.ascential.e2.common.CC_Exception: Failed to initialize the parser: The row delimiter was not found within the first 132 bytes of the file. Ensure that the Row delimiter property matches the row delimiter of the file.
at com.ibm.iis.cc.cloud.CloudLogger.createCCException (CloudLogger.java: 196)
at com.ibm.iis.cc.cloud.CloudStage.processReadAndParse (CloudStage.java: 1591)
at com.ibm.iis.cc.cloud.CloudStage.process (CloudStage.java: 680)
at com.ibm.is.cc.javastage.connector.CC_JavaAdapter.run (CC_JavaAdapter.java: 443)
Amazon_S3_0,1: Failed to initialize the parser: The row delimiter was not found within the first 132 bytes of the file. Ensure that the Row delimiter property matches the row delimiter of the file. (com.ibm.iis.cc.cloud.CloudLogger::createCCException, file CloudLogger.java, line 196)
If someone has come across this, please share your valuable inputs.

Split CSV file in records and save as a csv file format - Apache NIFI

What I want to do is the following...
I want to divide the input file into registers, convert each record into a
file and leave all the files in a directory.
My .csv file has the following structure:
ERP,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
ERP,FRANK,DIETSCH,5064 E METAIRIE AVE.,BRANDSVILLA,MO,65687,252-5592,1176 E THAYER ST.,COLUMBIA,MO,65215,557,291-9571,217-38-5525,129-10-0407,1/13/35,M,
As you can see it doesn't have Header row.
Here is my flow.
My problem is that when the Split Proccessor divides my csv into flows with 400 lines, it isn't save in my output directory.
It's first time using NIFI, sorry.
Make sure your RecordReader controller service is configured correctly(delimiter..etc) to read the incoming flowfile.
Records per split value as 1
You need to use UpdateAttribute processor before PutFile processor to change the filename to unique value (like UUID) unless if you are configured PutFile processor Conflict Resolution strategy as Ignore
The reason behind changing filename is SplitRecord processor is going to have same filename for all the splitted flowfiles.
Flow:
I tried your case and flow worked as expected, Use this template for your reference and upload to your NiFi instance, Make changes as per your requirements.

CSV to CSV datamapper in mule

I am trying to transform one csv file to another one using mule.
But how I want is for example I have 4 header in the source csv file,
heade1, header2, header3, header4
And client may pass only first 3 header and its value in the csv file. I am getting error if mule datamapper does not find all the header in source csv.
Parsing error: Unexpected end of file in record 1, field 2 ("test2"),
metadata "headertest"; value: '<Raw record data is not available,
please turn on verbose mode.>'
How can I set the datamapper to work if source file does not contains all the header/values
I couldn't find a clean way to do that yet, but you could add a pre process step that adds a field separator at the end of each line in the input csv (i.e. add a comma at the end of each line).
This way the last field will be assumed empty.
HTH,
Marcos

Open a .cfile from rtl_sdr after convert with GNU Radio

I have a binary file (capture.bin) from the rtl_sdr tool. I convert it to a .cfile with this manual http://sdr.osmocom.org/trac/wiki/rtl-sdr#Usingthedata
Where can I get the data in this file? The goal is to get a numerical format output from the the source. Is this possible?
That actually is covered by a GNU Radio FAQ entry.
What is the file format of a file_sink? How can I read files produced by a file sink?
All files are in pure binary format. Just bits. That’s it. A floating point data stream is saved as 32 bits in the file, one after the other. A complex signal has 32 bits for the real part and 32 bits for the imaginary part. Reading back a complex number means reading in 32 bits, saving that to the real part of a complex data structure, and then reading in the next 32 bits as the imaginary part of the data structure. And just keep reading the data.
Take a look at the Octave and Python files in gr-utils for reading in data using Octave and Python’s Scipy module.
The exception to the format is when using the metadata file format. These files are produced by the File Meta Sink: http://gnuradio.org/doc/doxygen/classgr_1_1blocks_1_1file__meta__sink.html block and read by the File Meta Source block. >See the manual page on the metadata file format for more information about how to deal with these files.
A one-line Python command to read the entire file into a numpy array is:
f = scipy.fromfile(open("filename"), dtype=scipy.uint8)
Replace the dtype with scipy.int16, scipy.int32, scipy.float32, scipy.complex64 or >whatever type you were using.
Update
scipy.fromfile will be deprecated in v2.0 so instead use numpy library
f = numpy.fromfile(open("filename"), dtype=numpy.uint8)