ND-JSON Split in SFTP - mule

I have a large ND-JSON file in SFTP (~20K lines). Is there a way to generate sub files out of this (~500 lines each) and place in another folder in SFTP?
Does Mule 4 has the capability to split a large file and write in SFTP? Or is there a need for a Java component?
Please advise.

If the input file is parsed as NDJSON, you can use the DataWeave function divideBy() to separate the array read from the file into subarrays of n elements.
Example:
%dw 2.0
output application/java
import * from dw::core::Arrays
---
payload divideBy 500
Then you should be able to use a to process each segment and output an NDJSON file inside.

Related

Find the size of a multipart/form-data file in mule4

I have a HTTP Listener which will read the file in multipart/form-data format. I want to find the size of that file and check if it is less/greater than 10 MB. Please help me with this
You can use the following:
%dw 2.0
output application/java
---
sizeOf(payload.parts.file.content.^raw)
Where file is the key for what is being sent.

How do I use sftp:content to write an sftp message from memory

I am trying to create an sftp file from memory using sftp:write and sftp:content. My dataweave code is:
<sftp:write doc:name="sftp from memory" doc:id="01bee2a1-69ad-4194-8ec8-c12852521e87" config-ref="SFTP_Config" path="#[vars.sftpFileName]" createParentDirectories="false">
<sftp:content><![CDATA[%dw 2.0
output application/csv
---
payload.toplevel.secondlevel.bottomlevel[0],
payload.more.andmore.andstillmore[0]
]]>
</sftp:content>
</sftp:write>
It does create a file in the correct directory, but the contents are not the payload values. Instead, it is the actual dataweave code. The file contents are:
%dw 2.0
output application/csv
---
payload.toplevel.secondlevel.bottomlevel[0]
payload.more.andmore.andstillmore[0]
I am using version 4.2.2 of the Mule Server and 1.3.2 of the SFTP component.
You aren't actually passing dataweave, you're passing a string. Press the fx button on fields when you're going to be using dataweave. The XML will look like this. Notice the extra #[? That indicates this is dataweave. Your dataweave is also invalid; you must output an object or an array of objects; to make your output an object, you wrap it in { .. } just like JSON, and use key-value pairs. When outputting this to CSV, the keys will be used as a header row unless you include header=false in the output line: https://docs.mulesoft.com/mule-runtime/4.3/dataweave-formats-csv#writer_properties
<sftp:write doc:name="sftp from memory" doc:id="01bee2a1-69ad-4194-8ec8-c12852521e87" config-ref="SFTP_Config" path="#[vars.sftpFileName]" createParentDirectories="false">
<sftp:content><![CDATA[#[%dw 2.0
output application/csv
---
{
someKeyName: payload.toplevel.secondlevel.bottomlevel[0],
someOtherKeyName: payload.more.andmore.andstillmore[0]
}]]]>
</sftp:content>
</sftp:write>

pandas.read_csv of a gzip file within a zipped directory

I would like to use pandas.read_csv to open a gzip file (.asc.gz) within a zipped directory (.zip). Is there an easy way to do this?
This code doesn't work:
csv = pd.read_csv(r'C:\folder.zip\file.asc.gz') // can't find the file
This code does work (however, it requires me to unzip the folder, which I want to avoid because my dataset currently contains thousands of zipped folders):
csv = pd.read_csv(r'C:\folder\file.asc.gz')
Is there an easy way to do this? I have tried using a combination of zipfile.Zipfile and read_csv, but have been unsuccessful (I think partly due to the fact that this is an ascii file as well)
Maybe the followings might help.
df = pd.read_csv('filename.gz', compression='gzip')
OR
import gzip
file=gzip.open('filename.gz','rb')
content=file.read()

Importing a *random* csv file from a folder into pandas

I have a folder with several csv files, with file names between 100 and 400 (Eg. 142.csv, 278.csv etc). Not all the numbers between 100-400 are associated with a file, for example there is no 143.csv. I want to write a loop that imports 5 random files into separate dataframes in pandas instead of manually searching and typing out the file names over and over. Any ideas to get me started with this?
You can use glob and read all the csv files in the directory.
file = glob.glob('*.csv')
random_files=np.random.choice(file,5)
dataframes= []
for fp in random_files :
dataframes.append(pd.read_csv(fp))
From this you can chose the random 5 files from directory and then read them seprately.
Hope I answer your question

Avoiding multiple headers in pig output files

We use Pig to load files from directories containing thousands of files, transform them, and then output files that are a consolidation of the input.
We've noticed that the output files contain the header record of every file processed, i.e. the header appears multiple times in each file.
Is there any way to have the header only once per output file?
raw_data = LOAD '$INPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',')
DO SOME TRANSFORMS
STORE data INTO '$OUTPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage('|')
Did you try this option?
SKIP_INPUT_HEADER
See https://github.com/apache/pig/blob/31278ce56a18f821e9c98c800bef5e11e5396a69/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java#L85