How to Zip files on Storage with ADF

How to Zip files on Storage with ADF - azure-storage

Would there be a way to Zip/copy files on a blob-storage into another storage/folder but with a zipped result ?
You can deflate with ADF, but can you enflate ?
We receive +100k files a day, the next day I would like to zip them all into a archive folder.
Couldn't find a way ...
So any help/.hints appreciated.

ADF supports zipdeflate format, which can archive multiple files into a single zip file. Refer to this doc: https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support.

Related

Create a ADF Dataset to load multiple csv files (same format) from the Blob

I try to create a dataset containing multiple csv files from the Blob. In the file path of dataset setting: I create a parameter - #dataset().FolderName and add FolderName in the Parameters.
I leave file (from File Path) empty as I want to grab all files in the folder. However, there is no data when I preview data. Is there anything missing? Thank you

I have tested it on my side and it can work fine.
add FolderName parameter
preview data
If you want to merge all csv files in Data Flow, you can do this:
1.output to single file
2.set Single partition

Pandas to_csv with ZIP compresses whole directory

df.to_csv("/path/to/destination.zip", compression="zip")
The above line will generate a file called destination.zip in the directory /path/to/.
Decompressing the ZIP file, will result in a directory structure path/to/destination.zip where destination.zip is the CSV file.
Why is the path/to/ folder structure included in the compressed file? Is there any way to avoid this?
Was blown away by this, currently writing the ZIP locally (destination.zip) and using os.rename to move it to the desired location.. Is this a bug ?

pandas.read_csv of a gzip file within a zipped directory

I would like to use pandas.read_csv to open a gzip file (.asc.gz) within a zipped directory (.zip). Is there an easy way to do this?
This code doesn't work:
csv = pd.read_csv(r'C:\folder.zip\file.asc.gz') // can't find the file
This code does work (however, it requires me to unzip the folder, which I want to avoid because my dataset currently contains thousands of zipped folders):
csv = pd.read_csv(r'C:\folder\file.asc.gz')
Is there an easy way to do this? I have tried using a combination of zipfile.Zipfile and read_csv, but have been unsuccessful (I think partly due to the fact that this is an ascii file as well)

Maybe the followings might help.
df = pd.read_csv('filename.gz', compression='gzip')
OR
import gzip
file=gzip.open('filename.gz','rb')
content=file.read()

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.

With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');

You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'

In kettle use text file input read csv file from a tar.gz file but it didn't worked. Where it might be wrong?

I have a csv file that is tared and zipped. So I have test.tar.gz.
I would like, through text file input, read csv file.
I try this tar:gz:file://C:/test/test.tar.gz!/test.tar! use wildcard like ".*\.csv".
But it sometime can't read success.
It throws Exception
org.apache.commons.vfs.FileNotFolderException:
Could not list the contents of
"tar:gz:file:///C:/test/test.tar.gz!/test.tar!/"
because it is not a folder.
I use windows8.1, pdi 5.2
Where it might be wrong?

For a compressed file csv reading, "Text File Input" step in Pentaho Kettle only supports the first files inside the compressed folder(either in Zip/GZip file). Check the Pentaho Wiki in the compression section.
Now for your issue, try removing the wildcard entry since only the first file inside the zip/gzip file will be read. (as explained above)
I have placed a sample code containing both reading zip and gzip files. Check it here.
Hope it helps :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to Zip files on Storage with ADF - azure-storage

ADF supports zipdeflate format, which can archive multiple files into a single zip file. Refer to this doc: https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support.

Related

Create a ADF Dataset to load multiple csv files (same format) from the Blob

Pandas to_csv with ZIP compresses whole directory

pandas.read_csv of a gzip file within a zipped directory

Remove files with Pig script after merging them

In kettle use text file input read csv file from a tar.gz file but it didn't worked. Where it might be wrong?

Categories

Resources