I've got a large amount of files inside .tar.gz.
Is it (programmatically) possible to extract a file by its filename, without the overhead of decompressing other files?
I'll split the reply into two parts
Is it (programmatically) possible to extract a file by its filename
yes, it is possible to extract a file by its filename.
tar xzf tarfile.tar filename
without the overhead of decompressing other files?
In order to extract a file from a compressed tar file the tar program has to find the file you want. If that is the first file in the tarfile, then it only has to uncompress that. If the file isn't the first in the tarfile the tar program needs to scan through the tarfile until it finds the file you want. To do that is MUST uncompress the preceding files in the tarfile. That doesn't mean it has to extract them to disk or buffer these files in memory. It will stream the uncompression so that them memory overhead isn't significant.
Related
df.to_csv("/path/to/destination.zip", compression="zip")
The above line will generate a file called destination.zip in the directory /path/to/.
Decompressing the ZIP file, will result in a directory structure path/to/destination.zip where destination.zip is the CSV file.
Why is the path/to/ folder structure included in the compressed file? Is there any way to avoid this?
Was blown away by this, currently writing the ZIP locally (destination.zip) and using os.rename to move it to the desired location.. Is this a bug ?
I have many .7z files every file containing many large CSV files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?
I believe you should be able to open the file using
import lzma
with lzma.open("myfile.7z", "r") as f:
df = pd.read_csv(f, ...)
This is, strictly speaking, meant for the xz file format, but may work for 7z also. If not, you will need to use libarchive.
For use with Dask, you can do the above for each file with dask.delayed.
dd.read_csv directly also allows you to specify storage_options={'compression': 'xz'}; however, ramdom access within a file is likely to be inefficient at best, so you should add blocksize=None to force one partition per file:
df = dd.read_csv('myfiles.*.7z', storage_options={'compression': 'xz'},
blocksize=None)
I would like to know if there is a way to calculate MD5 hashes of files contained in a zip archive.
For example, I have an zip file that contains three files: Prizes.dat, Promotions.dat and OutOfDate.dat and I would like to calculate the MD5 of the three files to compare it with a given string. Since I need to do this on a very very large amount of zip Archives, I'm wondering if there's a way to do this directly without decompressing the file.
Thanks in advance!
superPanda
Stumbled upon this need and discovered a way to check the hash of a file contained in a tarball without writing the uncompressed data to disk (unzipping).
BSD example below hence the md5
tar xOfz archive.tgz foo.txt | md5
tar xOfj archive.bz2 foo.txt | md5
Or use tar xOfz archive.tgz foo.txt | md5sum
for linux.
I think the simplest solution is to calculate the MD5 hash of a zipped file and store it in the zip archive alongside the file. If you are generating these files yourself, you can just hash the file before you zip it. If you are receiving the ZIP files from somewhere else, then write a script that will automate going through all the files and adding hashes. Then whenever you need to check the hash in the program, you can just pull the precomputed hash from the ZIP file.
I would like to overwrite the name of input file with the same name of output file owing to limited disk space that I have in my system. Is it possible? I know this is not recommended, but I have the input files already backup. I will have a loop in a shell to do the cut command.
#!/bin/bash
for i in {1..1000}
do
cut --delimiter=' ' --fields=1,3-7 input$i.txt > input$i.txt
done
You could always use a temporary file to which you redirect, and then when you're sure everything went fine, you rename it to the original file.
some gnu utils commands have a -i option (such as sed) that allow you to change a file in place .....most of file filtering and editing (like cut) can be done using sed.
The shell will parse the command and handle the redirections first. When it sees "> afile" it will truncate "afile" and open it for writing. Your data is now destroyed. Then the shell hands the filename to cut which now has nothing to read.
This is how I learned:
some | pipeline < my_file > my_file.tmp
ln my_file my_file.bak # this is a hard link
mv my_file.tmp my_file
That keeps the original data in place for as long as possible.
If you're having disk space issues, you will have to read the input file into memory entirely.
In case of very limited disk space (disk quota) you could try to place a compressed source file in ram (/dev/shm) and use that as the source (uncompressing it to stdout and piping that to your script).
I have this program that takes one argument for the source file and then it parse it. I have several files gzipped that I would like to parse, but since it only takes one input, I'm wondering if there is a way to create one huge file using gzip and then pipe it into the only one input.
Use zcat - you can provide it with multiple input files, and it will de-gzip them and then concatenate them just like cat would. If your parser supports piped input into stdin, you can just pipe it directly; otherwise, you can just redirect the output to a file and then invoke your parser program on that file.
If the program actually expects a gzip'd file, then just pipe the output from zcat to gzip to recompress the combined file into a single gzip'd archive.
http://www.mkssoftware.com/docs/man1/zcat.1.asp