Utilizing multi core for tar+gzip/bzip compression/decompression

Utilizing multi core for tar+gzip/bzip compression/decompression - gzip

I normally compress using tar zcvf and decompress using tar zxvf (using gzip due to habit).
I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.
Is there any way I can utilize the unused cores to make it faster?

You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.
For example use:
tar -c --use-compress-program=pigz -f tar.file dir_to_zip

You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:
tar cf - paths-to-archive | pigz > archive.tar.gz
By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz

Common approach
There is option for tar program:
-I, --use-compress-program PROG
filter through PROG (must accept -d)
You can use multithread version of archiver or compressor utility.
Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:
$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive
$ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archive
Archiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):
$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz
$ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gz
Input and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.
p7zip
For p7zip for compression you need a small shell script like the following:
#!/bin/sh
case $1 in
-d) 7za -txz -si -so e;;
*) 7za -txz -si -so a .;;
esac 2>/dev/null
Save it as 7zhelper.sh. Here the example of usage:
$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive
$ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7z
xz
Regarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting -T or --threads to an appropriate value via the environmental variable XZ_DEFAULTS (e.g. XZ_DEFAULTS="-T 0").
This is a fragment of man for 5.1.0alpha version:
Multithreaded compression and decompression are not implemented yet, so this
option has no effect for now.
However this will not work for decompression of files that haven't also
been compressed with threading enabled. From man for version 5.2.2:
Threaded decompression hasn't been implemented yet. It will only work
on files that contain multiple blocks with size information in
block headers. All files compressed in multi-threaded mode meet this
condition, but files compressed in single-threaded mode don't even if
--block-size=size is used.
Recompiling with replacement
If you build tar from sources, then you can recompile with parameters
--with-gzip=pigz
--with-bzip2=lbzip2
--with-lzip=plzip
After recompiling tar with these options you can check the output of tar's help:
$ tar --help | grep "lbzip2\|plzip\|pigz"
-j, --bzip2 filter the archive through lbzip2
--lzip filter the archive through plzip
-z, --gzip, --gunzip, --ungzip filter the archive through pigz

You can use the shortcut -I for tar's --use-compress-program switch, and invoke pbzip2 for bzip2 compression on multiple cores:
tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/

If you want to have more flexibility with filenames and compression options, you can use:
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \
tar -P --transform='s#/my/path/##g' -cf - {} + | \
pigz -9 -p 4 > myarchive.tar.gz
Step 1: find
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec
This command will look for the files you want to archive, in this case /my/path/*.sql and /my/path/*.log. Add as many -o -name "pattern" as you want.
-exec will execute the next command using the results of find: tar
Step 2: tar
tar -P --transform='s#/my/path/##g' -cf - {} +
--transform is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use -C option to change directory as you'll lose benefits of find: all files of the directory would be included.
-P tells tar to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by --transform anyway.
-cf - tells tar to use the tarball name we'll specify later
{} + uses everyfiles that find found previously
Step 3: pigz
pigz -9 -p 4
Use as many parameters as you want.
In this case -9 is the compression level and -p 4 is the number of cores dedicated to compression.
If you run this on a heavy loaded webserver, you probably don't want to use all available cores.
Step 4: archive name
> myarchive.tar.gz
Finally.

A relatively newer (de)compression tool you might want to consider is zstandard. It does an excellent job of utilizing spare cores, and it has made some great trade-offs when it comes to compression ratio vs. (de)compression time. It is also highly tweak-able depending on your compression ratio needs.

Here is an example for tar with modern zstd compressor, as finding out good examples on this one was difficult:
apt poem to install zstd and pv utilities for Ubuntu
Compress multiple files and folders (zstd command alone can only do single files)
Display progress using pv - shows the total bytes compressed and compression speed GB/sec real-time
Use all physical cores with -T0
Set compression level higher than the default with -8
Display the resulting wall clock and CPU time used after the operation is finished using time
apt install zstd pv
DATA_DIR=/path/to/my/folder/to/compress
TARGET=/path/to/my/arcive.tar.zst
time (cd $DATA_DIR && tar -cf - * | pv | zstd -T0 -8 -o $TARGET)

Related

How do I transfer small files quickly over the network with zstd?

As the question states, I want to backup many small files and send them via ssh to a destination. Does rsync speed things up significantly vs tar?

This works quite well, significantly faster than gzip.
Push (Upload)
tar -c --zstd src_dir | ssh user#dest_addr "cd dest_dir && tar -x --zstd"
This does the following
Creates a tar file using Zstd and outputs it via STDOUT
Connects via ssh, piping STDOUT over the network
Reads data from STDIN, and extracts it
Custom zstd flags
This uses maximum compression (default level is 3) and multithreading.
tar -c -I "zstd -19 -T0" src_dir | ssh user#dest_addr "cd dest_dir && tar -x --zstd"
With progress
tar -c --zstd src_dir | pv --timer --rate | ssh user#dst_addr "cd dest_dir && tar -x --zstd"
Pull (Download)
ssh user#dest_addr "tar --zstd -cf - src_dir" | tar -x --zstd --directory dest_dir

Gitlab Tee: collapsed multi-line command & unrecognized option: append

Inside of my GitLab CI file I have a file which is copied from the "Publish npm packages instruction",
before_script:
- |
{
echo "#${CI_PROJECT_ROOT_NAMESPACE}:registry=${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/npm/"
echo "${CI_API_V4_URL#https?}/projects/${CI_PROJECT_ID}/packages/npm/:_authToken=\${CI_JOB_TOKEN}"
} | tee --append .npmrc
When I try to run this in Alpine Linux I'm getting.
$ { # collapsed multi-line command
tee: unrecognized option: append
BusyBox v1.31.1 () multi-call binary.
Usage: tee [-ai] [FILE]...
Copy stdin to each FILE, and also to stdout
-a Append to the given FILEs, don't overwrite
-i Ignore interrupt signals (SIGINT)

The reason is simple, there are two implementations of tee,
Busybox, which Alpine uses, has tee but it doesn't provide an --append. It does provide an -a option (short for append) which is defined as "append to the given FILEs, do not overwrite"
GNU CoreUtils provides a copy of tee too it has --append which you're making use of here. It's also defined as "append to the given FILEs, do not overwrite". But as a shorthand, GNU Tee also provides the alias is -a.
So in short, if you want something to be compatible with Alpine and BusyBox as well as distros that ship GNU Tee then use -a (supported in both) instead of --append (supported only in GNU Tee).

How to launch openbios from Qemu

Good day,
So I am following this coreboot v3 + OpenBIOS tutorial Here .
In the instructions I have the following...
mkdir foo
cd foo
wget http://www.coreboot.org/images/9/9d/Qemu_coreboot_openbios.zip
wget http://www.coreboot.org/images/0/0d/Vgabios-cirrus.zip
unzip Qemu_coreboot_openbios.zip
unzip Vgabios-cirrus.zip
mv qemu_coreboot_openbios.bin bios.bin
cd ..
qemu -L foo -hda /dev/zero -serial stdio
I noticed that qemu has been replace or is implemented with qemu-system.
command I am running
qemu-x86_64 -L foo -hda /dev/zero -serial stdio
When I run the command, I see just qemu run it's typical and not find a disk.(which I expect since the disk switch points to /dev/zero) but none of the payloads run as I would expect from the tutorial.
What am I doing incorrectly?
Should I use a different version of qemu?
Should I create a dummy disk for this?
Qemu seems to be ignoring the files in the foo directory.

The examples are not up to date, as you have noticed by the renaming of qemu to qemu-system-x86_64.
I managed to get the examples to work using only the cirrus video card, and by renaming the outputs of the zips (bin - bios files to bios-256k.bin). I did this because by adding the -L option I specify the bios location and qemu will look for a file called bios-256k.bin as the bios. The command to run the bios with cirrus (all done while in the foo directory) was
qemu-system-x86_64 -L . -vga cirrus -serial stdio
Both machine types pc and q35 worked.

parallel : how to pass options to commands

For parallelizing gzip compression:
parallel gzip ::: myfile_*
does the job but how to pass gzip options such as -r or -9
I tried parallel gzip -r -9 ::: myfile_* and parallel gzip ::: 9 r myfile_*
but it doesn't work.
when I tried parallel "gzip -9 -r" ::: myfile_*
I get this error message :
gzip: compressed data not written to a terminal. Use -f to force compression
Also the -r switch for recursively adding directories is not working.
....
Similarly for other commands: how to pass the options while using parallel ?

You have the correct syntax:
parallel gzip -r -9 ::: myfile_*
So something else is wrong. What is the output of
parallel --version
You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/
You can install GNU Parallel in just 10 seconds with:
wget -O - pi.dk/3 | sh
Watch the intro video on
http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

(I don't think this question belongs here. Maybe superuser.com?)
parallel gzip -r -9 ::: * worked fine for me, going into directories and all. I am using parallel version 20130622.
Note that with this approach, each directory will be a single task. You may instead want to pipe the output of find to parallel to give each file separately to parallel.

Have you tried the --gnu flag for parallel??
parallel -j+0 --gnu "command"....
In some systems (like Ubuntu) is disabled by default.

is it possible to take a large number of files & tar/gzip and stream them on-the-fly?

I have a large number of files which I need to backup, problem is there isn't enough disk space to create a tar file of them and then upload it offsite. Is there a way of using python, php or perl to tar up a set of files and upload them on-the-fly without making a tar file on disk? They are also way too large to store in memory.

I always do this just via ssh:
tar czf - FILES/* | ssh me#someplace "tar xzf -"
This way, the files end up all unpacked on the other machine. Alternatively
tar czf - FILES/* | ssh me#someplace "cat > foo.tgz"
Puts them in an archive on the other machine, which is what you actually wanted.

You can pipe the output of tar over ssh:
tar zcvf - testdir/ | ssh user#domain.com "cat > testdir.tar.gz"

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas