parallel : how to pass options to commands - gzip

For parallelizing gzip compression:
parallel gzip ::: myfile_*
does the job but how to pass gzip options such as -r or -9
I tried parallel gzip -r -9 ::: myfile_* and parallel gzip ::: 9 r myfile_*
but it doesn't work.
when I tried parallel "gzip -9 -r" ::: myfile_*
I get this error message :
gzip: compressed data not written to a terminal. Use -f to force compression
Also the -r switch for recursively adding directories is not working.
....
Similarly for other commands: how to pass the options while using parallel ?

You have the correct syntax:
parallel gzip -r -9 ::: myfile_*
So something else is wrong. What is the output of
parallel --version
You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/
You can install GNU Parallel in just 10 seconds with:
wget -O - pi.dk/3 | sh
Watch the intro video on
http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

(I don't think this question belongs here. Maybe superuser.com?)
parallel gzip -r -9 ::: * worked fine for me, going into directories and all. I am using parallel version 20130622.
Note that with this approach, each directory will be a single task. You may instead want to pipe the output of find to parallel to give each file separately to parallel.

Have you tried the --gnu flag for parallel??
parallel -j+0 --gnu "command"....
In some systems (like Ubuntu) is disabled by default.

Related

How to gzip tee output while running in gnu parallel?

Suppose I am running tee inside a command executed by parallel.
I would like to gzip the output from tee:
... | tee --gzip the_file | and_continue
bash process substitution is useful for cases like this. Something like:
... | tee >(gzip -c the_file) | and_continue
If you're choosing different files in a parallel run and need to format the name differently each time, take a look at GNU Parallel argument placeholder in bash process substitution for how that has to change (to defer the process substitution to act per parallel job).

why is gzip trying to compress itself

Attempting to run gzip from a command prompt to compress any file returns
gzip: /usr/bin/gzip is not a directory or a regular file - ignored
as the first line of output.
Here's what I can think of to share that may shed some light:
oslevel
7.1.0.0
echo $SHELL
/usr/bin/ksh
gzip -V
gzip 1.2.4 (18 Aug 93)
Compilation options:
DIRENT UTIME STDC_HEADERS HAVE_UNISTD_H
To produce the error, all I have to do is try to compress any file with gzip (i.e. gzip test.out). The error occurs when run from the command prompt as well as when run from cron.
Any thoughts as to why this is happening?
Additional requsted information:
gzip -h
gzip 1.2.4 (18 Aug 93)
usage: gzip [-cdfhlLnNrtvV19] [-S suffix] [file ...]
-c --stdout write on standard output, keep original files unchanged
-d --decompress decompress
-f --force force overwrite of output file and compress links
-h --help give this help
-l --list list compressed file contents
-L --license display software license
-n --no-name do not save or restore the original name and time stamp
-N --name save or restore the original name and time stamp
-q --quiet suppress all warnings
-r --recursive operate recursively on directories
-S .suf --suffix .suf use suffix .suf on compressed files
-t --test test compressed file integrity
-v --verbose verbose mode
-V --version display version number
-1 --fast compress faster
-9 --best compress better
file... files to (de)compress. If none given, use standard input.
file /usr/bin/gzip
/usr/bin/gzip: executable (RISC System/6000) or object module
gzip *.out
gzip: /usr/bin/gzip is not a directory or a regular file - ignored
gzip -d *.gz
gzip: /usr/bin/gzip is not a directory or a regular file - ignored
Found the issue:
We have an environment file where we load common environment variables. One line in the file is
export GZIP="/usr/bin/gzip"
According to the gzip documentation, the GZIP environment variable is used to specify options. So, it's probably taking that as part of the command line. And, since it's not really an option, it's interpreting the value as being a file name, which it can't gzip because it's actually a symbolic link. By unsetting the variable, the error goes away.

Redirect stderr to stdout in C shell

When I run the following command in csh, I got nothing, but it works in bash.
Is there any equivalent in csh which can redirect the standard error to standard out?
somecommand 2>&1
The csh shell has never been known for its extensive ability to manipulate file handles in the redirection process.
You can redirect both standard output and error to a file with:
xxx >& filename
but that's not quite what you were after, redirecting standard error to the current standard output.
However, if your underlying operating system exposes the standard output of a process in the file system (as Linux does with /dev/stdout), you can use that method as follows:
xxx >& /dev/stdout
This will force both standard output and standard error to go to the same place as the current standard output, effectively what you have with the bash redirection, 2>&1.
Just keep in mind this isn't a csh feature. If you run on an operating system that doesn't expose standard output as a file, you can't use this method.
However, there is another method. You can combine the two streams into one if you send it to a pipeline with |&, then all you need to do is find a pipeline component that writes its standard input to its standard output. In case you're unaware of such a thing, that's exactly what cat does if you don't give it any arguments. Hence, you can achieve your ends in this specific case with:
xxx |& cat
Of course, there's also nothing stopping you from running bash (assuming it's on the system somewhere) within a csh script to give you the added capabilities. Then you can use the rich redirections of that shell for the more complex cases where csh may struggle.
Let's explore this in more detail. First, create an executable echo_err that will write a string to stderr:
#include <stdio.h>
int main (int argc, char *argv[]) {
fprintf (stderr, "stderr (%s)\n", (argc > 1) ? argv[1] : "?");
return 0;
}
Then a control script test.csh which will show it in action:
#!/usr/bin/csh
ps -ef ; echo ; echo $$ ; echo
echo 'stdout (csh)'
./echo_err csh
bash -c "( echo 'stdout (bash)' ; ./echo_err bash ) 2>&1"
The echo of the PID and ps are simply so you can ensure it's csh running this script. When you run this script with:
./test.csh >test.out 2>test.err
(the initial redirection is set up by bash before csh starts running the script), and examine the out/err files, you see:
test.out:
UID PID PPID TTY STIME COMMAND
pax 5708 5364 cons0 11:31:14 /usr/bin/ps
pax 5364 7364 cons0 11:31:13 /usr/bin/tcsh
pax 7364 1 cons0 10:44:30 /usr/bin/bash
5364
stdout (csh)
stdout (bash)
stderr (bash)
test.err:
stderr (csh)
You can see there that the test.csh process is running in the C shell, and that calling bash from within there gives you the full bash power of redirection.
The 2>&1 in the bash command quite easily lets you redirect standard error to the current standard output (as desired) without prior knowledge of where standard output is currently going.
I object the above answer and provide my own. csh DOES have this capability and here is how it's done:
xxx |& some_exec # will pipe merged output to your some_exec
or
xxx |& cat > filename
or if you just want it to merge streams (to stdout) and not redirect to a file or some_exec:
xxx |& tee /dev/null
As paxdiablo said you can use >& to redirect both stdout and stderr. However if you want them separated you can use the following:
(command > stdoutfile) >& stderrfile
...as indicated the above will redirect stdout to stdoutfile and stderr to stderrfile.
xxx >& filename
Or do this to see everything on the screen and have it go to your file:
xxx | & tee ./logfile
What about just
xxx >& /dev/stdout
???
I think this is the correct answer for csh.
xxx >/dev/stderr
Note most csh are really tcsh in modern environments:
rmockler> ls -latr /usr/bin/csh
lrwxrwxrwx 1 root root 9 2011-05-03 13:40 /usr/bin/csh -> /bin/tcsh
using a backtick embedded statement to portray this as follows:
echo "`echo 'standard out1'` `echo 'error out1' >/dev/stderr` `echo 'standard out2'`" | tee -a /tmp/test.txt ; cat /tmp/test.txt
if this works for you please bump up to 1. The other suggestions don't work for my csh environment.

Utilizing multi core for tar+gzip/bzip compression/decompression

I normally compress using tar zcvf and decompress using tar zxvf (using gzip due to habit).
I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.
Is there any way I can utilize the unused cores to make it faster?
You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.
For example use:
tar -c --use-compress-program=pigz -f tar.file dir_to_zip
You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:
tar cf - paths-to-archive | pigz > archive.tar.gz
By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz
Common approach
There is option for tar program:
-I, --use-compress-program PROG
filter through PROG (must accept -d)
You can use multithread version of archiver or compressor utility.
Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:
$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive
$ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archive
Archiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):
$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz
$ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gz
Input and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.
p7zip
For p7zip for compression you need a small shell script like the following:
#!/bin/sh
case $1 in
-d) 7za -txz -si -so e;;
*) 7za -txz -si -so a .;;
esac 2>/dev/null
Save it as 7zhelper.sh. Here the example of usage:
$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive
$ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7z
xz
Regarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting -T or --threads to an appropriate value via the environmental variable XZ_DEFAULTS (e.g. XZ_DEFAULTS="-T 0").
This is a fragment of man for 5.1.0alpha version:
Multithreaded compression and decompression are not implemented yet, so this
option has no effect for now.
However this will not work for decompression of files that haven't also
been compressed with threading enabled. From man for version 5.2.2:
Threaded decompression hasn't been implemented yet. It will only work
on files that contain multiple blocks with size information in
block headers. All files compressed in multi-threaded mode meet this
condition, but files compressed in single-threaded mode don't even if
--block-size=size is used.
Recompiling with replacement
If you build tar from sources, then you can recompile with parameters
--with-gzip=pigz
--with-bzip2=lbzip2
--with-lzip=plzip
After recompiling tar with these options you can check the output of tar's help:
$ tar --help | grep "lbzip2\|plzip\|pigz"
-j, --bzip2 filter the archive through lbzip2
--lzip filter the archive through plzip
-z, --gzip, --gunzip, --ungzip filter the archive through pigz
You can use the shortcut -I for tar's --use-compress-program switch, and invoke pbzip2 for bzip2 compression on multiple cores:
tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/
If you want to have more flexibility with filenames and compression options, you can use:
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \
tar -P --transform='s#/my/path/##g' -cf - {} + | \
pigz -9 -p 4 > myarchive.tar.gz
Step 1: find
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec
This command will look for the files you want to archive, in this case /my/path/*.sql and /my/path/*.log. Add as many -o -name "pattern" as you want.
-exec will execute the next command using the results of find: tar
Step 2: tar
tar -P --transform='s#/my/path/##g' -cf - {} +
--transform is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use -C option to change directory as you'll lose benefits of find: all files of the directory would be included.
-P tells tar to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by --transform anyway.
-cf - tells tar to use the tarball name we'll specify later
{} + uses everyfiles that find found previously
Step 3: pigz
pigz -9 -p 4
Use as many parameters as you want.
In this case -9 is the compression level and -p 4 is the number of cores dedicated to compression.
If you run this on a heavy loaded webserver, you probably don't want to use all available cores.
Step 4: archive name
> myarchive.tar.gz
Finally.
A relatively newer (de)compression tool you might want to consider is zstandard. It does an excellent job of utilizing spare cores, and it has made some great trade-offs when it comes to compression ratio vs. (de)compression time. It is also highly tweak-able depending on your compression ratio needs.
Here is an example for tar with modern zstd compressor, as finding out good examples on this one was difficult:
apt poem to install zstd and pv utilities for Ubuntu
Compress multiple files and folders (zstd command alone can only do single files)
Display progress using pv - shows the total bytes compressed and compression speed GB/sec real-time
Use all physical cores with -T0
Set compression level higher than the default with -8
Display the resulting wall clock and CPU time used after the operation is finished using time
apt install zstd pv
DATA_DIR=/path/to/my/folder/to/compress
TARGET=/path/to/my/arcive.tar.zst
time (cd $DATA_DIR && tar -cf - * | pv | zstd -T0 -8 -o $TARGET)

How to limit the background processes launched by C shell?

I process 100 files in a directory with a command called process and as I want to parallel this process as much as possible. So, I issue the following commands in a C shell and it works great:
foreach F (dir/file*.data)
process $F > $F.processed &
echo $F
end
All 100 processes launch at once in the background, maximizing the usage of all my cores.
Now I want to use only a half of my cores (2 out of 4) at once. Is there an elegant way to do this?
If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:
parallel -j 50% 'process {} > {}.processed; echo {}' ::: dir/file*.data
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos for GNU Parallel to learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1