How to get around 'Argument list too long' error when concatenating multiple gzip files? - gzip

I am trying to concatenate around ~21,000 gzip files that are all located in a local directory so I can unzip 1 large gzip file then convert the unzipped file into a csv. Unfortunately this is over the maximum number of arguments that can be accepted by the cat command. I have tried using cat *gz > final.gz as well as ls *.gz | xargs cat but both have given me the error 'Argument list too long'. How might I work around this error to concatenate all gzipped files?

You can try something like:
find . -name \*.gz -print0 | xargs -0 cat > final.gz
If there are .gz files in subdirectories of the current directory, and you only want the ones in the current directory, then add -maxdepth 1 to the find options.
If you want to impose a particular order on the files, you can pipe through an appropriate sort between the find and the xargs.

Related

Extract huge tar.gz archives from S3 without copying archives to a local system

I'm looking for a way to extract huge dataset (18 TB+ found here https://github.com/cvdfoundation/open-images-dataset#download-images-with-bounding-boxes-annotations) with this in mind I need the process to be fast (i.e. I don't want to spend twice the time for first copying and then extracting files) Also I don't want archives to take extra space not even one 20 gb+ archive.
Any thoughts on how one can achieve that?
If you can arrange to pipe the data straight into tar, it can uncompress and extract it without needing a temporary file.
Here is a example. First create a tar file to play with
$ echo abc >one
$ echo def >two
$ tar cvf test.tar
$ tar cvf test.tar one two
one
two
$ gzip test.tar
Remove the test files
$ rm one two
$ ls one two
ls: cannot access one: No such file or directory
ls: cannot access two: No such file or directory
Now extract the contents by piping the compressed tar file into the tar command.
$ cat test.tar.gz | tar xzvf -
one
two
$ ls one two
one two
The only part missing now is how to download the data and pipe it into tar. Assuming you can access the URL with wget, you can get it to send the data to stdout. So you end up with this
wget -qO- https://youtdata | tar xzvf -

How to unzip many .gz files to one same file?

I have a folder which contains a lot of .gz files, each of them contains los as text.
It's too troublesome to unzip and look through them one by one. So I'm wondering is there any command to unzip content of mutiple .gz files to a same file?
Thanks
This is the command you want probably:
cat *.gz | gzip -dc - | grep los
First the cat *.gz sends all the zipped files to stdout.
Using gzip the -d switch decompresses and -c sends output to stdout. The "-" allows input to come from stdin rather than through files.
Then this output can be piped to whatever program you want.
If you want to know the specifically each file that has matches you can do this too:
for f in *.gz
do echo $f: ;
gzip -dc $f | grep los ;
done

s3cmd copy files preserving path

Is there a way to use copy files to an S3 bucket by preserving the file path?
This is the example:
1. I produce a list of files that are different in bucket1 then in bucket2 using s3cmd sync --dry-run
The list looks like this:
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/app-store/.content.xml
I need to process this list to upload to a new location in the bucket (e.g. s3://bucket/diff/) only the files in the list BUT with the full path as shown in the list.
A simple loop like this:
diff_file_list=$(s3cmd -c s3cfg sync --dry-run s3://BUCKET/20150831/PROD s3://BUCKET/20150831/DEV | awk '{print $2}')
for f in $diff_file_list; do
s3cmd -c s3cfg cp $f s3://BUCKET/20150831/DIFF/
done
does not work; it produces this:
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd copied to s3://BUCKET/20150831/DIFF/nodetypes.cnd
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml copied to s3://BUCKET/20150831/DIFF/properties.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/origin-store/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
Thanks,
Short answer: not it is not! That is because the paths in S3 buckets are not actually directories/folders and the S3 bucket have no such concepts of structure even if various tools are presenting it this way (including s3cmd which is really confusing...).
So, the "path" is actually a prefix (although the sc3cmd sync to local knows how to translate this prefix in a directory structure on your filesystem).
For a bash script the solution is:
1. create a file listing all the paths from a s3cmd sync --dry-run command (basically a list of diffs) => file1
copy that file and use sed to modify the paths as needed:
sed 's/(^s3.*)PROD/\1DIFF/') => file2
Merge the files so that line1 in file1 is continued by line1 in file2 and so on:
paste file1 file2 > final.txt
Read final.txt, line by line, in a loop and use each line as a set of 2 parameters to a copy or syun command:
while IFS='' read -r line || [[ -n "$line" ]]; do
s3cmd -c s3cfg sync $line
done < "final.txt"
Notes:
1. $line in the s3cmd must not be in quotes; if it is the sync command will complain that it received one parameter only... of course!
2. the [[ -n "$line" ]] is used here so that read will not fail of the last line has not new line character
Boto could not help more unfortunately so if you need something similar in python you would do it pretty much the same....

AWK to process compressed files and printing original (compressed) file names

I would like to process multiple .gz files with gawk.
I was thinking of decompressing and passing it to gawk on the fly
but I have an additional requirement to also store/print the original file name in the output.
The thing is there's 100s of .gz files with rather large size to process.
Looking for anomalies (~0.001% rows) and want to print out the list of found inconsistencies ALONG with the file name and row number that contained it.
If I could have all the files decompressed I would simply use FILENAME variable to get this.
Because of large quantity and size of those files I can't decompress them upfront.
Any ideas how to pass filename (in addition to the gzip stdout) to gawk to produce required output?
Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.
for file in *.gz; do
gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done
Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.
$ ls *.awk
a.awk e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk
To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.
this is using combination of xargs and sh (to be able to use pipe on two commands: gzip and awk):
find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk >> baddata.txt'
I'm wondering if there's any bad practice with the above approach…

Bash copying specific files

How can I get tar/cp to copy only files that dont end in .jar and only in root and /plugins directories?
So, I'm making a Minecraft server backup script. One of the options I wish to have is a backup of configuration files only. Here's the scenario:
There are many folders with massive amounts of data in.
Configuration files mainly use the following extensions, but some may use a different one:
.yml
.json
.properties
.loc
.dat
.ini
.txt
Configuration files mainly appear in the /plugins folder
There are a few configuration files in the root directory, but none in any others except /plugins
The only other files in these two directories are .jar files - to an extent. These do not need to be backed up. That's the job of the currently-working plugins flag.
The code uses a mix of tar and cp depending on which flags the user started the process with.
The process is started with a command, then paths are added via a concatenated variable, such as $paths = plugins world_nether mysql/hawk where arguments can be added one at a time.
How can I selectively backup these configuration files with tar and cp? Due to the nature of the configuration process, we needn't have the same flags to add into both commands - it can be seperate arguments for either command.
Here are the two snippets of code in concern:
Configure paths:
# My first, unsuccessful attempt.
if $BKP_CFG; then
# Tell user they are backing up config
echo " +CONFIG $confType - NOT CURRENTLY WORKING"
# Main directory, and everything in plugin directory only
# Jars are not allowed to be backed up
#paths="$paths --no-recursion * --recursion plugins$suffix --exclude *.jar"
fi
---More Pro Stuff----
# Set commands
if $ARCHIVE; then
command="tar -cpv"
if $COMPRESSION; then
command=$command"z"
fi
# Paths starts with a space </protip>
command=$command"C $SERVER_PATH -f $BACKUP_PATH/$bkpName$paths"
prep=""
else
prep="mkdir $BACKUP_PATH/$bkpName"
# Make each path an absolute path. Currently, they are all relative
for path in $paths; do
path=$SERVER_PATH/$path
done
command="cp -av$paths $BACKUP_PATH/$bkpName"
fi
I can provide more code/explaination where neccessary.
find /actual/path ! -iname '*jar' -maxdepth 1 -exec cp \{\} /where/to/copy/ \;
find /actual/path/plugins ! -iname '*jar' -maxdepth 1 -exec cp \{\} /where/to/copy/ \;
Might help.
Final code:
if $BKP_CFG; then
# Tell user what's being backed up
echo " +CONFIG $confType"
# Main directory, and everything in plugin directory only
# Jars are not allowed to be backed up
# Find matches within the directory cd'd to earlier, strip leading ./
paths="$paths $(find . -maxdepth 1 -type f ! -iname '*.jar' | sed -e 's/\.\///')"
paths="$paths $(find ./plugins -type f ! -iname '*.jar' | sed -e 's/\.\///')"
fi