Comparing checksums of tarball archive with original directory - gzip

I'm wondering how to verify the checksum of a tarball backup with the original directory after creation.
Is it possible to do so without extracting it for example if it's a large 20GB backup?
Example, a directory with two files:
mkdir test &&
echo "one" > test/one.txt &&
echo "two" > test/two.txt
Get checksum of directory:
find test/ -type f -print0 | sort -z | xargs -0 shasum | shasum
Resulting checksum of directory content:
d191c793cacc4bec1f070eb96fa68524cca566f8 -
Create tarball:
tar -czf test.tar.gz test/
The checksum of the directory content stays constant.
But when creating the archive and getting the checksum of the archive I noticed that the results vary. Why is that?
How would I go about getting the checksum of the tarball content to compare to the directory content checksum?
Or what's a better solution to check that the archive contains all the necessary content from the original directory (without extracting it if it's large)?

Your directory checksum is calculating the SHA-1 of each file's contents. You would need to read and decompress the entire tar archive to do the same calculation. That doesn't mean you'd need to save the contents of the archive anywhere. You'd just need to read it sequentially into memory, and do the calculation there.

Related

Extract huge tar.gz archives from S3 without copying archives to a local system

I'm looking for a way to extract huge dataset (18 TB+ found here https://github.com/cvdfoundation/open-images-dataset#download-images-with-bounding-boxes-annotations) with this in mind I need the process to be fast (i.e. I don't want to spend twice the time for first copying and then extracting files) Also I don't want archives to take extra space not even one 20 gb+ archive.
Any thoughts on how one can achieve that?
If you can arrange to pipe the data straight into tar, it can uncompress and extract it without needing a temporary file.
Here is a example. First create a tar file to play with
$ echo abc >one
$ echo def >two
$ tar cvf test.tar
$ tar cvf test.tar one two
one
two
$ gzip test.tar
Remove the test files
$ rm one two
$ ls one two
ls: cannot access one: No such file or directory
ls: cannot access two: No such file or directory
Now extract the contents by piping the compressed tar file into the tar command.
$ cat test.tar.gz | tar xzvf -
one
two
$ ls one two
one two
The only part missing now is how to download the data and pipe it into tar. Assuming you can access the URL with wget, you can get it to send the data to stdout. So you end up with this
wget -qO- https://youtdata | tar xzvf -

How to gzip a folder under a symlink

I'm trying to gzip all subdirectories and files of a folder.The peculiarity is that the file that I compress is a symbolic link to the last release of my site
filename=$(date '+%Y%m%d')
cd /home/site
tar -zcvf $filename.tar.gz current/
scp $filename.tar.gz server:~/backups/production
rm $filename.tar.gz
When the operation ended and I open the compressed folder. I'm sying the symlink of the folder not its content. What's the wrong point ?
This is expected behavior. You need to specify the -h flag when creating the archive if you want to dereference symlinks. From the tar manual:
Normally, when tar archives a symbolic link, it writes a block to the
archive naming the target of the link. In that way, the tar archive is
a faithful record of the file system contents. When --dereference
(-h) is used with --create (-c), tar archives the files symbolic
links point to, instead of the links themselves.

Bazaar: How to export just changed file of some specific revision?

I'm wonder if there is anyway to just export files that have changed in specific revision.
e.g : I have branch with three files :
file.php
file.js
file.css
Just file.js has changed in last commit.
How to use export command to just export changed file (file.js) and prevent exporting others.
Is there any Plugin or external 3rdParty ?
Using bzr export you can specify a single directory to export, but not individual files.
As an alternative, you can get the contents of a file at some past revision like this:
bzr cat -r REV path/to/file > file.rREV
You can get the list of changed files at some past revision with the one-liner:
bzr diff -c REV | grep ^===
To wrap it up, here's a complete one-liner that does just what you asked for: export just the modified files of some specific revision REV into a directory called EX:
bzr diff -cREV | grep '^=== modified file ' | sed -e "s/[^']*//" -e "s/'//g" |\
while read fname; do echo $fname; mkdir -p EX/"$(dirname "$fname")";\
bzr cat -rREV "$fname" > EX/"$fname"; done
It loops over the modified files in revision REV, prepares the export directory EX with all parent directories needed to save the file preserving the path, and finally gets the file with bzr cat and writes it at the correct relative path inside EX.

How to ignore certain files when branching / checking out?

I'd like to compare a few files from the bazaar branch lp:ubuntu/nvidia-graphics-drivers. I'm mainly interested in the debian subdirectory inside that branch, but due to the binary blob in http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-graphics-drivers/oneiric/files, it takes ages to get just the text files. I've already downloaded 555MB and it's still counting.
Is it possible to retrieve a bazaar branch, including or excluding certain files by one of the following properties:
file size
file extension
file name (include only debian/ for example)
I do not need to push back any changes, nor do I need to view the history of a file. I just want to compare two files in the debian/ directory, files with the .in extension and files without.
As far as I'm aware, no. You're downloading the branch history, not just the individual files. And each file is an integral part of the branch's history.
On the bright side, you only have to check it out once. Unless those binary files change, they'll be skipped the next time you pull from Launchpad.
Depending on the branch's history, you may be able to cut down on the download size if you use a lightweight checkout (bzr checkout --lightweight). But of course, that may come back and bite you later, as it means you won't get a local copy of the branch, only the checked-out files. So it'll work much like SVN, where every operation has to go through the server. And as long as you don't need to look at the branch history, or commit your changes, that should serve you just fine, I believe.
I ended up doing some dirty grep-ing on the HTTP response since bzr info "$branch" and bzr ls -d "$branch" "$directory" did not provide enough information to me.
The below Bash script relies on the working of Launchpads front-end Loggerhead. It recursively downloads from a given URL. Currently, it ignores *.run files. Save it as bzrdl in a directory available from $PATH and run it with bzrdl http://launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-graphics-drivers/oneiric/files/head:/debian/. All files will be saved in the current directory, be sure that it's empty to avoid conflicts.
#!/bin/bash
max_retries=5
rooturl="$1"
if ! [[ $rooturl =~ /$ ]]; then
echo "Usage: ${0##*/} URL"
echo "URL must end with a slash. Example URL:"
echo "http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-graphics-drivers/oneiric/files/head:/"
exit 1
fi
tmpdir="$(mktemp -d)"
target="$(pwd)"
# used for holding HTTP response before extracting data
tmp="$(mktemp)"
# url_filter reads download URLs from stdin (piped)
url_filter() {
grep -v '\.run$'
}
get_files_from_dir() {
local slash=/
local dir="$1"
# to avoid name collision: a/b/c/ -> a.d/b.d/c.d/
local storedir="${dir//$slash/.d${slash}}"
mkdir -p "$tmpdir/$storedir" "$target/$dir"
local i subdir
for ((i=0; i<$max_retries; i++ )); do
if wget -O "$tmp" "$rooturl$dir"; then
# store file list
grep -F -B 1 '<img src="/static/images/ico_file_download.gif" alt="Download File" />' "$tmp" |\
grep '^<a' | cut -d '"' -f 2 | url_filter \
> "$tmpdir/$storedir/files"
IFS=$'\n'
for subdir in $(grep -F -B 1 '<img src="/static/images/ico_folder.gif" ' "$tmp" | \
grep -F '<a ' | rev | cut -d / -f 2 | rev); do
IFS=$' \t\n'
get_files_from_dir "$dir$subdir/"
done
return
fi
done
echo "Failed to download directory listing of: $dir" >> "$tmpdir/errors"
}
download_files() {
local slash=/
local dir="$1"
# to avoid name collision: a/b/c/ -> a.d/b.d/c.d/
local storedir="${dir//$slash/.d${slash}}"
local done=false
local subdir
cd "$tmpdir/$storedir"
for ((i=0; i<$max_retries; i++)); do
if wget -B "$rooturl$dir" -nc -i files -P "$target/$dir"; then
done=true
break
fi
done
$done || echo "Failed to download all files from $dir" >> "$tmpdir/errors"
for subdir in *.d; do
download_files "$dir${subdir%%.d}/"
done
}
get_files_from_dir ''
# make *.d expand to nothing if no directories are found
shopt -s nullglob
download_files ''
echo "TMP dir: $tmpdir"
echo "Errors : $(wc -l "$tmpdir/errors" 2>/dev/null | cut -d ' ' -f 2 || echo 0)"
The temporary directory and file is not removed afterwards, that must be done manually. Any errors (failures to download) will be written to $tmpdir/errors
It's confirmed to work with:
bzrdl http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-settings/oneiric/files/head:/debian/
Feel free to correct any mistakes or add improvements.
There is no way to selectively check out a specific directory from a Bazaar branch at the moment, although we do have plans to add such support in the future.
There is definitely too much traffic for the clone you are doing, considering the size of the branch. It's probably a bug in the client implementation.
Here on bzr 2.4 it is still quite slow but not too bad (60s):
localhost:/tmp% bzr branch http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-settings/oneiric
Most recent Ubuntu Oneiric version: 275.09.07-0ubuntu1
Packaging branch status: CURRENT
Branched 37 revision(s).
From the log:
[11866] 2011-07-31 00:56:57.007 INFO: Branched 37 revision(s).
56.786 Transferred: 5335kB (95.8kB/s r:5314kB w:21kB)

is it possible to take a large number of files & tar/gzip and stream them on-the-fly?

I have a large number of files which I need to backup, problem is there isn't enough disk space to create a tar file of them and then upload it offsite. Is there a way of using python, php or perl to tar up a set of files and upload them on-the-fly without making a tar file on disk? They are also way too large to store in memory.
I always do this just via ssh:
tar czf - FILES/* | ssh me#someplace "tar xzf -"
This way, the files end up all unpacked on the other machine. Alternatively
tar czf - FILES/* | ssh me#someplace "cat > foo.tgz"
Puts them in an archive on the other machine, which is what you actually wanted.
You can pipe the output of tar over ssh:
tar zcvf - testdir/ | ssh user#domain.com "cat > testdir.tar.gz"