tar gzip of directory leaves old directory there and tar.gz is much larger - gzip

New to zipping files here. I used the following command to gzip a bunch of large files within a single directory:
tar -cvzf archive-RAW-MAFs.tar.gz RAW_MAFS/
When this was done, I noticed that it left the old directory tree where it was, and that the tar.gz was much larger. I'm not sure what the original size of the directory was as I didn't check it beforehand, but I think it was much larger than stated here...
-rw-r----- 1 xxx xxxx 21218045403 May 8 21:39 archive-RAW-MAFs.tar.gz
drwxr-s--- 34 xxx xxxx 4096 May 8 20:21 RAW_MAFS
I can also traverse through the original RAW_MAFs directory and open files. Ideally, I would like only the zipped file, because I don't need to touch this data again for a while and want to save as much as I can.

I'll take the second question first.
The original file are still there because you haven't told tar to delete them. Add the --remove-files option to the command line to get tar to do what you want
tar -cvzf archive-RAW-MAFs.tar.gz RAW_MAFS/ --remove-files
Regarding the size of the RAW_MAFS directory tree. If it hasn't been deleted yet can you not check their sizes?
If the original files in RAW_MAFS are already compressed, then compressing again when you put them in your tar file will increase the size. Can you provide more details on what you are storing in the tar file?
If you are storing compressed files in the tar, try running without the z option.

Related

wget not downloading pdfs in directories

Following problem: I'm trying to download a directory that contains pdfs, and it downloads the file structure, some of the pdfs but doesn't go deeper than the 2nd directory to download pdfs.
Details (theoretical)
So I have folder1/folder2/folder3(/folder4/folder5)
folder1 contains no pdfs, file structure contained in it, is downloaded.
folder 2 contains another folder and some pdfs, folders are created, pdfs are downloaded
folder 3 sometimes contains more folders, which are created but all pdfs contained in it and in the subfolders are not downloaded.
here is what I'm using to try to download all of it:
wget -r -l inf --no-remove-listing -np -c -w 3 --no-check-certificate -R "index.html*" -P "target directory" "https://etc./"
What am I doing wrong?
Solved it: -erobots=off was the solution. Which is weird since the site actually a wget command that I disagreed with, but still tried and had even less of an result than with my own commands, anyway -erobots=off was not mentioned in their orignally code so I figured and I didn't need it, but I did.

Sync clients' files with server - Electron/node.js

My goal is to make an Electron application, which synchronizes clients' folder with server. To explain it more clearly:
If client doesn't have the files present on the host server, the application downloads all of the files from server to client.
If client has the files, but some files have been updated on the server, the application deletes ONLY the outdated files (leaving the unmodified ones) and downloads the updated files.
If a file has been removed from the host server, but is present at client's folder, the application deletes the file.
Simply, the application has to make sure, that client has EXACT copy of host server's folder.
So far, I did this via wget -m, however frequently wget did not recognize, that some files changed and left clients with outdated files.
Recently I've heard of zsync-windows and webtorrent npm package, but I am not sure which approach is right and how to actually accomplish my goal. Thanks for any help.
rsync is a good approach but you will need to access it via node.js
An npm package like this may help you:
https://github.com/mattijs/node-rsync
But things will get slightly more difficult on windows systems:
How to get rsync command on windows?
If you have ssh access to the server an approach could be using rsync through a Node.js package.
There's a good article here on how to implement this.
You can use rsync which is widely used for backups and mirroring and as an improved copy command for everyday use. It offers a large number of options that control every aspect of its behaviour and permit very flexible specification of the set of files to be copied.
It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination.
For your use case:
If the client doesn't have the files present on the host server, the application downloads all of the files from a server to the client. This can be achieved by simple rsync.
If the client has the files, but some files have been updated on the server, the application deletes ONLY the outdated files (leaving the unmodified ones) and downloads the updated files. Use: –remove-source-files or -delete based on whether you want to delete the outdated files from the source or the destination.
If a file has been removed from the host server but is present at the client's folder, the application deletes the file. Use: -delete option of rsync.
rsync -a --delete source destination
Given it's a folder list (and therefore having simple filenames without spaces, etc.), you can pick the filenames with below code
# Get last item from each line of FILELIST
awk '{print $NF}' FILELIST | sort >weblist
# Generate a list of your files
find -type f -print | sort >mylist
# Compare results
comm -23 mylist weblist >diffs
# Remove old files
xargs -r echo rm -fv <diffs
you'll need to remove the final echo to allow rm work
Next time you want to update your mirror, you can modify the comm line (by swapping the two file arguments) to find the set of files you don't have, and feed those to wget.
or
rsync -av --delete https://mirror.abcd.org/xyz/xyz-folder/ my-client-xyz-directory/

How to untar and gzip the extracted files in one operation?

I have a huge (500GB) gzipped tar file, and I want to extract all the files in it. The tar file is gzipped, but the files in it are not. The problem is that if I extract them all like this
tar xzf huge.tgz
then I run out of space.
Is there a way to simultaneously extract and gzip the files? I could write a script to do
tar tzf huge.tgz
and then extract each file and gzip it, one after the other. But I was hoping there might be a more efficient solution.
You would have to write a program that uses, for example, libarchive and zlib to extract entries and run them through gzip compression.

gsutil finding the uncompressed file size

Is there a way to get the uncompressed file size using gsutils? I've looked into du and ls -l. Both return the compressed size. I would like to avoid having to download the files to see their size.
gsutils provides only some basic commands like copy, list directory files etc. I would like to suggest you to write a python scripts which let you know the original size of the zipped files.

how to merge gz files into a tar.gz without decompression?

I have a program that only consumes uncomprssed files. I have a couple of .gz files and my goal is to feed the concatenation of them to that program. If I had a tar.gz file I could mount the tar.gz archive with the archivemont command.
I know I can concatenate the gz files:
cat a.gz b.gz > c.gz
But there is no way, that I am aware of, to mount a gz file. I don't have enough disk space to uncompress all of the files and the tar command do not accept stdin as the input so I cannot do this:
zcat *.gz | tar - | gzip > file.tar.gz
It is not clear what operations you need to perform on the tar.gz archive. But from what I can discern, tar.gz is not the format for this application. The entire archive stream is compressed by gzip, so you can't pull out or change a file without having to re-compress everything after it. The tar.gz stream can be specially prepared to keep the compression of each file independent, but then you might as well use the .zip format, which is better suited for random access and manipulation of individual files in the archive.
To address one of your comments, tar can in fact accept stdin as input. See pipe tar extract into tar create for some examples, where both GNU tar and BSD tar (with different syntax) can take in a tar file from stdin, delete entries, and write a new tar file to stdout.