gsutil rsync only files matching a pattern - gsutil

I need to rsync files from a bucket to a local machine everyday, and the bucket contains 20k files. I need to download only the changed files that end with *some_naming_convention.csv .
What's the best way to do that? using a wildcard in the download source gave me an error.

I don't think you can do that with Rsynch. As Christopher told you, you can skip files by using the "-x" flag, but no just synch those [1]. I created a public Feature Request on your behalf [2] for you to follow updates there.
As I say in the FR, IMHO I consider this to not follow the purpose of rsynch, as it's to keep folders/buckets synchronise, and just synchronising some of them don't fall in that purpose.
There is a possible "workaround" by using gsutil cp to copy files and -n to skip the ones that already exist. The whole command for your case should be:
gsutil -m cp -n <bucket>/*some_naming_convention.csv <directory>
Other option, maybe a little bit more far-fetched is to copy/move those files to a folder and then use that folder to rsynch.
I hope this works for you ;)

Original Answer
From here, you can do something like gsutil rsync -r -x '^(?!.*\.json$).*' gs://mybucket mydir to rsync all json files. The key is the ?! prefix to the pattern you actually want.
Edit
The -x flag excludes a pattern. The pattern ^(?!.*\.json$).* uses negative look-ahead to specify patterns not ending in .json. It follows that the result of the gsutil rsync call will get all files which end in .json.

Rsync lets you include and exclude files matching patterns.
For each file rsync applies the first patch that matches, some if you want to sync only selected files then you need to include those, and then exclude everything else.
Add the following to your rsync options:
--include='*some_naming_convention.csv' --exclude='*'
That's enough if all your files are in one directory. If you also want to search sub folders then you need a little bit more:
--include='*/' --include='*some_naming_convention.csv' --exclude='*'
This will duplicate all the directory tree, but only copy the files you want. If that leaves empty directories you don't want then add --prune-empty-dirs.

Related

Sync clients' files with server - Electron/node.js

My goal is to make an Electron application, which synchronizes clients' folder with server. To explain it more clearly:
If client doesn't have the files present on the host server, the application downloads all of the files from server to client.
If client has the files, but some files have been updated on the server, the application deletes ONLY the outdated files (leaving the unmodified ones) and downloads the updated files.
If a file has been removed from the host server, but is present at client's folder, the application deletes the file.
Simply, the application has to make sure, that client has EXACT copy of host server's folder.
So far, I did this via wget -m, however frequently wget did not recognize, that some files changed and left clients with outdated files.
Recently I've heard of zsync-windows and webtorrent npm package, but I am not sure which approach is right and how to actually accomplish my goal. Thanks for any help.
rsync is a good approach but you will need to access it via node.js
An npm package like this may help you:
https://github.com/mattijs/node-rsync
But things will get slightly more difficult on windows systems:
How to get rsync command on windows?
If you have ssh access to the server an approach could be using rsync through a Node.js package.
There's a good article here on how to implement this.
You can use rsync which is widely used for backups and mirroring and as an improved copy command for everyday use. It offers a large number of options that control every aspect of its behaviour and permit very flexible specification of the set of files to be copied.
It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination.
For your use case:
If the client doesn't have the files present on the host server, the application downloads all of the files from a server to the client. This can be achieved by simple rsync.
If the client has the files, but some files have been updated on the server, the application deletes ONLY the outdated files (leaving the unmodified ones) and downloads the updated files. Use: –remove-source-files or -delete based on whether you want to delete the outdated files from the source or the destination.
If a file has been removed from the host server but is present at the client's folder, the application deletes the file. Use: -delete option of rsync.
rsync -a --delete source destination
Given it's a folder list (and therefore having simple filenames without spaces, etc.), you can pick the filenames with below code
# Get last item from each line of FILELIST
awk '{print $NF}' FILELIST | sort >weblist
# Generate a list of your files
find -type f -print | sort >mylist
# Compare results
comm -23 mylist weblist >diffs
# Remove old files
xargs -r echo rm -fv <diffs
you'll need to remove the final echo to allow rm work
Next time you want to update your mirror, you can modify the comm line (by swapping the two file arguments) to find the set of files you don't have, and feed those to wget.
or
rsync -av --delete https://mirror.abcd.org/xyz/xyz-folder/ my-client-xyz-directory/

Backup file folder in correct way

My situation is I only have execute permission from some folder:
Lets say, I would like to backup entire folder and exclude some folder and files with exclude.txt
Here is path I would like to backup:
/pdf/data/pdfnew/2014
And I only have permission to execute from this folder (main):
/pdf/data/pdfnew/2014/public/main
I put exclude.txt in same folder which I can execute the command (main)
I execute this command in (main folder):
tar -cjvf -X exclude.txt 2014.tar.bz2 /pdf/data/pdfnew/2014
The result is it still included folder that I dont want to backup.
Is there a correct way doing this?
Do you have a user/home directory on that server? You should, so you should just place exclude.txt in your user/homedirectory on that server & run it like this from that directory:
tar -cjvf -X ~/exclude.txt ~/2014.tar.bz2 /pdf/data/pdfnew/2014
The ~/ is a shorthand for your user/home directory so in this case it is explicitly stating, “Read exclude.txt from the user/home directory & write ~/2014.tar.bz2 to the user/home directory.
But you also ask this:
Is there a correct way doing this?
There is never one canonical best way of doing something like this. It is all based on your final/end goal. Nothing more. Nothing less. That said, if I were you I would do it like this instead using the -C option:
tar -cjvf -X ~/exclude.txt ~/2014.tar.bz2 -C /pdf/data/pdfnew/ 2014
The uppercase -C option allows tar to internally change the working directory to /pdf/data/pdfnew/ so you can then create an archive of 2014 without having to have the whole directory tree retained in the backup. I find this is easier to work with because many times I want to backup the contents of a directory but have no use to retain the parent structure. That way the archive is more like a traditional ZIP archive which I find is easer to understand & work with.

excluding a directory from accurev using pop command

I have 10 directories in a AccuRev depot and don't want to populate one directory using "accurev pop" command. Is there any way? .acignore is not suiting to my requirements because in another jenkins build I need that folder. Just want to save time to avoid unnecessary populate of directories.
Any idea?
Thanks,
Sanjiv
I would create a stream off this stream and exclude the directories you dont want. Then you can pop this stream and only get the directories you want.
When you run the AccuRev populate command you can specify which directories to populate by specifying the directory name:
accurev pop -O -R thisDirectory
will cause the contents of thisDirectory to match the backing stream from the time of the last AccuRev update in that workspace.
The -O is for over write and the -R is for recurse. If you have active work in progress the -O will cause that work to be over written/destroyed.
The .acignore is only for (external) files and not those that are being managed by AccuRev.
David Howland

copying the whole directory with some exceptions using scp

I have to retrieve a directory with all the subdirectories from a server. However, I want to exclude some file with a specific extension (they are heavy and useless to me).
scp -r myname#servername:fodertocpy .
does copy the whole directory but I don't know how to exclude files with .abc extension, let's say.
I would like to use scp because it already automatically handles my passwords.
This isn't possible with only scp as scp doesn't have an exclude flag. I assume you want to utilise your key auth you've already setup with ssh /scp. If so I would do rsync over ssh - it would then utilise your existing key authentication.
Something like this would work:
rsync --exclude '*.abc' -avz -e ssh myname#servername:foldertocpy .
Have a look at the man rsync for an explanation of the flags.
Hope this helps,
Will

bzr ignore for all executables files under Linux

Is there a way to make bazaar to ignore all executable files under Linux? They don't have a particular extension, so I'm not able to accomplish this with regexp.
Thank you very much.
If all your executables were under a certain directory, you can ignore the directory content (eg. bzr ignore "mybindir/*"). I realize this isn't exactly what you want, but other than bialix's work around I don't think there is a better answer at the moment. It might be possible in future to add a keyword like EXECUTABLE: to the .bzrignore file which will indicate what you need. Even better would be to be able to chain them eg. EXECUTABLE:RE:someprefix.+ .
According to bzr ignore -h there is no pattern to select executable files.
But you can ignore them one by one.
find . -type f -perm /111 -print0 | xargs -0 bzr ignore
You can ignore all files without extension with following regex: RE:\.?[^.]+ but it will also ignore all directories and symlinks those don't have "extension", i.e. anything after dot.
Sometimes it's undesirable, so if you don't have a lot of executable files you'd better ignore them by name.