gsutil + rsync remove synced files? (no --remove-source-files flag?)

gsutil + rsync remove synced files? (no --remove-source-files flag?) - gsutil

Is there a way to make gsutil rsync remove synced files?
As far as I know, normally it is done by passing --remove-source-files, but it does not seem to be an option with gsutil rsync (documentation).
Context:
I have a script that produces a large amount of CSV files (100GB+) I want those files to be transferred to Cloud Storage (and once transferred to be removed from my HDD).

Ended up using gcsfuse.
Per documentation:
Local storage: Objects that are new or modified will be stored in
their entirety in a local temporary file until they are closed or
synced.

One work-around for small buckets is delete all bucket contents and re-sync periodically.

Related

Upload files from multiple subfolders with gsutil

I need to "move" all the content of a folder, including its subfolders to a bucket in Google Cloud Storage.
The closest way is to use gsutil -rsync, but it clones all the data without moving the files.
I need to move all the data and keep data only in GCP and not in local storage. My local storage is being used only as a pass-thought server (Cause I only have a few GB to store data on local storage)
How can I achieve this?
Is there any way with gsutil?
Thanks!

To move the data to a bucket and reclaim the space on your local disk, you need to use mv command for example:
gsutil mv -r mylocalfolder gs://mybucketname
mv command copies the files to a bucket and delete them after the upload.

Streaming compression to S3 bucket with a custom directory structure

I have got an application that requires to create a compressed file from different objects that are saved on S3. The issue I am facing is I would like to compress objects on the fly without downloading files into a container and do the compression. The reason for that is the size of files can be quite big and I can easily run out of disk space and of course, there will be an extra round trip time of downloading files on disk, compressing them and upload the compressed file to s3 again.
It is worth mentioning that I would like to locate files in the output compressed file in different directories, so when a user decompress the file can see it is stored in different folders.
Since S3 does not have the concept of physical folder structure, I am not sure if this is possible and if there is a better way than download/uploading the files.
NOTE
My issue is not about how to use AWS Lambda to export a set of big files. It is about how I can export files from S3 without downloading objects on a local disk and create a zip file and upload to S3. I would like to simply zip the files on S3 on the fly and most importantly being able to customize the directory structure.
For example,
inputs:
big-file1
big-file2
big-file3
...
output:
big-zip.zip
with the directory structure of:
images/big-file1
images/big-file2
videos/big-file3
...

I have almost the same use case as yours. I have researched it for about 2 months and try with multiple ways but finally I have to use ECS (EC2) for my use case because of the zip file can be huge like 100GB ....
Currently AWS doesn't support a native way to perform compress. I have talked to them and they are considering it a feature but there is no time line given yet.
If your files is about 3 GB in term of size, you can think of Lambda to achieve your requirement.
If your files is more than 4 GB, I believe it is safe to do it with ECS or EC2 and attach more volume if it requires more space/memory for compression.
Thanks,

Yes, there are at least two ways: either using AWS-Lambda or AWS-EC2
EC2
Since aws-cli has support of cp command, you can pipe S3 file to any archiver using unix-pipe, e.g.:
aws s3 cp s3://yours-bucket/huge_file - | gzip | aws s3 cp - s3://yours-bucket/compressed_file
AWS-Lambda
Since maintaining and using EC2 instance just for compressing may be too expensive, you can use Lambda for one-time compressions.
But keep in mind that Lambda has a lifetime limit of 15 minutes. So, if your files really huge try this sequence:
To make sure that file will be compressed, try partial file compression using Lambda
Compressed files could me merged on S3 into one file using Upload Part - Copy

syncing files with aws s3 sync that have a minimum timestamp

I am syncing a directory to an s3 bucket. It is a directory, so I only want it to check for files that were created/updated in the last 24 hours.
With GNU/Linux's rsync, you might do this by piping the output of 'find -mtime' to rsync; I'm wondering if anything like this is possible with aws s3 sync?
Edited to show final goal: I'm running a script that constantly syncs files to S3 from a web server. It runs every minute, first checks if there is already a process running (and exits if it does), then runs the aws sync command. The sync command takes about 5 minutes to run and usually gets 3-5 new files. This causes a slight load on the system, and I think if I just checked for files in the last 24 hours, it'd be much much faster.

No, the AWS Command-Line Interface (CLI) aws s3 sync command does not have an option to only include files created within a defined time period.
See: aws s3 sync documentation
It sounds like most of your time is being consumed by the check of whether files need to be updated. Some options:
If you don't need all the files locally, you could delete them after some time (48 hours?). This means less files will need to be compared. By default, aws s3 sync will not delete destination files that do not match a local file (but this can be configured via a flag).
You could copy recent files (past 24 hours?) into a different directory and run aws s3 sync from that directory. Then, clear out those files after a successful sync run.
If you have flexibility over the filenames, you could include the date in the filename (eg 2018-03-13-foo.txt) and then use --include and --exclude parameters to only copy files with desired prefixes.

CSV Files from AWS S3 to MarkLogic 8

Can csv files from the AWS S3 bucket be configured to go straight into ML or do the files need to land somewhere and then the CSV files have to get ingested using MCLP?

Assuming you have CSV files in the S3 Bucket and that one row in the CSV file is to be inserted as a single XML record...that wasn't clear in your question, but is the most common use case. If your plan is to just pull the files in and persist them as CSV files, there are undocumented XQuery functions that could be used to access the S3 bucket and pull the files in off that. Anyway, the MLCP documents are very helpful in understanding this very versatile and powerful tool.
According to the documentation (https://developer.marklogic.com/products/mlcp) the supported data sources are:
Local filesystem
HDFS
MarkLogic Archive
Another MarkLogic Database
You could potentially mount the S3 Bucket to a local filesystem on EC2 to bypass the need to make the files accessible to MLCP. Google's your friend if that's important. I personally haven't seen a production-stable method for that, but it's been a long time since I've tried.
Regardless, you need to make those files available on a supported source, most likely a filesystem location in this case, where MLCP can be run and can reach the files. I suppose that's what you meant by having the files land somewhere. MLCP can process delimited files in import mode. The documentation is very good for understanding all the options.

Backing up to S3: symlinks

I'm trying to figure out right now how to backup some data to S3.
We have a local backup system implemented using rsnapshot and that works perfectly. We're trying to use s3cmd with the --sync option to mimic rsync to transfer the files.
The problem we're having is that symlinks aren't created as symlinks, it seems to be resolving them to the physical file and uploading that instead. Does anyone have any suggestions as to why this would happen?
Am I missing something obvious? Or is it that S3 just isn't suited to this sort of operation? I could setup an EC2 instance and attach some EBS, but it'd be preferable to use S3.

Amazon S3 itself doesn't have the concept of symlinks, which is why I suspect s3cmd uploads the physical file. Its a limitation of S3, not s3cmd.
I'm assuming that you need the symlink itself copied though? If that's the case, can you gzip/tar your directory with symlink and upload that?

There is no symlinks available on S3 but what you can use is google's solution which creates a based file system using S3 (FUSE) . More information here :
https://code.google.com/p/s3fs/wiki/FuseOverAmazon
and here:
http://tjstein.com/articles/mounting-s3-buckets-using-fuse/
I hope it helps

Try using the -F, --follow-symlinks OPTiON when using sync. This worked for me.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

gsutil + rsync remove synced files? (no --remove-source-files flag?) - gsutil

Ended up using gcsfuse. Per documentation: Local storage: Objects that are new or modified will be stored in their entirety in a local temporary file until they are closed or synced.

One work-around for small buckets is delete all bucket contents and re-sync periodically.

Related

Upload files from multiple subfolders with gsutil

Streaming compression to S3 bucket with a custom directory structure

syncing files with aws s3 sync that have a minimum timestamp

CSV Files from AWS S3 to MarkLogic 8

Backing up to S3: symlinks

Categories

Resources