backup distributed cache data to cloud storage - redis

I want to backup the REDIS data on google storage bucket as flat file, is there any existing utility to do that?
Although, I do not fully agree to idea of backing up of cache data on cloud. I was wondering if there is any existing utility rather than reinventing the wheel.

If you are using Cloud Memorystore for Redis you can simply refer to the following documentation. Notice that you can simply use the following gcloud command:
gcloud redis instances export gs://[BUCKET_NAME]/[FILE_NAME].rdb [INSTANCE_ID] --region=[REGION] --project=[PROJECT_ID]
or use the Export operation from the Cloud Console.
If you manage your own instance (e.g. you have the Redis instance hosted on a Compute Engine Instance) you could simply use the SAVE or BGSAVE (preferred) commands to take a snapshot of the instance and then upload the .rdb file to Google Cloud Storage using any of the available methods, from which I think the most convenient one would be gsutil (notice that it will require the following installation procedure) in a similar fashion to:
gsutil cp path/to/your-file.rdb gs://[DESTINATION_BUCKET_NAME]/

Related

How do I access files generated during Cloud Run function execution?

I'm running a very simple program getting screenshots of a page using Selenium in Cloud Run. I know that Cloud Run is stateless and I cannot access the screenshot that is generated after the program finishes executing, but I wanted to know where/how can I access these files right after the screenshot is taken and read them, so I can store a reference to them in my Cloud Storage bucket too
You have several solution:
Store the screenshot locally, and then upload them to Cloud Storage (you can create a script for that, use client libraries,...). A good evolution is to make a tar (optionally a gzip also) to upload only 1 file, it's faster.
Use Cloud Run execution runtime 2nd generation, and mount a bucket with GCSFuse into your Cloud Run instance. Like that, a file directly written in the mounted directory will be written on Cloud Storage. For that solution, and despite the good tutorial, it requires good skills in container.

What is the best approach to sync data from AWS 3 bucket to Azure Data Lake Gen 2

Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like #raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.
Any tips, suggestions, or ideas on automating this process are greatly appreciated.
Adding one more to the list :)
6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts .
#3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/

Reading file inside S3 from EC2 instance

I would like to use AWS Data Pipeline to start an EC2 instance and then run a python script that is stored in S3.
Is it possible? I would like to make a single ETL step using a python script.
Is it the best way?
Yes, it is possible and relatively straight forward using Shell Command Activity.
I believe from the details you have provided so far, it seems to be the best way - as DataPipeline provisions the EC2 instance for you ondemand and shuts it down afterwards.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
There is also a tutorial that you can follow to get acclimated to ShellCommndActivity of Data Pipeline.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-gettingstartedshell.html
yes, you can direct upload and backup your data in s3
http://awssolution.blogspot.in/2015/10/how-to-backup-share-and-organize-data.html

getting large datasets onto amazon elastic map reduce

There are some large datasets (25gb+, downloadable on the Internet) that I want to play around with using Amazon EMR. Instead of downloading the datasets onto my own computer, and then re-uploading them onto Amazon, what's the best way to get the datasets onto Amazon?
Do I fire up an EC2 instance, download the datasets (using wget) into S3 from within the instance, and then access S3 when I run my EMR jobs? (I haven't used Amazon's cloud infrastructure before, so not sure if what I just said makes any sense.)
I recommend the following...
fire up your EMR cluster
elastic-mapreduce --create --alive --other-options-here
log on to the master node and download the data from there
wget http://blah/data
copy into HDFS
hadoop fs -copyFromLocal data /data
There's no real reason to put the original dataset through S3. If you want to keep the results you can move them into S3 before shutting down your cluster.
If the dataset is represented by multiple files you can use the cluster to download it in parallel across the machines. Let me know if this is the case and I'll walk you through it.
Mat
If you're just getting started and experimenting with EMR, I'm guessing you want these on s3 so you don't have to start an interactive Hadoop session (and instead use the EMR wizards via the AWS console).
The best way would be to start a micro instance in the same region as your S3 bucket, download to that machine using wget and then use something like s3cmd (which you'll probably need to install on the instance). On Ubuntu:
wget http://example.com/mydataset dataset
sudo apt-get install s3cmd
s3cmd --configure
s3cmd put dataset s3://mybucket/
The reason you'll want your instance and s3 bucket in the same region is to avoid extra data transfer charges. Although you'll be charged in bound bandwidth to the instance for the wget, the xfer to S3 will be free.
I'm not sure about it, but to me it seems like hadoop should be able to download files directly from your sources.
just enter http://blah/data as your input, and hadoop should do the rest. It certainly works with s3, why should it not work with http?

Fastest / best way copy data between S3 to EC2?

I have a fairly large amount of data (~30G, split into ~100 files) I'd like to transfer between S3 and EC2: when I fire up the EC2 instances I'd like to copy the data from S3 to EC2 local disks as quickly as I can, and when I'm done processing I'd like to copy the results back to S3.
I'm looking for a tool that'll do a fast / parallel copy of the data back and forth. I have several scripts hacked up, including one that does a decent job, so I'm not looking for pointers to basic libraries; I'm looking for something fast and reliable.
Unfortunately, Adam's suggestion won't work as his understanding of EBS is wrong (although I wish he was right and often thought myself it should work that way)... as EBS has nothing to do with S3, but it will only give you an "external drive" for EC2 instances that are separate, but connectable to the instances. You still have to do copying between S3 and EC2, even though there are no data transfer costs between the two.
You didn't mention an operating system of your instance, so I cannot give tailored information. A popular command line tool I use is http://s3tools.org/s3cmd ... it is based on Python and therefore, according to info on its website it should work on Win as well as Linux, although I use it ALL the time on Linux. You could easily whip up a quick script that uses its built in "sync" command that works similar to rsync, and have it triggered every time you're done processing your data. You could also use the recursive put and get commands to get and put data only when needed.
There are graphical tools like Cloudberry Pro that have some command line options for Windows too that you can setup schedule commands. http://s3tools.org/s3cmd is probably the easiest.
By now, there is a sync command in the AWS Command line tools, that should do the trick: http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
On startup:
aws s3 sync s3://mybucket /mylocalfolder
before shutdown:
aws s3 sync /mylocalfolder s3://mybucket
Of course, the details are always fun to work out eg. how can parallel it is (and can you make it more parallel and is that any faster goven the virtual nature of the whole setup)
Btw hope you're still working on this... or somebody is. ;)
I think you might be better off using an Elastic Block Store to store your files instead of S3. An EBS is akin to a 'drive' on S3 that can be mounted into your EC2 instance without having to copy the data each time, thereby allowing you to persist your data between EC2 instances without having to write to or read from S3 each time.
http://aws.amazon.com/ebs/
Install s3cmd Package as
yum install s3cmd
or
sudo apt-get install s3cmd
depending on your OS
then copy data with this
s3cmd get s3://tecadmin/file.txt
also ls can list the files.
for more detils see this
For me the best form is:
wget http://s3.amazonaws.com/my_bucket/my_folder/my_file.ext
from PuTTy