I would like to use AWS Data Pipeline to start an EC2 instance and then run a python script that is stored in S3.
Is it possible? I would like to make a single ETL step using a python script.
Is it the best way?
Yes, it is possible and relatively straight forward using Shell Command Activity.
I believe from the details you have provided so far, it seems to be the best way - as DataPipeline provisions the EC2 instance for you ondemand and shuts it down afterwards.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
There is also a tutorial that you can follow to get acclimated to ShellCommndActivity of Data Pipeline.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-gettingstartedshell.html
yes, you can direct upload and backup your data in s3
http://awssolution.blogspot.in/2015/10/how-to-backup-share-and-organize-data.html
Related
I want to backup the REDIS data on google storage bucket as flat file, is there any existing utility to do that?
Although, I do not fully agree to idea of backing up of cache data on cloud. I was wondering if there is any existing utility rather than reinventing the wheel.
If you are using Cloud Memorystore for Redis you can simply refer to the following documentation. Notice that you can simply use the following gcloud command:
gcloud redis instances export gs://[BUCKET_NAME]/[FILE_NAME].rdb [INSTANCE_ID] --region=[REGION] --project=[PROJECT_ID]
or use the Export operation from the Cloud Console.
If you manage your own instance (e.g. you have the Redis instance hosted on a Compute Engine Instance) you could simply use the SAVE or BGSAVE (preferred) commands to take a snapshot of the instance and then upload the .rdb file to Google Cloud Storage using any of the available methods, from which I think the most convenient one would be gsutil (notice that it will require the following installation procedure) in a similar fashion to:
gsutil cp path/to/your-file.rdb gs://[DESTINATION_BUCKET_NAME]/
I want to auto sync my local folder with S3 bucket. I mean, when i change some file in S3, automatically this file would update in the local folder.
I tried using scheduler task and AWS cli but i think there is a better way to do that.
Do you know some app or better solution?
Hope you can help me.
#mgg, You can mount the s3 bucket to the local server using s3fs, this way you can sync your local changes to s3 bucket.
You could execute code (Lambda Functions) that responds to some events in a given bucket (such file change, deleted or created), so, you could have a simple http service that receives a post or a get request from that lambda and update your local data accordingly.
Read more:
Tutorial, Using AWS Lambda with Amazon S3
Working with Lambda Functions
The other approach (I don't recommend this) is to have some code "pulling" for changes in some bucket and then reflecting those changes locally. At first glance it looks easier to implement, but ... it get complicated when you try to handle not just creation events.
And of course for each cycle of your "pulling" component you have to check all elements in your local directory against all elements in the bucket, it is a performance killing approach!
Creating AMI's from EBS backed instances is exceedingly easy, but doing the same from an instance-store based instance seems like it can only be done manually using the CLI.
So far I've been able to bootstrap the creation of an 'instance-store' based server off of an HVM Amazon Linux AMI with Ansible, but I'm getting lost on the steps that follow... I'm trying to follow this: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-instance-store-ami.html#amazon_linux_instructions
Apparently I need to store my x.509 cert and key on the instance, but which key is that? Is that...
one I have to generate on the instance with openssl,
one that I generate/convert from AWS,
one I generate with Putty, or
one that already exists in my AWS account?
After that, I can't find any reference to ec2-bundle-vol in Ansible. So I'm left wondering if the only way to do this is with Ansible's command module.
Basically what I'm hoping to find out is: Is there a way to easily create instance-store based AMI's using Ansible, and if not, if anyone can reference the steps necessary to automate this? Thanks!
Generally speaking, Ansible AWS modules are meant to manage AWS resources by interacting with AWS HTTP API (ie. actions you could otherwise do in the AWS Management Console).
They are not intended to run AWS specific system tools on EC2 instances.
ec2-bundle-vol and ec2-upload-bundle must be run on the EC2 instance itself. It is not callable via the HTTP API.
I'm afraid you need to write a custom playbook / role to automate the process.
On the other hand, aws ec2 register-image is an AWS API call and correspond to the ec2_ami Ansible module.
Unfortunately, this module doesn't seem to support image registering from an S3 bucket.
I have a s3 bucket with about 100 gb of small files (in folders).
I have been requested to back this up to a local NAS on a weekly basis.
I have access to a an EC2 instance that is attached to the S3 storage.
My Nas allows me to run an sFTP server.
I also have access to a local server in which I can run a cron job to pull the backup if need be.
How can I best go about this? If possible i would like to only download the files that have been added or changed, or compress it on the server end and then push the compressed file to the SFtp on the Nas.
The end goal is to have a complete backup of the S3 bucket on my Nas with the lowest amount of transfer each week.
Any suggestions are welcome!
Thanks for your help!
Ryan
I think the most scalable method for you to achieve this is using AWS Elastic Map Reduce and Data pipeline.
The architecture is this way:
You will use Data pipeline to configure S3 as an input data node, then EC2 with pig/hive scripts to do the required processing to send the data to SFTP. Pig is extendable to have a custom UDF (user defined function) to send data to SFTP. Then you can setup this pipeline to run at a periodical interval. Having said this this, it requires quite some reading to achieve all these - But a good skill to achieve if you for see future data transformation needs.
Start reading from here:
http://aws.typepad.com/aws/2012/11/the-new-amazon-data-pipeline.html
Similar method can be used for Taking periodic backup of DynamoDB to S3, Reading files from FTP servers, processing and moving to say S3/RDS etc.
I have a fairly large amount of data (~30G, split into ~100 files) I'd like to transfer between S3 and EC2: when I fire up the EC2 instances I'd like to copy the data from S3 to EC2 local disks as quickly as I can, and when I'm done processing I'd like to copy the results back to S3.
I'm looking for a tool that'll do a fast / parallel copy of the data back and forth. I have several scripts hacked up, including one that does a decent job, so I'm not looking for pointers to basic libraries; I'm looking for something fast and reliable.
Unfortunately, Adam's suggestion won't work as his understanding of EBS is wrong (although I wish he was right and often thought myself it should work that way)... as EBS has nothing to do with S3, but it will only give you an "external drive" for EC2 instances that are separate, but connectable to the instances. You still have to do copying between S3 and EC2, even though there are no data transfer costs between the two.
You didn't mention an operating system of your instance, so I cannot give tailored information. A popular command line tool I use is http://s3tools.org/s3cmd ... it is based on Python and therefore, according to info on its website it should work on Win as well as Linux, although I use it ALL the time on Linux. You could easily whip up a quick script that uses its built in "sync" command that works similar to rsync, and have it triggered every time you're done processing your data. You could also use the recursive put and get commands to get and put data only when needed.
There are graphical tools like Cloudberry Pro that have some command line options for Windows too that you can setup schedule commands. http://s3tools.org/s3cmd is probably the easiest.
By now, there is a sync command in the AWS Command line tools, that should do the trick: http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
On startup:
aws s3 sync s3://mybucket /mylocalfolder
before shutdown:
aws s3 sync /mylocalfolder s3://mybucket
Of course, the details are always fun to work out eg. how can parallel it is (and can you make it more parallel and is that any faster goven the virtual nature of the whole setup)
Btw hope you're still working on this... or somebody is. ;)
I think you might be better off using an Elastic Block Store to store your files instead of S3. An EBS is akin to a 'drive' on S3 that can be mounted into your EC2 instance without having to copy the data each time, thereby allowing you to persist your data between EC2 instances without having to write to or read from S3 each time.
http://aws.amazon.com/ebs/
Install s3cmd Package as
yum install s3cmd
or
sudo apt-get install s3cmd
depending on your OS
then copy data with this
s3cmd get s3://tecadmin/file.txt
also ls can list the files.
for more detils see this
For me the best form is:
wget http://s3.amazonaws.com/my_bucket/my_folder/my_file.ext
from PuTTy