I'm storing copies of database backups on Amazon S3 using the Python Boto library. But I worry that if my web server was hacked, those backups could be deleted using the credentials I need to do the upload.
Ok, so I know you can grant permissions to another Amazon email address, so I can imagine doing that after an upload then removing the original user's write access BUT in this scenario I now end up with 2 accounts and 2 sets of invoices to give to accounts every month.
Is there a solution to this that doesn't require multiple invoices, yet keeps my backups completely independent of my web server. What's the best practice here?
Just seen that Amazon announced Consolidated Billing to solve this problem.
Are there any other/better solutions?
Also, if you are really worried, there is 'MFA Delete'. (MFA == Multi Factor Authorization)
With MFA - Delete 'on' - which requires versioning, no one can delete files from S3 unless they have a physical key - fob thingy that has a constantly changing number on it that needs to be entered so you can delete the file. Kinda 'secret agent man' - like.
Related
I wanted to ask if you have any proven ways to deal with removing non-tracked attachments from S3?
Architecture diagram
I am using the S3 in project to hold files in a way that the Frontend fetching a pre-signed URL from Backend (1). Then Frontend adds an attachment to S3 (2), and then adds a URL to some resource (3).
The challenge is that there may be a difference between attachments on the server and URLs in the resource - eg Frontend getting the URL from Backend, sends file to S3, and does not add it to the Backend resource.
Does any of you have a way to deal with such a problem?
Fetching all the resources from database once in a day and comparing their attachments with those on the server sounds unsatisfactory. :/
Distributed consistency is hard sometimes... Orphaned records in non-transactional systems are hard and expensive to track down.
You have to decide what your tolerance for data-consistency-drift is. If you need it to be absolute, then you need to put in place things like rollbacks when transactions fail (e.g. when the writ to the backend fails, rollback the write to S3 when the backend write fails). Alternatively you can tag files in S3 with a tag that says something like "backend:pending" when creating the files and update it to "backend:done" so that you can easily identify s3 objects that need cleanup.
A more scalable option would be to use something like SQS to manage your workflow, ensuring that both changes are handled, and creating hospital queues for when things go sideways.
My team is planning on building a data processing pipeline that will involve S3 integration with Snowflake. This article from Snowflake shows that an AWS IAM role must be created in order for Snowflake to access S3's data.
However, in our pipeline, we need to ensure multi-tenancy and data isolation between users. For example, let's assume that Alice and Bob has files in S3 under "s3://bucket-alice/file_a.csv" and "s3://bucket-bob/file_b.csv" respectively. Then, we want to make sure that, when staging Alice's data onto Snowflake, Alice can only access "s3://bucket-alice" and nothing under "s3://bucket-bob". This means that individual AWS IAM roles must be created for each user.
I do realize that Snowflake has it's own access control system, but my team wants to make sure that data isolation is fully achieved from the S3-to-Snowflake stage of the pipeline, and not only relying on Snowflake's access control.
We are worried that this will not be scalable, as AWS sets a limit of 5000 IAM users, and that will not be enough as we scale our product. Is this the only way of ensuring data multi-tenancy, and does anyone have a real-world application example of something like this?
Have you explored leveraging Snowflake's Internal Stage, instead? By default, every user gets their own internal stage that only they have permissions to from within Snowflake and NO access outside of Snowflake. Snowflake offers the ability to move data in and out of that Internal Stage using just about every driver/connector that Snowflake has available. This said, any pipeline/workflow that is being leveraged by 5000+ users would be able to use these connectors to load data to Snowflake Internal Stage (S3) without the need for any additional AWS IAM Users. Would that be a sufficient solution for your situation?
I am working on an application in which I am using AWS Cognito to store users data. I am working on understanding how to manage the back-up and disaster recovery scenarios for Cognito!
Following are the main queries I have:
I wanted to know what is the availability of this stored user data?
What are the possible scenarios with Cognito, which I need to take
care before we go in production?
AWS does not have any published SLA for AWS Cognito. So, there is no official guarantee for your data stored in Cognito. As to how secure your data is, AWS Cognito uses other AWS services (for example, Dynamodb, I think). Data in these services are replicated across Availability Zones.
I guess you are asking for Disaster Recovery scenarios. There is not much you can do on your end. If you use Userpools, there is no feature to export user data, as of now. Although you can do so by writing a custom script, a built-in backup feature would be much more efficient & reliable. If you use Federated Identities, there is no way to export & re-use Identities. If you use Datasets provided by Cognito Sync, you can use Cognito Streams to capture dataset changes. Not exactly a stellar way to backup your data.
In short, there is no official word on availability, no official backup or DR feature. I have heard that there are feature requests for the same but who knows when they would be released. And there is not much you can do by writing custom code or follow any best practices. The only thing I can think of is that periodically backup your Userpool's user data by writing a custom script using AdminGetUser API. But again, there are rate limits on how many times you can call this API. So, backup using this method can take a long time.
AWS now offers a SLA for Cognito. In the event they are unable to meet their availability target (99.9% at the time of writing), you will receive service credits.
Even through there are couple of third party solutions available, when restoring a user pool users will be created using admin flow (users are not restored rather they will be created from an admin) and they will end up with "Force Change Password" status. So the users will be forced to change the password using the temporary password and that has to be facilitated from the front end of the application.
More info : https://docs.amazonaws.cn/en_us/cognito/latest/developerguide/signing-up-users-in-your-app.html
Tools available.
https://www.npmjs.com/package/cognito-backup
https://github.com/mifi/cognito-backup
https://github.com/rahulpsd18/cognito-backup-restore
https://github.com/serverless-projects/cognito-tool
Pls bear in mind that some of these tools are outdated and can not be used. I have tested "cognito-backup-restore" and it is working as expected.
Also you have to think of how to secure the user information outputted by these tools. Usually they create a json file containing all the user information (except the passwords as passwords can not be backed up) and this file is not encrypted.
The best solution so far is to prevent accidental deletion of user pools with AWS SCPs.
I am working on a project in which we are planning to use EdgeCast to store our data. I am concerned about it, because the client wants to upload the image to our server first, and then use curl to upload it to EdgeCast. In this case our servers will be "proxying" the request, doubling the time needed for uploads.
What would you suggest? And is direct uploading risky?
PS the reason I mentioned S3 is because of its similarity to EdgeCast. Hence I assume the same principle will apply.
Yep - Martin's right - usually a good idea when letting users have direct access to storage to have a proxy. EdgeCast supports rsync which will automatically sync content from your server and the EdgeCast storage account. Or you can use our "customer origin" reverse proxy feature for our network to pull content automatically from your servers as its requested by the public. Feel free to contact us at sales#edgecast.com with questions.
Having your server in between the end user and the storage, is probably a good idea. Whenever I let users direct access to storage places, with FTP or SSH, it tends to get really messy. A place where you can upload files, that get accessible from the web, is used for all sorts of things.
Having your server in between you can organise the files uploaded into some rational structure. A folder per date for instance, and perhaps also enforce some strict naming of the files themselves, avoiding URL encoding problems etc.
There is no reason to be concerned about Edgecast. On my opinion, it always makes its best to serve its customers the best way so that the customers have their websites as fast as it's possible and also secured and well-optimized. The whole comparison of Edgecasr vs Amazon look at http://jodihost.com/2014_edgecast_vs_amazon.php
What is the easiest way to duplicate an entire Amazon S3 bucket to a bucket in a different account?
Ideally, we'd like to duplicate the bucket nightly to a different account in Amazon's European data center for backup purposes.
One thing to consider is that you might want to have whatever is doing this running in an Amazon EC2 VM. If you have your backup running outside of Amazon's cloud then you pay for the data transfer both ways. If you run in an EC2 VM, you pay no bandwidth fees (although I'm not sure if this is true when going between the North American and European stores) - only for the wall time that the EC2 instance is running (and whatever it costs to store the EC2 VM, which should be minimal I think).
Cool, I may look into writing a script to host on Ec2. The main purpose of the backup is to guard against human error on our side -- if a user accidentally deletes a bucket or something like that.
If you're worried about deletion, you should probably look at S3's new Versioning feature.
I suspect there is no "automatic" way to do this. You'll just have to write a simple app that moves the files over. Depending on how you track the files in S3 you could move just the "changes" as well.
On a related note, I'm pretty sure Amazon does a darn good job backup up the data so I don't think you necessarily need to worry about data loss, unless your back up for archival purposes, or you want to safeguard against accidentally deleting files.
You can make an application or service that responsible to create two instances of AmazonS3Client one for the source and the other for the destination, then the source AmazonS3Client start looping in the source bucket and streaming objects in, and the destination AmazonS3Client streaming them out to the destination bucket.
Note: this doesn't work for cross-account syncing, but this works for cross-region on the same account.
For simply copying everything from one bucket to another, you can use the AWS CLI (https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/): aws s3 sync s3://SOURCE_BUCKET_NAME s3://NEW_BUCKET_NAME
In your case, you'll need the --source-region flag: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
If you are moving an enormous amount of data, you can optimize how quickly it happens by finding ways to split the transfers into different groups: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/
There are a variety of ways to run this nightly. One is example is the AWS instance-schedule (personally unverified) https://docs.aws.amazon.com/solutions/latest/instance-scheduler/appendix-a.html