S3 Integration with Snowflake: Best way to implement multi-tenancy? - amazon-s3

My team is planning on building a data processing pipeline that will involve S3 integration with Snowflake. This article from Snowflake shows that an AWS IAM role must be created in order for Snowflake to access S3's data.
However, in our pipeline, we need to ensure multi-tenancy and data isolation between users. For example, let's assume that Alice and Bob has files in S3 under "s3://bucket-alice/file_a.csv" and "s3://bucket-bob/file_b.csv" respectively. Then, we want to make sure that, when staging Alice's data onto Snowflake, Alice can only access "s3://bucket-alice" and nothing under "s3://bucket-bob". This means that individual AWS IAM roles must be created for each user.
I do realize that Snowflake has it's own access control system, but my team wants to make sure that data isolation is fully achieved from the S3-to-Snowflake stage of the pipeline, and not only relying on Snowflake's access control.
We are worried that this will not be scalable, as AWS sets a limit of 5000 IAM users, and that will not be enough as we scale our product. Is this the only way of ensuring data multi-tenancy, and does anyone have a real-world application example of something like this?

Have you explored leveraging Snowflake's Internal Stage, instead? By default, every user gets their own internal stage that only they have permissions to from within Snowflake and NO access outside of Snowflake. Snowflake offers the ability to move data in and out of that Internal Stage using just about every driver/connector that Snowflake has available. This said, any pipeline/workflow that is being leveraged by 5000+ users would be able to use these connectors to load data to Snowflake Internal Stage (S3) without the need for any additional AWS IAM Users. Would that be a sufficient solution for your situation?

Related

Snowflake - Loading data loading data from cloud storage

I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance
The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.

Using AWS Glue to Create a Table and move the dataset

I've never used AWS Glue however believe it will deliver what I want and am after some advice. I have a monthly CSV data upload that I push to S3 that has a staging Athena table (all strings) associated to it. I want Glue to perform a Create Table As (with all necessary convert/cast) against this dataset in Parquet format, and then move that dataset from one S3 bucket to another S3 bucket, so the primary Athena Table can access the data.
As stated, never used Glue before, and want a starter for 10, so I don't go down rabbit holes.
I currently perform all these steps manually, so want to understand how to use Glue to automate my manual tasks.
Yes, you can use AWS Glue ETL jobs to do exactly what you described. However, it doesn't perform CREATE TABLE AS SELECT queries, instead it does it with ETL jobs based on spark. Here is github repo that describes such process in quite detailed way and here is more of official AWS documentation on ETL programming based on AWS Glue service. After the initial setup, you can define some trigger events/scheduling to run your Glue ETL jobs automatically.
However, one thing to remember is cost of using AWS Glue services. Since it is based on execution time, sometimes it is not that trivial to forecast the final cost. For the workflow you described, performing CTAS queries with Athena would work just fine to transform your data and write it into a different s3 bucket. In this case you would know exactly price since it depends on the size of your data. Then you can use AWS API to do some manipulation with metadata catalog, so that new information would be accessible and in once place.
Since you are new to AWS Glue ETL jobs, I would suggest to stick with CTAS queries for simple tasks (although you can come up with quite complicated queries) and look into an open source project Apache Airflow for automation/scheduling and orchestration. This is the approach the I am using for tasks similar to yours. Airflow is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. It even has hooks to interact with AWS services. Hell, Airflow even provides you with a dedicated operator for sending queries to Athena. I wrote a little bit more about this approach here.

Are calls to sensitive data in S3 from AWS Athena tracked in behavior analytics in AWS Macie?

I'm looking to understand whether calls from Athena made to sensitive data in S3 identified by Macie would be included in the behavior analytics performed by Macie? For example, if someone gets query results using Athena in a way that would trigger an anomaly alert from Macie, would that level of visibility be available?
Macie works by analyzing CloudTrail events, and treats the S3 operations made by Athena on behalf of an IAM user the same as any other S3 operations made by that user – so yes, queries run through Athena should trigger the same anomaly alerts as direct S3 usage.

How to work with AWS Cognito in production environment?

I am working on an application in which I am using AWS Cognito to store users data. I am working on understanding how to manage the back-up and disaster recovery scenarios for Cognito!
Following are the main queries I have:
I wanted to know what is the availability of this stored user data?
What are the possible scenarios with Cognito, which I need to take
care before we go in production?
AWS does not have any published SLA for AWS Cognito. So, there is no official guarantee for your data stored in Cognito. As to how secure your data is, AWS Cognito uses other AWS services (for example, Dynamodb, I think). Data in these services are replicated across Availability Zones.
I guess you are asking for Disaster Recovery scenarios. There is not much you can do on your end. If you use Userpools, there is no feature to export user data, as of now. Although you can do so by writing a custom script, a built-in backup feature would be much more efficient & reliable. If you use Federated Identities, there is no way to export & re-use Identities. If you use Datasets provided by Cognito Sync, you can use Cognito Streams to capture dataset changes. Not exactly a stellar way to backup your data.
In short, there is no official word on availability, no official backup or DR feature. I have heard that there are feature requests for the same but who knows when they would be released. And there is not much you can do by writing custom code or follow any best practices. The only thing I can think of is that periodically backup your Userpool's user data by writing a custom script using AdminGetUser API. But again, there are rate limits on how many times you can call this API. So, backup using this method can take a long time.
AWS now offers a SLA for Cognito. In the event they are unable to meet their availability target (99.9% at the time of writing), you will receive service credits.
Even through there are couple of third party solutions available, when restoring a user pool users will be created using admin flow (users are not restored rather they will be created from an admin) and they will end up with "Force Change Password" status. So the users will be forced to change the password using the temporary password and that has to be facilitated from the front end of the application.
More info : https://docs.amazonaws.cn/en_us/cognito/latest/developerguide/signing-up-users-in-your-app.html
Tools available.
https://www.npmjs.com/package/cognito-backup
https://github.com/mifi/cognito-backup
https://github.com/rahulpsd18/cognito-backup-restore
https://github.com/serverless-projects/cognito-tool
Pls bear in mind that some of these tools are outdated and can not be used. I have tested "cognito-backup-restore" and it is working as expected.
Also you have to think of how to secure the user information outputted by these tools. Usually they create a json file containing all the user information (except the passwords as passwords can not be backed up) and this file is not encrypted.
The best solution so far is to prevent accidental deletion of user pools with AWS SCPs.

What is the minimum permission set for a service account to export a table from a specific dataset on BigQuery to a specific bucket on CloudStorage?

Currently I have a service account that can export anything from BigQuery to anywhere in cloud storage but I feel it has more permissions than meets Principle of Least Privilege.
What is the minimum permission set to run a BigQuery as a job, save that query to temporary table, and export that to Cloud Storage?
I suspect the answer will involve some permissions from the following groups.
bigquery.tables
bigquery.jobs
bigquery.datasets
resourcemanager.projects # Seems to be replacement for storage.buckets and storage.objects ?
You could run some tests on a service account key for this one.
As you can see in BigQuery IAM Documentation, if you add the roles dataEditor and jobUser it might be quite enough for what you want. You could also test the custom roles but this is still a alpha tool.
For Cloud Storage IAM, maybe roles/storage.objectCreator would already be enough.
If you end up playing around with these roles and find which one works best I'd also recommend answering your own question with your findings so people arriving here by Google might learn how you did it.