How do I specify a bucket name using an s3 connection in Airflow? - amazon-s3

We have an s3 bucket that Airflow uses as the source of all of our DAG data pipelines. We have a bucket for dev, test and production. Let's say the bucket in dev is called dev-data-bucket, in test it's called test-data-bucket etc.
I don't want to manually specify the bucket name in our DAG code because this code gets migrated between environments. If I manually specify the dev-data-bucket in our dev environment and this DAG code goes to our test environment, the bucket name would need to change to test-data-bucket and prod-data-bucket for prod.
I understand that the usual way to do this would be to create an Airflow connection in each environment which has the same name like data-bucket. However, I don't know where to specify the bucket name in the connection screen for airflow like I would for a database connection?
How do I create an Airflow s3 connection with the same name in each environment but which specifies a different bucket name for each environment?

The way to do this is to specify the bucket name in the Schema field in the Airflow connection screen. This is what the connection screen would look like in dev:
Then, when you use one of the provided s3 operators in airflow you don't need to specify a bucket name because the s3 hook in Airflow is setup to fetch the bucket name from the connection if you haven't specified one.

Related

Could not find an active AWS Glue VPC interface endpoint. Could not find an active NAT

I'm trying to create a AWS Databrew job that pulls data from an S3 folder into a AWS RDS SQL Server table and receive the following:
"AWS Glue VPC interface endpoint validation failed for SubnetId: subnet-xxx9574. VPC: vpc-xxxdd2. Reason: Could not find an active AWS Glue VPC interface endpoint. Could not find an active NAT."
My Databrew output is a "Data catalog RDS Tables", which I have setup a crawler/connection and everything is green.
I followed this path and helped solve a different issue, but this error is slightly different.
https://aws.amazon.com/premiumsupport/knowledge-center/glue-s3-endpoint-validation-failed/
I tried to create another endpoint with type of 'Interface' instead of 'Gateway' but not really sure if that's the correct path.
Any guidance? Trying to avoid any custom scripts as this should be something AWS can handle in their interfaces...IMO.
Thanks!
I followed this path and helped solve a different issue, but this error is slightly different.
https://aws.amazon.com/premiumsupport/knowledge-center/glue-s3-endpoint-validation-failed/
I tried to create another endpoint with type of 'Interface' instead of 'Gateway' but not really sure if that's the correct path.

What's the correct URI format when creating scheduled backups from CockroachDB to a linode s3 bucket?

CockroachDB's documentation gives the example
CREATE SCHEDULE core_schedule_label
FOR BACKUP INTO 's3://test/schedule-test-core?AWS_ACCESS_KEY_ID=x&AWS_SECRET_ACCESS_KEY=x'
How can I modify this to use an S3-compatible service like linode rather than AWS?
The format is very similar; you just need to override the endpoint with your actual linode endpoint. A linode s3 URI can look like
CREATE SCHEDULE my_own_backup_schedule FOR BACKUP INTO 's3://test/schedule-test-core?AWS_ACCESS_KEY_ID=accesskeyid&AWS_SECRET_ACCESS_KEY=secret&AWS_REGION=us-east-1&AWS_ENDPOINT=https://us-east-1.linodeobjects.com'
Note that the AWS_ENDPOINT is just the host, not the full endpoint with the bucket name. On older versions of CockroachDB, providing the bucket name in AWS_ENDPOINT (like AWS_ENDPOINT=https://us-east-1.linodeobjects.com/test/schedule-test-core) worked, but in 22.1+ backups created like that may get the error "failed to list s3 bucket". You can fix this issue by creating a new backup schedule formatted as above and adding WITH SCHEDULE OPTIONS ignore_existing_backups so that you don't get an error like unexpected error occurred when checking for existing backups in s3 from validations in current code trying to use the older URI.

Move S3 files to Snowflake stage using Airflow PUT command

I am trying to find a solution to move files from an S3 bucket to Snowflake internal stage (not table directly) with Airflow but it seems that the PUT command is not supported with current Snowflake operator.
I know there are other options like Snowpipe but I want to showcase Airflow's capabilities.
COPY INTO is also an alternative solution but I want to load DDL statements from files, not run them manually in Snowflake.
This is the closest I could find but it uses COPY INTO table:
https://artemiorimando.com/2019/05/01/data-engineering-using-python-airflow/
Also : How to call snowsql client from python
Is there any way to move files from S3 bucket to Snowflake internal stage through Airflow+Python+Snowsql?
Thanks!
I recommend you execute the COPY INTO command from within Airflow to load the files directly from S3, instead. There isn't a great way to get files to internal stage from S3 without hopping the files to another machine (like the Airflow machine). You'd use SnowSQL to GET from S3 to local, and the PUT from local to S3. The only way to execute a PUT to Internal Stage is through SnowSQL.

Terraform Shared State

Terraform 0.9.5.
I am in the process of putting together a group of modules that our infrastructure team and automation team will use to create resources in a standard fashion and in turn create stacks to provision different envs. All working well.
Like all teams using terraform shared state becomes a concern. I have configured terraform to use a s3 backend, that is versioned and encrypted, added a lock via a dynamo db table. Perfect. All works with local accounts... Okay the problem...
We have multiple aws accounts, 1 for IAM, 1 for billing, 1 for production, 1 for non-production, 1 for shared services etc... you get where I am going. My problem is as follows.
I authenticate as user in our IAM account and assume the required role. This has been working like a dream until i introduced terraform backend configuration to utilise s3 for shared state. It looks like the backend config within terraform requires default credentials to be set within ~/.aws/credentials. It also looks like these have to be a user that is local to the account where the s3 bucket was created.
Is there a way to get the backend configuration setup in such a way that it will use the creds and role configured within the provider? Is there a better way to configured shared state and locking? Any suggestions welcome :)
Update:Got this working. I created a new user within the account where the s3 bucket is created. Created a policy to just allow that new user s3:DeleteObject,GetObject,PutObject,ListBucket and dynamodb:* on the specific s3 bucket and dynamodb table. Created a custom credentials file and added default profile with access and secret keys assigned to that new user. Used the backend config similar to
terraform {
required_version = ">= 0.9.5"
backend "s3" {
bucket = "remote_state"
key = "/NAME_OF_STACK/terraform.tfstate"
region = "us-east-1"
encrypt = "true"
shared_credentials_file = "PATH_TO_CUSTOM_CREDENTAILS_FILE"
lock_table = "MY_LOCK_TABLE"
}
}
It works but there is an initial configuration that needs to happen within your profile to get it working. If anybody knows of a better setup or can identify problems with my backend config please let me know.
Terraform expects backend configuration to be static, and does not allow it to include interpolated variables as might be true elsewhere in the config due to the need for the backend to be initialized before any other work can be done.
Due to this, applying the same config multiple times using different AWS accounts can be tricky, but is possible in one of two ways.
The lowest-friction way is to create a single S3 bucket and DynamoDB table dedicated to state storage across all environments, and use S3 permissions and/or IAM policies to impose granular access controls.
Organizations adopting this strategy will sometimes create the S3 bucket in a separate "adminstrative" AWS account, and then grant restrictive access to the individual state objects in the bucket to the specific roles that will run Terraform in each of the other accounts.
This solution has the advantage that once it has been set up correctly in S3 Terraform can be used routinely without any unusual workflow: configure the single S3 bucket in the backend, and provide appropriate credentials via environment variables to allow them to vary. Once the backend is initialized, use workspaces (known as "state environments" prior to Terraform 0.10) to create a separate state for each of the target environments of a single configuration.
The disadvantage is the need to manage a more-complicated access configuration around S3, rather than simply relying on coarse access control with whole AWS accounts. It is also more challenging with DynamoDB in the mix, since the access controls on DynamoDB are not as flexible.
There is a more complete description of this option in the Terraform s3 provider documentation, Multi-account AWS Architecture.
If a complex S3 configuration is undesirable, the complexity can instead be shifted into the Terraform workflow by using partial configuration. In this mode, only a subset of the backend settings are provided in config and additional settings are provided on the command line when running terraform init.
This allows options to vary between runs, but since it requires extra arguments to be provided most organizations adopting this approach will use a wrapper script to configure Terraform appropriately based on local conventions. This can be just a simple shell script that runs terraform init with suitable arguments.
This then allows to vary, for example, the custom credentials file by providing it on the command line. In this case, state environments are not used, and instead switching between environments requires re-initializing the working directory against a new backend configuration.
The advantage of this solution is that it does not impose any particular restrictions on the use of S3 and DynamoDB, as long as the differences can be represented as CLI options.
The disadvantage is the need for unusual workflow or wrapper scripts to configure Terraform.

Access files in s3n://elasticmapreduce/samples/wordcount/input

How I can I access the file sitting in the following folder of S3 which is own by someone else
s3n://elasticmapreduce/samples/wordcount/input
The files in s3n://elasticmapreduce/samples/wordcount/input are public, and made available as input by Amazon to the sample word count Hadoop program. The best way to fetch them is to
Start a new Amazon Elastic MapReduce Job Flow (it doesn't matter which one) from the Amazon Web Services console, and make sure that you keep the the job alive with the Keep Alive option
Once the EC2 machines have started, find the instances on EC2 from the Amazon Web Services console
ssh into one of the running EC2 instances, using the hadoop user, for example
ssh -i keypair.pem hadoop#ec2-IPADDRESS.compute-1.amazonaws.com
Obtain the files you need, using hadoop dfs -copyToLocal s3://elasticmapreduce/samples/wordcount/input/0002 .
sftp the files to your local system
You can access wordSplitter.py here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/wordSplitter.py
You can access the input files here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0012
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0011
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0010
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0009
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0008
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0007
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0006
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0005
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0004
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0003
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0002
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0001
The owner of the folder (most likely a file in the folder) must have made it accessible to anonymous reader.
If that is the case, s3n://x/y... is translated to
http://s3.amazonaws.com/x/y...
or
http://x.s3.amazonaws.com/y...
x is the name of the bucket.
y... is the path wihtin the bucket.
If you want to make sure the file exists, e.g. if you suspect the name was misspelled, you can in your browser to open
http://s3.amazonaws.com/x
and you'll see XML describing "files" that is S3 objects, available.
Try this:
http://s3.amazonaws.com/elasticmapreduce
I tried this, and seems that the path you want is not public.
AWS EBS documentation quotes s3://elasticmapreduce/samples/wordcount/input in one of the "getting started" examples. But s3 is different from s3n, so input might be available to EMR, but not to HTTP access.
In Amazon S3, there is no concept of folders, a bucket it just a flat collection of objects. But you can list all the files you are interested in a browser with the following URL:
s3.amazonaws.com/elasticmapreduce?prefix=samples/wordcount/input/
Then you can download them by specifying the whole name, e.g.
s3.amazonaws.com/elasticmapreduce/samples/wordcount/input/0001