How can I tell airflow to authenticate on an external server? - amazon-s3

I have the following task in airflow, which works like a charm:
t = SparkSubmitOperator(
task_id = 'some_id',
application = '/path/to/app.py',
name = 'airflow-spark',
conf = {
'spark.hadoop.fs.s3a.endpoint': 'https://some.url.com/',
'spark.hadoop.fs.s3a.access.key': 'myuser',
'spark.hadoop.fs.s3a.secret.key': 'my_super_secret_password',
},
dag = dag,
)
As you can guess, my spark job needs to authenticate on an S3 server instance to retrieve data. While this works, I don't want to put my password as cleartext in the dag. How can I authenticate with the S3 server, without using my password in cleartext? I tried setting up connections in airflow, which seems to be exactly for this use case, but when I use conn_id = 'my_connection' inside the task, it tries to run the spark job on the server instead.

If you are running airflow in AWS infra, you can use the IAM-granted permissions of the container/VM as the accessor.
If you can update the config regularly, you could issue session tokens on your desktop and update the spec. You'll need a hadoop version which supports session credentials (2.8+).
You can also use JCEKs files to store the credentials -you'd then get that file onto all VMs for work and set the hadoop/spark config to load it. See https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Storing_secrets_with_Hadoop_Credential_Providers. This does need to be on a shared store, e.g, HDFS, mounted EBS, etc, as each spark worker will resolve the jceks path locally to load the secrets.
Simplest is use IAM permissions. If you are running on a shared cluster, JCEKS files is better
Finally, Hadoop 3.3+ allows for the S3A to dynamically generate session/role tokens from a user with full credentials, and pass these with the spark job. If you play with that you can have credentials on your desktop/JCEKs file only airflow can read, and have session/role credentials generated from those. Useful, but trickier to set up

Related

Pulumi automation backend

I am a newbie in pulumi. I am having an issue. When I do pulumi login in GCP backend It appears an error:
stderr: error: getting secrets manager: passphrase must be set with
PULUMI_CONFIG_PASSPHRASE or PULUMI_CONFIG_PASSPHRASE_FILE environment
variables
When I do pulumi logout the deployment works - pulumi api automation. Does anyone have an idea how to fix this?
Tried to set pulumi_config_passphrase.
When using the self-managed backends for Pulumi, you need to provide a pass phrase to encrypt secret values.
This can be done by setting a global environment variable which will depend on the operating system you're using. In Unix like environments (eg MacOs or Linux) you can do:
export PULUMI_CONFIG_PASSPHRASE=<a password you can remember>
In Windows on Powershell this can be done using:
$env:PULUMI_CONFIG_PASSPHRASE=<a password you can remember>
If you don't wish to use a passphrase, you can leverage the Pulumi service as your state store, or configure a cloud secrets provider.
This is done when initializing your stack, more information on that can be found here

What is the correct Cloud SQL connection string syntax for dotnetcore app with Cloud Run?

I want to setup a .NET Core web application on Cloud Run with a Google Cloud SQL database. I easily deployed the database which has a public IP on Cloud SQL and my web application with Docker Container on Cloud Run. I can access the database with SQL Server Management Studio without any difficulties and the web app is up and running as expected. The only piece missing is the link between them that allows them to connect.
In my web app, I got a connection string in that format :
Data Source=***;Initial Catalog=***;User ID=***;Password=***;Pooling=true;Trusted_Connection=false;Connection Timeout=60;Integrated Security=false;Persist Security Info={0};Encrypt=true;TrustServerCertificate=true;MultipleActiveResultSets=true;
Once I got the public IP and the connection name from Cloud SQL, how should be precisely be the connection string and/or the next steps?
Furthermore, in the connections tab under Cloud Run Service, I added the Cloud SQL connection. This is supposed to configure a Cloud SQL Proxy for me.
In order to connect to Cloud SQL from Cloud Run, you must follow this guide
You have already made some configurations in the Connections tab as stated in the Configuring Cloud Run section. You can check the guide for the Public IP since you configured your instance that way, to be sure that all steps were followed.
Briefly, the steps are:
Configure the service account for your service. Make sure that the service account has the appropriate Cloud SQL roles and permissions to connect to Cloud SQL.
The service account for your service needs one of the following IAM roles:
Cloud SQL Client (preferred)
Cloud SQL Admin
If the authorizing service account belongs to a different project than the Cloud SQL instance, the Cloud SQL Admin API and IAM permissions will need to be added for both projects.
Like any configuration change, setting a new configuration for the Cloud SQL connection leads to the creation of a new Cloud Run revision. Subsequent revisions will also automatically get this Cloud SQL connection, unless you make explicit updates to change it.
Go to Cloud Run
Configure the service:
If you are adding Cloud SQL connections to an existing service:
Click on the service name.
Click on the Connections tab.
Click Deploy.
Enable connecting to a Cloud SQL instance:
Click Advanced Settings.
Click on the Connections tab.
If you are adding a connection to a Cloud SQL instance in your project, select the desired Cloud SQL instance from the dropdown menu.
If you are deleting a connection, hover your cursor to the right of the connection to display the Trash icon, and click it.
Click Create or Deploy.
After you've double checked the steps above, you could continue with the section Connecting to Cloud SQL. You can follow the steps on the Public IP tab.
Connect with Unix sockets
Once correctly configured, you can connect your service to your Cloud SQL instance's Unix domain socket accessed on the environment's filesystem at the following path: /cloudsql/INSTANCE_CONNECTION_NAME.
The INSTANCE_CONNECTION_NAME can be found on the Overview page for your instance in the Google Cloud Console or by running the following command:
gcloud sql instances describe [INSTANCE_NAME].
These connections are automatically encrypted without any additional configuration.
The code samples shown below are extracts from more complete examples on the GitHub site. To see this snippet in the context of a web application, view the README on GitHub.
// Equivalent connection string:
// "Server=<dbSocketDir>/<INSTANCE_CONNECTION_NAME>;Uid=<DB_USER>;Pwd=<DB_PASS>;Database=<DB_NAME>;Protocol=unix"
String dbSocketDir = Environment.GetEnvironmentVariable("DB_SOCKET_PATH") ?? "/cloudsql";
String instanceConnectionName = Environment.GetEnvironmentVariable("INSTANCE_CONNECTION_NAME");
var connectionString = new MySqlConnectionStringBuilder()
{
// The Cloud SQL proxy provides encryption between the proxy and instance.
SslMode = MySqlSslMode.None,
// Remember - storing secrets in plain text is potentially unsafe. Consider using
// something like https://cloud.google.com/secret-manager/docs/overview to help keep
// secrets secret.
Server = String.Format("{0}/{1}", dbSocketDir, instanceConnectionName),
UserID = Environment.GetEnvironmentVariable("DB_USER"), // e.g. 'my-db-user
Password = Environment.GetEnvironmentVariable("DB_PASS"), // e.g. 'my-db-password'
Database = Environment.GetEnvironmentVariable("DB_NAME"), // e.g. 'my-database'
ConnectionProtocol = MySqlConnectionProtocol.UnixSocket
};
connectionString.Pooling = true;
// Specify additional properties here.
return connectionString;
Google recommends that you use Secret Manager to store sensitive information such as SQL credentials. You can pass secrets as environment variables or mount as a volume with Cloud Run.
After creating a secret in Secret Manager, update an existing service, with the following command:
gcloud run services update SERVICE_NAME \
--add-cloudsql-instances=INSTANCE_CONNECTION_NAME
--update-env-vars=INSTANCE_CONNECTION_NAME=INSTANCE_CONNECTION_NAME_SECRET \
--update-secrets=DB_USER=DB_USER_SECRET:latest \
--update-secrets=DB_PASS=DB_PASS_SECRET:latest \
--update-secrets=DB_NAME=DB_NAME_SECRET:latest
See also:
GoogleCloudPlatform/dotnet-docs-samples on GitHub

ADLS to Azure Storage Sync Using AzCopy

Looking for some help to resolve the errors I'm facing. Let me explain the scenario. I'm trying to sync one of the ADLS Gen2 container to Azure BLOB Storage. I have AzCopy 10.4.3, I'm using Azcopy Sync to do this. I'm using the command below
azcopy sync 'https://ADLSGen2.blob.core.windows.net/testsamplefiles/SAMPLE' 'https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE' --recursive
When I run this command I'm getting below error
REQUEST/RESPONSE (Try=1/71.0063ms, OpTime=110.9373ms) -- RESPONSE SUCCESSFULLY RECEIVED
PUT https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet?blockid=ZDQ0ODlkYzItN2N2QzOWJm&comp=block&timeout=901
X-Ms-Request-Id: [378ca837-d01e-0031-4f48-34cfc2000000]
ERR: [P#0-T#0] COPYFAILED: https://ADLSGen2.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet: 404 : 404 The specified resource does not exist.. When Staging block from URL. X-Ms-Request-Id: [378ca837-d01e-0031-4f48-34cfc2000000]
Dst: https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet
REQUEST/RESPONSE (Try=1/22.9854ms, OpTime=22.9854ms) -- RESPONSE SUCCESSFULLY RECEIVED
GET https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet?blocklisttype=all&comp=blocklist&timeout=31
X-Ms-Request-Id: [378ca84e-d01e-0031-6148-34cfc2000000]
So far I checked and ensured below things
I logged into correct tenant while logging into AzCopy
Storage Blob Data Contributor role was granted to my AD credentials
Not sure what else I'm missing as the file exists in the source and I'm getting the same error. I tried with SAS but I received different error though. I cannot proceed with SAS due to the vendor policy so I need to ensure this is working with oAuth. Any inputs is really appreciated.
For the 404 error, you may check if there is any typo in the command and the path /testsamplefiles/SAMPLE exists on both source and destination account. Also, please note that from the tips.
Use single quotes in all command shells except for the Windows Command
Shell (cmd.exe). If you're using a Windows Command Shell (cmd.exe),
enclose path arguments with double quotes ("") instead of single
quotes ('').
From azcopy sync supported scenario:
Azure Blob <-> Azure Blob (Source must include a SAS or is publicly
accessible; either SAS or OAuth authentication can be used for
destination)
We must provide include a SAS token in the source, but I tried the below code with AD authentication.
azcopy sync "https://[account].blob.core.windows.net/[container]/[path/to/blob]?[SAS]" "https://[account].blob.core.windows.net/[container]/[path/to/blob]"
but got the same 400 error as the Github issue.
Thus, in this case, after my validation, you could use this command to sync one of the ADLS Gen2 container to Azure BLOB Storage without executing azcopy login. If you have login in, you can run azcopy logout.
azcopy sync "https://nancydl.blob.core.windows.net/container1/sample?sv=xxx" "https://nancytestdiag244.blob.core.windows.net/container1/sample?sv=xxx" --recursive --s2s-preserve-access-tier=false

Can't create bucket without authentication

We updated our Couchbase from 4.6 Community edition to 5.0.0-2873 Enterprise Edition for testing purposes and our software using the java-client started throwing InvalidPasswordException when trying to open a bucket.
As I've found, every newly created bucket has authType='sasl' and a randomly generated saslPassword.
I've tried creating a bucket using the CLI instead of the GUI:
couchbase-cli bucket-create -c localhost:8091 -u Administrator -p password --bucket=general --bucket-ramsize=1300 --bucket-type=couchbase --bucket-password=
I got the following error:
ERROR: unrecognized arguments: --bucket-password=password
I also tried the bucket-edit function with the same result.
According to the documentation the argument should be valid.
I also tried using the REST API to change bucket authentication (and similarly password), but even though this didn't throw any erros, the authType and the password remained the same.
curl -X POST -u Administrator:password -d 'authType=none' http://<host>:8091/pools/default/buckets/general
Again, according to the documentation this should work.
If I query the bucket information for the sasl password and provide that for the openBucket function then the connection works, however we really don't want to use this feature in our system.
So, any other ideas how it would be possible to remove the bucket authentication in our 5.0EE Couchbase setup?
In Couchbase 5.0 we no longer support bucket passwords and have moved to using role based access control when connecting to buckets. This means that in 5.0 the standard (pre-production) way to connect to a bucket is by using the Administrator user and password that you created when setting up the cluster. In case you're unsure what the Administrator user is, it is the user you create when you first go through the Couchbase setup wizard or the it is the username and password you specify on the command line when running the couchbase-cli cluster-init command.
One thing to note is that using the Administrator user/password is the standard pre-production workflow. I would recommend that when you go into production you create separate users for your application which only have access to cluster resources they need to access in the cluster. You can do this by going to the Users tab in the Administration Console and creating a new user and giving them the Full Bucket Access role which is the standard role that applications should have.
You might now be saying to yourself that this all sounds great, but when I use the Administrator user/password I still am having issues. If this is the case the reason is because you have Couchbase 5.0, but your SDK is not new enough to handle the new RBAC authentication mechanism in 5.0. The workaround for this is to create a user in the Users tab with the same name as the bucket and give that user the Full Bucket Access role. You can then use this user to authenticate.
One last thing to mention is that during an upgrade from a pre-5.0 cluster to a 5.0 cluster Couchbase will automatically create a user for each bucket. The each user will have the same name as one of the buckets and the password for that user will correspond to the bucket password. This is done mainly to ensure that there is no application downtime during an upgrade. After upgrading the cluster the next step should ideally be to upgrade the Couchbase client library to have it start using RBAC authentication.
If you need to stay with old approach and no password you can use cochbase-cli with --rbac-username and --rbac-password "", but you need to specify password as "", e.g.
./couchbase-cli user-manage -c localhost:8091 -u Admin -p password --set --rbac-username <UserForBucket> --roles bucket_full_access[<BucketName>] --rbac-password "" --auth-domain local

Google cloud dataproc failing to create new cluster with initialization scripts

I am using the below command to create data proc cluster:
gcloud dataproc clusters create informetis-dev
--initialization-actions “gs://dataproc-initialization-actions/jupyter/jupyter.sh,gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh,gs://dataproc-initialization-actions/hue/hue.sh,gs://dataproc-initialization-actions/ipython-notebook/ipython.sh,gs://dataproc-initialization-actions/tez/tez.sh,gs://dataproc-initialization-actions/oozie/oozie.sh,gs://dataproc-initialization-actions/zeppelin/zeppelin.sh,gs://dataproc-initialization-actions/user-environment/user-environment.sh,gs://dataproc-initialization-actions/list-consistency-cache/shared-list-consistency-cache.sh,gs://dataproc-initialization-actions/kafka/kafka.sh,gs://dataproc-initialization-actions/ganglia/ganglia.sh,gs://dataproc-initialization-actions/flink/flink.sh”
--image-version 1.1 --master-boot-disk-size 100GB --master-machine-type n1-standard-1 --metadata "hive-metastore-instance=g-test-1022:asia-east1:db_instance”
--num-preemptible-workers 2 --num-workers 2 --preemptible-worker-boot-disk-size 1TB --properties hive:hive.metastore.warehouse.dir=gs://informetis-dev/hive-warehouse
--worker-machine-type n1-standard-2 --zone asia-east1-b --bucket info-dev
But Dataproc failed to create cluster with following errors in failure file:
cat
+ mysql -u hive -phive-password -e '' ERROR 2003 (HY000): Can't connect to MySQL server on 'localhost' (111)
+ mysql -e 'CREATE USER '\''hive'\'' IDENTIFIED BY '\''hive-password'\'';' ERROR 2003 (HY000): Can't connect to MySQL
server on 'localhost' (111)
Does anyone have any idea behind this failure ?
It looks like you're missing the --scopes sql-admin flag as described in the initialization action's documentation, which will prevent the CloudSQL proxy from being able to authorize its tunnel into your CloudSQL instance.
Additionally, aside from just the scopes, you need to make sure the default Compute Engine service account has the right project-level permissions in whichever project holds your CloudSQL instance. Normally the default service account is a project editor in the GCE project, so that should be sufficient when combined with the sql-admin scopes to access a CloudSQL instance in the same project, but if you're accessing a CloudSQL instance in a separate project, you'll also have to add that service account as a project editor in the project which owns the CloudSQL instance.
You can find the email address of your default compute service account under the IAM page for your project deploying Dataproc clusters, with the name "Compute Engine default service account"; it should look something like <number>#project.gserviceaccount.com`.
I am assuming that you already created the Cloud SQL instance with something like this, correct?
gcloud sql instances create g-test-1022 \
--tier db-n1-standard-1 \
--activation-policy=ALWAYS
If so, then it looks like the error is in how the argument for the metadata is formatted. You have this:
--metadata "hive-metastore-instance=g-test-1022:asia-east1:db_instance”
Unfortuinately, the zone looks to be incomplete (asia-east1 instead of asia-east1-b).
Additionally, with running that many initializayion actions, you'll want to provide a pretty generous initialization action timeout so the cluster does not assume something has failed while your actions take awhile to install. You can do that by specifying:
--initialization-action-timeout 30m
That will allow the cluster to give the initialization actions 30 minutes to bootstrap.
By the time you reported, it was detected an issue with cloud sql proxy initialization action. It is most probably that such issue affected you.
Nowadays, it shouldn't be an issue.