GCSToBigQueryOperator not working in composer-2.1.0-airflow-2.3.4 - google-bigquery

After a recent upgrade to composer-2.1.0-airflow-2.3.4 the GCSToBigQueryOperator is no longer able to find data in buckets to upload to BigQuery.
All other aspects of the DAGs still work.
The usage is as follows
gcs_to_bq = GCSToBigQueryOperator(
task_id = f"transfer_{data_type}_to_bq_task",
bucket = os.environ["GCS_BUCKET"],
source_objects = file_names,
destination_project_dataset_table = os.environ["GCP_PROJECT"] + f".creditsafe.{data_type}",
schema_object = f"dags/schema/creditsafe/{data_type}.json",
source_format = "CSV",
field_delimiter = '|',
quote_character = "",
max_bad_records = 0,
create_disposition = "CREATE_IF_NEEDED",
ignore_unknown_values = True,
allow_quoted_newlines = True,
allow_jagged_rows = True,
write_disposition = "WRITE_TRUNCATE",
gcp_conn_id = 'google_cloud_default',
skip_leading_rows = 1,
dag = dag
)
The error from the API is
google.api_core.exceptions.NotFound: 404 GET
{ "error": { "code": 400, "message": "Unknown output format: media:", "errors": [ { "message": "Unknown output format: media:", "domain": "global", "reason": "invalidAltValue", "locationType": "parameter", "location": "alt" } ] } }
The error delivered by Cloud Composer is
google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/download/storage/v1/b/[BUCKET_HIDDEN]/o/data%2Fcreditsafe%2FCD01%2Ftxt%2F%2A.txt?alt=media: No such object: [BUCKET_HIDDEN]/data/creditsafe/CD01/txt/*.txt: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
I can't see the cause of the error. The reference to the GCS location has not changed and appears correct while the gcp_conn_id appears sufficient for all other tasks. I'm at a loss.

A fix for this above issue has now been made
https://github.com/apache/airflow/pull/28444
It is unclear how long it will take for this to be integrated into the Cloud Composer libraries.

GCSToBigQueryOperator does not support wildcard *.csv. For your requirement, you can try the below steps:
You can attach to a pod in the composer environment by running the below commands :
gcloud container clusters get-credentials --region __GCP_REGION__ __GKE_CLUSTER_NAME__
kubectl get pods -n [Namespace]
kubectl exec -it [Worker] -n [Namespace] -- bash
You can run the below command to identify the google provider package,
pip list | grep -i goo | grep provider
If the output of the above command is a different version from 8.3.0 then change the version to apache-airflow-providers-google ==8.3.0.

The release 8.5.0 of apache-airflow-providers-google that cames with airflow (>2.3.4 and <2.5.1) introduced several critical regressions nottably:
GCSToBigquery operator is broken as it ignores its options https://github.com/apache/airflow/pull/27961
This means that all custom settings specified in the operator (delimiter, formatting, null values, wildcards on files) are no longer sent to BigQuery leading to unexpected results.
Until Google releases a composer version based on airflow 2.5.1 the workaround is to upgrade the apache-airflow-providers-google library (or to use a composer version based on apache <=2.2.5).
No need to connect via gcloud/kubectl to change the apache-airflow-providers-google version, you can change it directly in the Composer UI, via PyPi Packages page (or via the terraform provider).
I can confirm that on the latest (today) composer : composer-1.20.4-airflow-2.4.3 configuring the apache-airflow-providers-google ==8.8.0 (latest) solves those issues for me.
But as mentioned previously this is only an workaround and your mileage might vary...
configuring custom PyPI packages

Related

Can't get the subnet id of azure using python

I want to get the resource id of a subnet in a virtual network in azure using python, the command i have used is this line : subnets=network_client.subnets.get(resource_group,'XXX','XXX')
But what I get is an error: HttpResponseError: (InvalidApiVersionParameter) The api-version '2021-02-01' is invalid. The supported versions are '2021-04-01,2021-01-01,2020-10-01,2020-09-01,2020-08-01,2020-07-01,2020-06-01,2020-05-01,2020-01-01,2019-11-01,2019-10-01,2019-09-01,2019-08-01,2019-07-01,2019-06-01,2019-05-10,2019-05-01,2019-03-01,2018-11-01,2018-09-01,2018-08-01,2018-07-01,2018-06-01,2018-05-01,2018-02-01,2018-01-01,2017-12-01,2017-08-01,2017-06-01,2017-05-10,2017-05-01,2017-03-01,2016-09-01,2016-07-01,2016-06-01,2016-02-01,2015-11-01,2015-01-01,2014-04-01-preview,2014-04-01,2014-01-01,2013-03-01,2014-02-26,2014-04'.
I have tried different api versions but it's getting me errors.Any idea please ?
The version of azure-mgmt-network I used is 19.0.0
Please make sure that you have the below two models installed first before executing the script:
pip install azure-mgmt-network
pip install azure-identity
Then use the below script to get the subnet-id of specific subnet present in your subscription:
from azure.identity import AzureCliCredential
from azure.mgmt.network import NetworkManagementClient
credential = AzureCliCredential()
subscription_id = "948d4068-xxxx-xxxx-xxxx-e00a844e059b"
network_client = NetworkManagementClient(credential, subscription_id)
resource_group_name = "ansumantest"
location = "West US 2"
virtual_network_name = "ansuman-vnet"
subnet_name = "acisubnet"
Subnet=network_client.subnets.get(resource_group_name, virtual_network_name, subnet_name)
print(Subnet.id)
Output:
Note : I am using pip version pip 21.2.4 and (python 3.9). The pip models version that I am using are as below :
I am using the same network model version as you . But if you are still facing the issue then trying installing the new one i.e. 19.1.0 .

Debugging terragrunt dependency block resulting in s3 permission error

I'm trying to use a dependency block for the first time, but get aws s3 list object permission denied issues and have trouble debugging the issue.
The setup is as follows, using an s3 backend for storing terraform state:
A git repo containing the terraform modules:
archive
s3_inventory
Instantiations of the above:
prod/eu/archive/terragrunt.hcl:
terraform {
source = "git::ssh://git#my_server//archive?ref=v1.0.0"
}
include {
path = find_in_parent_folders()
}
dependency "s3-inventory" {
config_path = "../s3-inventory/"
}
prod/eu/s3_inventory/terragrunt.hcl:
terraform {
source = "git::ssh://git#my_server//s3_inventory?ref=v1.0.0"
}
include {
path = find_in_parent_folders()
}
Running terragrunt apply in prod/eu/archive works just fine when I remove the dependency block from the hcl file. It fails when I add the dependency block in.
Running terragrunt output -json in prod/eu/s3-inventory also works just fine.
With debugging flags on I still don't seem to get enough info as to why it's failing.
terragrunt apply --terragrunt-log-level debug --terragrunt-debug in prod/eu/archive results in something like this:
...<omitted>...
DEBU[0000] Detected module /Users/tim.kersten/prod/eu/s3-inventory/terragrunt.hcl is already init-ed. Retrieving outputs directly from working directory. prefix=[/Users/tim.kersten/prod/eu/s3-inventory]
DEBU[0000] Running command: terraform output -json prefix=[/Users/tim.kersten/prod/eu/s3-inventory]
Failed to load state: AccessDenied: Access Denied
status code: 403, request id: ABC123DEF456GHI, host id: WW91J3JlIHRlcnJpYmx5IG5vc2UgZm9yIHRyeWluZyB0byBsb29rIGF0IG15IGhvc3QK
ERRO[0003] exit status 1
Something is clearly different, but the debugging options I set on terragrunt don't seem to give me enough info to understand what's different.
Anyone understand what's going on here?
Edit:
terragrunt version: 0.28.6

Vue CLI 3 service worker fails to register out of box

I built my app using Vue CLI3 with PWA. When I build for production the service worker fails to register.
I then decided to check if it was something I did or Vue CLI 3 out the box. I built a brand new app, built it and deployed it to AWS s3 with cloudfront. Even the brand new app without any changes fails to register the service worker with error: "The script has an unsupported MIME type ('text/plain')." and "Error during service worker registration: DOMException"
I've tried quite a few things other than listed below that google search results suggested but I end up with the same error.
I tried using the vue.config.js to load a custom worker in which I just copied the contents of the one that vue produces in a build.
pwa: {
workboxPluginMode: 'InjectManifest',
workboxOptions: {
swSrc: 'public/service-worker.js'
},
themeColor: '#ffffff'
}
I have tried loading it from index.html also.
If I host it locally it registers without any issues
The file does get created and it's accessible from the console but for some odd reason unknown to me it does not want to register at all.
Has anyone had this problem before and how did you resolve this?
Hosted on AWS s3 & cloudfront with HTTPS enabled and using the default AWS certificates for testing.
$ vue --version
3.9.3
$ node --version
v12.7.0
$ npm --version
6.10.0
UPDATE
I found that when I upload to S3 using aws cli sync it changes all .js files content-type
Once I resolve this I will update my question again.

Programmatically determine latest Gradle version?

Is there a way to find, via an API or CLI, the latest available version of Gradle?
I'm working on a tool to check versions of dependencies and compare them to the latest versions, and I need a way to determine what the latest version of Gradle that is available.
To be clear, I am not looking for the version of Gradle I already have. I know how to get that any number of ways. I'm just looking for some officially maintained endpoint I can call to determine the latest version available.
Gradle has an API to retrieve all sorts of information:
https://services.gradle.org/
For the current version:
GET https://services.gradle.org/versions/current
{
"version" : "6.8.1",
"buildTime" : "20210122132008+0000",
"current" : true,
"snapshot" : false,
"nightly" : false,
"releaseNightly" : false,
"activeRc" : false,
"rcFor" : "",
"milestoneFor" : "",
"broken" : false,
"downloadUrl" : "https://services.gradle.org/distributions/gradle-6.8.1-bin.zip",
"checksumUrl" : "https://services.gradle.org/distributions/gradle-6.8.1-bin.zip.sha256",
"wrapperChecksumUrl" : "https://services.gradle.org/distributions/gradle-6.8.1-wrapper.jar.sha256"
}
You can get the data using curl and then use jq to extract the version key.
Node.js has in-built JSON support so this will be even easier.
CURRENT_GRADLE_VERSION="$(curl -s https://services.gradle.org/versions/current | jq -r '.version')"
echo "${CURRENT_GRADLE_VERSION}" # prints 6.8.1

Kubernetes 1.2.2: api-server fails: can't find mounted certs for TLS on etcd

I'm been struggling to get api-server 1.2.2 to run with etcd secured with TLS.
I am upgrading from 1.1.2 to 1.2.2
In 1.1.2 I was using the --etcd-config flag and had a file that looked like:
{
"cluster": {
"machines": [
"https://XXX.XXX.XXX.XXX:2379",
"https://XXX.XXX.XXX.XXY:2379",
"https://XXX.XXX.XXX.XXZ:2379"
]
},
"config": {
"certFile": "/etc/ssl/etcd/etcd-peer.cert.pem",
"keyFile": "/etc/ssl/etcd/private/etcd-peer.key.pem",
"caCertFiles": [
"/etc/ssl/etcd/ca-chain.cert.pem"
],
"consistency": "STRONG_CONSISTENCY"
}
}
now this is no longer supported and I switched to using the flags:
--etcd-cafile="/etc/ssl/etcd/ca-chain.cert.pem"
--etcd-certfile="/etc/ssl/etcd/etcd-peer.cert.pem"
--etcd-keyfile="/etc/ssl/etcd/private/etcd-peer.key.pem"
--etcd-servers="https://XXX.XXX.XXX.XXX:2379, https://XXX.XXX.XXX.XXY:2379,https://XXX.XXX.XXX.XXZ:2379"
now I am getting this error:
F0421 00:54:40.133777 1 server.go:291] Invalid storage version or misconfigured etcd: open "/etc/ssl/etcd/etcd-peer<nodeIP>.cert.pem": no such file or directory
So, it seems like it cannot find the cert file.
The file paths and names are the same as before, and they are mounted with hostPath the exact same way as with v1.1.2, so I don't understand why api-server would not not find them.
I have been trying to figure what is going on with the file paths by simply switching the command in the pod from
- /hyperkube
- api-server
...
to
- /bin/sleep
- 60
but kubelet won't start this pod for some reason I don't understand.
Does it have to do with the yaml file name or something?
I don't understand what is happening why kubelet won't run with this command.
Any help with this would be greatly appreciated.
Thanks
UPDATE
I was able to get into the running container after replacing the command with /hyperkube scheduler
i can cat the files that apiserver is complaining about, so I don't understand why they're not found.
Well, the culprit was as simple as ""
--etcd-cafile="/etc/ssl/etcd/ca-chain.cert.pem"
--etcd-certfile="/etc/ssl/etcd/etcd-peer.cert.pem"
--etcd-keyfile="/etc/ssl/etcd/private/etcd-peer.key.pem"
--etcd-servers="https://XXX.XXX.XXX.XXX:2379, https://XXX.XXX.XXX.XXY:2379,https://XXX.XXX.XXX.XXZ:2379"
is WRONG
but this works:
--etcd-cafile=/etc/ssl/etcd/ca-chain.cert.pem
--etcd-certfile=/etc/ssl/etcd/etcd-peer.cert.pem
--etcd-keyfile=/etc/ssl/etcd/private/etcd-peer.key.pem
--etcd-servers=https://XXX.XXX.XXX.XXX:2379,https://XXX.XXX.XXX.XXY:2379,https://XXX.XXX.XXX.XXZ:237