Airflow BigQueryInsertJobOperator configuration - google-bigquery

I'm having some issue converting from the deprecated BigQueryOperator to BigQueryInsertJobOperator. I have the below task:
bq_extract = BigQueryInsertJobOperator(
dag="big_query_task,
task_id='bq_query',
gcp_conn_id='google_cloud_default',
params={'data': Utils().querycontext},
configuration={
"query": {"query": "{% include 'sql/bigquery.sql' %}", "useLegacySql": False,
"writeDisposition": "WRITE_TRUNCATE", "destinationTable": {"datasetId": bq_dataset}}
})
this line in my bigquery_extract.sql query is throwing the error:
{% for field in data.bq_fields %}
I want to use 'data' from params, which is calling a method, this method is reading from a .json file:
class Utils():
bucket = Variable.get('s3_bucket')
_qcontext = None
#property
def querycontext(self):
if self._qcontext is None:
self.load_querycontext()
return self._qcontext
def load_querycontext(self):
with open(path.join(conf.get("core", "dags"), 'traffic/bq_query.json')) as f:
self._qcontext = json.load(f)
the bq_query.json is this format, and I need to use the nested bq_fields list values:
{
"bq_fields": [
{ "name": "CONCAT(ID, '-', CAST(ID AS STRING), "alias": "new_id" },
{ "name": "TIMESTAMP(CAST(visitStartTime * 1000 AS INT64)", "alias": "new_timestamp" },
{ "name": "TO_JSON_STRING(hits.experiment)", "alias": "hit_experiment" }]
}
this file has a list which I want to use in the above mentioned query line, but its throwing this error:
jinja2.exceptions.UndefinedError: 'data' is undefined

There are two issues with your code.
First "params" is not a supported field in BigQueryInsertJobOperator. See this post where I post how to pass params to sql file when using BigQueryInsertJobOperator. How do you pass variables with BigQueryInsertJobOperator in Airflow
Second, if you happen to get an error that your file cannot be found, make sure you set the full path of your file. I have had to do this when migrating from local testing to the cloud, even though file is in same directory. You can set the path in the dag config with example below(replace path with your path):
with DAG(
...
template_searchpath = '/opt/airflow/dags',
...
) as dag:

Related

Generate "Instances" definition programmatically to create EMR cluster in StepFunctions

I have a case where I want to dynamically create an EMR cluster based on a user-defined configuration and execute a sequence of steps on it using AWS Step Functions.
For this, I am planning to provide the instance configuration as an input to the step functions workflow.
Based on the StepFunctions-EMR Integration Documentation, the definition is the same as that of the RunJobFlow API.
However, when I try to generate the definition by serializing an instance of JobFlowInstancesConfig to JSON and pass it to the StateMachine as an input, it throws an error saying:
The field 'Instances.KeepJobFlowAliveWhenNoSteps' is required but was missing
Here is the JSON generated post serialization:
{
"instanceFleets": [
{
"instanceFleetType": "MAIN",
"targetOnDemandCapacity": 1,
"instanceTypeConfigs": [
{
"instanceType": "m5.xlarge"
}
]
},
{
"instanceFleetType": "CORE",
"targetOnDemandCapacity": 1,
"instanceTypeConfigs": [
{
"instanceType": "c5.2xlarge"
}
]
}
],
"keepJobFlowAliveWhenNoSteps": true
}
I am passing this in the input, and accessing it in my StepFunctions definition in the below Task (where I expect the above definition to be replacing $.jobFlowInstancesConfig):
...
"GetCluster": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
"Parameters": {
"Name.$": "$.clusterName",
"VisibleToAllUsers": true,
"ReleaseLabel": "emr-5.30.0",
"Applications": [
{
"Name": "Spark"
}
],
"ServiceRole": "EMR_DefaultRole",
"JobFlowRole": "EMR_EC2_DefaultRole",
"LogUri": "s3://my-aws-logs/elasticmapreduce/",
"Instances.$": "$.jobFlowInstancesConfig"
}
}
...
My suspicion is that this is failing because StepFunctions expects the field names to start with upper case.
Question: How do I programmatically generate the appropriate definition without having to play around with Strings for generating the JSON? Is there a straightforward way to serialize the above definition to one that will work with StepFunctions?

GCP Bigquery: Can't query stackdriver access logs exported in cloudstorage because invalid json field "#type"

I store the access log of a pixel image in a cloudstorage bucket dev-access-log-bucket using the standard "sink"
so the files looks like this requests/2019/05/08/15:00:00_15:59:59_S1.json
and one line looks like this (I formatted the json, but it's on one line normmaly) :
{
"httpRequest": {
"cacheLookup": true,
"remoteIp": "93.24.25.190",
"requestMethod": "GET",
"requestSize": "224",
"requestUrl": "https://dev-snowplow.legalstart.fr/one_pixel_image.png?user_id=0&action=purchase&product_id=0&money=10",
"responseSize": "779",
"status": 200,
"userAgent": "python-requests/2.21.0"
},
"insertId": "w6wyz1g2jckjn6",
"jsonPayload": {
"#type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
"statusDetails": "response_sent_by_backend"
},
"logName": "projects/tracking-pixel-239909/logs/requests",
"receiveTimestamp": "2019-05-08T15:34:24.126095758Z",
"resource": {
"labels": {
"backend_service_name": "",
"forwarding_rule_name": "dev-yolaw-pixel-forwarding-rule",
"project_id": "tracking-pixel-239909",
"target_proxy_name": "dev-yolaw-pixel-proxy",
"url_map_name": "dev-urlmap",
"zone": "global"
},
"type": "http_load_balancer"
},
"severity": "INFO",
"spanId": "7d8823509c2dc94f",
"timestamp": "2019-05-08T15:34:23.140747307Z",
"trace": "projects/tracking-pixel-239909/traces/bb55577eedd5797db2867931f8de9162"
}
all of these once again are standard GCP things, I did not customize anything here.
So now I want to do some requests on it from Bigquery, I create a dataset and an external table configured like this :
External Data Configuration
Source URI(s) gs://dev-access-log-bucket/requests/*
Auto-detect schema true (note: I don't know why it puts true though i've manually defined it)
Ignore unknown values true
Source format NEWLINE_DELIMITED_JSON
Max bad records 0
and the following manual schema:
timestamp DATETIME REQUIRED
httpRequest RECORD REQUIRED
httpRequest. requestUrl STRING REQUIRED
and when I run a request
SELECT
timestamp
FROM
`path.to.my.table`
LIMIT
1000
I got
Invalid field name "#type". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
How can I work around this without needing to pre-process the log to not have the "#type" field in it ?

cloudwatch metric Filter Pattern doesn't match with the json logs

I am trying to configure a custom metric for my service which will monitor the uptime of the service, I am trying to get the value of status code and raise an alarm if the value is 4* and 5* I am trying to match the statusCode using json filter patter but not able to get it correct
{
"res": {
"statusCode": 500
},
"req": {
"url": "",
"headers": {
"host": "",
"connection": "close",
"user-agent": "",
"accept-encoding": "gzip, compressed"
},
"method": "GET",
"httpVersion": "1.1",
"originalUrl": "",
"query": {}
},
"responseTime": 1,
"requestId": "",
"level": "info",
"message": "",
"timestamp": ""
}
my filter pattern is
I tried these both but none of them worked
{$.res.statusCode = 500}
{($.res.statusCode = 2*)||($.res.statusCode = 5*)}
I am trying to follow https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html
It seems to be the issue with json when we copy paste the json doesn't work,
if selected from Log Data to Test dropdown , the filter worked for me in this case.
The answer that worked for me is the one in, AWS Cloudwatch Json Metric Filter Pattern
which quotes,
When using the Test Metric Filter feature in the AWS Console, each log
event has to be in a separate line. You can still run the same test,
but you have to remove all new lines from the sample data.

How to use 'Ignore or Validate' in JSON file (karate framework)?

I have to verify that each response is in correct format. I've put in my feature:
And match each response contains { id: '#string', name: '#string', phone: '#number' etc..}
But I would like to put this in a JSON file, beacuse I need it several times in different features.
When I use 'Ignore or Validate' tags in JSON file, it doesn't work. Is it possible to do this?
Yes why not. First place the following in a file called item-schema.json.
{ "id": "#string", "name": "#string", "phone": "#number" }
Now all you need to do is:
And match each response contains read('item-schema.json')
Please go through the documentation of Reading Files carefully.

Dropbox API V2 list_file_members/batch empty results

I'm currently trying to work with the Dropbox list_file_members API endpoint, as it appears to me to be the only place to find out who owns a file (
see follow example result taken from the documentation page )
{
"users": [
{
"access_type": {
".tag": "owner"
},
"user": {
"account_id": "dbid:AAH4f99T0taONIb-OurWxbNQ6ywGRopQngc",
"same_team": true,
"team_member_id": "dbmid:abcd1234"
},
"permissions": [],
"is_inherited": false
}
],
"groups":[...]
...
}
However, when I call the API on a single file I get the follow
{
"users": [],
"groups": [
{
"access_type": {
".tag": "editor"
},
"permissions": [],
"is_inherited": true,
"group": {
"group_name": "Everyone at TEAM_NAME_HERE",
"group_id": "g:GROUP_ID_HERE",
"member_count": 6,
"group_management_type": {
".tag": "company_managed"
},
"group_type": {
".tag": "team"
},
"is_owner": false,
"same_team": true
}
}
],
"invitees": []
}
This result contains no owner information, so I'm assuming this is because everyone has the same access levels ??
The problem worsens when I try to call files in batches using the sharing_list_file_members/batch endpoint, I get the following result
[
{
"file": "id:THIS_IS_MY_FILE_ID",
"result": {
".tag": "result",
"members": {
"users": [],
"groups": [],
"invitees": []
},
"member_count": 0
}
}
]
Obviously this is even less helpful, this is the same when I access the API via my own PHP, as well as the API explorer, could anyone tell me where I'm going wrong and why I'm getting no results from users and even groups when done in batches ?
The /2/sharing/list_file_members endpoint is documented as:
Use to obtain the members who have been invited to a file, both inherited and uninherited members.
The /2/sharing/list_file_members/batch endpoint is documented as:
Get members of multiple files at once. The arguments to this route are more limited, and the limit on query result size per file is more strict. To customize the results more, use the individual file endpoint.
Inherited users are not included in the result, and permissions are not returned for this endpoint.
It sounds like the file for your example is in a team folder, and so the group listed for your non-batch example is the team group, i.e., an inherited group. The documentation indicates that this group isn't expected when using the batch endpoint.