Saving output from parsing json file and passing it to Bigqueryinsertjoboperator - google-bigquery

I need some advise on solving this requirement for auditing purpose . I am using airflow composer - dataflow java operator job which spits out json file after job completion with status and error message details (into airflow data folder ) . I want to extract the status and error message from json file via some operator and then pass the variable to next pipeline job Bigqueryinsertjoboperator which calls the stored proc and passes status and error message as input parameter and finally gets written into BQ dataset table.
Thanks

You need to do XCom and JINJA templating. When you return meta-data from the operator, the data is stored in XCom and you can retrieve it using JINJA templating or Python code in Python operator (or Python code in your custom operator).
Those are two very good articles from Marc Lamberti (who also has really nice courses on Airlfow) describing how templating and jinja can be leveraged in Airflow https://marclamberti.com/blog/templates-macros-apache-airflow/ and this one describes XCom: https://marclamberti.com/blog/airflow-xcom/
By combining the two you can get what you want.

Related

Azure Data Factory Script Activity does not like the keyword GO

If I create a script, e.g.
print 'hello'
GO
print 'cats'
GO
Then the script errors when I try to run my ADF pipeline:
Operation on target GreetCatsActivity failed: Incorrect syntax near 'GO'.
Is GO not allowed in scripts? The issue is I need it to run a gigantic script that is autogenerated and has tons of GO statements in it. Part of the script might reference things created earlier in the script so I suspect the GO statements are important to ensure items are created to be used later on.
Could I be doing something wrong or is there another way to handle this?
Just throught of sharing this as a think this may be helpful .
You can use the a Mapping data flow pipeline to replace the "GO" .
What I tried is on the source side I added the scripts ( a files with the extension .sql , I am assuming you must be gaving something similar ) which is shared with "GO" above & I used the FIlter to ger rid of the all the GO's and on the sink I did write back the scripts ( without GO to a blob .
Now I wanted to automate the execution of the command using the Script activity in ADF .
For executing the queries we will use the a Lookup which will read the file which we created in the last step and a for each loop and a Script activity inside that .
My lookup output looks something like
and so wjem you are setting the scripts activity , please pass dynamic value as i did
HTH

Google Dataflow SQL | Creating Branches | Error Handling

Trying to use Dataflow SQL for Stream ingestion:
We have a Pubsub topic (source) and BigQuery Table (sink).
To achieve that we need to follow steps:
From BigQuery UI, adding schema for topic manually.
Question: Can we automate this process using commandline options?
Writing SQL for the transformation and executing using gcloud dataflow query command (helps us with dynamic queries and automation).
Question: Suppose we have missing key from Pubsub messages and the pipeline will mark those messages as error in stack driver. Can we add some capability like if validation of schema fails move to table y else table x? Something like, if we get message type y move of table y else table x?
You can use gcloud to add a schema to a topic. This was actually the only way to do it, at first: https://cloud.google.com/dataflow/docs/guides/sql/data-sources-destinations#gcloud
For saving messages that cannot be parsed into SQL rows, the functionality is often called "dead letter queue". It is available in Beam SQL DDL for Pubsub but is not yet available when using Dataflow SQL through the BigQuery UI. See https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#pubsub

BigQuery autodetect doesn't work with inconsistent json?

I'm trying to upload JSON to BigQuery, with --autodetect so I don't have to manually discover and write out the whole schema. The rows of JSON don't all have the same form, and so fields are introduced in later rows that aren't in earlier rows.
Unfortunately I get the following failure:
Upload complete.
Waiting on bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '[...]:bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1': Error while reading data, error message: JSON table encountered too many errors, giving up.
Rows: 1209; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 1209; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 829980: No such field:
mc.marketDefinition.settledTime.
Here's the data I'm uploading: https://gist.github.com/max-sixty/c717e700a2774ba92547c7585b2b21e3
Maybe autodetect uses the first n rows, and then fails if rows after n are different? If that's the case, is there any way of resolving this?
Is there any tool I could use to pull out the schema from the whole file and then pass to BigQuery explicitly?
I found two tools that can help:
bigquery-schema-generator 0.5.1 that uses all the data to get the schema instead of 100 sample rows like BigQuery.
Spark SQL, you should to setup your dev env, or at least install Spark and invoke the spark-shell tool.
However, I noticed that the file is intended to fail, see this text in the link you shared: "Sample for BigQuery autodetect failure". So, I'm not pretty sure that such tools can work for a json file intended to fail.
The last but not least, I got the json imported after I removed manually the problematic field: "settledTime":"2020-03-01T02:55:47.000Z".
Hope this info helps.
Yes, see documentation here:
https://cloud.google.com/bigquery/docs/schema-detect
When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
So if the data in the rest of the rows does not comply with initial rows, you should not use autodetect and need to provide explicit schema.
Autodetect may not work well since it looks only into the first 100 rows to detect schema. Using schema detection for JSON could be a costly endeavor.
How about using BqTail with AllowFieldAddition option allowing cost-effectively expand schema.
You could simply use the following ingestion workflow with CLI or serverless
bqtail -r=rule.yaml -s=sourceURL
#rule.yaml
When:
Prefix: /data/somefolder
Suffix: .json
Async: false
Dest:
Table: mydataset.mytable
AllowFieldAddition: true
Transient:
Template: mydataset.myTableTempl
Dataset: temp
Batch:
MultiPath: true
Window:
DurationInSec: 15
OnSuccess:
- Action: delete
See JSON with allow field addition e2e test case

Nifi - How to put data into Hive database?

I am building a Nifi flow to get json elements from a kafka and write them into a Have table.
However, there is very little to none documentation about the processors and how to use them.
What I plan to do is the following:
kafka consume --> ReplaceText --> PutHiveQL
Consuming kafka topic is doing great. I receive a json string.
I would like to extract the json data (with replaceText) and put them into the hive table (PutHiveQL).
However, I have absolutely no idea how to do this. Documentation is not helping and there is no precise example of processor usage (or I could not find one).
Is my theoretical solution valid ?
How to extract json data, build a HQL query and send it to my local hive database ?
basicly you want to transform your record from kafka into HQL request then send the request to putHiveQl processor.
I am not sur that the transformation kafka record -> putHQL can be done with replacing text ( seam little bit hard/ tricky) . In general i use custom groovy script processor to do this.
Edit
Global overview :
EvaluateJsonPath
This extract the properties timestamp and uuid of my Json flowfile and put them as attribute of the flowfile.
ReplaceText
This set flowfile string to empty string and replaces it by the replacement value property, in which I build the query.
You can directly inject the streaming data using Puthivestreaming process.
create an ORC table with the strcuture matching to the flow and pass the flow to PUTHIVE3STreaming processor it works.

Cannot load jdbc driver class org.apache.hive.jdbc.hivedriver in Kylo

I am trying to create a Data Ingest Feed but all the jobs are failing. I checked Nifi and there are error marks saying that "org.apache.hive.jdbc.hivedriver" was not found. I checked the nifi logs and found the following error :
So where exactly do I need to put the hivedriver jar?
Based on the comments, this seems to be the solution as mentioned by #Greg Hart:
Have you tried using a Data Transformation feed? The Data Ingest
template is for loading data into Hive, but it looks like you're using
it to move data from one Hive table into another.