Using schema update option in beam.io.writetobigquery - google-bigquery

I am loading a bunch log files into BigQuery using apache beam data flow. The file format can change over a period of time by adding new columns to the files. I see Schema Update Option ALLOW_FILED_ADDITION.
Anyone know how to use it? This is how my WriteToBQ step looks:
| 'write to bigquery' >> beam.io.WriteToBigQuery('project:datasetId.tableId', ,write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)

I haven't actually tried this yet but digging into the documentation, it seems you are able to pass whatever configuration you like to the BigQuery Load Job using additional_bq_parameters. In this case it might look something like:
| 'write to bigquery' >> beam.io.WriteToBigQuery(
'project:datasetId.tableId',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
additional_bq_parameters={
'schemaUpdateOptions': [
'ALLOW_FIELD_ADDITION',
'ALLOW_FIELD_RELAXATION',
]
}
)
Weirdly, this is actually in the Java SDK but doesn't seem to have made its way to the Python SDK.

Related

Deploy sql workflow with DBX

I am developing deployment via DBX to Azure Databricks. In this regard I need a data job written in SQL to happen everyday. The job is located in the file data.sql. I know how to do it with a python file. Here I would do the following:
build:
python: "pip"
environments:
default:
workflows:
- name: "workflow-name"
#schedule:
quartz_cron_expression: "0 0 9 * * ?" # every day at 9.00
timezone_id: "Europe"
format: MULTI_TASK #
job_clusters:
- job_cluster_key: "basic-job-cluster"
<<: *base-job-cluster
tasks:
- task_key: "task-name"
job_cluster_key: "basic-job-cluster"
spark_python_task:
python_file: "file://filename.py"
But how can I change it so I can run a SQL job instead? I imagine it is the last two lines of code (spark_python_task: and python_file: "file://filename.py") which needs to be changed.
There are various ways to do that.
(1) One of the most simplest is to add a SQL query in the Databricks SQL lens, and then reference this query via sql_task as described here.
(2) If you want to have a Python project that re-uses SQL statements from a static file, you can add this file to your Python Package and then call it from your package, e.g.:
sql_statement = ... # code to read from the file
spark.sql(sql_statement)
(3) A third option is to use the DBT framework with Databricks. In this case you probably would like to use dbt_task as described here.
I found a simple workaround (although might not be the prettiest) to simply change the data.sql to a python file and run the queries using spark. This way I could use the same spark_python_task.

Add file name and timestamp into each record in BigQuery using Dataflow

I have a few .txt files with data in JSON to be loaded to google BigQuery table. Along with the columns in the text files I will need to insert filename and current timestamp for each rows. It is in GCP Dataflow with Python 3.7
I accessed the Filemetadata containing the filepath and size using GCSFileSystem.match and metadata_list.
I believe I need to get the pipeline code to run in a loop, pass the filepath to ReadFromText, and call a FileNameReadFunction ParDo.
(p
| "read from file" >> ReadFromText(known_args.input)
| "parse" >> beam.Map(json.loads)
| "Add FileName" >> beam.ParDo(AddFilenamesFn(), GCSFilePath)
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(known_args.output,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
I followed the steps in Dataflow/apache beam - how to access current filename when passing in pattern? but I can't make it quite work.
Any help is appreciated.
You can use textio.ReadFromTextWithFilename instead of ReadFromText. That will produce a PCollection of (filename,line) tuples.
To include the file and timestamp in your output json record, you could change your "parse" line to
| "parse" >> beam.map(lambda (file, line): {
**json.loads(line),
"filename": file,
"timestamp": datetime.now()})

How to get information on latest successful pod deployment in OpenShift 3.6

I am currently working on making a CICD script to deploy a complex environment into another environment. We have multiple technology involved and I currently want to optimize this script because it's taking too much time to fetch information on each environment.
In the OpenShift 3.6 section, I need to get the last successful deployment for each application for a specific project. I try to find a quick way to do so, but right now I only found this solution :
oc rollout history dc -n <Project_name>
This will give me the following output
deploymentconfigs "<Application_name>"
REVISION STATUS CAUSE
1 Complete config change
2 Complete config change
3 Failed manual change
4 Running config change
deploymentconfigs "<Application_name2>"
REVISION STATUS CAUSE
18 Complete config change
19 Complete config change
20 Complete manual change
21 Failed config change
....
I then take this output and parse each line to know which is the latest revision that have the status "Complete".
In the above example, I would get this list :
<Application_name> : 2
<Application_name2> : 20
Then for each application and each revision I do :
oc rollout history dc/<Application_name> -n <Project_name> --revision=<Latest_Revision>
In the above example the Latest_Revision for Application_name is 2 which is the latest complete revision not building and not failed.
This will give me the output with the information I need which is the version of the ear and the version of the configuration that was used in the creation of the image use for this successful deployment.
But since I have multiple application, this process can take up to 2 minutes per environment.
Would anybody have a better way of fetching the information I required?
Unless I am mistaken, it looks like there are no "one liner" with the possibility to get the information on the currently running and accessible application.
Thanks
Assuming that the currently active deployment is the latest successful one, you may try the following:
oc get dc -a --no-headers | awk '{print "oc rollout history dc "$1" --revision="$2}' | . /dev/stdin
It gets a list of deployments, feeds it to awk to extract the name $1 and revision $2, then compiles your command to extract the details, finally sends it to standard input to execute. It may be frowned upon for not using xargs or the like, but I found it easier for debugging (just drop the last part and see the commands printed out).
UPDATE:
On second thoughts, you might actually like this one better:
oc get dc -a -o jsonpath='{range .items[*]}{.metadata.name}{"\n\t"}{.spec.template.spec.containers[0].env}{"\n\t"}{.spec.template.spec.containers[0].image}{"\n-------\n"}{end}'
The example output:
daily-checks
[map[name:SQL_QUERIES_DIR value:daily-checks/]]
docker-registry.default.svc:5000/ptrk-testing/daily-checks#sha256:b299434622b5f9e9958ae753b7211f1928318e57848e992bbf33a6e9ee0f6d94
-------
jboss-webserver31-tomcat
registry.access.redhat.com/jboss-webserver-3/webserver31-tomcat7-openshift#sha256:b5fac47d43939b82ce1e7ef864a7c2ee79db7920df5764b631f2783c4b73f044
-------
jtask
172.30.31.183:5000/ptrk-testing/app-txeq:build
-------
lifebicycle
docker-registry.default.svc:5000/ptrk-testing/lifebicycle#sha256:a93cfaf9efd9b806b0d4d3f0c087b369a9963ea05404c2c7445cc01f07344a35
You get the idea, with expressions like .spec.template.spec.containers[0].env you can reach for specific variables, labels, etc. Unfortunately the jsonpath output is not available with oc rollout history.
UPDATE 2:
You could also use post-deployment hooks to collect the data, if you can set up a listener for the hooks. Hopefully the information you need is inherited by the PODs. More info here: https://docs.openshift.com/container-platform/3.10/dev_guide/deployments/deployment_strategies.html#lifecycle-hooks

Is it possible to query data from Whisper (Graphite DB) from console?

I have configured Graphite to monitor my application metrics. And I configured Zabbix to monitor my servers CPU and other metrics.
Now I want to pass some critical Graphite metrics to Zabbix to add triggers for them.
So I want to do something like
$ whisper get prefix1.prefix2.metricName
> 155
Is it possible?
P.S. I know about Graphite-API project, I don't want to install extra app.
You can use the whisper-fetch program which is provided in the whisper installation package.
Use it like this:
whisper-fetch /path/to/dot.wsp
Or to get e.g. data from the last 5 minutes:
whisper-fetch --from=$(date +%s -d "-5 min") /path/to/dot.wsp
Defaults will result in output like this:
1482318960 21.187000
1482319020 None
1482319080 21.187000
1482319140 None
1482319200 21.187000
You can change it to json using the --json option.
OK! I found it myself: http://graphite.readthedocs.io/en/latest/render_api.html?highlight=rawJson (I can use curl and return csv or json).
Answer was found here custom querying in graphite
Also see: https://github.com/graphite-project/graphite-web/blob/master/docs/render_api.rst

JProfile jpexport csv output missing headings

I'm evaluating JProfiler to see if it can be used in a automated fashion. I'd like to run the tool in offline mode, save a snapshot, export the data, and parse it to see if there is performance degradation since the last run.
I was able to use the jpexport command to export the TelemetryHeap view from a snapshot file to a csv file (sample below). When I look at the csv file I see 10 to 13 columns of data but only 4 headings. Is there documentation that explains the output more fully?
Sample output:
"Time [s]","Committed size","Free size","Used size"
0.0,450,880,000,371,600,000,79,280,000
1.0,450,880,000,371,600,000,79,280,000
2.0,450,880,000,371,600,000,79,280,000
3.0,450,880,000,371,600,000,79,280,000
4.0,450,880,000,371,600,000,79,280,000
5.0,450,880,000,371,600,000,79,280,000
6.0,450,880,000,371,600,000,79,280,000
7.0,450,880,000,355,932,992,94,947,000
8.0,450,880,000,355,932,992,94,947,000
9.58,969,216,000,634,564,992,334,651,008
11.05,1,419,456,000,743,606,016,675,849,984
12.05,1,609,792,000,377,251,008,1,232,541,056
17.33,2,524,032,000,1,115,268,992,1,408,763,008
19.43,2,588,224,000,953,451,008,1,634,772,992
26.08,3,711,936,000,1,547,981,056,2,163,954,944
39.75,3,711,936,000,1,145,185,024,2,566,750,976
40.75,3,711,936,000,1,137,052,032,2,574,884,096
41.75,3,711,936,000,1,137,052,032,2,574,884,096
42.75,3,711,936,000,1,137,052,032,2,574,884,096
43.75,3,711,936,000,1,137,051,008,2,574,885,120
44.75,3,711,936,000,1,137,051,008,2,574,885,120
45.75,3,711,936,000,1,137,051,008,2,574,885,120
46.75,3,711,936,000,1,137,051,008,2,574,885,120
47.75,3,711,936,000,1,137,051,008,2,574,885,120
48.75,3,711,936,000,1,137,051,008,2,574,885,120
49.75,3,711,936,000,1,137,051,008,2,574,885,120
50.75,3,711,936,000,1,137,051,008,2,574,885,120
51.75,3,711,936,000,1,137,051,008,2,574,885,120
52.75,3,711,936,000,1,137,051,008,2,574,885,120
53.75,3,711,936,000,1,137,051,008,2,574,885,120
54.75,3,711,936,000,1,137,051,008,2,574,885,120
55.75,3,711,936,000,1,137,051,008,2,574,885,120
56.75,3,711,936,000,1,137,051,008,2,574,885,120
57.75,3,711,936,000,1,137,051,008,2,574,885,120
58.75,3,711,936,000,1,137,051,008,2,574,885,120
60.96,3,711,936,000,1,137,051,008,2,574,885,120
68.73,3,711,936,000,1,137,051,008,2,574,885,120
74.39,3,711,936,000,1,137,051,008,2,574,885,120