Using schema update option in beam.io.writetobigquery - google-bigquery
I am loading a bunch log files into BigQuery using apache beam data flow. The file format can change over a period of time by adding new columns to the files. I see Schema Update Option ALLOW_FILED_ADDITION.
Anyone know how to use it? This is how my WriteToBQ step looks:
| 'write to bigquery' >> beam.io.WriteToBigQuery('project:datasetId.tableId', ,write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
I haven't actually tried this yet but digging into the documentation, it seems you are able to pass whatever configuration you like to the BigQuery Load Job using additional_bq_parameters. In this case it might look something like:
| 'write to bigquery' >> beam.io.WriteToBigQuery(
'project:datasetId.tableId',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
additional_bq_parameters={
'schemaUpdateOptions': [
'ALLOW_FIELD_ADDITION',
'ALLOW_FIELD_RELAXATION',
]
}
)
Weirdly, this is actually in the Java SDK but doesn't seem to have made its way to the Python SDK.
Related
Deploy sql workflow with DBX
I am developing deployment via DBX to Azure Databricks. In this regard I need a data job written in SQL to happen everyday. The job is located in the file data.sql. I know how to do it with a python file. Here I would do the following: build: python: "pip" environments: default: workflows: - name: "workflow-name" #schedule: quartz_cron_expression: "0 0 9 * * ?" # every day at 9.00 timezone_id: "Europe" format: MULTI_TASK # job_clusters: - job_cluster_key: "basic-job-cluster" <<: *base-job-cluster tasks: - task_key: "task-name" job_cluster_key: "basic-job-cluster" spark_python_task: python_file: "file://filename.py" But how can I change it so I can run a SQL job instead? I imagine it is the last two lines of code (spark_python_task: and python_file: "file://filename.py") which needs to be changed.
There are various ways to do that. (1) One of the most simplest is to add a SQL query in the Databricks SQL lens, and then reference this query via sql_task as described here. (2) If you want to have a Python project that re-uses SQL statements from a static file, you can add this file to your Python Package and then call it from your package, e.g.: sql_statement = ... # code to read from the file spark.sql(sql_statement) (3) A third option is to use the DBT framework with Databricks. In this case you probably would like to use dbt_task as described here.
I found a simple workaround (although might not be the prettiest) to simply change the data.sql to a python file and run the queries using spark. This way I could use the same spark_python_task.
Add file name and timestamp into each record in BigQuery using Dataflow
I have a few .txt files with data in JSON to be loaded to google BigQuery table. Along with the columns in the text files I will need to insert filename and current timestamp for each rows. It is in GCP Dataflow with Python 3.7 I accessed the Filemetadata containing the filepath and size using GCSFileSystem.match and metadata_list. I believe I need to get the pipeline code to run in a loop, pass the filepath to ReadFromText, and call a FileNameReadFunction ParDo. (p | "read from file" >> ReadFromText(known_args.input) | "parse" >> beam.Map(json.loads) | "Add FileName" >> beam.ParDo(AddFilenamesFn(), GCSFilePath) | "WriteToBigQuery" >> beam.io.WriteToBigQuery(known_args.output, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND) ) I followed the steps in Dataflow/apache beam - how to access current filename when passing in pattern? but I can't make it quite work. Any help is appreciated.
You can use textio.ReadFromTextWithFilename instead of ReadFromText. That will produce a PCollection of (filename,line) tuples. To include the file and timestamp in your output json record, you could change your "parse" line to | "parse" >> beam.map(lambda (file, line): { **json.loads(line), "filename": file, "timestamp": datetime.now()})
How to get information on latest successful pod deployment in OpenShift 3.6
I am currently working on making a CICD script to deploy a complex environment into another environment. We have multiple technology involved and I currently want to optimize this script because it's taking too much time to fetch information on each environment. In the OpenShift 3.6 section, I need to get the last successful deployment for each application for a specific project. I try to find a quick way to do so, but right now I only found this solution : oc rollout history dc -n <Project_name> This will give me the following output deploymentconfigs "<Application_name>" REVISION STATUS CAUSE 1 Complete config change 2 Complete config change 3 Failed manual change 4 Running config change deploymentconfigs "<Application_name2>" REVISION STATUS CAUSE 18 Complete config change 19 Complete config change 20 Complete manual change 21 Failed config change .... I then take this output and parse each line to know which is the latest revision that have the status "Complete". In the above example, I would get this list : <Application_name> : 2 <Application_name2> : 20 Then for each application and each revision I do : oc rollout history dc/<Application_name> -n <Project_name> --revision=<Latest_Revision> In the above example the Latest_Revision for Application_name is 2 which is the latest complete revision not building and not failed. This will give me the output with the information I need which is the version of the ear and the version of the configuration that was used in the creation of the image use for this successful deployment. But since I have multiple application, this process can take up to 2 minutes per environment. Would anybody have a better way of fetching the information I required? Unless I am mistaken, it looks like there are no "one liner" with the possibility to get the information on the currently running and accessible application. Thanks
Assuming that the currently active deployment is the latest successful one, you may try the following: oc get dc -a --no-headers | awk '{print "oc rollout history dc "$1" --revision="$2}' | . /dev/stdin It gets a list of deployments, feeds it to awk to extract the name $1 and revision $2, then compiles your command to extract the details, finally sends it to standard input to execute. It may be frowned upon for not using xargs or the like, but I found it easier for debugging (just drop the last part and see the commands printed out). UPDATE: On second thoughts, you might actually like this one better: oc get dc -a -o jsonpath='{range .items[*]}{.metadata.name}{"\n\t"}{.spec.template.spec.containers[0].env}{"\n\t"}{.spec.template.spec.containers[0].image}{"\n-------\n"}{end}' The example output: daily-checks [map[name:SQL_QUERIES_DIR value:daily-checks/]] docker-registry.default.svc:5000/ptrk-testing/daily-checks#sha256:b299434622b5f9e9958ae753b7211f1928318e57848e992bbf33a6e9ee0f6d94 ------- jboss-webserver31-tomcat registry.access.redhat.com/jboss-webserver-3/webserver31-tomcat7-openshift#sha256:b5fac47d43939b82ce1e7ef864a7c2ee79db7920df5764b631f2783c4b73f044 ------- jtask 172.30.31.183:5000/ptrk-testing/app-txeq:build ------- lifebicycle docker-registry.default.svc:5000/ptrk-testing/lifebicycle#sha256:a93cfaf9efd9b806b0d4d3f0c087b369a9963ea05404c2c7445cc01f07344a35 You get the idea, with expressions like .spec.template.spec.containers[0].env you can reach for specific variables, labels, etc. Unfortunately the jsonpath output is not available with oc rollout history. UPDATE 2: You could also use post-deployment hooks to collect the data, if you can set up a listener for the hooks. Hopefully the information you need is inherited by the PODs. More info here: https://docs.openshift.com/container-platform/3.10/dev_guide/deployments/deployment_strategies.html#lifecycle-hooks
Is it possible to query data from Whisper (Graphite DB) from console?
I have configured Graphite to monitor my application metrics. And I configured Zabbix to monitor my servers CPU and other metrics. Now I want to pass some critical Graphite metrics to Zabbix to add triggers for them. So I want to do something like $ whisper get prefix1.prefix2.metricName > 155 Is it possible? P.S. I know about Graphite-API project, I don't want to install extra app.
You can use the whisper-fetch program which is provided in the whisper installation package. Use it like this: whisper-fetch /path/to/dot.wsp Or to get e.g. data from the last 5 minutes: whisper-fetch --from=$(date +%s -d "-5 min") /path/to/dot.wsp Defaults will result in output like this: 1482318960 21.187000 1482319020 None 1482319080 21.187000 1482319140 None 1482319200 21.187000 You can change it to json using the --json option.
OK! I found it myself: http://graphite.readthedocs.io/en/latest/render_api.html?highlight=rawJson (I can use curl and return csv or json). Answer was found here custom querying in graphite Also see: https://github.com/graphite-project/graphite-web/blob/master/docs/render_api.rst
JProfile jpexport csv output missing headings
I'm evaluating JProfiler to see if it can be used in a automated fashion. I'd like to run the tool in offline mode, save a snapshot, export the data, and parse it to see if there is performance degradation since the last run. I was able to use the jpexport command to export the TelemetryHeap view from a snapshot file to a csv file (sample below). When I look at the csv file I see 10 to 13 columns of data but only 4 headings. Is there documentation that explains the output more fully? Sample output: "Time [s]","Committed size","Free size","Used size" 0.0,450,880,000,371,600,000,79,280,000 1.0,450,880,000,371,600,000,79,280,000 2.0,450,880,000,371,600,000,79,280,000 3.0,450,880,000,371,600,000,79,280,000 4.0,450,880,000,371,600,000,79,280,000 5.0,450,880,000,371,600,000,79,280,000 6.0,450,880,000,371,600,000,79,280,000 7.0,450,880,000,355,932,992,94,947,000 8.0,450,880,000,355,932,992,94,947,000 9.58,969,216,000,634,564,992,334,651,008 11.05,1,419,456,000,743,606,016,675,849,984 12.05,1,609,792,000,377,251,008,1,232,541,056 17.33,2,524,032,000,1,115,268,992,1,408,763,008 19.43,2,588,224,000,953,451,008,1,634,772,992 26.08,3,711,936,000,1,547,981,056,2,163,954,944 39.75,3,711,936,000,1,145,185,024,2,566,750,976 40.75,3,711,936,000,1,137,052,032,2,574,884,096 41.75,3,711,936,000,1,137,052,032,2,574,884,096 42.75,3,711,936,000,1,137,052,032,2,574,884,096 43.75,3,711,936,000,1,137,051,008,2,574,885,120 44.75,3,711,936,000,1,137,051,008,2,574,885,120 45.75,3,711,936,000,1,137,051,008,2,574,885,120 46.75,3,711,936,000,1,137,051,008,2,574,885,120 47.75,3,711,936,000,1,137,051,008,2,574,885,120 48.75,3,711,936,000,1,137,051,008,2,574,885,120 49.75,3,711,936,000,1,137,051,008,2,574,885,120 50.75,3,711,936,000,1,137,051,008,2,574,885,120 51.75,3,711,936,000,1,137,051,008,2,574,885,120 52.75,3,711,936,000,1,137,051,008,2,574,885,120 53.75,3,711,936,000,1,137,051,008,2,574,885,120 54.75,3,711,936,000,1,137,051,008,2,574,885,120 55.75,3,711,936,000,1,137,051,008,2,574,885,120 56.75,3,711,936,000,1,137,051,008,2,574,885,120 57.75,3,711,936,000,1,137,051,008,2,574,885,120 58.75,3,711,936,000,1,137,051,008,2,574,885,120 60.96,3,711,936,000,1,137,051,008,2,574,885,120 68.73,3,711,936,000,1,137,051,008,2,574,885,120 74.39,3,711,936,000,1,137,051,008,2,574,885,120