Deploy sql workflow with DBX

Deploy sql workflow with DBX - sql

I am developing deployment via DBX to Azure Databricks. In this regard I need a data job written in SQL to happen everyday. The job is located in the file data.sql. I know how to do it with a python file. Here I would do the following:
build:
python: "pip"
environments:
default:
workflows:
- name: "workflow-name"
#schedule:
quartz_cron_expression: "0 0 9 * * ?" # every day at 9.00
timezone_id: "Europe"
format: MULTI_TASK #
job_clusters:
- job_cluster_key: "basic-job-cluster"
<<: *base-job-cluster
tasks:
- task_key: "task-name"
job_cluster_key: "basic-job-cluster"
spark_python_task:
python_file: "file://filename.py"
But how can I change it so I can run a SQL job instead? I imagine it is the last two lines of code (spark_python_task: and python_file: "file://filename.py") which needs to be changed.

There are various ways to do that.
(1) One of the most simplest is to add a SQL query in the Databricks SQL lens, and then reference this query via sql_task as described here.
(2) If you want to have a Python project that re-uses SQL statements from a static file, you can add this file to your Python Package and then call it from your package, e.g.:
sql_statement = ... # code to read from the file
spark.sql(sql_statement)
(3) A third option is to use the DBT framework with Databricks. In this case you probably would like to use dbt_task as described here.

I found a simple workaround (although might not be the prettiest) to simply change the data.sql to a python file and run the queries using spark. This way I could use the same spark_python_task.

Related

Using schema update option in beam.io.writetobigquery

I am loading a bunch log files into BigQuery using apache beam data flow. The file format can change over a period of time by adding new columns to the files. I see Schema Update Option ALLOW_FILED_ADDITION.
Anyone know how to use it? This is how my WriteToBQ step looks:
| 'write to bigquery' >> beam.io.WriteToBigQuery('project:datasetId.tableId', ,write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)

I haven't actually tried this yet but digging into the documentation, it seems you are able to pass whatever configuration you like to the BigQuery Load Job using additional_bq_parameters. In this case it might look something like:
| 'write to bigquery' >> beam.io.WriteToBigQuery(
'project:datasetId.tableId',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
additional_bq_parameters={
'schemaUpdateOptions': [
'ALLOW_FIELD_ADDITION',
'ALLOW_FIELD_RELAXATION',
]
}
)
Weirdly, this is actually in the Java SDK but doesn't seem to have made its way to the Python SDK.

Is it possible to query data from Whisper (Graphite DB) from console?

I have configured Graphite to monitor my application metrics. And I configured Zabbix to monitor my servers CPU and other metrics.
Now I want to pass some critical Graphite metrics to Zabbix to add triggers for them.
So I want to do something like
$ whisper get prefix1.prefix2.metricName
> 155
Is it possible?
P.S. I know about Graphite-API project, I don't want to install extra app.

You can use the whisper-fetch program which is provided in the whisper installation package.
Use it like this:
whisper-fetch /path/to/dot.wsp
Or to get e.g. data from the last 5 minutes:
whisper-fetch --from=$(date +%s -d "-5 min") /path/to/dot.wsp
Defaults will result in output like this:
1482318960 21.187000
1482319020 None
1482319080 21.187000
1482319140 None
1482319200 21.187000
You can change it to json using the --json option.

OK! I found it myself: http://graphite.readthedocs.io/en/latest/render_api.html?highlight=rawJson (I can use curl and return csv or json).
Answer was found here custom querying in graphite
Also see: https://github.com/graphite-project/graphite-web/blob/master/docs/render_api.rst

How to generate executable TPC-DS queries?

I have downloaded the DSGEN tool from the TPC-DS web site and already generated the tables and loaded the data into Oracle XE.
I am using the following command to generate the SQL statements :
dsqgen -input ..\query_templates\templates.lst -directory ..\query_templates -dialect oracle -scale 1
However, No matter how I adjust the command I always get this error message :
ERROR: A query template list must be supplied using the INPUT option
Can anybody help?

Apparently you need to use / rather than - for the flags for the Windows executable:
dsqgen /input ..\query_templates\templates.lst /directory ..\query_templates
/dialect oracle /scale 1

Unable to resolve ERROR 2017: Internal error creating job configuration on EMR when running PIG

I have been trying to run a very simple task with Pig on Amazon EMR. When I run the commands in interactive shell, everything works fine. But when I ran the same thing as batch job, I get
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal
error creating job configuration.
and the running the script fails.
Here's my 7 line script. It's just computing averages over tuples of Google bigrams. mc is match count and vc is volume count.
bigrams = LOAD 's3n://<<bucket-name>>/gbbigrams/*' AS (bigram:chararray, year:int, mc:int, vc:int);
grouped_bigrams = group bigrams by bigram;
answer1 = foreach grouped_bigrams generate group, ((DOUBLE) SUM(bigrams.mc))/COUNT(bigrams) AS avg_mc;
sort_answer1 = ORDER answer1 BY avg_mc desc;
answer2 = LIMIT sort_answer1 5;
STORE answer1 INTO 's3n://<bucket-name>/output/bigram/20130409/answer1';
STORE answer2 INTO 's3n://<bucket-name>/output/bigram/20130409/answer2';
I was guessing the error has to something to do with STORE and s3 path. So I have tried various combinations like using $OUTPUT, backslashes, etc. But keep getting the same error.
Any help would be highly appreciated.

Have you tried using the S3 Block File System instead of the native file system?
e.g.
s3://<<bucket-name>>/gbbigrams/*
s3://<bucket-name>/output/bigram/20130409/answer1

Generating TPC-DS database for sql server

How do I populate the Transaction Processing Performance Council's TPC-DS database for SQL Server? I have downloaded the TPC-DS tool but there are few tutorials about how to use it.

In case you are using windows, you gotta have visual studio 2005 or later. Unzip dsgen in the folder tools there is dsgen2.sln file, open it using visual studio and build the project, will generate tables for you, I've tried that and I loaded tables manually into sql server

I've just succeeded in generating these queries.
There are some tips may not the best but useful.
cp ${...}/query_templates/* ${...}/tools/
add define _END = ""; to each query.tpl
${...}/tools/dsqgen -INPUT templates.lst -OUTPUT_DIR /home/query99/

Let's describe the base steps:
Before go to the next steps double-check that the required TPC-DS Kit has not been already prepared for your DB
Download TPC-DS Tools
Build Tools as described in 'v2.11.0rc2\tools\How_To_Guide-DS-V2.0.0.docx' (I used VS2015)
Create DB
Take the DB schema described in tpcds.sql and tpcds_ri.sql (they located in 'v2.11.0rc2\tools\'-folder), suit it to your DB if required.
Generate data that be stored to database
# Windows
dsdgen.exe /scale 1 /dir .\tmp /suffix _001.dat
# Linux
dsdgen -scale 1 -dir /tmp -suffix _001.dat
Upload data to DB
# example for ClickHouse
database_name=tpcds
ch_password=12345
for file_fullpath in /tmp/tpc-ds/*.dat; do
filename=$(echo ${file_fullpath##*/})
tablename=$(echo ${filename%_*})
echo " - $(date +"%T"): start processing $file_fullpath (table: $tablename)"
query="INSERT INTO $database_name.$tablename FORMAT CSV"
cat $file_fullpath | clickhouse-client --format_csv_delimiter="|" --query="$query" --password $ch_password
done
Generate queries
# Windows
set tmpl_lst_path="..\query_templates\templates.lst"
set tmpl_dir="..\query_templates"
set dialect_path="..\..\clickhouse-dialect"
set result_dir="..\queries"
set tmpl_name="query1.tpl"
dsqgen /input %tmpl_lst_path% /directory %tmpl_dir% /dialect %dialect_path% /output_dir %result_dir% /scale 1 /verbose y /template %tmpl_name%
# Linux
# see for example https://github.com/pingcap/tidb-bench/blob/master/tpcds/genquery.sh
To fix the error 'Substitution .. is used before being initialized' follow this fix.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Deploy sql workflow with DBX - sql

I found a simple workaround (although might not be the prettiest) to simply change the data.sql to a python file and run the queries using spark. This way I could use the same spark_python_task.

Related

Using schema update option in beam.io.writetobigquery

Is it possible to query data from Whisper (Graphite DB) from console?

How to generate executable TPC-DS queries?

Unable to resolve ERROR 2017: Internal error creating job configuration on EMR when running PIG

Generating TPC-DS database for sql server

Categories

Resources