Automate PySpark SQL script with parameters

Automate PySpark SQL script with parameters - apache-spark-sql

I have a PySpark Sql script that need to run daily and would like to pass the required parameters to the script and use them in my SQL queries inside the script.
For ex: below are the parameters that I would like to pass to the script and use them inside my script
my_st_dt='2019-02-04'
my_end_dt='2019-02-10'
mth_yyyymm='201902'
my_partition_dt='20190204'
my_table_name='table_1'
my_path='hdfs:///abcd/efgh/ijkl/mnop'
my_query1='''
SELECT * FROM parquet.`{my_file_path}/{my_table}/data`
WHERE {my_partition_name} = {partition} AND (my_date >= '{partition_st_dt}' AND
my_date <= '{partition_end_dt}')
'''.format(my_file_path=my_path,my_table=my_table_name,my_partition_name='my_yyyymm', \
partition=mth_yyyymm,partition_st_dt=my_st_dt, partition_end_dt=my_end_dt)
I have several queries like above in my script. Can someone please show me efficient way of writing this code such that I don't have to edit the script everytime I run? If there is any option other than Python ".format", then please do let me know as well. Thanks a lot in advance.

There are two ways we can do that:
Method 1: Use argsparse and pass all the required variables while executing the script.
parser = argparse.ArgumentParser(
description="Usage: spark2-submit ens_load_pb_sdps.py <sdp_name>")
parser.add_argument(
"-start_date", "--start_date"
type=date,
help='start date for query')
parser.add_argument(
"end_date", "--end_date"
type=date,
help='start date for query')
parser.add_argument(
"-table_name", "--table_name"
type=date,
help='start date for query')
parser.add_argument(
"-file_path", "--file_path"
type=date,
help='start date for query')
parser.add_argument(
"partition_name", "--partition_name"
type=date,
help='start date for query')
parser.add_argument(
"-partition_value", "--partition_value"
type=date,
help='start date for query')
args = parser.parse_args()
start_date, end_date, table_name, file_path, partition_name, partition_value = args.start_date, args.end_date, args.table_name, args.file_path, args.partition_name, args.partition_value
query_stmt="""
SELECT * FROM parquet.{file_path}/{table_name}/data
WHERE {partition_name} = {partition_value} AND (my_date >= '{start_date}' AND
my_date <= '{end_date}')
""".format(file_path=file_path, table_name=table_name, partition_name=partition_name, \
partition_value=partition_value, start_date=start_date, end_date=end_date)
Method2: Use a conf file
Also instead of passing the value while executing the scripts, we can keep the value in a conf json file and then load that json file into a python dict and then use it while forming the query.
Inside Json File - conf.json
{
"start_date": "2019-01-01",
"end_date": "2020-01-01",
"partition_name": "name",
"partition_value": "John Doe",
"file_path": "C://files/dummyPathh/",
"table_name": "dummyTable"
}
Python code to load the conf file -
file_path = "C://path/to/config" // path to conf file
json_file = glob.glob(file_path)
with open(json_file, 'r', encoding="utf8") as file:
conf_object = json.load(file)
start_date, end_date, table_name, file_path, partition_name, partition_value = conf_object.start_date, conf_object.end_date, conf_object.table_name, conf_object.file_path, conf_object.partition_name, conf_object.partition_value
query_stmt="""
SELECT * FROM parquet.{file_path}/{table_name}/data
WHERE {partition_name} = {partition_value} AND (my_date >= '{start_date}' AND
my_date <= '{end_date}')
""".format(file_path=file_path, table_name=table_name, partition_name=partition_name, \
partition_value=partition_value, start_date=start_date, end_date=end_date)
You can change the value in the conf file for running different queries.
To take it one step more furthur,
You can make a nested dict json conf file and then use it so that you don't have to edit the conf file each time we run the query.

Related

How can I get the results of a query in bigquery into a list?

I'm running a DAG that runs multiple stored procedures in Bigquery in each DAG run. Currently, my code is the following
sp_names = [
'sp_airflow_test_1',
'sp_airflow_test_2'
]
# Define DAG
with DAG(
dag_id,
default_args = default_args) as dag:
for sp in sp_names:
i = i + 1
task_array.append(
BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
)
if i != len(sp_names):
task_array[i - 1] >> task_array[i]
I'd like my list "sp_names" to be the result of a query I do to a 1 column table that is stored on my BQ dataset, instead of being hardcoded like it is right now.
How can I do this?
Thanks in advance.

To execute multiple Bigquery with a similar SQL structure, create BigQueryOperator dynamically by create_dynamic_task function.
# funtion to create task
def create_dynamic_task(sp):
task = BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
return task
# dynamically create task
task_list = []
for sp in sp_names:
task_list.append(create_dynamic_task(sp))

Create View that will extract metadata information about dataset and table sizes in different environments

We need to monitor table sizes in different environments.
Use Google metadata API to get the information for a given Project/Environment.
Need to create a view which will provide
1. What are all the datasets
2. What tables in each dataset
3. Table sizes
4. Dataset size

BigQuery has such views for you already built-in: INFORMATION_SCHEMA is a series of views that provide access to metadata about datasets, tables, and views
For example, below returns metadata for all datasets in the default project
SELECT * FROM INFORMATION_SCHEMA.SCHEMATA
or
for my_project
SELECT * FROM my_project.INFORMATION_SCHEMA.SCHEMATA
There are other such views for tables also
In addition, there is a meta table that can be used to get more info about tables in given dataset: __TABLES__SUMMARY and __TABLES__
SELECT * FROM `project.dataset.__TABLES__`
For example:
SELECT table_id,
DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date,
DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date,
row_count,
size_bytes,
CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type,
TIMESTAMP_MILLIS(creation_time) AS creation_time,
TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time,
dataset_id,
project_id
FROM `project.dataset.__TABLES__`

In order to automatize the query to check for every dataset in the project instead of adding them manually with UNION ALL, you can follow the advice given by #ZinkyZinky here and create a query that generates the UNION ALL calls for every dataset.__TABLES_. I have not managed to use this solution fully automatically in BigQuery because I don’t find a way to execute a command generated as a string (That is what string_agg is creating). Anyhow, I have managed to develop the solution in Python, adding the generated string in the next query. You can find the code below. It also creates a new table and stores the results there:
from google.cloud import bigquery
client = bigquery.Client()
project_id = "wave27-sellbytel-bobeda"
# Construct a full Dataset object to send to the API.
dataset_id = "project_info"
dataset = bigquery.Dataset(".".join([project_id, dataset_id]))
dataset.location = "US"
# Send the dataset to the API for creation.
# Raises google.api_core.exceptions.Conflict if the Dataset already
# exists within the project.
dataset = client.create_dataset(dataset) # API request
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))
schema = [
bigquery.SchemaField("dataset_id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("table_id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("size_bytes", "INTEGER", mode="REQUIRED"),
]
table_id = "table_info"
table = bigquery.Table(".".join([project_id, dataset_id, table_id]), schema=schema)
table = client.create_table(table) # API request
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table(table_id)
job_config.destination = table_ref
# QUERIES
# 1. Creating the UNION ALL list with the table information of each dataset
query = (
r"SELECT string_agg(concat('SELECT * from `', schema_name, '.__TABLES__` '), 'union all \n') "
r"from INFORMATION_SCHEMA.SCHEMATA"
)
query_job = client.query(query, location="US") # API request - starts the query
select_tables_from_all_datasets = ""
for row in query_job:
select_tables_from_all_datasets += row[0]
# 2. Using the before mentioned list to create a table.
query = (
"WITH ALL__TABLES__ AS ({})"
"SELECT dataset_id, table_id, size_bytes FROM ALL__TABLES__;".format(select_tables_from_all_datasets)
)
query_job = client.query(query, location="US", job_config=job_config) # job_config configures in which table the results will be stored.
for row in query_job:
print row
print('Query results loaded to table {}'.format(table_ref.path))

pull xcom in BigQueryOperator

I'm trying to run a BigQueryOperator with some dynamic parameter based on a previous task using xcom ( I managed to push it using BashOperator with xcom_push=True)
I thought using the following would do the trick
def get_next_run_date(**context):
last_date = context['task_instance'].xcom_pull(task_ids=['get_autoplay_last_run_date'])[0].rstrip()
last_date = datetime.strptime(last_date, "%Y%m%d").date()
return last_date + timedelta(days=1)
t3 = BigQueryOperator(
task_id='autoplay_calc',
bql='autoplay_calc.sql',
params={
"env" : deployment
,"region" : region
,"partition_start_date" : get_next_run_date()
},
bigquery_conn_id='gcp_conn',
use_legacy_sql=False,
write_disposition='WRITE_APPEND',
allow_large_results=True,
#provide_context=True,
destination_dataset_table=reporting_project + '.pa_reporting_public_batch.autoplay_calc',
dag=dag
)`
but using the above provide me with a Broken Dag Error with 'task_instance' error.

Have you tried using context['ti'].xcom_pull()?

You are using it in a wrong way.
You can not use xcom in params. You need to use it in bql/sql parameter. You sql file, autoplay_calc.sql can contain something like
select * from XYZ where date == "{{xcom_pull(task_ids=['get_autoplay_last_run_date'])[0].rstrip() }}"

Airflow BigQueryOperator: how to save query result in a partitioned Table?

I have a simple DAG
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table='my_dataset.my_table20180524'),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={})
start >> bq_query >> end
When executing the bq_query task the SQL query gets saved in a sharded table. I want it to get saved in a daily partitioned table. In order to do so, I only changed destination_dataset_table to my_dataset.my_table$20180524. I got the error below when executing the bq_task:
Partitioning specification must be provided in order to create partitioned table
How can I specify to BigQuery to save query result to a daily partitioned table ? my first guess has been to use query_params in BigQueryOperator
but I didn't find any example on how to use that parameter.
EDIT:
I'm using google-cloud==0.27.0 python client ... and it's the one used in Prod :(

You first need to create an Empty partitioned destination table. Follow instructions here: link to create an empty partitioned table
and then run below airflow pipeline again.
You can try code:
import datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
today_date = datetime.datetime.now().strftime("%Y%m%d")
table_name = 'my_dataset.my_table' + '$' + today_date
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table={{ params.t_name }}),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={'t_name': table_name},
dag=dag
)
start >> bq_query >> end
So what I did is that I created a dynamic table name variable and passed to the BQ operator.

The main issue here is that I don't have access to the new version of google cloud python API, the prod is using version 0.27.0.
So, to get the job done, I made something bad and dirty:
saved the query result in a sharded table, let it be table_sharded
got table_sharded's schema, let it be table_schema
saved " SELECT * FROM dataset.table_sharded" query to a partitioned table providing table_schema
All this is abstracted in one single operator that uses a hook. The hook is responsible of creating/deleting tables/partitions, getting table schema and running queries on BigQuery.
Have a look at the code. If there is any other solution, please let me know.

Using BigQueryOperator you can pass time_partitioning parameter which will create ingestion-time partitioned tables
bq_cmd = BigQueryOperator (
task_id= "task_id",
sql= [query],
destination_dataset_table= destination_tbl,
use_legacy_sql= False,
write_disposition= 'WRITE_TRUNCATE',
time_partitioning= {'time_partitioning_type':'DAY'},
allow_large_results= True,
trigger_rule= 'all_success',
query_params= query_params,
dag= dag
)

from datetime import datetime,timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.operators.dummy_operator import DummyOperator
DEFAULT_DAG_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=10),
'project_id': Variable.get('gcp_project'),
'zone': Variable.get('gce_zone'),
'region': Variable.get('gce_region'),
'location': Variable.get('gce_zone'),
}
with DAG(
'test',
start_date=datetime(2019, 1, 1),
schedule_interval=None,
catchup=False,
default_args=DEFAULT_DAG_ARGS) as dag:
bq_query = BigQueryOperator(
task_id='create-partition',
bql="""SELECT
*
FROM
`dataset.table_name`""", -- table from which you want to pull data
destination_dataset_table='project.dataset.table_name' + '$' + datetime.now().strftime('%Y%m%d'), -- Auto partitioned table in Bq
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
use_legacy_sql=False,
)
I recommend to use Variable in Airflow and create all fields and use in DAG.
By above code, partition will be added in Bigquery table for Todays date.

Fetching and inserting date with time stamp

I am writing a perl script which copies data from a table in one db to the same table in other DB. I am using DBI to obtain connection to DBS.
I noticed that when I am copying data,it's not copied properly.
In source table if date is like this-'04/22/1996 13:51:15 PM'
In destination table it's appearing like this-'22 APR 1996'.
Can anyone help me in copying exact date?
Thanks in advance

#!/usr/bin/env perl
use DateTime::Format::Strptime;
my $datetimestr = '04/22/1996 13:51:15 PM';
# 1) parse datetime
my $parser = DateTime::Format::Strptime->new(
pattern=>'%m/%d/%Y %H:%M:%S %p',
on_error => 'croak',
);
my $dt = $parser->parse_datetime($datetimestr,);
# 2) format datetime for your database format
warn $dt->strftime('%Y-%m-%d %H:%M:%S');

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Automate PySpark SQL script with parameters - apache-spark-sql

Related

How can I get the results of a query in bigquery into a list?

Create View that will extract metadata information about dataset and table sizes in different environments

pull xcom in BigQueryOperator

Airflow BigQueryOperator: how to save query result in a partitioned Table?

Fetching and inserting date with time stamp

Categories

Resources