Google Composer - Creation of partitioned External tables - google-bigquery

I am trying to create an External table on top of Google Cloud Storage bucket through Composer DAG. My upstream provides me partitioned parquet files based on specific country. So, I would like to create an external table with Source Data partitioning enabled.
Sample GCS Path - gs://my-gs-bucket/folder1/subfolder/country=US/ (has multiple parquet files)
While creating the external table through console. Below are the options provided
Source:
Create table from -> Google Cloud Storage
Select file from GCS bucket or use a URI pattern -> gs://my-gs-bucket/folder1/subfolder/country*
File Format -> Parquet
Source Data Partitioning - Enabled
Select Source URI Prefix - gs://my-gs-bucket/folder1/subfolder
I would like to do this through Composer DAG. I was able to create an External table on the GCS files using BigQueryCreateExternalTableOperator but without Source Data Partitioning. Any idea how do we enable Source Data Partitioning and Select Source URI Prefix through DAG.

There is no parameter to create partitioned External tables in BigQueryCreateExternalTableOperator.
For your requirement to create Partitioned table you can consider below sample code to run dag’s which uses BigQueryCreateEmptyDatasetOperator and BigQueryCreateEmptyTableOperator to create dataset and partitioned table with the time_partitioning parameter in BigQuery.
import os
import time
import datetime
from datetime import datetime, date, time, timedelta
from airflow import models
from airflow.providers.google.cloud.operators.bigquery import (
BigQueryCreateEmptyDatasetOperator,
BigQueryCreateEmptyTableOperator,
)
from airflow.utils.dates import days_ago
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "your-project-id")
DATASET_NAME = os.environ.get("GCP_BIGQUERY_DATASET_NAME", "testdataset")
TABLE_NAME = "partitioned_table"
SCHEMA = [
{"name": "value", "type": "INTEGER", "mode": "REQUIRED"},
{"name": "ds", "type": "DATE", "mode": "NULLABLE"},
]
dag_id = "example_bigquery"
with models.DAG(
dag_id,
schedule_interval="#hourly", # Override to match your needs
start_date=days_ago(1),
tags=["example"],
user_defined_macros={"DATASET": DATASET_NAME, "TABLE": TABLE_NAME},
default_args={"project_id": PROJECT_ID},
) as dag_with_locations:
create_dataset = BigQueryCreateEmptyDatasetOperator(
task_id="create-dataset", dataset_id=DATASET_NAME, project_id=PROJECT_ID
)
create_table = BigQueryCreateEmptyTableOperator(
task_id="create_table",
dataset_id=DATASET_NAME,
table_id=TABLE_NAME,
schema_fields=SCHEMA,
time_partitioning={
"type": "DAY",
"field": "ds",
},
)
create_dataset >> create_table
You can also use schema from GCS, Example (with schema JSON in GCS):
CreateTable = BigQueryCreateEmptyTableOperator(
task_id='BigQueryCreateEmptyTableOperator_task',
dataset_id='ODS',
table_id='Employees',
project_id='internal-gcp-project',
gcs_schema_object='gs://schema-bucket/employee_schema.json',
bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)
You can refer to this stack link to create external table for Parquet file using BigQueryCreateExternalTableOperator.

Related

BigQuery scheduled queries - Create table and add date suffix to its name in a different project

According to the documentation, (https://cloud.google.com/bigquery/docs/scheduling-queries#destination_table) the same project must be used when defining a destination for a scheduled query.
However, I'd like to schedule a query with the capability to write tables during the query steps to other projects (e.g., CREATE TABLE xxx.dataset.name_{run_date}) and to preserve the {run_date} as a suffix. Is it possible to do that in BQ?
This is a limitation in BigQuery UI. A possible workaround is to use BigQuery Python Client Library as shown below code:
from google.cloud import bigquery_datatransfer
from datetime import date
today = date.today() #use to replicate #run_date parameter
str_today = str(today).replace("-", "")
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
# The project where the query job runs is the same as the project
# containing the destination dataset.
project_id = "your-project-id"
dataset_id = "your-source-dataset-id"
# This service account will be used to execute the scheduled queries. Omit
# this request parameter to run the query as the user with the credentials
# associated with this client.
service_account_name = "your-service-account"
# Use standard SQL syntax for the query.
query_string = f"CREATE TABLE destination-project.destination-dataset.new_table_{str_today} AS (SELECT column FROM source-project.source-dataset.source-table);"
parent="projects/your-project-id/locations/us-central1" #change location accordingly
transfer_config = bigquery_datatransfer.TransferConfig(
name="projects/your-project-id/locations/us-central1/transferConfigs", #change location accordingly
display_name="Test Schedule",
data_source_id="scheduled_query",
params={
"query": query_string,
},
schedule="every 24 hours",
)
transfer_config = transfer_client.create_transfer_config(
bigquery_datatransfer.CreateTransferConfigRequest(
parent=parent,
transfer_config=transfer_config,
service_account_name=service_account_name,
)
)
print("Created scheduled query '{}'".format(transfer_config.name))
My sample output table:

dask read parquet and specify schema

Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow?
I have a bunch of parquet files in a bucket but some of the fields have slightly inconsistent names. I could create a custom delayed function to handle these cases after reading them but I'm hoping I could specify the schema when opening them via globing. Maybe not though as I guess opening then via globing is going to try and concatenate them. This currently fails because of the inconsistent field names.
Create a parquet file:
import dask.dataframe as dd
df = dd.demo.make_timeseries(
start="2000-01-01",
end="2000-01-03",
dtypes={"id": int, "z": int},
freq="1h",
partition_freq="24h",
)
df.to_parquet("df.parquet", engine="pyarrow", overwrite=True)
Read it in via dask and specify the schema after reading:
df = dd.read_parquet("df.parquet", engine="pyarrow")
df["z"] = df["z"].astype("float")
df = df.rename(columns={"z": "a"})
Read it in via spark and specify the schema:
from pyspark.sql import SparkSession
import pyspark.sql.types as T
spark = SparkSession.builder.appName('App').getOrCreate()
schema = T.StructType(
[
T.StructField("id", T.IntegerType()),
T.StructField("a", T.FloatType()),
T.StructField("timestamp", T.TimestampType()),
]
)
df = spark.read.format("parquet").schema(schema).load("df.parquet")
Some of the options are:
Specify dtypes after loading (requires consistent column names):
custom_dtypes = {"a": float, "id": int, "timestamp": pd.datetime}
df = dd.read_parquet("df.parquet", engine="pyarrow").astype(custom_dtypes)
This currently fails because of the inconsistent field names.
If the column names are not the same across files, you might want to use a custom delayed before loading:
#delayed
def custom_load(path):
df = pd.read_parquet(path)
# some logic to ensure consistent columns
# for example:
if "z" in df.columns:
df = df.rename(columns={"z": "a"}).astype(custom_dtypes)
return df
dask_df = dd.from_delayed([custom_load(path) for path in glob.glob("some_path/*parquet")])

Dynamically Append datetime to filename during copy activity or when specifying name in blob dataset

I am saving a file to blob storage in Data factory V2, when I specify the location to save to I am calling the file (for example) file1 and it saves in blob as file1, no problem. But can I use the dynamic content feature to append the datetime to the filename so its something like file1_01-07-2019_14-30-00 ?(7th Jan 14:30:00 just in case its awkward to read). Alternatively, can I output the result (the filename) of the webhook activity to the next activity (the function)?
Thank you.
I couldn't get this to work without editing the copy pipeline JSON file directly (late 2018 - may not be needed anymore). You need dynamic code in the copy pipeline JSON and settings defined in the dataset for setting filename parameters.
In the dataset define 'Parameters' for folder path and/or filename (click '+ New' and give them any name you like) e.g. sourceFolderPath, sourceFileName.
Then in dataset under 'Connection' include the following in the 'File path' definition:
#dataset().sourceFolderPath and #dataset().sourceFileName either side of the '/'
(see screenshot below)
In the copy pipeline click on 'Code' in the upper right corner of pipeline window and look for the following code under the 'blob' object you want defined by a dynamic filename - it the 'parameters' code isn't included add it to the JSON and click the 'Finish' button - this code may be needed in 'inputs', 'outputs' or both depending on the dynamic files you are referencing in your flow - below is an example where the output includes the date parameter in both folder path and file name (the date is set by a Trigger parameter):
"inputs": [
{
"referenceName": "tmpDataForImportParticipants",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "StgParticipants",
"type": "DatasetReference",
"parameters": {
"sourceFolderPath": {
"value": <derived value of folder path>,
"type": "Expression"
},
"sourceFileName": {
"value": <derived file name>,
"type": "Expression"
}
}
}
]
Derived value of folder path may be something like the following - this results in a folder path of yyyy/mm/dd within specified blobContainer:
"blobContainer/#{formatDateTime(pipeline().parameters.windowStart,'yyyy')}/#{formatDateTime(pipeline().parameters.windowStart,'MM')}/#{formatDateTime(pipeline().parameters.windowStart,'dd')}"
or it could be hardcoded e.g. "blobContainer/directoryPath" - don't include '/' at start or end of definition
Derived file name could be something like the following:
"#concat(string(pipeline().parameters.'_',formatDateTime(dataset().WindowStartTime, 'MM-dd-yyyy_hh-mm-ss'))>,'.txt')"
You can include any parameter set by the Trigger e.g. an ID value, account name, etc. by including pipeline().parameters.
Dynamic Dataset Parameters example
Dynamic Dataset Connection example
Once you set up the copy activity and select you blob dataset as the sink, you need to put in a value for the WindowStartTime, this can either just be a timestamp e.g. 1900-01-01T13:00:00Z or you can put in a pipeline parameter into this.
Having a parameter would maybe be more helpful if you're setting up a schedule trigger, as you will be able to input this WindowStartTime timestamp by when the trigger runs. For this you would use #trigger().scheduledTime as the value for the trigger parameter WindowStartTime.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers#trigger-type-comparison
You can add a dataset parameter such an WindowStartTime, which is in the format 2019-01-10T13:50:04.279Z. Then you would have something like below for the dynamic filename:
#concat('file1_', formatDateTime(dataset().WindowStartTime, 'MM-dd-yyyy_hh-mm-ss')).
To use in the copy activity you will also need to add a pipeline parameter.

Stream Analytics doesn't produce output to SQL table when reference data is used

I am working with ASA lately and I am trying to insert ASA stream directly to the SQL table using reference data. I based my development on this MS article: https://msdn.microsoft.com/en-us/azure/stream-analytics/reference/reference-data-join-azure-stream-analytics.
Overview of data flow - telemetry:
I've many devices of different types (Heat Pumps, Batteries, Water Pumps, AirCon...). Each of these devices has different JSON schema for their telemetry data. I can distinguish JSONs by an attribute in message (e.g.: "DeviceType":"HeatPump" or "DeviceType":"AirCon"...)
All of these devices are sending their telemetry to a single Event Hub
Behind Event Hub, there is a single Stream Analytics component where I redirect streams to different outputs based on attribute Device Type. For example I redirect telemetry from HeatPumps with query SELECT * INTO output-sql-table FROM input-event-hub WHERE DeviceType = 'HeatPump'
I would like to use some reference data to "enrich" ASA stream with some IDKeys, before I inserted stream into SQL table.
What I've already done:
Successfully inserted ASA stream directly to SQL table using ASA query SELECT * INTO [sql-table] FROM Input WHERE DeviceType ='HeatPump', where [sql-table] has the same schema than JSON message + standard columns (EventProcessedUtcTime, PartitionID, EventEnqueueUtcTime)
Successfully inserted ASA stream directly to SQL table using ASA query SELECT Column1, Column2, Column3... INTO [sql-table] FROM Input WHERE DeviceType = 'HeatPump' - basically the same query as above, only this time I used named columns in select statement.
Generated JSON file of reference data and put it to the BLOB storage
Created new static (not using {date} and {time} placeholders) reference data input in ASA pointing to the file in BLOB storage.
Then I joined reference data to the data stream in ASA query using the same statement with named columns
Results no output rows in SQL table
When debugging the problem I've used Test functionality in Query ASA
I sample data from Event Hub - stream data.
I upload sample data from file - reference data.
After sampling data from Event Hub have finished, I tested a query -> output produced some rows -> it's not a problem in a query
Yet... if I run ASA, no output rows are inserted into SQL table.
Some other ideas I tried:
Used TRY_CAST function to cast fields from reference data to appropriate data types before I joined them with fields in stream data
Used TRY_CAST function to cast fields in SELECT before I inserted them into SQL table
I really don't know what to do now. Any suggestions?
EDIT: added data stream JSON, reference data JSON, ASA query, ASA input configuration, BLOB storage configuration and ASA test output result
Data Stream JSON - single message
[
{
"Activation": 0,
"AvailablePowerNegative": 6.0,
"AvailablePowerPositive": 1.91,
"DeviceID": 99999,
"DeviceIsAvailable": true,
"DeviceOn": true,
"Entity": "HeatPumpTelemetry",
"HeatPumpMode": 3,
"Power": 1.91,
"PowerCompressor": 1.91,
"PowerElHeater": 0.0,
"Source": "<omitted>",
"StatusToPowerOff": 1,
"StatusToPowerOn": 9,
"Timestamp": "2018-08-29T13:34:26.0Z",
"TimestampDevice": "2018-08-29T13:34:09.0Z"
}
]
Reference data JSON - single message
[
{
"SourceID": 1,
"Source": "<ommited>",
"DeviceID": 10,
"DeviceSourceCode": 99999,
"DeviceName": "NULL",
"DeviceType": "Heat Pump",
"DeviceTypeID": 1
}
]
ASA Query
WITH HeatPumpTelemetry AS
(
SELECT
*
FROM
[input-eh]
WHERE
source='<omitted>'
AND entity = 'HeatPumpTelemetry'
)
SELECT
e.Activation,
e.AvailablePowerNegative,
e.AvailablePowerPositive,
e.DeviceID,
e.DeviceIsAvailable,
e.DeviceOn,
e.Entity,
e.HeatPumpMode,
e.Power,
e.PowerCompressor,
e.PowerElHeater,
e.Source,
e.StatusToPowerOff,
e.StatusToPowerOn,
e.Timestamp,
e.TimestampDevice,
e.EventProcessedUtcTime,
e.PartitionId,
e.EventEnqueuedUtcTime
INTO
[out-SQL-HeatPumpTelemetry]
FROM
HeatPumpTelemetry e
LEFT JOIN [input-json-devices] d ON
TRY_CAST(d.DeviceSourceCode as BIGINT) = TRY_CAST(e.DeviceID AS BIGINT)
ASA Reference Data Input configuration
Reference Data input configuration in Stream Analytics
BLOB storage directory tree
Blob storage directory tree
ASA test query output
ASA test query output
matejp. I didn't reproduce your issue and you could refer to my steps.
reference data in blob storage:
{
"a":"aaa",
"reference":"www.bing.com"
}
stream data in blob storage
[
{
"id":"1",
"name":"DeIdentified 1",
"DeviceType":"aaa"
},
{
"id":"2",
"name":"DeIdentified 2",
"DeviceType":"No"
}
]
query statement:
SELECT
inputSteam.*,inputRefer.*
into sqloutput
FROM
inputSteam
Join inputRefer on inputSteam.DeviceType = inputRefer.a
Output:
Hope it helps you.Any concern, let me know.
I think I found the error. In past days I tested nearly every combination possible when configuring inputs in Azure Stream Analytics.
I've started with this example as baseline: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-build-an-iot-solution-using-stream-analytics
I've tried the solution without any changes to be sure that the example with reference data input works -> it worked
Then I've changed ASA output from CosmosDB to SQL table without changing anything -> it worked
Then I've changed my initial ASA job to be the as much the "same" as the ASA job in the example (writing into SQL table) -> it worked
Then I've started playing with BLOB directory names -> here I've found the error.
I think the problem I encountered is due to using a character "-" in folder name.
In my case I've created folder named "reference-data" and upload file named "devices.json" (folder structure "/reference-data/devices.json") -> ASA output to SQL table didn't work
As soon as I've changed the folder name to "refdata" (folder structure "/referencedata/devices.json") -> ASA output to SQL table worked.
Tried 3 times changing reference data input from folder name containing "-" and not containing it => every time ASA output to SQL server stop working when "-" was in folder name.
To recap:
I recommend not to use "-" in BLOB folder names for static reference data input in ASA Jobs.

Add column description to BiqQuery table?

need to add descriptions to each column of a BigQuery table, seems I can do it manually, how to do it programmatically?
BigQuery now supports ALTER COLUMN SET OPTIONS statement, which can be used to update the description of a column
example:
ALTER TABLE mydataset.mytable
ALTER COLUMN price
SET OPTIONS (
description="Price per unit"
)
Documentation:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_column_set_options_statement
As Adam mentioned, you can use the table PATCH method on the API to update the schema columns. The other method is to use bq.
You can first get the schema by doing the following:
1: Get the JSON schema:
TABLE=publicdata:samples.shakespeare
bq show --format=prettyjson ${TABLE} > table.txt
Then copy the schema from table.txt to schema.txt ... it will look something like:
[
{
"description": "A single unique word (where whitespace is the delimiter) extracted from a corpus.",
"mode": "REQUIRED",
"name": "word",
"type": "STRING"
},
{
"description": "The number of times this word appears in this corpus.",
"mode": "REQUIRED",
"name": "word_count",
"type": "INTEGER"
},
....
]
2: Set the description field to whatever you want (if it is not there, add it).
3: Tell BigQuery to update the schema with the added columns. Note that schema.txt must contain the complete schema.
bq update --schema schema.txt -t ${TABLE}
You can use the REST API to create or update a table, and specify a field desciption (schema.fields[].description) in your schema.
https://cloud.google.com/bigquery/docs/reference/v2/tables#methods