BigQuery join and UDF - sql

How can I join two tables in a select statement in which I also use a UDF? I stored the SQL query and UDF function in two files that I call via the bq command line. However, when I run it, I get the following error:
BigQuery error in query operation: Error processing job
'[projectID]:bqjob_[error_number]':
Table name cannot be resolved: dataset name is missing.
Note that I'm logged in the correct project via the gcloud auth method.
My SQL statement:
SELECT
substr(date,1,6) as date,
device,
channelGroup,
COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
COUNT(DISTINCT fullVisitorId) AS users,
FROM
defaultChannelGroup(
SELECT
a.date,
a.device.deviceCategory AS device,
b.hits.page.pagePath AS page,
a.fullVisitorId,
a.visitId,
a.trafficSource.source AS trafficSourceSource,
a.trafficSource.medium AS trafficSourceMedium,
a.trafficSource.campaign AS trafficSourceCampaign
FROM FLATTEN(
SELECT date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID
FROM
TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
,hits) as a
LEFT JOIN FLATTEN(
SELECT hits.page.pagePath,hits.time,visitID,fullVisitorId
FROM
TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
WHERE
hits.time = 0
and trafficSource.medium = 'organic'
,hits) as b
ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
)
GROUP BY
date,
device,
channelGroup
ORDER BY sessions DESC
where I replaced my datasetname with the correct name of course;
and some of the UDF (which works with another query):
function defaultChannelGroup(row, emit)
{
function output(channelGroup) {
emit({channelGroup:channelGroup,
fullVisitorId: row.fullVisitorId,
visitId: row.visitId,
device: row.device,
date: row.date
});
}
computeDefaultChannelGroup(row, output);
}
bigquery.defineFunction(
'defaultChannelGroup',
['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
//['device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
[{'name': 'channelGroup', 'type': 'string'},
{'name': 'fullVisitorId', 'type': 'string'},
{'name': 'visitId', 'type': 'integer'},
{'name': 'device', 'type': 'string'},
{'name': 'date', 'type': 'string'}
],
defaultChannelGroup
);

The select statements within the FLATTEN function needed to be in brackets.
Ran the bq command in the shell:
bq query --udf_resource=udf.js "$(cat query.sql)"
query.sql contains the following scripts:
SELECT
substr(date,1,6) as date,
device,
channelGroup,
COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
COUNT(DISTINCT fullVisitorId) AS users,
COUNT(DISTINCT transactionId) as orders,
CAST(SUM(transactionRevenue)/1000000 AS INTEGER) as sales
FROM
defaultChannelGroup(
SELECT
a.date as date,
a.device.deviceCategory AS device,
b.hits.page.pagePath AS page,
a.fullVisitorId as fullVisitorId,
a.visitId as visitId,
a.trafficSource.source AS trafficSourceSource,
a.trafficSource.medium AS trafficSourceMedium,
a.trafficSource.campaign AS trafficSourceCampaign,
a.hits.transaction.transactionRevenue as transactionRevenue,
a.hits.transaction.transactionID as transactionId
FROM FLATTEN((
SELECT date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID,
hits.transaction.transactionID, hits.transaction.transactionRevenue
FROM
TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
),hits) as a
LEFT JOIN FLATTEN((
SELECT hits.page.pagePath,hits.time,trafficSource.medium,visitID,fullVisitorId
FROM
TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
WHERE
hits.time = 0
and trafficSource.medium = 'organic'
),hits) as b
ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
)
GROUP BY
date,
device,
channelGroup
ORDER BY sessions DESC
and udf.js contains the following function (the 'computeDefaultChannelGroup' function is not included):
function defaultChannelGroup(row, emit)
{
function output(channelGroup) {
emit({channelGroup:channelGroup,
date: row.date,
fullVisitorId: row.fullVisitorId,
visitId: row.visitId,
device: row.device,
transactionId: row.transactionId,
transactionRevenue: row.transactionRevenue,
});
}
computeDefaultChannelGroup(row, output);
}
bigquery.defineFunction(
'defaultChannelGroup',
['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId', 'transactionId', 'transactionRevenue'],
[{'name': 'channelGroup', 'type': 'string'},
{'name': 'date', 'type': 'string'},
{'name': 'fullVisitorId', 'type': 'string'},
{'name': 'visitId', 'type': 'integer'},
{'name': 'device', 'type': 'string'},
{'name': 'transactionId', 'type': 'string'},
{'name': 'transactionRevenue', 'type': 'integer'}
],
defaultChannelGroup
);
Ran without error and matched the data in Google Analytics.

Related

Can you partition on unix time (INT) in BigQuery? If so, How?

Currently I'm working on a task in Airflow that loads CSV files to BigQuery where the time column is unix time (e.g., 1658371030).
The Airflow operator I'm using is GCSToBigQueryOperator where one of the params passed is schema_fields. If I define the time field in schema_fields value to be:
schema_fields = [
{"name": "UTCTimestamp", "type": "TIMESTAMP", "mode": "NULLABLE"},
....,
{"name": "OtherValue", "type": "STRING", "mode": "NULLABLE"}
]
Will BigQuery automatically detect that the unix time is INT and cast it to utc timestamp?
If it can't, how can we partition on a unix time (INT) in BigQuery?
I have tried making a table with partitioned tables using airflow, Can you try adding this parameter to your code(looking at your post UTCTimestamp is the only field applicable for partitioning):
time_partitioning={'type': 'MONTH', 'field': 'UTCTimestamp'}
For your reference type Specifies the type of time partitioning to perform and a required parameter for time portioning and field is the field name that is going to be partitioned.
Below is the dag file I have used for testing creating partitioned table.
My full code:
import os
from airflow import models
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
from airflow.utils.dates import days_ago
from datetime import datetime
dag_id = "TimeStampTry"
DATASET_NAME = os.environ.get("GCP_DATASET_NAME", '<yourDataSetName>')
TABLE_NAME = os.environ.get("GCP_TABLE_NAME", '<yourTableNameHere>')
with models.DAG(
dag_id,
schedule_interval=None,
start_date=days_ago(1),
tags=["SampleReplicate"],
) as dag:
load_csv = GCSToBigQueryOperator(
task_id='gcs_to_bigquery_example2',
bucket='<yourBucketNameHere>',
source_objects=['timestampsamp.csv'],
destination_project_dataset_table=f"{DATASET_NAME}.{TABLE_NAME}",
schema_fields=[
{'name': 'Name', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'date', 'type': 'TIMESTAMP', 'mode': 'NULLABLE'},
{'name': 'Device', 'type': 'STRING', 'mode': 'NULLABLE'},
],
time_partitioning={'type': 'MONTH', 'field': 'date'}
,
write_disposition='WRITE_TRUNCATE',
dag=dag,
)
timestampsamp.csv content:
Screenshot of the table created in BQ:
As you can see the table type is set to partitioned.
Also please visit this article about BigQuery Rest reference for more details about the parameters and its descriptions.

Best practices Airflow to create BigQuery tables from table

I am new to BigQuery and come from an AWS background.
I have a bucket with no structure, just files of names YYYY-MM-DD-<SOME_ID>.csv.gzip.
The goal is to import this into BigQuery, then create another dataset with a subset table of the imported data. It should be last month's data, exclude some rows with a WHERE statement and exclude some columns.
There seem to be many alternatives using different operators. What would be the best practice to do it?
BigQueryCreateEmptyDatasetOperator(...)
BigQueryCreateEmptyTableOperator(...)
BigQueryExecuteQueryOperator(...) / BigQueryInsertJobOperator / BigQueryUpsertTableOperator
I also found
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
GCSToBigQueryOperator,
)
GCSToBigQueryOperator(...)
When is this preferred?
This is my current code:
create_new_dataset_A = BigQueryCreateEmptyDatasetOperator(
dataset_id=DATASET_NAME_A,
project_id=PROJECT_ID,
gcp_conn_id='_my_gcp_conn_',
task_id='create_new_dataset_A')
load_csv = GCSToBigQueryOperator(
bucket='cloud-samples-data',
compression="GZIP",
create_disposition="CREATE_IF_NEEDED",
destination_project_dataset_table=f"{PROJECT_ID}.{DATASET_NAME_A}.{TABLE_NAME}",
source_format="CSV",
source_objects=['202*'],
task_id='load_csv',
write_disposition='WRITE_APPEND',
schema_fields=[
{'name': 'name', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'post_abbr', 'type': 'STRING', 'mode': 'NULLABLE'},
],
)
create_new_dataset_B = BigQueryCreateEmptyDatasetOperator(
dataset_id=DATASET_NAME_B,
project_id=PROJECT_ID,
gcp_conn_id='_my_gcp_conn_',
task_id='create_new_dataset_B')
populate_new_dataset_B = BigQueryExecuteQueryOperator(...) / BigQueryInsertJobOperator / BigQueryUpsertTableOperator
Alternatives below:
populate_new_dataset_B = BigQueryExecuteQueryOperator(
task_id='load_from_table_a_to_table_b',
use_legacy_sql=False,
write_disposition='WRITE_APPEND',
sql=f'''
INSERT `{PROJECT_ID}.{DATASET_NAME_A}.D_EXCHANGE_RATE`
SELECT col_x, col_y #skip som col from table_a
FROM
`{PROJECT_ID}.{DATASET_NAME_A}.S_EXCHANGE_RATE`
WHERE col_x is not null
'''
Does it keep track of rows it loaded due to write_disposition='WRITE_APPEND'?
Does GCSToBigQueryOperator keep track of metadata or load duplicates?
populate_new_dataset_B = BigQueryInsertJobOperator(
task_id="load_from_table_a_to_table_b",
configuration={
"query": {
"query": "{% include 'sql-file.sql' %}",
"use_legacy_sql": False,
}
},
dag=dag,
)
Is this more for scheduled ETL jobs? Example: https://github.com/simonbreton/Capstone-project/blob/a6563576fa63b248a24d4a1bba70af10f527f6b4/airflow/dags/sql/fact_query.sql.
Here they do not use write_disposition='WRITE_APPEND' they use a where statement instead. Why? When to prefer?
Last operator I dont get, when to use it?
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery.html#howto-operator-bigqueryupserttableoperator
Which operator to use for populate_new_dataset_B?
Appreciate all help.

Converting SAS EG Data Step to Spark SQL

I'm trying to convert the following Data Step from SAS EG to SPARK SQL
data work.Test;
set WORK.PROGRAM3;
by Year Month Day;
if first.Month then HLProfit=0;
HLProfit+HighLevelProfit;
if first.Month then UnearnedRev=0;
UnearnedRev + UnearnedRevenue_Total;
run;
I'm getting the following error when trying to run the data function using Spark SQL.
ParseException:
mismatched input 'data' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}
Appreciate if any of you can give some direction on this as I am new to this.
The expected output will be first day of the month will have HLProfit=0 and UnearnedRev=0, and then this will gradually add up as the no of days increases.
Thanks :)
I managed to get the output I want using the code below:
PROGRAM3 = spark.sql(""" SELECT *,
SUM(HighLevelProfit) OVER(partition by Year,
Month ORDER BY Day, Month, Year) AS HLProfit,
SUM(UnearnedRevenue_Total)
OVER(partition by Year, Month ORDER BY Day, Month, Year) AS UnearnedRev
FROM TEST""")

Elasticsearch SQL filtering by #timestamp not working

I am trying to create an elastic query for a Kibana canvas entry and need to be able to filter by the #timestamp field.
This is the query I have tried that I thought would work from other answers to this problem:
SELECT count(DISTINCT app) as counter FROM "snapshot*" where \"#timestamp\" >= '2020-11-01' and \"#timestamp\" <= '2021-11-01'
But I instead get this error back:
[essql] > Couldn't parse Elasticsearch SQL query. You may need to add
double quotes to names containing special characters. Check your query
and try again. Error: [parsing_exception] line 1:76: extraneous input
'' expecting {'(', 'ANALYZE', 'ANALYZED', 'CASE', 'CAST', 'CATALOGS',
'COLUMNS', 'CONVERT', 'CURRENT_DATE', 'CURRENT_TIME',
'CURRENT_TIMESTAMP', 'DAY', 'DEBUG', 'EXECUTABLE', 'EXISTS',
'EXPLAIN', 'EXTRACT', 'FALSE', 'FIRST', 'FORMAT', 'FULL', 'FUNCTIONS',
'GRAPHVIZ', 'HOUR', 'INTERVAL', 'LAST', 'LEFT', 'LIMIT', 'MAPPED',
'MATCH', 'MINUTE', 'MONTH', 'NOT', 'NULL', 'OPTIMIZED', 'PARSED',
'PHYSICAL', 'PLAN', 'RIGHT', 'RLIKE', 'QUERY', 'SCHEMAS', 'SECOND',
'SHOW', 'SYS', 'TABLES', 'TEXT', 'TRUE', 'TYPE', 'TYPES', 'VERIFY',
'YEAR', '{FN', '{D', '{T', '{TS', '{GUID', '+', '-', '*', '?', STRING,
INTEGER_VALUE, DECIMAL_VALUE, IDENTIFIER, DIGIT_IDENTIFIER,
QUOTED_IDENTIFIER, BACKQUOTED_IDENTIFIER}
I'm not sure what else I can try as I've only seen the accepted answer to be to put #timestamp as "#timestamp" and it doesn't work for me.
Any help would be appreciated
Try as 'timestamp', replacing "timestamp", just single quotes.

Get Most Recent Column Value With Nested And Repeated Fields

I have a table with the following structure:
and the following data in it:
[
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:25 UTC"
},
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:16 UTC"
}
]
What I'd like to do is recreate the same table (same structure) but with only the latest version of a row. In this example let's say that I'd like to group by everything by name and take the row with the most recent createdAt.
I tried to do something like this: Google Big Query SQL - Get Most Recent Column Value but I couldn't get it to work with record and repeated fields.
I really hoped someone from Google Team will provide answer on this question as it is very frequent topic/problem asked here on SO. BigQuery definitelly not friendly enough with writing Nested / Repeated stuff back to BQ off of BQ query.
So, I will provide the workaround I found relatively long time ago. I DO NOT like it, but (and that is why I hoped for the answer from Google Team) it works. I hope you will be able to adopt it for you particular scenario
So, based on your example, assume you have table as below
and you expect to get most recent records based on createdAt column, so result will look like:
Below code does this:
SELECT name, age, createdAt, addresses.city
FROM JS(
( // input table
SELECT name, age, createdAt, NEST(city) AS addresses
FROM (
SELECT name, age, createdAt, addresses.city
FROM (
SELECT
name, age, createdAt, addresses.city,
MAX(createdAt) OVER(PARTITION BY name, age) AS lastAt
FROM yourTable
)
WHERE createdAt = lastAt
)
GROUP BY name, age, createdAt
),
name, age, createdAt, addresses, // input columns
"[ // output schema
{'name': 'name', 'type': 'STRING'},
{'name': 'age', 'type': 'INTEGER'},
{'name': 'createdAt', 'type': 'INTEGER'},
{'name': 'addresses', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'city', 'type': 'STRING'}
]
}
]",
"function(row, emit) { // function
var c = [];
for (var i = 0; i < row.addresses.length; i++) {
c.push({city:row.addresses[i]});
};
emit({name: row.name, age: row.age, createdAt: row.createdAt, addresses: c});
}"
)
the way above code works is: it implicitely flattens original records; find rows that belong to most recent records (partitioned by name and age); assembles those rows back into respective records. final step is processing with JS UDF to build proper schema that can be actually written back to BigQuery Table as nested/repeated vs flatten
The last step is the most annoying part of this workaround as it needs to be customized each time for specific schema(s)
Please note, in this example - it is only one nested field inside addresses record, so NEST() fuction worked. In scenarious when you have more than just one
field inside - above approach still works, but you need to involve concatenation of those fields to put them inside nest() and than inside js function to do extra splitting those fields, etc.
You can see examples in below answers:
Create a table with Record type column
create a table with a column type RECORD
How to store the result of query on the current table without changing the table schema?
I hope this is good foundation for you to experiment with and make your case work!