Best practices Airflow to create BigQuery tables from table - google-bigquery

I am new to BigQuery and come from an AWS background.
I have a bucket with no structure, just files of names YYYY-MM-DD-<SOME_ID>.csv.gzip.
The goal is to import this into BigQuery, then create another dataset with a subset table of the imported data. It should be last month's data, exclude some rows with a WHERE statement and exclude some columns.
There seem to be many alternatives using different operators. What would be the best practice to do it?
BigQueryCreateEmptyDatasetOperator(...)
BigQueryCreateEmptyTableOperator(...)
BigQueryExecuteQueryOperator(...) / BigQueryInsertJobOperator / BigQueryUpsertTableOperator
I also found
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
GCSToBigQueryOperator,
)
GCSToBigQueryOperator(...)
When is this preferred?
This is my current code:
create_new_dataset_A = BigQueryCreateEmptyDatasetOperator(
dataset_id=DATASET_NAME_A,
project_id=PROJECT_ID,
gcp_conn_id='_my_gcp_conn_',
task_id='create_new_dataset_A')
load_csv = GCSToBigQueryOperator(
bucket='cloud-samples-data',
compression="GZIP",
create_disposition="CREATE_IF_NEEDED",
destination_project_dataset_table=f"{PROJECT_ID}.{DATASET_NAME_A}.{TABLE_NAME}",
source_format="CSV",
source_objects=['202*'],
task_id='load_csv',
write_disposition='WRITE_APPEND',
schema_fields=[
{'name': 'name', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'post_abbr', 'type': 'STRING', 'mode': 'NULLABLE'},
],
)
create_new_dataset_B = BigQueryCreateEmptyDatasetOperator(
dataset_id=DATASET_NAME_B,
project_id=PROJECT_ID,
gcp_conn_id='_my_gcp_conn_',
task_id='create_new_dataset_B')
populate_new_dataset_B = BigQueryExecuteQueryOperator(...) / BigQueryInsertJobOperator / BigQueryUpsertTableOperator
Alternatives below:
populate_new_dataset_B = BigQueryExecuteQueryOperator(
task_id='load_from_table_a_to_table_b',
use_legacy_sql=False,
write_disposition='WRITE_APPEND',
sql=f'''
INSERT `{PROJECT_ID}.{DATASET_NAME_A}.D_EXCHANGE_RATE`
SELECT col_x, col_y #skip som col from table_a
FROM
`{PROJECT_ID}.{DATASET_NAME_A}.S_EXCHANGE_RATE`
WHERE col_x is not null
'''
Does it keep track of rows it loaded due to write_disposition='WRITE_APPEND'?
Does GCSToBigQueryOperator keep track of metadata or load duplicates?
populate_new_dataset_B = BigQueryInsertJobOperator(
task_id="load_from_table_a_to_table_b",
configuration={
"query": {
"query": "{% include 'sql-file.sql' %}",
"use_legacy_sql": False,
}
},
dag=dag,
)
Is this more for scheduled ETL jobs? Example: https://github.com/simonbreton/Capstone-project/blob/a6563576fa63b248a24d4a1bba70af10f527f6b4/airflow/dags/sql/fact_query.sql.
Here they do not use write_disposition='WRITE_APPEND' they use a where statement instead. Why? When to prefer?
Last operator I dont get, when to use it?
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery.html#howto-operator-bigqueryupserttableoperator
Which operator to use for populate_new_dataset_B?
Appreciate all help.

Related

Pymongo using hint on find_one gets AttributeError

Im need to preform a mongodb find_one query with pymongo but get AttributeError: 'NoneType' object has no attribute 'hint' since there are no results matching the filter
db.collection_name.find_one( filter=filter_query, projection={ _id: False, date: True, }, sort=[ ( date, pymongo.DESCENDING, ) ], ).hint('some_index')
also tried
db.collection_name.find_one( filter=filter_query, projection={ _id: False, date: True, }, sort=[ ( date, pymongo.DESCENDING, ) ], hint='some_index'
)
I know I can do it with find() but is there a way to do it with find_one?
The first approach definitely won't work. The second requires the hint parameter to be passed in the same format used to create the index, e.g. [('field', ASCENDING)].
https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find

Can you partition on unix time (INT) in BigQuery? If so, How?

Currently I'm working on a task in Airflow that loads CSV files to BigQuery where the time column is unix time (e.g., 1658371030).
The Airflow operator I'm using is GCSToBigQueryOperator where one of the params passed is schema_fields. If I define the time field in schema_fields value to be:
schema_fields = [
{"name": "UTCTimestamp", "type": "TIMESTAMP", "mode": "NULLABLE"},
....,
{"name": "OtherValue", "type": "STRING", "mode": "NULLABLE"}
]
Will BigQuery automatically detect that the unix time is INT and cast it to utc timestamp?
If it can't, how can we partition on a unix time (INT) in BigQuery?
I have tried making a table with partitioned tables using airflow, Can you try adding this parameter to your code(looking at your post UTCTimestamp is the only field applicable for partitioning):
time_partitioning={'type': 'MONTH', 'field': 'UTCTimestamp'}
For your reference type Specifies the type of time partitioning to perform and a required parameter for time portioning and field is the field name that is going to be partitioned.
Below is the dag file I have used for testing creating partitioned table.
My full code:
import os
from airflow import models
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
from airflow.utils.dates import days_ago
from datetime import datetime
dag_id = "TimeStampTry"
DATASET_NAME = os.environ.get("GCP_DATASET_NAME", '<yourDataSetName>')
TABLE_NAME = os.environ.get("GCP_TABLE_NAME", '<yourTableNameHere>')
with models.DAG(
dag_id,
schedule_interval=None,
start_date=days_ago(1),
tags=["SampleReplicate"],
) as dag:
load_csv = GCSToBigQueryOperator(
task_id='gcs_to_bigquery_example2',
bucket='<yourBucketNameHere>',
source_objects=['timestampsamp.csv'],
destination_project_dataset_table=f"{DATASET_NAME}.{TABLE_NAME}",
schema_fields=[
{'name': 'Name', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'date', 'type': 'TIMESTAMP', 'mode': 'NULLABLE'},
{'name': 'Device', 'type': 'STRING', 'mode': 'NULLABLE'},
],
time_partitioning={'type': 'MONTH', 'field': 'date'}
,
write_disposition='WRITE_TRUNCATE',
dag=dag,
)
timestampsamp.csv content:
Screenshot of the table created in BQ:
As you can see the table type is set to partitioned.
Also please visit this article about BigQuery Rest reference for more details about the parameters and its descriptions.

Random Dataframe Column ordering

strange question here but I was trying to create an empty dataframe with the following code. I want the columns to be in the order that I wrote them but when output they are in a different order. Is there a reason why this is happening intuitively?
import pandas as pd
user_df = pd.DataFrame(columns={'NAME',
'AGE',
'EMAIL',
'PASSWORD',
'FAVORITE_TEAM'
})
user_df
Output:
PASSWORD EMAIL AGE NAME FAVORITE_TEAM
Reason is because use sets ({}), there is not defined order.
Docs:
A set object is an unordered collection of distinct hashable objects.
If use list ([]) all working nice:
user_df = pd.DataFrame(columns={'NAME',
'AGE',
'EMAIL',
'PASSWORD',
'FAVORITE_TEAM'
})
print (user_df)
Empty DataFrame
Columns: [AGE, FAVORITE_TEAM, EMAIL, NAME, PASSWORD]
Index: []
user_df = pd.DataFrame(columns=['NAME',
'AGE',
'EMAIL',
'PASSWORD',
'FAVORITE_TEAM'
])
print (user_df)
Empty DataFrame
Columns: [NAME, AGE, EMAIL, PASSWORD, FAVORITE_TEAM]
Index: []

How to avoid uploading duplicated row into BigQuery table with Google App Script

I'm uploading some data into BigQuery from a google sheet using Google App Script. Is there a way to upload these data without uploading duplicated row...
Here is the JobSpec I'm currently using :
var jobSpec = {
configuration: {
load: {
destinationTable: {
projectId: projectId,
datasetId: 'ClientAccount',
tableId: tableId
},
allowJaggedRows: true,
writeDisposition: 'WRITE_APPEND',
schema: {
fields: [
{name: 'date', type: 'STRING'},
{name: 'Impressions', type: 'INTEGER'},
{name: 'Clicks', type: 'INTEGER'},
]
}
}
}
};
So I'm looking for something like allowDuplicates: true... I think you get the idea... I can I do this...
BigQuery loads do not have any concept of deduplication, but you can effectively do this by loading all the data to an initial table, then querying that table with a deduplication query into another table.
with t as (SELECT 1 as field, [1,3,4, 4] as dupe)
SELECT ANY_VALUE(field), dupe FROM t, t.dupe group by dupe;
You can deduplicate your data with Apps Script directly in Google Sheets before loading to BQ.
Or as Victor said you can deduplicate your data into BQ. With smth like:
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY Field_to_deduplicate ORDER BY key) AS RowNr
FROM
YourDataset.YourTable ) AS X
WHERE
X.RowNr = 1

Get Most Recent Column Value With Nested And Repeated Fields

I have a table with the following structure:
and the following data in it:
[
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:25 UTC"
},
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:16 UTC"
}
]
What I'd like to do is recreate the same table (same structure) but with only the latest version of a row. In this example let's say that I'd like to group by everything by name and take the row with the most recent createdAt.
I tried to do something like this: Google Big Query SQL - Get Most Recent Column Value but I couldn't get it to work with record and repeated fields.
I really hoped someone from Google Team will provide answer on this question as it is very frequent topic/problem asked here on SO. BigQuery definitelly not friendly enough with writing Nested / Repeated stuff back to BQ off of BQ query.
So, I will provide the workaround I found relatively long time ago. I DO NOT like it, but (and that is why I hoped for the answer from Google Team) it works. I hope you will be able to adopt it for you particular scenario
So, based on your example, assume you have table as below
and you expect to get most recent records based on createdAt column, so result will look like:
Below code does this:
SELECT name, age, createdAt, addresses.city
FROM JS(
( // input table
SELECT name, age, createdAt, NEST(city) AS addresses
FROM (
SELECT name, age, createdAt, addresses.city
FROM (
SELECT
name, age, createdAt, addresses.city,
MAX(createdAt) OVER(PARTITION BY name, age) AS lastAt
FROM yourTable
)
WHERE createdAt = lastAt
)
GROUP BY name, age, createdAt
),
name, age, createdAt, addresses, // input columns
"[ // output schema
{'name': 'name', 'type': 'STRING'},
{'name': 'age', 'type': 'INTEGER'},
{'name': 'createdAt', 'type': 'INTEGER'},
{'name': 'addresses', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'city', 'type': 'STRING'}
]
}
]",
"function(row, emit) { // function
var c = [];
for (var i = 0; i < row.addresses.length; i++) {
c.push({city:row.addresses[i]});
};
emit({name: row.name, age: row.age, createdAt: row.createdAt, addresses: c});
}"
)
the way above code works is: it implicitely flattens original records; find rows that belong to most recent records (partitioned by name and age); assembles those rows back into respective records. final step is processing with JS UDF to build proper schema that can be actually written back to BigQuery Table as nested/repeated vs flatten
The last step is the most annoying part of this workaround as it needs to be customized each time for specific schema(s)
Please note, in this example - it is only one nested field inside addresses record, so NEST() fuction worked. In scenarious when you have more than just one
field inside - above approach still works, but you need to involve concatenation of those fields to put them inside nest() and than inside js function to do extra splitting those fields, etc.
You can see examples in below answers:
Create a table with Record type column
create a table with a column type RECORD
How to store the result of query on the current table without changing the table schema?
I hope this is good foundation for you to experiment with and make your case work!