Currently I'm working on a task in Airflow that loads CSV files to BigQuery where the time column is unix time (e.g., 1658371030).
The Airflow operator I'm using is GCSToBigQueryOperator where one of the params passed is schema_fields. If I define the time field in schema_fields value to be:
schema_fields = [
{"name": "UTCTimestamp", "type": "TIMESTAMP", "mode": "NULLABLE"},
....,
{"name": "OtherValue", "type": "STRING", "mode": "NULLABLE"}
]
Will BigQuery automatically detect that the unix time is INT and cast it to utc timestamp?
If it can't, how can we partition on a unix time (INT) in BigQuery?
I have tried making a table with partitioned tables using airflow, Can you try adding this parameter to your code(looking at your post UTCTimestamp is the only field applicable for partitioning):
time_partitioning={'type': 'MONTH', 'field': 'UTCTimestamp'}
For your reference type Specifies the type of time partitioning to perform and a required parameter for time portioning and field is the field name that is going to be partitioned.
Below is the dag file I have used for testing creating partitioned table.
My full code:
import os
from airflow import models
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
from airflow.utils.dates import days_ago
from datetime import datetime
dag_id = "TimeStampTry"
DATASET_NAME = os.environ.get("GCP_DATASET_NAME", '<yourDataSetName>')
TABLE_NAME = os.environ.get("GCP_TABLE_NAME", '<yourTableNameHere>')
with models.DAG(
dag_id,
schedule_interval=None,
start_date=days_ago(1),
tags=["SampleReplicate"],
) as dag:
load_csv = GCSToBigQueryOperator(
task_id='gcs_to_bigquery_example2',
bucket='<yourBucketNameHere>',
source_objects=['timestampsamp.csv'],
destination_project_dataset_table=f"{DATASET_NAME}.{TABLE_NAME}",
schema_fields=[
{'name': 'Name', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'date', 'type': 'TIMESTAMP', 'mode': 'NULLABLE'},
{'name': 'Device', 'type': 'STRING', 'mode': 'NULLABLE'},
],
time_partitioning={'type': 'MONTH', 'field': 'date'}
,
write_disposition='WRITE_TRUNCATE',
dag=dag,
)
timestampsamp.csv content:
Screenshot of the table created in BQ:
As you can see the table type is set to partitioned.
Also please visit this article about BigQuery Rest reference for more details about the parameters and its descriptions.
Related
I am getting JSON response from a API.
I want to get 5 columns from that,from which 4 are normal,but 1 column in RECORD REPEATED type.
I want to load that data to Bigquery table.
Below is my code in which Schema is mentioned.
import requests
from requests.auth import HTTPBasicAuth
import json
from google.cloud import bigquery
import pandas
import pandas_gbq
URL='<API>'
auth = HTTPBasicAuth('username', 'password')
# sending get request and saving the response as response object
r = requests.get(url=URL ,auth=auth)
data = r.json()
----------------------json repsonse----------------
{
"data": {
"id": "jfp695q8",
"origin": "taste",
"title": "Christmas pudding martini recipe",
"subtitle": null,
"customTitles": [{
"name": "editorial",
"value": "Christmas pudding martini"
}]
}
}
id=data['data']['id']
origin=data['data']['origin']
title=data['data']['title']
subtitle=data['data']['subtitle']
customTitles=json.dumps(data['data']['customTitles'])
# print(customTitles)
df = pandas.DataFrame(
{
'id':id,
'origin':origin,
'title':title,
'subtitle':'subtitle',
'customTitles':customTitles
},index=[0]
)
# df.head()
client = bigquery.Client(project='ncau-data-newsquery-sit')
table_id = 'sdm_adpoint.testfapi'
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("id", "STRING"),
bigquery.SchemaField("origin", "STRING"),
bigquery.SchemaField("title", "STRING"),
bigquery.SchemaField("subtitle", "STRING"),
bigquery.SchemaField(
"customTitles",
"RECORD",
mode="REPEATED",
fields=[
bigquery.SchemaField("name", "STRING", mode="NULLABLE"),
bigquery.SchemaField("value", "STRING", mode="NULLABLE"),
])
],
autodetect=False
)
df.head()
job = client.load_table_from_dataframe(
df, table_id, job_config=job_config
)
job.result()
customeTitle is RECORD REPEATED fiels, which has two keys name and values, so I have made schema like that.
Below is my table schema.
Below is output of df.head()
jfp695q8 taste Christmas pudding martini recipe subtitle [{"name": "editorial", "value": "Christmas pudding martini"}]
Till here its good.
But ,when I try to load the data to table it throws below error.
ArrowTypeError: Could not convert '[' with type str: was expecting tuple of (key, value) pair
Can anyone tell me whats wrong here?
I am new to BigQuery and come from an AWS background.
I have a bucket with no structure, just files of names YYYY-MM-DD-<SOME_ID>.csv.gzip.
The goal is to import this into BigQuery, then create another dataset with a subset table of the imported data. It should be last month's data, exclude some rows with a WHERE statement and exclude some columns.
There seem to be many alternatives using different operators. What would be the best practice to do it?
BigQueryCreateEmptyDatasetOperator(...)
BigQueryCreateEmptyTableOperator(...)
BigQueryExecuteQueryOperator(...) / BigQueryInsertJobOperator / BigQueryUpsertTableOperator
I also found
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
GCSToBigQueryOperator,
)
GCSToBigQueryOperator(...)
When is this preferred?
This is my current code:
create_new_dataset_A = BigQueryCreateEmptyDatasetOperator(
dataset_id=DATASET_NAME_A,
project_id=PROJECT_ID,
gcp_conn_id='_my_gcp_conn_',
task_id='create_new_dataset_A')
load_csv = GCSToBigQueryOperator(
bucket='cloud-samples-data',
compression="GZIP",
create_disposition="CREATE_IF_NEEDED",
destination_project_dataset_table=f"{PROJECT_ID}.{DATASET_NAME_A}.{TABLE_NAME}",
source_format="CSV",
source_objects=['202*'],
task_id='load_csv',
write_disposition='WRITE_APPEND',
schema_fields=[
{'name': 'name', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'post_abbr', 'type': 'STRING', 'mode': 'NULLABLE'},
],
)
create_new_dataset_B = BigQueryCreateEmptyDatasetOperator(
dataset_id=DATASET_NAME_B,
project_id=PROJECT_ID,
gcp_conn_id='_my_gcp_conn_',
task_id='create_new_dataset_B')
populate_new_dataset_B = BigQueryExecuteQueryOperator(...) / BigQueryInsertJobOperator / BigQueryUpsertTableOperator
Alternatives below:
populate_new_dataset_B = BigQueryExecuteQueryOperator(
task_id='load_from_table_a_to_table_b',
use_legacy_sql=False,
write_disposition='WRITE_APPEND',
sql=f'''
INSERT `{PROJECT_ID}.{DATASET_NAME_A}.D_EXCHANGE_RATE`
SELECT col_x, col_y #skip som col from table_a
FROM
`{PROJECT_ID}.{DATASET_NAME_A}.S_EXCHANGE_RATE`
WHERE col_x is not null
'''
Does it keep track of rows it loaded due to write_disposition='WRITE_APPEND'?
Does GCSToBigQueryOperator keep track of metadata or load duplicates?
populate_new_dataset_B = BigQueryInsertJobOperator(
task_id="load_from_table_a_to_table_b",
configuration={
"query": {
"query": "{% include 'sql-file.sql' %}",
"use_legacy_sql": False,
}
},
dag=dag,
)
Is this more for scheduled ETL jobs? Example: https://github.com/simonbreton/Capstone-project/blob/a6563576fa63b248a24d4a1bba70af10f527f6b4/airflow/dags/sql/fact_query.sql.
Here they do not use write_disposition='WRITE_APPEND' they use a where statement instead. Why? When to prefer?
Last operator I dont get, when to use it?
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery.html#howto-operator-bigqueryupserttableoperator
Which operator to use for populate_new_dataset_B?
Appreciate all help.
I'm loading data from Google Storage to bigQuery using GoogleCloudStorageToBigQueryOperator
It may be that the Json file will have more columns than what I defined. In that case I want the load job continue - simply ignore this unrecognized column.
I tried to use the ignore_unknown_values argument but it didn't make any difference.
My operator:
def dc():
return [
{
"name": "id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "storeId",
"type": "INTEGER",
"mode": "NULLABLE"
},
...
]
gcs_to_bigquery_st = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id='load_to_BigQuery_stage',
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template_st,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[gcs_export_uri_template],
ignore_unknown_values = True,
schema_fields=dc(),
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID
)
The error:
u'Error while reading data, error message: JSON parsing error in row
starting at position 0: No such field: shippingService.',
which is true. shippingService doesn't exist and it won't be added to the table.
How can I fix this?
Edit:
Removed the schema_fields=dc() from the operator:
gcs_to_bigquery_st = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id='load_to_BigQuery_stage',
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template_st,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[gcs_export_uri_template],
ignore_unknown_values = True,
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID
)
Still gives the same error.
This doesn't make scene.. It has command to ignore unknown values :(
The only reason I can think of is you are probably using Airflow 1.9. This feature was added in Airflow 1.10.
However, you can use it as follows in Airflow 1.9 by adding src_fmt_configs={'ignoreUnknownValues': True}:
gcs_to_bigquery_st = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id='load_to_BigQuery_stage',
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template_st,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[gcs_export_uri_template],
src_fmt_configs={'ignoreUnknownValues': True},
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID
)
I have a table with the following structure:
and the following data in it:
[
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:25 UTC"
},
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:16 UTC"
}
]
What I'd like to do is recreate the same table (same structure) but with only the latest version of a row. In this example let's say that I'd like to group by everything by name and take the row with the most recent createdAt.
I tried to do something like this: Google Big Query SQL - Get Most Recent Column Value but I couldn't get it to work with record and repeated fields.
I really hoped someone from Google Team will provide answer on this question as it is very frequent topic/problem asked here on SO. BigQuery definitelly not friendly enough with writing Nested / Repeated stuff back to BQ off of BQ query.
So, I will provide the workaround I found relatively long time ago. I DO NOT like it, but (and that is why I hoped for the answer from Google Team) it works. I hope you will be able to adopt it for you particular scenario
So, based on your example, assume you have table as below
and you expect to get most recent records based on createdAt column, so result will look like:
Below code does this:
SELECT name, age, createdAt, addresses.city
FROM JS(
( // input table
SELECT name, age, createdAt, NEST(city) AS addresses
FROM (
SELECT name, age, createdAt, addresses.city
FROM (
SELECT
name, age, createdAt, addresses.city,
MAX(createdAt) OVER(PARTITION BY name, age) AS lastAt
FROM yourTable
)
WHERE createdAt = lastAt
)
GROUP BY name, age, createdAt
),
name, age, createdAt, addresses, // input columns
"[ // output schema
{'name': 'name', 'type': 'STRING'},
{'name': 'age', 'type': 'INTEGER'},
{'name': 'createdAt', 'type': 'INTEGER'},
{'name': 'addresses', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'city', 'type': 'STRING'}
]
}
]",
"function(row, emit) { // function
var c = [];
for (var i = 0; i < row.addresses.length; i++) {
c.push({city:row.addresses[i]});
};
emit({name: row.name, age: row.age, createdAt: row.createdAt, addresses: c});
}"
)
the way above code works is: it implicitely flattens original records; find rows that belong to most recent records (partitioned by name and age); assembles those rows back into respective records. final step is processing with JS UDF to build proper schema that can be actually written back to BigQuery Table as nested/repeated vs flatten
The last step is the most annoying part of this workaround as it needs to be customized each time for specific schema(s)
Please note, in this example - it is only one nested field inside addresses record, so NEST() fuction worked. In scenarious when you have more than just one
field inside - above approach still works, but you need to involve concatenation of those fields to put them inside nest() and than inside js function to do extra splitting those fields, etc.
You can see examples in below answers:
Create a table with Record type column
create a table with a column type RECORD
How to store the result of query on the current table without changing the table schema?
I hope this is good foundation for you to experiment with and make your case work!
When I pass an input field of repeated record type into Bigquery UDF, it keeps saying that the input field is not found.
This is my 2 rows of data:
{"name":"cynthia", "Persons":[ { "name":"john","age":1},{"name":"jane","age":2} ]}
{"name":"jim","Persons":[ { "name":"mary","age":1},{"name":"joe","age":2} ]}
This is the schema of the data:
[
{"name":"name","type":"string"},
{"name":"Persons","mode":"repeated","type":"RECORD",
"fields":
[
{"name": "name","type": "STRING"},
{"name": "age","type": "INTEGER"}
]
}
]
And this is the query:
SELECT
name,maxts
FROM
js
(
//input table
[dw_test.clokTest_bag],
//input columns
name, Persons,
//output schema
"[
{name: 'name', type:'string'},
{name: 'maxts', type:'string'}
]",
//function
"function(r, emit)
{
emit({name: r.name, maxts: '2'});
}"
)
LIMIT 10
Error I got when trying to run the query:
Error: 5.3 - 15.6: Undefined input field Persons
Job ID: ord2-us-dc:job_IPGQQEOo6NHGUsoVvhqLZ8pVLMQ
Would someone please help?
Thank you.
In your list of input columns, list the leaf fields directly:
//input columns
name, Persons.name, Persons.age,
They'll still appear in their proper structure when you get the records in your UDF.