I have multiple JSON files. The files have two nested fields. The files are generated daily so I need to perform daily insert and update operations in the BigQuery table. I have shared Table schema in the image.
How to perform update operation on nested fields?
A little late, but in case someone else is searching.
If you can use Standard SQL:
INSERT INTO your_table (optout_time, clicks, profile_id, opens, ... )
VALUES (
1552297347,
[
STRUCT(1539245347 as ts, 'url1' as url),
STRUCT(1539245341 as ts, 'url2' as url)
],
'whatever',
[
STRUCT(1539245347 as ts),
STRUCT(1539245341 as ts)
],
...
)
The BigQuery UI just provides import of JSONs to create new tables. So, to stream the content of the files into already existing tables BigQuery, you can write a small program in your favorite programming language using the client library.
I am going to assume you have your data as line-delimited JSONs looking like this:
{"optout_time": 1552297349, "clicks": {"ts": 1539245349, "url": "www.google.com"}, "profile_id": "foo", ...}
{"optout_time": 1532242949, "clicks": {"ts": 1530247349, "url": "www.duckduckgo.com"}, "profile_id": "bar", ...}
A python script to the job would look like this. It takes the json file names as command line arguments:
import json
import sys
from google.cloud import bigquery
dataset_id = "<DATASET-ID>" # the ID of your dataset
table_id = "<TABLE-ID>" # the ID of your table
client = bigquery.Client()
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)
for f in sys.argv[1:]:
with open(f) as fh:
data = [json.loads(x) for x in fh]
client.insert_rows_json(table, data)
The nesting is taken care of automatically.
For pointers of how this sort of operation would look like in other languages, you can take a look at this documentation.
Related
Here is my sample code where I create a file in S3 bucket using AWS Athena. The file by default is in csv format. Is there a way to change it to pipe delimiter ?
import json
import boto3
def lambda_handler(event, context):
s3 = boto3.client('s3')
client = boto3.client('athena')
# Start Query Execution
response = client.start_query_execution(
QueryString="""
select * from srvgrp
where category_code = 'ACOMNCDU'
""",
QueryExecutionContext={
'Database': 'tmp_db'
},
ResultConfiguration={
'OutputLocation': 's3://tmp-results/athena/'
}
)
queryId = response['QueryExecutionId']
print('Query id is :' + str(queryId))
There is a way to do that with CTAS query.
BUT:
This is a hacky way and not what CTAS queries are supposed to be used for, since it will also create a new table definition in AWS Glue Data Catalog.
I'm not sure about performance
CREATE TABLE "UNIQU_PREFIX__new_table"
WITH (
format = 'TEXTFILE',
external_location = 's3://tmp-results/athena/__SOMETHING_UNIQUE__',
field_delimiter = '|',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1
) AS
SELECT *
FROM srvgrp
WHERE category_code = 'ACOMNCDU'
Note:
It is important to set bucket_count = 1, otherwise Athena will create multiple files.
Name of the table in CREATE_TABLE ... also should be unique, e.g. use timestamp prefix/suffix which you can inject during python runtime.
External location should be unique, e.g. use timestamp prefix/suffix which you can inject during python runtime. I would advise to embed table name into S3 path.
You need to include in bucketed_by only one of the columns from SELECT.
At some point you would need to clean up AWS Glue Data Catalog from all table defintions that were created in such way
I'm using Databricks Notebooks to read avro files stored in an Azure Data Lake Gen2. The avro files are created by an Event Hub Capture, and present a specific schema. From these files I have to extract only the Body field, where the data which I'm interested in is actually stored.
I already implented this in Python and it works as expected:
path = 'abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path) # 1
df1 = df0.select(df0.Body.cast('string')) # 2
rdd1 = df1.rdd.map(lambda x: x[0]) # 3
data = spark.read.json(rdd1) # 4
Now I need to translate this to raw SQL in order to filter the data directly in the SQL query. Considering the 4 steps above, steps 1 and 2 with SQL are as follows:
CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")
WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro)
SELECT * FROM body_array
With this partial query I get the same as df1 above (step 2 with Python):
Body
[{"id":"a123","group":"0","value":1.0,"timestamp":"2020-01-01T00:00:00.0000000"},
{"id":"a123","group":"0","value":1.5,"timestamp":"2020-01-01T00:01:00.0000000"},
{"id":"a123","group":"0","value":2.3,"timestamp":"2020-01-01T00:02:00.0000000"},
{"id":"a123","group":"0","value":1.8,"timestamp":"2020-01-01T00:03:00.0000000"}]
[{"id":"b123","group":"0","value":2.0,"timestamp":"2020-01-01T00:00:01.0000000"},
{"id":"b123","group":"0","value":1.2,"timestamp":"2020-01-01T00:01:01.0000000"},
{"id":"b123","group":"0","value":2.1,"timestamp":"2020-01-01T00:02:01.0000000"},
{"id":"b123","group":"0","value":1.7,"timestamp":"2020-01-01T00:03:01.0000000"}]
...
I need to know how to introduce the steps 3 and 4 into the SQL query, to parse the strings into json objects and finally get the desired dataframe with columns id, group, value and timestamp. Thanks.
One way I found to do this with raw SQL is as follows, using from_json Spark SQL built-in function and the scheme of the Body field:
CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")
WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro),
data1 AS (SELECT from_json(Body, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') FROM body_array),
data2 AS (SELECT explode(*) FROM data1),
data3 AS (SELECT col.* FROM data2)
SELECT * FROM data3 WHERE id = "a123" --FILTERING BY CHANNEL ID
It performs faster than the Python code I posted in the question, surely because of the use of from_json and the scheme of Body to extract data inside it. My version of this approach in PySpark looks as follows:
path = 'abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path)
df1 = df0.selectExpr("cast(Body as string) as json_data")
df2 = df1.selectExpr("from_json(json_data, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') as parsed_json")
data = df2.selectExpr("explode(parsed_json) as json").select("json.*")
I'm using the Python client create_table() function which calls the underlying tables insert API. There is an exists_ok parameter but this causes the function to simply ignore the create if the table already exists. The problem with this is that when creating a view, I would like to overwrite the existing view SQL if it's already there. What I'm currently doing to get around this is:
if overwrite:
bq_client.delete_table(view, not_found_ok=True)
view = bq_client.create_table(view)
What I don't like about this is there are potentially several seconds during which the view no longer exists. And if the code dies for whatever reason after the delete but before the create then the view is effectively gone.
My question: is there a way to create a table (view) such that it overwrites any existing object? Or perhaps I have to detect this situation and run some kind of update_table() (patch)?
If you want to overwrite an existing table, you can use google.cloud.bigquery.job.WriteDisposition class, please refer to official documentation.
You have three possibilities here: WRITE_APPEND, WRITE_EMPTY and WRITE_TRUNCATE. What you should use, is WRITE_TRUNCATE, which overwrites the table data.
You can see following example here:
from google.cloud import bigquery
import pandas
client = bigquery.Client()
table_id = "<YOUR_PROJECT>.<YOUR_DATASET>.<YOUR_TABLE_NAME>"
records = [
{"artist": u"Michael Jackson", "birth_year": 1958},
{"artist": u"Madonna", "birth_year": 1958},
{"artist": u"Shakira", "birth_year": 1977},
{"artist": u"Taylor Swift", "birth_year": 1989},
]
dataframe = pandas.DataFrame(
records,
columns=["artist", "birth_year"],
index=pandas.Index(
[u"Q2831", u"Q1744", u"Q34424", u"Q26876"], name="wikidata_id"
),
)
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("artist", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("wikidata_id", bigquery.enums.SqlTypeNames.STRING),
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
)
job.result()
table = client.get_table(table_id)
Let me know if it suits your need. I hope it helps.
UPDATED:
You can use following Python code to update a table view using the client library:
client = bigquery.Client(project="projectName")
table_ref = client.dataset('datasetName').table('tableViewName')
table = client.get_table(table_ref)
table.view_query = "SELECT * FROM `projectName.dataset.sourceTableName`"
table = client.update_table(table, ['view_query'])
You can do it this way.
Hope this may help!
from google.cloud import bigquery
clientBQ = bigquery.Client()
def tableExists(tableID, client=clientBQ):
"""
Check if a table already exists using the tableID.
return : (Boolean)
"""
try:
table = client.get_table(tableID)
return True
except NotFound:
return False
if tableExists(viewID, client=clientBQ):
print("View already exists, Deleting the view ... ")
clientBQ .delete_table(viewID)
view = bigquery.Table(viewID)
view.view_query = "SELECT * FROM `PROJECT_ID.DATASET_NAME.TABLE_NAME`"
clientBQ.create_table(view)
We need to monitor table sizes in different environments.
Use Google metadata API to get the information for a given Project/Environment.
Need to create a view which will provide
1. What are all the datasets
2. What tables in each dataset
3. Table sizes
4. Dataset size
BigQuery has such views for you already built-in: INFORMATION_SCHEMA is a series of views that provide access to metadata about datasets, tables, and views
For example, below returns metadata for all datasets in the default project
SELECT * FROM INFORMATION_SCHEMA.SCHEMATA
or
for my_project
SELECT * FROM my_project.INFORMATION_SCHEMA.SCHEMATA
There are other such views for tables also
In addition, there is a meta table that can be used to get more info about tables in given dataset: __TABLES__SUMMARY and __TABLES__
SELECT * FROM `project.dataset.__TABLES__`
For example:
SELECT table_id,
DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date,
DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date,
row_count,
size_bytes,
CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type,
TIMESTAMP_MILLIS(creation_time) AS creation_time,
TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time,
dataset_id,
project_id
FROM `project.dataset.__TABLES__`
In order to automatize the query to check for every dataset in the project instead of adding them manually with UNION ALL, you can follow the advice given by #ZinkyZinky here and create a query that generates the UNION ALL calls for every dataset.__TABLES_. I have not managed to use this solution fully automatically in BigQuery because I don’t find a way to execute a command generated as a string (That is what string_agg is creating). Anyhow, I have managed to develop the solution in Python, adding the generated string in the next query. You can find the code below. It also creates a new table and stores the results there:
from google.cloud import bigquery
client = bigquery.Client()
project_id = "wave27-sellbytel-bobeda"
# Construct a full Dataset object to send to the API.
dataset_id = "project_info"
dataset = bigquery.Dataset(".".join([project_id, dataset_id]))
dataset.location = "US"
# Send the dataset to the API for creation.
# Raises google.api_core.exceptions.Conflict if the Dataset already
# exists within the project.
dataset = client.create_dataset(dataset) # API request
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))
schema = [
bigquery.SchemaField("dataset_id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("table_id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("size_bytes", "INTEGER", mode="REQUIRED"),
]
table_id = "table_info"
table = bigquery.Table(".".join([project_id, dataset_id, table_id]), schema=schema)
table = client.create_table(table) # API request
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table(table_id)
job_config.destination = table_ref
# QUERIES
# 1. Creating the UNION ALL list with the table information of each dataset
query = (
r"SELECT string_agg(concat('SELECT * from `', schema_name, '.__TABLES__` '), 'union all \n') "
r"from INFORMATION_SCHEMA.SCHEMATA"
)
query_job = client.query(query, location="US") # API request - starts the query
select_tables_from_all_datasets = ""
for row in query_job:
select_tables_from_all_datasets += row[0]
# 2. Using the before mentioned list to create a table.
query = (
"WITH ALL__TABLES__ AS ({})"
"SELECT dataset_id, table_id, size_bytes FROM ALL__TABLES__;".format(select_tables_from_all_datasets)
)
query_job = client.query(query, location="US", job_config=job_config) # job_config configures in which table the results will be stored.
for row in query_job:
print row
print('Query results loaded to table {}'.format(table_ref.path))
I have a CSV (tab separated) in s3 that needs to be queried on a JSON field.
uid\tname\taddress
1\tmoorthi\t{"rno":123,"code":400111}
2\tkiranp\t{"rno":124,"street":"kemp road"}
How can I query this data in Amazon Athena?
I should be able to query like:
select uid
from table1
where address['street']="kemp road";
You could try using the json_extract() command.
From Extracting Data from JSON - Amazon Athena:
You may have source data with containing JSON-encoded strings that you do not necessarily want to deserialize into a table in Athena. In this case, you can still run SQL operations on this data, using the JSON functions available in Presto.
WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},
{"name":"project2", "completed":true}]}'
AS blob
)
SELECT
json_extract(blob, '$.name') AS name,
json_extract(blob, '$.projects') AS projects
FROM dataset
This example shows how json_extract() can be used to extract fields from JSON. Thus, you might be able to do something like:
select uid
from table1
where json_extract(address, '$.street') = "kemp road";