Can BigQuery API overwrite existing table/view with create_table() (tables insert)? - google-bigquery

I'm using the Python client create_table() function which calls the underlying tables insert API. There is an exists_ok parameter but this causes the function to simply ignore the create if the table already exists. The problem with this is that when creating a view, I would like to overwrite the existing view SQL if it's already there. What I'm currently doing to get around this is:
if overwrite:
bq_client.delete_table(view, not_found_ok=True)
view = bq_client.create_table(view)
What I don't like about this is there are potentially several seconds during which the view no longer exists. And if the code dies for whatever reason after the delete but before the create then the view is effectively gone.
My question: is there a way to create a table (view) such that it overwrites any existing object? Or perhaps I have to detect this situation and run some kind of update_table() (patch)?

If you want to overwrite an existing table, you can use google.cloud.bigquery.job.WriteDisposition class, please refer to official documentation.
You have three possibilities here: WRITE_APPEND, WRITE_EMPTY and WRITE_TRUNCATE. What you should use, is WRITE_TRUNCATE, which overwrites the table data.
You can see following example here:
from google.cloud import bigquery
import pandas
client = bigquery.Client()
table_id = "<YOUR_PROJECT>.<YOUR_DATASET>.<YOUR_TABLE_NAME>"
records = [
{"artist": u"Michael Jackson", "birth_year": 1958},
{"artist": u"Madonna", "birth_year": 1958},
{"artist": u"Shakira", "birth_year": 1977},
{"artist": u"Taylor Swift", "birth_year": 1989},
]
dataframe = pandas.DataFrame(
records,
columns=["artist", "birth_year"],
index=pandas.Index(
[u"Q2831", u"Q1744", u"Q34424", u"Q26876"], name="wikidata_id"
),
)
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("artist", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("wikidata_id", bigquery.enums.SqlTypeNames.STRING),
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
)
job.result()
table = client.get_table(table_id)
Let me know if it suits your need. I hope it helps.
UPDATED:
You can use following Python code to update a table view using the client library:
client = bigquery.Client(project="projectName")
table_ref = client.dataset('datasetName').table('tableViewName')
table = client.get_table(table_ref)
table.view_query = "SELECT * FROM `projectName.dataset.sourceTableName`"
table = client.update_table(table, ['view_query'])

You can do it this way.
Hope this may help!
from google.cloud import bigquery
clientBQ = bigquery.Client()
def tableExists(tableID, client=clientBQ):
"""
Check if a table already exists using the tableID.
return : (Boolean)
"""
try:
table = client.get_table(tableID)
return True
except NotFound:
return False
if tableExists(viewID, client=clientBQ):
print("View already exists, Deleting the view ... ")
clientBQ .delete_table(viewID)
view = bigquery.Table(viewID)
view.view_query = "SELECT * FROM `PROJECT_ID.DATASET_NAME.TABLE_NAME`"
clientBQ.create_table(view)

Related

Schema update options should only be specified with WRITE_APPEND disposition, or with WRITE_TRUNCATE disposition on a table partition

I have a cloud functions that loads a dataframe to bigquery and want to allow field addtions
My job_config is as follow but I always get an error saying that the ALLOW FIELD ADDTION
is only available with table partittions. The destination table is already partition by date so I dont really know how to get this work.
job_config = bigquery.LoadJobConfig(
schema=[bigquery.SchemaField("fecha", bigquery.enums.SqlTypeNames.DATE)],
write_disposition="WRITE_TRUNCATE"
,create_disposition = "CREATE_IF_NEEDED"
,time_partitioning = bigquery.table.TimePartitioning(field="fecha")
,schema_update_options = 'ALLOW_FIELD_ADDITION'
)
Thre is no need to pass ,schema_update_options = 'ALLOW_FIELD_ADDITION' since the write_disposition="WRITE_TRUNCATE" will handle the schema changes.
Thank you all!

Need to add query tag to snowflake queries while I fetch data from python ( using threadpool, code is provided)

I am using SQLAlchemy by create engine to connect python to snowflake for fetching data. Adding a snippet of code on how I am doing it. Before you suggest to use connector.snowflake, I have tried that and it has query tag but I need to pull queries by thread pool method and couldn't find a way to add query tag.
Have also tried ALTER SESSION SET QUERY_TAG but since the query runs parallelly it doesn't give the query tag.
Code:
vendor_class_query ='select * from table'
query_list1 = [vendor_class_query]
pool = ThreadPool(8)
def query(x):
engine = create_engine(
'snowflake://{user}:{password}#{account}/{database_name}/{schema_name}?\
warehouse={warehouse}&role={role}&paramstyle={paramstyle}'.format(
user=---------,
password=----------,
account=----------,
database_name=----------,
schema_name=----------,
warehouse=----------,
role=----------,
paramstyle='pyformat'
),
poolclass=NullPool
)
try:
connection = engine.connect()
for df in pd.read_sql_query(x, engine, chunksize=1000000000):
df.columns = map(str.upper, df.columns)
return df
finally:
connection.close()
engine.dispose()
return df
results1 = pool.map(query, query_list1)
vendor_class = results1[0]'''

Is it possible to change the delimiter of AWS athena output file

Here is my sample code where I create a file in S3 bucket using AWS Athena. The file by default is in csv format. Is there a way to change it to pipe delimiter ?
import json
import boto3
def lambda_handler(event, context):
s3 = boto3.client('s3')
client = boto3.client('athena')
# Start Query Execution
response = client.start_query_execution(
QueryString="""
select * from srvgrp
where category_code = 'ACOMNCDU'
""",
QueryExecutionContext={
'Database': 'tmp_db'
},
ResultConfiguration={
'OutputLocation': 's3://tmp-results/athena/'
}
)
queryId = response['QueryExecutionId']
print('Query id is :' + str(queryId))
There is a way to do that with CTAS query.
BUT:
This is a hacky way and not what CTAS queries are supposed to be used for, since it will also create a new table definition in AWS Glue Data Catalog.
I'm not sure about performance
CREATE TABLE "UNIQU_PREFIX__new_table"
WITH (
format = 'TEXTFILE',
external_location = 's3://tmp-results/athena/__SOMETHING_UNIQUE__',
field_delimiter = '|',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1
) AS
SELECT *
FROM srvgrp
WHERE category_code = 'ACOMNCDU'
Note:
It is important to set bucket_count = 1, otherwise Athena will create multiple files.
Name of the table in CREATE_TABLE ... also should be unique, e.g. use timestamp prefix/suffix which you can inject during python runtime.
External location should be unique, e.g. use timestamp prefix/suffix which you can inject during python runtime. I would advise to embed table name into S3 path.
You need to include in bucketed_by only one of the columns from SELECT.
At some point you would need to clean up AWS Glue Data Catalog from all table defintions that were created in such way

Querying DynamoDB table using Global Secondary Indexes

I was trying to query a DynamoDB table using a Lambda function.
My table's partition key is id. I am trying to query it on another key named dipl_idpp. I unstrood that that is not possible.
I found here a solution: I need to create a Global Secondary Index poitning on the column that I want to query on (in my case dipl_idpp).
I did that on Dynamo. But when I execute my function I still have the same problem:
An error occurred (ValidationException) when calling the Query operation: Query condition missed key schema element: id', 'occurred at index 0')"
This is the code I use:
def query_dipl_dynamo(key_table,valeur_query,name_table):
dynamoDBResource = boto3.resource('dynamodb')
table = dynamoDBResource.Table(name_table)
response = table.query(
KeyConditionExpression=Key(key_table).eq(valeur_query))
df_fr = pd.DataFrame([response['Items']])
if len(df_fr.columns) > 0 :
print("hellooo1")
df = pd.DataFrame([response['Items'][0]])
return valeur_query, df["dipl_libelle"].iloc[0]
//
//
df9_tmp["dipl_idpp"] = df8_tmp.apply(lambda x : query_dipl_dynamo("dipl_idpp",x["num_auto"], "ddb-dev-PS_LibreAcces_Dipl_AutExerc")[0], axis=1)
Should I change something else beside creating the index? Too little documentation is available.
Thank you!
I just found the solution. When we use Indexes we must provide
an argument namde IndexName who takes the name of the index in Dynamo.
I had to change my code to:
def query_dipl_dynamo(key_table,valeur_query,name_table):
dynamoDBResource = boto3.resource('dynamodb')
table = dynamoDBResource.Table(name_table)
response = table.query(
IndexName:"NameOfTheIndexInDynamoDB",
KeyConditionExpression=Key(key_table).eq(valeur_query))
df_fr = pd.DataFrame([response['Items']])
if len(df_fr.columns) > 0 :
df = pd.DataFrame([response['Items'][0]])
return valeur_query, df["dipl_libelle"].iloc[0]
//
//
df9_tmp["dipl_idpp"] = df8_tmp.apply(lambda x : query_dipl_dynamo("dipl_idpp",x["num_auto"], "ddb-dev-PS_LibreAcces_Dipl_AutExerc")[0], axis=1)

Scio saveAsTypedBigQuery write to a partition for SCollection of Typed Big Query case class

I'm trying to write a SCollection to a partition in Big Query using:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val date = LocateDate.parse("2017-06-21")
val col = sCollection.typedBigQuery[Blah](query)
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.ISO_LOCAL_DATE),
writeDisposition = WriteDisposition.WRITE_EMPTY,
createDisposition = CreateDisposition.CREATE_IF_NEEDED)
The error I get is
Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used."
How can I write to a partition? I don't see any options to specify partitions via either saveAsTypedBigQuery method so I was trying the Legacy SQL table decorators.
See: BigqueryIO Unable to Write to Date-Partitioned Table. You need to manually create the table. BQ IO cannot create a table and partition it.
Additionally, the no table decorators was a complete ruse. It's the alphanumeric part I was missing.
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.BASIC_ISO_DATE),
writeDisposition = WriteDisposition.WRITE_APPEND,
createDisposition = CreateDisposition.CREATE_NEVER)