BigQuery scheduled queries - Create table and add date suffix to its name in a different project - google-bigquery

According to the documentation, (https://cloud.google.com/bigquery/docs/scheduling-queries#destination_table) the same project must be used when defining a destination for a scheduled query.
However, I'd like to schedule a query with the capability to write tables during the query steps to other projects (e.g., CREATE TABLE xxx.dataset.name_{run_date}) and to preserve the {run_date} as a suffix. Is it possible to do that in BQ?

This is a limitation in BigQuery UI. A possible workaround is to use BigQuery Python Client Library as shown below code:
from google.cloud import bigquery_datatransfer
from datetime import date
today = date.today() #use to replicate #run_date parameter
str_today = str(today).replace("-", "")
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
# The project where the query job runs is the same as the project
# containing the destination dataset.
project_id = "your-project-id"
dataset_id = "your-source-dataset-id"
# This service account will be used to execute the scheduled queries. Omit
# this request parameter to run the query as the user with the credentials
# associated with this client.
service_account_name = "your-service-account"
# Use standard SQL syntax for the query.
query_string = f"CREATE TABLE destination-project.destination-dataset.new_table_{str_today} AS (SELECT column FROM source-project.source-dataset.source-table);"
parent="projects/your-project-id/locations/us-central1" #change location accordingly
transfer_config = bigquery_datatransfer.TransferConfig(
name="projects/your-project-id/locations/us-central1/transferConfigs", #change location accordingly
display_name="Test Schedule",
data_source_id="scheduled_query",
params={
"query": query_string,
},
schedule="every 24 hours",
)
transfer_config = transfer_client.create_transfer_config(
bigquery_datatransfer.CreateTransferConfigRequest(
parent=parent,
transfer_config=transfer_config,
service_account_name=service_account_name,
)
)
print("Created scheduled query '{}'".format(transfer_config.name))
My sample output table:

Related

How can I pass a date to a sql file when executing a query in Airflow

I am using BigQueryExecuteQueryOperator in airflow. I have my sql in a separate file, and I am passing params to that file, specifically I am passing folder names and dates. When I pass a date though, I get an error, because the date is passed into the sql file without any 'quotes'.
Error
400 No matching signature for function DATE for argument types
From Airflow Logs. The param is sent to the sql file with no 'quotes' around it
DATE(2020-01-01)
Ways I have tried to define the date
partitionFilter = datetime.today().strftime('%Y-%m-%d')
partitionFilter = str(datetime.today()).split()[0]
partitionFilter = '2020-01-01'
partitionFilter = datetime.now().strftime('%Y-%m-%d')
Fixed that seems to work
In the sql file, if I wrap the passed param in 'quotes', sql is able to read it is a date. This seems a little hacky, and I am wondering is there a better way. What I am concerned with is that as this application grows, maybe the date will be passed a different way and it will include 'quotes', which would not work since the 'quotes' are hard coded in the sql file.
This works in the sql file if I wrap param in 'quotes':
DATE('{{params.partitionFilter}}')
This does not work in the sql file:
DATE( {{params.partitionFilter}} )
excerpt from Dag
partitionFilter = datetime.today().strftime('%Y-%m-%d')
with DAG(
dag_id,
schedule_interval='#daily',
start_date= days_ago(1),
catchup = False,
user_defined_macros={"DATASET": DATASET_NAME,}
) as dag:
query_one = BigQueryExecuteQueryOperator(
task_id='query_one',
sql='/sql/my-file.sql',
use_legacy_sql=False,
params={'partitionFilter':partitionFilter}
)

Query from dynamic project+dataset+table names Google BigQuery

I need to execute a single query over all my projects in BigQuery. The list of projects may increase every day, so I need to do this job dynamically. All tables I need to query share the same schema, but each table is in a different project with different dataset names.
I thought to create a table to save all the project.dataset.table I need to query. Then I could execute a query where in "from" I could take the locations from the mentioned table.
But actually I don't know how to do that. Or if there is another solution I can implement...
If you are running queries on multiple accounts, you have to somehow be explicit about specifying those accounts and their credentials in some centralized location.
Assuming you can create independent Service Account JSONs for each of those accounts, then you can simply have a local script that can do the job for you. In general, all that script really needs to do is to go over accounts and reset the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the specific account before it runs the query.
For instance, if you use Python, then something roughly on the lines of this:
import os
from google.cloud import bigquery
accounts = [
{
"account_name": "xyz",
"credentials_json": "/path/to/xyz/credentials.json",
"dataset_name": "dataset",
"table_name": "table_name"
},
{
"account_name": "xyz",
"credentials_json": "/path/to/xyz/credentials.json",
"dataset_name": "dataset",
"table_name": "table_name"
}
]
generic_query = '''
select * from `{dataset_name}.{table_name}` where 1=1;
'''
def worker(account_info):
'''
your worker function which takes an account_info and runs the query.
'''
# set the credentials file env variable based on the account info
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = account_info.get("credentials_json")
client = bigquery.Client()
query = generic_query.format(dataset_name = account_info.get("dataset_name"), table_name = account_info.get("table_name"))
query_job = client.query(query)
rows = query_job.result()
for row in rows:
print(account_info.get("account_name"), row)
return
if __name__ == "__main__":
#--run through your accounts and submit to the worker
while accounts:
account_info = accounts.pop(0)
worker(account_info)
Hope it helps.

BigQuery - loading Json with more than 10K columns

I would like to use the BQ API to load 10K columns out of json file that has more than 10K columns (BQ limit) in it.
Can I use the BQ code to extract the first 10K columns? This is the code that I found online that uses autodetect schema but I couldnt find anything to select columns.
Any advice to achieve this goal is appreciated.
Thanks,
eilalan
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.json'
load_job = client.load_table_from_uri(
uri,
dataset_ref.table('us_states'),
job_config=job_config) # API request
assert load_job.job_type == 'load'
load_job.result() # Waits for table load to complete.
assert load_job.state == 'DONE'
assert client.get_table(dataset_ref.table('us_states')).num_rows == 50
Load job does not support selecting specific columns - rather you can load your file into table with just one column of type STRING and then using query extract needed columns and SELECT them INTO final table
You cannot select specific columns if you use auto-detect, but if you can provide the schema yourself, you can use the ignoreUnknownValues option to let BigQuery ignore columns not in the schema, which basically means loading the specified columns only.

Google API to get data from BigQuery Table

I am trying to get data from BigQuery Table with Python. I am aware that BigQuery Connector is available and I can export table using that. However I don't want to involve the GCS (Google Cloud Storage), and that where the things get tricky.
I could see that there are few API call through which I can get whole table data.
https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/list
And another way is I can query the BigQuery Table.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query
However I am not able to understand how exactly I need to query those API using Python or JAVA?
How to create a client ? or how to authenticate?
As mentioned by #GrahamPolley you can follow documentation where is explained how to:
Authenticate:
To run the client library, you must first set up authentication by
creating a service account and setting an environment variable.
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/auth/[FILE_NAME].json"
Create a client:
BigQueryClient client = BigQueryClient.Create(projectId);
And for browsing data in selected table you can use an example from official library documentation:
# from google.cloud import bigquery
# client = bigquery.Client()
dataset_ref = client.dataset('samples', project='bigquery-public-data')
table_ref = dataset_ref.table('shakespeare')
table = client.get_table(table_ref) # API call
# Load all rows from a table
rows = client.list_rows(table)
assert len(list(rows)) == table.num_rows
# Load the first 10 rows
rows = client.list_rows(table, max_results=10)
assert len(list(rows)) == 10
# Specify selected fields to limit the results to certain columns
fields = table.schema[:2] # first two columns
rows = client.list_rows(table, selected_fields=fields, max_results=10)
assert len(rows.schema) == 2
assert len(list(rows)) == 10
# Use the start index to load an arbitrary portion of the table
rows = client.list_rows(table, start_index=10, max_results=10)
# Print row data in tabular format
format_string = '{!s:<16} ' * len(rows.schema)
field_names = [field.name for field in rows.schema]
print(format_string.format(*field_names)) # prints column headers
for row in rows:
print(format_string.format(*row)) # prints row data

How do I set the destination table for a query in BigQuqery Ruby API?

Is there a method for BigQuery API that allows you to set the destination table for a query? I found one in the REST API but not for programming languages like ruby.
If there is an example for other languages.. maybe I can try to do the same in ruby
You need to set the destination table via the API. Either one of these example snippets should be easy to port to the Ruby client, and be enough to get you going:
Java
JobConfiguration jobConfiguration = newBuilder("select * from..)
.setAllowLargeResults(true)
.setUseLegacySql(false)
.setDryRun(dryRun)
.setDestinationTable(TableId.of("projectId", "dataset", "table"))
.setCreateDisposition(CREATE_IF_NEEDED)
.setWriteDisposition(WRITE_TRUNCATE)
.setPriority(BATCH)
.build();
Python
from google.cloud import bigquery
client = bigquery.Client()
query = """\
SELECT firstname + ' ' + last_name AS full_name,
FLOOR(DATEDIFF(CURRENT_DATE(), birth_date) / 365) AS age
FROM dataset_name.persons
"""
dataset = client.dataset('dataset_name')
table = dataset.table(name='person_ages')
job = client.run_async_query('fullname-age-query-job', query)
job.destination = table
job.write_disposition= 'truncate'
job.begin()
didn't know if this is exactly what you were asking - but looks like it is :o)
Ruby API Reference Documentation for the Google BigQuery API Client Library.
You can see more for all supported clients in BigQuery Client Libraries
You can query into a destination table with a single command:
bigquery = Google::Cloud::Bigquery.new(...)
dataset = bigquery.dataset('my_dataset')
job = bigquery.query_job("SELECT * FROM source_table", table: dataset.table('destination_table'), write: 'truncate', create: 'needed')
job.wait_until_done!
From http://www.rubydoc.info/github/GoogleCloudPlatform/gcloud-ruby/Gcloud%2FBigquery%2FProject%3Aquery_job
#mikhailberlyant found it.
Bigquery documentation