Bigquery LoadJobConfig Delete Source Files After Transfer - google-bigquery

When creating a Bigquery Data Transfer Service Job Manually through the UI, I can select an option to delete source files after transfer. When I try to use the CLI or the Python Client to create on-demand Data Transfer Service Jobs, I do not see an option to delete the source files after transfer. Do you know if there is another way to do so? Right now, my Source URI is gs://<bucket_path>/*, so it's not trivial to delete the files myself.

For me works this snippet (replace YOUR-... with your data):
from google.cloud import bigquery_datatransfer
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOUR-CRED-FILE-PATH"
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
destination_project_id = "YOUR-PROJECT-ID"
destination_dataset_id = "YOUR-DATASET-ID"
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=destination_dataset_id,
display_name="YOUR-TRANSFER-NAME",
data_source_id="google_cloud_storage",
params={
"data_path_template":"gs://PATH-TO-YOUR-DATA/*.csv",
"destination_table_name_template":"YOUR-TABLE-NAME",
"file_format":"CSV",
"skip_leading_rows":"1",
"delete_source_files": True
},
)
transfer_config = transfer_client.create_transfer_config(
parent=transfer_client.common_project_path(destination_project_id),
transfer_config=transfer_config,
)
print(f"Created transfer config: {transfer_config.name}")
In this example, table YOUR-TABLE-NAME must already exist in BigQuery, otherwise the transfer will crash with error "Not found: Table YOUR-TABLE-NAME".
I used this packages:
google-cloud-bigquery-datatransfer>=3.4.1
google-cloud-bigquery>=2.31.0
Pay attention to the attribute delete_source_files in params. From docs:
Optional param delete_source_files will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for the delete_source_files is false.

Related

Is there a way to automate this Python script in GCP?

I am a complete beginner in using GCP functions/products.
I have written the following code below, that takes a list of cities from a local folder, and call in weather data for each city in that list, eventually uploading those weather values into a table in BigQuery. I don't need to change the code anymore, as it creates new tables when a new week begins, now I would want to "deploy" (I am not even sure if this is called deploying a code) in the cloud for it to automatically run there. I tried using App Engine and Cloud Functions but faced issues in both places.
import requests, json, sqlite3, os, csv, datetime, re
from google.cloud import bigquery
#from google.cloud import storage
list_city = []
with open("list_of_cities.txt", "r") as pointer:
for line in pointer:
list_city.append(line.strip())
API_key = "PLACEHOLDER"
Base_URL = "http://api.weatherapi.com/v1/history.json?key="
yday = datetime.date.today() - datetime.timedelta(days = 1)
Date = yday.strftime("%Y-%m-%d")
table_id = f"sonic-cat-315013.weather_data.Historical_Weather_{yday.isocalendar()[0]}_{yday.isocalendar()[1]}"
credentials_path = r"PATH_TO_JSON_FILE"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path
client = bigquery.Client()
try:
schema = [
bigquery.SchemaField("city", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Date", "Date", mode="REQUIRED"),
bigquery.SchemaField("Hour", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Temperature", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Humidity", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Condition", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Chance_of_rain", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Precipitation_mm", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Cloud_coverage", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Visibility_km", "FLOAT", mode="REQUIRED")
]
table = bigquery.Table(table_id, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="Date", # name of column to use for partitioning
)
table = client.create_table(table) # Make an API request.
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
except:
print("Table {}_{} already exists".format(yday.isocalendar()[0], yday.isocalendar()[1]))
def get_weather():
try:
x["location"]
except:
print(f"API could not call city {city_name}")
global day, time, dailytemp, dailyhum, dailycond, chance_rain, Precipitation, Cloud_coverage, Visibility_km
day = []
time = []
dailytemp = []
dailyhum = []
dailycond = []
chance_rain = []
Precipitation = []
Cloud_coverage = []
Visibility_km = []
for i in range(24):
dayval = re.search("^\S*\s" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
timeval = re.search("\s(.*)" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
day.append(dayval.group()[:-1])
time.append(timeval.group()[1:])
dailytemp.append(x["forecast"]["forecastday"][0]["hour"][i]["temp_c"])
dailyhum.append(x["forecast"]["forecastday"][0]["hour"][i]["humidity"])
dailycond.append(x["forecast"]["forecastday"][0]["hour"][i]["condition"]["text"])
chance_rain.append(x["forecast"]["forecastday"][0]["hour"][i]["chance_of_rain"])
Precipitation.append(x["forecast"]["forecastday"][0]["hour"][i]["precip_mm"])
Cloud_coverage.append(x["forecast"]["forecastday"][0]["hour"][i]["cloud"])
Visibility_km.append(x["forecast"]["forecastday"][0]["hour"][i]["vis_km"])
for i in range(len(time)):
time[i] = int(time[i][:2])
def main():
i = 0
while i < len(list_city):
try:
global city_name
city_name = list_city[i]
complete_URL = Base_URL + API_key + "&q=" + city_name + "&dt=" + Date
response = requests.get(complete_URL, timeout = 10)
global x
x = response.json()
get_weather()
table = client.get_table(table_id)
varlist = []
for j in range(24):
variables = city_name, day[j], time[j], dailytemp[j], dailyhum[j], dailycond[j], chance_rain[j], Precipitation[j], Cloud_coverage[j], Visibility_km[j]
varlist.append(variables)
client.insert_rows(table, varlist)
print(f"City {city_name}, ({i+1} out of {len(list_city)}) successfully inserted")
i += 1
except Exception as e:
print(e)
continue
In the code, there is direct reference to two files that is located locally, one is the list of cities and the other is the JSON file containing the credentials to access my project in GCP. I believed that uploading these files in Cloud Storage and referencing them there won't be an issue, but then I realised that I can't actually access my Buckets in Cloud Storage without using the credential files.
This leads me to being unsure whether the entire process would be possible at all, how do I authenticate in the first place from the cloud, if I need to reference that first locally? Seems like an endless circle, where I'd authenticate from the file in Cloud Storage, but I'd need authentication first to access that file.
I'd really appreciate some help here, I have no idea where to go from this, and I also don't have great knowledge in SE/CS, I only know Python R and SQL.
For Cloud Functions, the deployed function will run with the project service account credentials by default, without needing a separate credentials file. Just make sure this service account is granted access to whatever resources it will be trying to access.
You can read more info about this approach here (along with options for using a different service account if you desire): https://cloud.google.com/functions/docs/securing/function-identity
This approach is very easy, and keeps you from having to deal with a credentials file at all on the server. Note that you should remove the os.environ line, as it's unneeded. The BigQuery client will use the default credentials as noted above.
If you want the code to run the same whether on your local machine or deployed to the cloud, simply set a "GOOGLE_APPLICATION_CREDENTIALS" environment variable permanently in the OS on your machine. This is similar to what you're doing in the code you posted; however, you're temporarily setting it every time using os.environ rather than permanently setting the environment variable on your machine. The os.environ call only sets that environment variable for that one process execution.
If for some reason you don't want to use the default service account approach outlined above, you can instead directly reference it when you instantiate the bigquery.Client()
https://cloud.google.com/bigquery/docs/authentication/service-account-file
You just need to package the credential file with your code (i.e. in the same folder as your main.py file), and deploy it alongside so it's in the execution environment. In that case, it is referenceable/loadable from your script without needing any special permissions or credentials. Just provide the relative path to the file (i.e. assuming you have it in the same directory as your python script, just reference only the filename)
There may be different flavors and options to deploy your application and these will depend on your application semantics and execution constraints.
It will be too hard to cover all of them and the official Google Cloud Platform documentation cover all of them in great details:
Google Compute Engine
Google Kubernetes Engine
Google App Engine
Google Cloud Functions
Google Cloud Run
Based on my understanding of your application design, the most suitable ones would be:
Google App Engine
Google Cloud Functions
Google Cloud Run: Check these criteria to see if you application is a good fit for this deployment style
I would suggest using Cloud Functions as you deployment option in which case your application will default to using the project App Engine service account to authenticate itself and perform allowed actions. Hence, you should only check if the default account PROJECT_ID#appspot.gserviceaccount.com under the IAM configuration section has proper access to needed APIs (BigQuery in your case).
In such a setup, you want need to push your service account key to Cloud Storage which I would recommend to avoid in either cases, and you want need to pull it either as the runtime will handle authentication the function for you.

How to invoke an on-demand bigquery Data transfer service?

I really liked BigQuery's Data Transfer Service. I have flat files in the exact schema sitting to be loaded into BQ. It would have been awesome to just setup DTS schedule that picked up GCS files that match a pattern and load the into BQ. I like the built in option to delete source files after copy and email in case of trouble. But the biggest bummer is that the minimum interval is 60 minutes. That is crazy. I could have lived with a 10 min delay perhaps.
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
StartManualTransferRuns is part of the RPC library but does not have a REST API equivalent as of now. How to use that will depend on your environment. For instance, you can use the Python Client Library (docs).
As an example, I used the following code (you'll need to run pip install google-cloud-bigquery-datatransfer for the depencencies):
import time
from google.cloud import bigquery_datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = bigquery_datatransfer_v1.DataTransferServiceClient()
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = '5e6...7bc' # alphanumeric ID you'll find in the UI
parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)
Note that you'll need to use the right Transfer Config ID and the requested_run_time has to be of type bigquery_datatransfer_v1.types.Timestamp (for which there was no example in the docs). I set a start time 10 seconds ahead of the current execution time.
You should get a response such as:
runs {
name: "projects/PROJECT_NUMBER/locations/us/transferConfigs/5e6...7bc/runs/5e5...c04"
destination_dataset_id: "DATASET_NAME"
schedule_time {
seconds: 1579358571
nanos: 922599371
}
...
data_source_id: "google_cloud_storage"
state: PENDING
params {
...
}
run_time {
seconds: 1579358581
}
user_id: 28...65
}
and the transfer is triggered as expected (nevermind the error):
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
With this you can set a cron job to execute your function every ten minutes. As discussed in the comments, the minimum interval is 60 minutes so it won't pick up files less than one hour old (docs).
Apart from that, this is not a very robust solution and here come into play your follow-up questions. I think these might be too broad to address in a single StackOverflow question but I would say that, for on-demand refresh, Cloud Scheduler + Cloud Functions/Cloud Run can work very well.
Dataflow would be best if you needed ETL but it has a GCS connector that can watch a file pattern (example). With this you would skip the transfer, set the watch interval and the load job triggering frequency to write the files into BigQuery. VM(s) would be running constantly in a streaming pipeline as opposed to the previous approach but a 10-minute watch period is possible.
If you have complex workflows/dependencies, Airflow has recently introduced operators to start manual runs.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
You can use wildcards to match a file pattern when you create the transfer:
Also, this can be done on a file-by-file basis using Pub/Sub notifications for Cloud Storage to trigger a Cloud Function.
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
There is already a Feature Request here. Feel free to star it to show your interest and receive updates
Now your can easy manual run transfer Bigquery data use RESTApi:
HTTP request
POST https://bigquerydatatransfer.googleapis.com/v1/{parent=projects/*/locations/*/transferConfigs/*}:startManualRuns
About this part > {parent=projects//locations//transferConfigs/*}, check on CONFIGURATION of your Transfer then notice part like image bellow.
Here
More here:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns
following the Guillem's answer and the API updates, this is my new code:
import time
from google.cloud.bigquery import datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = datatransfer_v1.DataTransferServiceClient()
config = '34y....654'
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = config
parent = client.transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = Timestamp(seconds=int(time.time()))
request = datatransfer_v1.types.StartManualTransferRunsRequest(
{ "parent": parent, "requested_run_time": start_time }
)
response = client.start_manual_transfer_runs(request, timeout=360)
print(response)
For this to work, you need to know the correct TRANSFER_CONFIG_ID.
In my case, I wanted to list all the BigQuery Scheduled queries, to get a specific ID. You can do it like that :
# Put your projetID here
PROJECT_ID = 'PROJECT_ID'
from google.cloud import bigquery_datatransfer_v1
bq_transfer_client = bigquery_datatransfer_v1.DataTransferServiceClient()
parent = bq_transfer_client.project_path(PROJECT_ID)
# Iterate over all results
for element in bq_transfer_client.list_transfer_configs(parent):
# Print Display Name for each Scheduled Query
print(f'[Schedule Query Name]:\t{element.display_name}')
# Print name of all elements (it contains the ID)
print(f'[Name]:\t\t{element.name}')
# Extract the IDs:
TRANSFER_CONFIG_ID= element.name.split('/')[-1]
print(f'[TRANSFER_CONFIG_ID]:\t\t{TRANSFER_CONFIG_ID}')
# You can print the entire element for debug purposes
print(element)

Denormalize a GCS file before uploading to BigQuery

I have written a Cloud Run API in .Net Core that reads files from a GCS location and then is supposed to denormalize (i.e. add more information for each row to include textual descriptions) and then write that to a BigQuery table. I have two options:
My cloud run API could create denormalized CSV files and write them to another GCS location. Then another cloud run API could pick up those denormalized CSV files and write them straight to BigQuery.
My cloud run API could read the original CSV file, denormalize them in memory (filestream) and then somehow write from the in memory filestream straight to the BigQuery table.
What is the best way to write to BigQuery in this scenario if performance (speed) and cost (monetary) is my goal. These files are roughly 10KB each before denormalizing. Each row is roughly 1000 characters. After denormalizing it is about three times as much. I do not need to keep denormalized files after they are successfully loaded in BigQuery. I am concerned about performance, as well as any specific BigQuery daily quotas around inserts/writes. I don't think there are any unless you are doing DML statements but correct me if I'm wrong.
I would use Cloud Functions that are triggered when you upload a file to a bucket.
It is so common that Google has a repo a tutorial just for this for JSON files Streaming data from Cloud Storage into BigQuery using Cloud Functions.
Then, I would modify the example main.py file from:
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
db_ref = DB.document(u'streaming_files/%s' % file_name)
if _was_already_ingested(db_ref):
_handle_duplication(db_ref)
else:
try:
_insert_into_bigquery(bucket_name, file_name)
_handle_success(db_ref)
except Exception:
_handle_error(db_ref)
To this that accepts CSV files:
import json
import csv
import logging
import os
import traceback
from datetime import datetime
from google.api_core import retry
from google.cloud import bigquery
from google.cloud import storage
import pytz
PROJECT_ID = os.getenv('GCP_PROJECT')
BQ_DATASET = 'fromCloudFunction'
BQ_TABLE = 'mytable'
CS = storage.Client()
BQ = bigquery.Client()
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
newRows = postProcessing(bucket_name, file_name)
# It is recommended that you save
# what you process for debugging reasons.
destination_bucket = 'post-processed' # gs://post-processed/
destination_name = file_name
# saveRowsToBucket(newRows,destination_bucket,destination_name)
rowsInsertIntoBigquery(newRows)
class BigQueryError(Exception):
'''Exception raised whenever a BigQuery error happened'''
def __init__(self, errors):
super().__init__(self._format(errors))
self.errors = errors
def _format(self, errors):
err = []
for error in errors:
err.extend(error['errors'])
return json.dumps(err)
def postProcessing(bucket_name, file_name):
blob = CS.get_bucket(bucket_name).blob(file_name)
my_str = blob.download_as_string().decode('utf-8')
csv_reader = csv.DictReader(my_str.split('\n'))
newRows = []
for row in csv_reader:
modified_row = row # Add your logic
newRows.append(modified_row)
return newRows
def rowsInsertIntoBigquery(rows):
table = BQ.dataset(BQ_DATASET).table(BQ_TABLE)
errors = BQ.insert_rows_json(table,rows)
if errors != []:
raise BigQueryError(errors)
It would be still necesssary to define your map(row->newRow) and the function saveRowsToBucket if you needed it.

Google Cloud Storage Joining multiple csv files

I exported a dataset from Google BigQuery to Google Cloud Storage, given the size of the file BigQuery exported the file as 99 csv files.
However now I want to connect to my GCP Bucket and perform some analysis with Spark, yet I need to join all 99 files into a single large csv file to run my analysis.
How can this be achieved?
BigQuery splits the data exported into several files if it is larger than 1GB. But you can merge these files with the gsutil tool, check this official documentation to know how to perform object composition with gsutil.
As BigQuery export the files with the same prefix, you can use a wildcard * to merge them into one composite object:
gsutil compose gs://example-bucket/component-obj-* gs://example-bucket/composite-object
Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.
The downside of this option is that the header row of each .csv file will be added in the composite object. But you can avoid this by modifiyng the jobConfig to set the print_header parameter to False.
Here is a Python sample code, but you can use any other BigQuery Client library:
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'yourBucket'
project = 'bigquery-public-data'
dataset_id = 'libraries_io'
table_id = 'dependencies'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'file-*.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.job.ExtractJobConfig(print_header=False)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US',
job_config=job_config) # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
Finally, remember to compose an empty .csv with just the headers row.
I got tired kind tired of doing multiple recursive compose operations, stripping headers, etc... Especially when dealing with 3500 split gzipped csv files.
Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.
https://github.com/tcwicks/DataUtilities
Download latest release, unzip and use.
Also wrote an article with a use case and usage example for it:
https://medium.com/#TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope it is of use to someone.
p.s. Recommend tab delimited over CSV as it tends to have less data issues.

Renaming an Amazon CloudWatch Alarm

I'm trying to organize a large number of CloudWatch alarms for maintainability, and the web console grays out the name field on an edit. Is there another method (preferably something scriptable) for updating the name of CloudWatch alarms? I would prefer a solution that does not require any programming beyond simple executable scripts.
Here's a script we use to do this for the time being:
import sys
import boto
def rename_alarm(alarm_name, new_alarm_name):
conn = boto.connect_cloudwatch()
def get_alarm():
alarms = conn.describe_alarms(alarm_names=[alarm_name])
if not alarms:
raise Exception("Alarm '%s' not found" % alarm_name)
return alarms[0]
alarm = get_alarm()
# work around boto comparison serialization issue
# https://github.com/boto/boto/issues/1311
alarm.comparison = alarm._cmp_map.get(alarm.comparison)
alarm.name = new_alarm_name
conn.update_alarm(alarm)
# update actually creates a new alarm because the name has changed, so
# we have to manually delete the old one
get_alarm().delete()
if __name__ == '__main__':
alarm_name, new_alarm_name = sys.argv[1:3]
rename_alarm(alarm_name, new_alarm_name)
It assumes you're either on an ec2 instance with a role that allows this, or you've got a ~/.boto file with your credentials. It's easy enough to manually add yours.
Unfortunately it looks like this is not currently possible.
I looked around for the same solution but it seems neither console nor cloudwatch API provides that feature.
Note:
But we can copy the existing alram with the same parameter and can save on new name
.