Beam Job Creates BigQuery Table but Does Not Insert - google-bigquery

I am writing a beam job that is a simple 1:1 ETL from a binary protobuf file stored in GCS into BigQuery. The table schema is quite large, and generated automatically from a representative protobuf.
I am encountering behavior where the BigQuery table is created successfully, but no records are inserted. I have confirmed that records are being generated by the earlier stage, and when I use a normal file sink I can confirm that records are written.
Does anyone know why this is happening?
Logs:
WARNING:root:Inferring Schema...
WARNING:root:Unable to find default credentials to use: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Connecting anonymously.
WARNING:root:Defining Beam Pipeline...
<PATH REDACTED>/venv/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py:1145: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported
experiments = p.options.view_as(DebugOptions).experiments or []
WARNING:root:Running Beam Pipeline...
WARNING:root:extracted {'counters': [MetricResult(key=MetricKey(step=extract_games, metric=MetricName(namespace=__main__.ExtractGameProtobuf, name=extracted_games), labels={}), committed=8, attempted=8)], 'distributions': [], 'gauges': []} games
Pipeline Source:
def main(args):
DEFAULT_REPLAY_IDS_PATH = "./replay_ids.txt"
DEFAULT_BQ_TABLE_OUT = "<PROJECT REDACTED>:<DATASET REDACTED>.games"
# configure logging
logging.basicConfig(level=logging.WARNING)
# set up replay source
replay_source = ETLReplayRemoteSource.default()
# TODO: load the example replay and parse schema
logging.warning("Inferring Schema...")
sample_replay = replay_source.load_replay(DEFAULT_REPLAY_IDS[0])
game_schema = ProtobufToBigQuerySchemaGenerator(
sample_replay.analysis.DESCRIPTOR).schema()
# print("GAME SCHEMA:\n{}".format(game_schema)) # DEBUG
# submit beam job that reads replays into bigquery
def count_ones(word_ones):
(word, ones) = word_ones
return (word, sum(ones))
with beam.Pipeline(options=PipelineOptions()) as p:
logging.warning("Defining Beam Pipeline...")
# replay_ids = p | "create_replay_ids" >> beam.Create(DEFAULT_REPLAY_IDS)
(p | "read_replay_ids" >> beam.io.ReadFromText(DEFAULT_REPLAY_IDS_PATH)
| "extract_games" >> beam.ParDo(ExtractGameProtobuf())
| "write_out_bq" >> WriteToBigQuery(
DEFAULT_BQ_TABLE_OUT,
schema=game_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
)
logging.warning("Running Beam Pipeline...")
result = p.run()
result.wait_until_finish()
n_extracted = result.metrics().query(
MetricsFilter().with_name('extracted_games'))
logging.warning("extracted {} games".format(n_extracted))

Related

Python BigQuery Storage Write retry strategy when writing to default stream

I'm testing python-bigquery-storage to insert multiple items into a table using the _default stream.
I used the example shown in the official docs as a basis, and modified it to use the default stream.
Here is a minimal example that's similar to what I'm trying to do:
customer_record.proto
syntax = "proto2";
message CustomerRecord {
optional string customer_name = 1;
optional int64 row_num = 2;
}
append_rows_default.py
from itertools import islice
from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types
from google.cloud.bigquery_storage_v1 import writer
from google.protobuf import descriptor_pb2
import customer_record_pb2
import logging
logging.basicConfig(level=logging.DEBUG)
CHUNK_SIZE = 2 # Maximum number of rows to use in each AppendRowsRequest.
def chunks(l, n):
"""Yield successive `n`-sized chunks from `l`."""
_it = iter(l)
while True:
chunk = [*islice(_it, 0, n)]
if chunk:
yield chunk
else:
break
def create_stream_manager(project_id, dataset_id, table_id, write_client):
# Use the default stream
# The stream name is:
# projects/{project}/datasets/{dataset}/tables/{table}/_default
parent = write_client.table_path(project_id, dataset_id, table_id)
stream_name = f'{parent}/_default'
# Create a template with fields needed for the first request.
request_template = types.AppendRowsRequest()
# The initial request must contain the stream name.
request_template.write_stream = stream_name
# So that BigQuery knows how to parse the serialized_rows, generate a
# protocol buffer representation of our message descriptor.
proto_schema = types.ProtoSchema()
proto_descriptor = descriptor_pb2.DescriptorProto()
customer_record_pb2.CustomerRecord.DESCRIPTOR.CopyToProto(proto_descriptor)
proto_schema.proto_descriptor = proto_descriptor
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.writer_schema = proto_schema
request_template.proto_rows = proto_data
# Create an AppendRowsStream using the request template created above.
append_rows_stream = writer.AppendRowsStream(write_client, request_template)
return append_rows_stream
def send_rows_to_bq(project_id, dataset_id, table_id, write_client, rows):
append_rows_stream = create_stream_manager(project_id, dataset_id, table_id, write_client)
response_futures = []
row_count = 0
# Send the rows in chunks, to limit memory usage.
for chunk in chunks(rows, CHUNK_SIZE):
proto_rows = types.ProtoRows()
for row in chunk:
row_count += 1
proto_rows.serialized_rows.append(row.SerializeToString())
# Create an append row request containing the rows
request = types.AppendRowsRequest()
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.rows = proto_rows
request.proto_rows = proto_data
future = append_rows_stream.send(request)
response_futures.append(future)
# Wait for all the append row requests to finish.
for f in response_futures:
f.result()
# Shutdown background threads and close the streaming connection.
append_rows_stream.close()
return row_count
def create_row(row_num: int, name: str):
row = customer_record_pb2.CustomerRecord()
row.row_num = row_num
row.customer_name = name
return row
def main():
write_client = bigquery_storage_v1.BigQueryWriteClient()
rows = [ create_row(i, f"Test{i}") for i in range(0,20) ]
send_rows_to_bq("PROJECT_NAME", "DATASET_NAME", "TABLE_NAME", write_client, rows)
if __name__ == '__main__':
main()
Note:
In the above, CHUNK_SIZE is 2 just for this minimal example, but, in a real situation, I used a chunk size of 5000.
In real usage, I have several separate streams of data that need to be processed in parallel, so I make several calls to send_rows_to_bq, one for each stream of data, using a thread pool (one thread per stream of data). (I'm assuming here that AppendRowsStream is not meant to be shared by multiple threads, but I might be wrong).
It mostly works, but I often get a mix of intermittent errors in the call to append_rows_stream's send method:
google.cloud.bigquery_storage_v1.exceptions.StreamClosedError: This manager has been closed and can not be used.
google.api_core.exceptions.Unknown: None There was a problem opening the stream. Try turning on DEBUG level logs to see the error.
I think I just need to retry on these errors, but I'm not sure how to best implement a retry strategy here. My impression is that I need to use the following strategy to retry errors when calling send:
If the error is a StreamClosedError, the append_rows_stream stream manager can't be used anymore, and so I need to call close on it and then call my create_stream_manager again to create a new one, then try to call send on the new stream manager.
Otherwise, on any google.api_core.exceptions.ServerError error, retry the call to send on the same stream manager.
Am I approaching this correctly?
Thank you.
The best solution to this problem is to update to the newer lib release.
This problem happens or was happening in the older versions because once the connection write API reaches 10MB, it hangs.
If the update to the newer lib does not work you can try these options:
Limit the connection to < 10MB.
Disconnect and connect again to the API.

Copy records from one table to another using spark-sql-jdbc

I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()
It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another.
The job fails with the error:
Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error:
Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].
I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene.
The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one.
Any idea of how to work with this issue ?
Here's the code:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'
with beam.Pipeline(options=options) as p:
query = "SELECT ..."
bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)
table_schema = ...
bq_data | beam.io.WriteToBigQuery(
project="prj",
dataset="test",
table="test",
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.
So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.
This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().
You can specify the dataset to use. That way the process doesn't create a temp dataset.
Example:
TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
.fromQuery("selectQuery").withQueryTempDataset("existingDataset")
.usingStandardSql().withMethod(TypedRead.Method.DEFAULT);

What is the best way to communicate among multiple processes in ubuntu

I've three different machine learning models in python. To improve performance, I run them on different terminals in parallel. They are communicating and sharing data with one another through files. These models are creating batches of files to make available for other. All the processes are running in parallel but dependent on data prepared by other process. Once a process A prepares a batch of data, it creates a file to give signal to other process that data is ready, then process B starts processing it, while looking for other batch too simultaneously. How can this huge data be shared with next process without creating files? Is there any better way to communicate among these processes without creating/deleting temporary files in python?
Thanks
You could consider running up a small Redis instance... a very fast, in-memory data structure server.
It allows you to share strings, lists, queues, hashes, atomic integers, sets, ordered sets between processes very simply.
As it is networked, you can share all these data structures not only within a single machine, but across multiple machines.
As it has bindings for C/C++, Python, bash, Ruby, Perl and so on, it also means you can use the shell, for example, to quickly inject commands/data into your app to change its behaviour, or get debugging insight by looking at how variables are set.
Here's an example of how to do multiprocessing in Python3. Instead of storing results in a file the results are stored in a dictionary (see output)
from multiprocessing import Pool, cpu_count
def multi_processor(function_name):
file_list = []
# Test, put 6 strings in the list so your_function should run six times
# with 6 processors in parallel, (assuming your CPU has enough cores)
file_list.append("test1")
file_list.append("test2")
file_list.append("test3")
file_list.append("test4")
file_list.append("test5")
file_list.append("test6")
# Use max number of system processors - 1
pool = Pool(processes=cpu_count()-1)
pool.daemon = True
results = {}
# for every item in the file_list, start a new process
for aud_file in file_list:
results[aud_file] = pool.apply_async(your_function, args=("arg1", "arg2"))
# Wait for all processes to finish before proceeding
pool.close()
pool.join()
# Results and any errors are returned
return {your_function: result.get() for your_function, result in results.items()}
def your_function(arg1, arg2):
try:
print("put your stuff in this function")
your_results = ""
return your_results
except Exception as e:
return str(e)
if __name__ == "__main__":
some_results = multi_processor("your_function")
print(some_results)
The output is
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
{'test1': '', 'test2': '', 'test3': '', 'test4': '', 'test5': '', 'test6': ''}
Try using a sqlite database to share files.
I made this for this exact purpose:
https://pypi.org/project/keyvalue-sqlite/
You can use it like this:
from keyvalue_sqlite import KeyValueSqlite
DB_PATH = '/path/to/db.sqlite'
db = KeyValueSqlite(DB_PATH, 'table-name')
# Now use standard dictionary operators
db.set_default('0', '1')
actual_value = db.get('0')
assert '1' == actual_value
db.set_default('0', '2')
assert '1' == db.get('0')

How to best handle data stored in different locations in Google BigQuery?

My current workflow in BigQuery is as follows:
(1) query data in a public repository (stored in the US), (2) write it to a table in my repository, (3) export a csv to a cloud bucket and (4) download the csv on the server I work on and (5) work with that on the server.
The problem I have now, is that the server I work on is located in EU. Thus, I have to pay quite some fees for transfering data between my US bucket and my EU server. I could now go ahead and locate my bucket in EU, but then I still have the problem that I would transfer data from the US (BigQuery) to EU (bucket). So I could also set my dataset in bq to be located in the EU, but then I cant do any queries anylonger, because the data in the public repository is located in the US, and queries between different locations are not allowed.
Does anyone have an idea of how to approach this?
One way to copy a BigQuery dataset from one region to another is to take advantage of the Storage Data Transfer Service. It doesn't get around the fact that you still have to pay for bucket-to-bucket network traffic, but might save you some CPU time on copying data to a server in the EU.
The flow would be to:
Extract all the BigQuery tables into a bucket in the same region as the tables. (Recommend Avro format for best fidelity in data types and fastest loading speed.)
Run a storage transfer job to copy the extracted files from the starting location bucket to a bucket in the destination location.
Load all the files into a BigQuery dataset located in the destination location.
Python example:
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import sys
import time
import googleapiclient.discovery
from google.cloud import bigquery
import json
import pytz
PROJECT_ID = 'swast-scratch' # TODO: set this to your project name
FROM_LOCATION = 'US' # TODO: set this to the BigQuery location
FROM_DATASET = 'workflow_test_us' # TODO: set to BQ dataset name
FROM_BUCKET = 'swast-scratch-us' # TODO: set to bucket name in same location
TO_LOCATION = 'EU' # TODO: set this to the destination BigQuery location
TO_DATASET = 'workflow_test_eu' # TODO: set to destination dataset name
TO_BUCKET = 'swast-scratch-eu' # TODO: set to bucket name in destination loc
# Construct API clients.
bq_client = bigquery.Client(project=PROJECT_ID)
transfer_client = googleapiclient.discovery.build('storagetransfer', 'v1')
def extract_tables():
# Extract all tables in a dataset to a Cloud Storage bucket.
print('Extracting {}:{} to bucket {}'.format(
PROJECT_ID, FROM_DATASET, FROM_BUCKET))
tables = list(bq_client.list_tables(bq_client.dataset(FROM_DATASET)))
extract_jobs = []
for table in tables:
job_config = bigquery.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.AVRO
extract_job = bq_client.extract_table(
table.reference,
['gs://{}/{}.avro'.format(FROM_BUCKET, table.table_id)],
location=FROM_LOCATION, # Available in 0.32.0 library.
job_config=job_config) # Starts the extract job.
extract_jobs.append(extract_job)
for job in extract_jobs:
job.result()
return tables
def transfer_buckets():
# Transfer files from one region to another using storage transfer service.
print('Transferring bucket {} to {}'.format(FROM_BUCKET, TO_BUCKET))
now = datetime.datetime.now(pytz.utc)
transfer_job = {
'description': '{}-{}-{}_once'.format(
PROJECT_ID, FROM_BUCKET, TO_BUCKET),
'status': 'ENABLED',
'projectId': PROJECT_ID,
'transferSpec': {
'transferOptions': {
'overwriteObjectsAlreadyExistingInSink': True,
},
'gcsDataSource': {
'bucketName': FROM_BUCKET,
},
'gcsDataSink': {
'bucketName': TO_BUCKET,
},
},
# Set start and end date to today (UTC) without a time part to start
# the job immediately.
'schedule': {
'scheduleStartDate': {
'year': now.year,
'month': now.month,
'day': now.day,
},
'scheduleEndDate': {
'year': now.year,
'month': now.month,
'day': now.day,
},
},
}
transfer_job = transfer_client.transferJobs().create(
body=transfer_job).execute()
print('Returned transferJob: {}'.format(
json.dumps(transfer_job, indent=4)))
# Find the operation created for the job.
job_filter = {
'project_id': PROJECT_ID,
'job_names': [transfer_job['name']],
}
# Wait until the operation has started.
response = {}
while ('operations' not in response) or (not response['operations']):
time.sleep(1)
response = transfer_client.transferOperations().list(
name='transferOperations', filter=json.dumps(job_filter)).execute()
operation = response['operations'][0]
print('Returned transferOperation: {}'.format(
json.dumps(operation, indent=4)))
# Wait for the transfer to complete.
print('Waiting ', end='')
while operation['metadata']['status'] == 'IN_PROGRESS':
print('.', end='')
sys.stdout.flush()
time.sleep(5)
operation = transfer_client.transferOperations().get(
name=operation['name']).execute()
print()
print('Finished transferOperation: {}'.format(
json.dumps(operation, indent=4)))
def load_tables(tables):
# Load all tables into the new dataset.
print('Loading tables from bucket {} to {}:{}'.format(
TO_BUCKET, PROJECT_ID, TO_DATASET))
load_jobs = []
for table in tables:
dest_table = bq_client.dataset(TO_DATASET).table(table.table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.AVRO
load_job = bq_client.load_table_from_uri(
['gs://{}/{}.avro'.format(TO_BUCKET, table.table_id)],
dest_table,
location=TO_LOCATION, # Available in 0.32.0 library.
job_config=job_config) # Starts the load job.
load_jobs.append(load_job)
for job in load_jobs:
job.result()
# Actually run the script.
tables = extract_tables()
transfer_buckets()
load_tables(tables)
The preceding sample uses google-cloud-bigquery library for BigQuery API and google-api-python-client for Storage Data Transfer API.
Note that this sample does not account for partitioned tables.
No matter what, you have data in the US that you need in the EU, so I think you have two options:
You could continue to pay many smaller fees to move your reduced datasets from the US to the EU as you're doing today.
You could pay the one-off fee to transfer the original public BQ dataset from the US to your own dataset in the EU. From then on, all queries you run stay in the same region, and you have no more trans-continental transfers.
It really depends on how many queries you plan to do. If it's not a lot, then the way you're doing things today seems like it'd be the most efficient. If it's a lot, then moving the data once (paying the up-front fee) might work out cheaper.
Maybe Google has some magical way to make this better, but as far as I can tell, you're dealing with lots of data on one side of the Atlantic that you need on the other side, and moving it across that wire costs money.