Sensor file from gcs to bigquery. If success load it to GCS bucket, else load it to another bucket - google-bigquery

I want to build a Dag like this
Description:
I have a file named population_YYYYMMDD.csv in local. Then, I load it to GCS bucket - folder A using GCSObjectExistenceSensor => Done
Then, I transform it using DataflowTemplatedJobStartOperator. Transform sth like column name, data type,...
Base on whether the population_YYYYMMDD file was success or failure
If success, I want to load it into Bigquery - dataset A, table named population_YYYYMMDD. And the csv file will move to another folder - Success Folder (Same or Different Bucket is also ok)
If failure, the csv file will move to Failure Folder

You can make use of the BranchPythonOperator in Airflow for implementing the conditional block in your case. This operator calls a python callable function which in turn returns the task_id of the next task to be executed.
Since you've not shared any code, here is a simple exampler dag you can follow. Pls feel free to replace the DummyOperator with your requirement specific operators.
from airflow import DAG
from airflow.operators.python import BranchPythonOperator
from airflow.operators.dummy import DummyOperator
from datetime import datetime
from airflow.models.taskinstance import TaskInstance
default_args = {
'start_date': datetime(2020, 1, 1)
}
def _choose_best_model(**kwargs):
dag_instance = kwargs['dag']
execution_date= kwargs['execution_date']
operator_instance = dag_instance.get_task("Dataflow-operator")
task_status = TaskInstance(operator_instance, execution_date).current_state()
if task_status == 'success':
return 'upload-to-BQ'
else:
return 'upload-to-GCS'
with DAG('branching', schedule_interval=None, default_args=default_args, catchup=False) as dag:
task1 = DummyOperator(task_id='Local-to-GCS')
task2 = DummyOperator(task_id='Dataflow-operator')
branching = BranchPythonOperator(
task_id='conditional-task',
python_callable=_choose_best_model,
provide_context=True
)
accurate = DummyOperator(task_id='upload-to-BQ')
inaccurate = DummyOperator(task_id='upload-to-GCS')
task1>>task2
branching >> [accurate, inaccurate]
Here _choose_best_model function is called by BranchPythonOperator, this function determines the state of Task2 (which in your case would be DataflowTemplatedJobStartOperator) and if the task is successful it will return the one task skipping the other.

Related

how to avoid needing to restart spark session after overwriting external table

I have an Azure data lake external table, and want to remove all rows from it. I know that the 'truncate' command doesn't work for external tables, and BTW I don't really want to re-create the table (but might consider that option for certain flows). Anyway, the best I've gotten to work so far is to create an empty data frame (with a defined schema) and overwrite the folder containing the data, e.g.:
from pyspark.sql.types import *
data = []
schema = StructType(
[
StructField('Field1', IntegerType(), True),
StructField('Field2', StringType(), True),
StructField('Field3', DecimalType(18, 8), True)
]
)
sdf = spark.createDataFrame(data, schema)
#sdf.printSchema()
#display(sdf)
sdf.write.format("csv").option('header',True).mode("overwrite").save("/db1/table1")
This mostly works, except that if I go to select from the table, it will fail with the below error:
Error: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 13) (vm-9cb62393 executor 2): java.io.FileNotFoundException: Operation failed: "The specified path does not exist."
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I tried running 'refresh' on the table but the error persisted. Restarting the spark session fixes it, but that's not ideal. Is there a correct way for me to be doing this?
UPDATE: I don't have it working yet, but at least I now have a function that dynamically clears the table:
from pyspark.sql.types import *
from pyspark.sql.types import _parse_datatype_string
def empty_table(database_name, table_name):
data = []
schema = StructType()
for column in spark.catalog.listColumns(table_name, database_name):
datatype_string = _parse_datatype_string(column.dataType)
schema.add(column.name, datatype_string, True)
sdf = spark.createDataFrame(data, schema)
path = "/{}/{}".format(database_name, table_name)
sdf.write.format("csv").mode("overwrite").save(path)

Copy records from one table to another using spark-sql-jdbc

I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()
It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another.
The job fails with the error:
Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error:
Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].
I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene.
The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one.
Any idea of how to work with this issue ?
Here's the code:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'
with beam.Pipeline(options=options) as p:
query = "SELECT ..."
bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)
table_schema = ...
bq_data | beam.io.WriteToBigQuery(
project="prj",
dataset="test",
table="test",
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.
So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.
This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().
You can specify the dataset to use. That way the process doesn't create a temp dataset.
Example:
TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
.fromQuery("selectQuery").withQueryTempDataset("existingDataset")
.usingStandardSql().withMethod(TypedRead.Method.DEFAULT);

Beam Job Creates BigQuery Table but Does Not Insert

I am writing a beam job that is a simple 1:1 ETL from a binary protobuf file stored in GCS into BigQuery. The table schema is quite large, and generated automatically from a representative protobuf.
I am encountering behavior where the BigQuery table is created successfully, but no records are inserted. I have confirmed that records are being generated by the earlier stage, and when I use a normal file sink I can confirm that records are written.
Does anyone know why this is happening?
Logs:
WARNING:root:Inferring Schema...
WARNING:root:Unable to find default credentials to use: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Connecting anonymously.
WARNING:root:Defining Beam Pipeline...
<PATH REDACTED>/venv/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py:1145: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported
experiments = p.options.view_as(DebugOptions).experiments or []
WARNING:root:Running Beam Pipeline...
WARNING:root:extracted {'counters': [MetricResult(key=MetricKey(step=extract_games, metric=MetricName(namespace=__main__.ExtractGameProtobuf, name=extracted_games), labels={}), committed=8, attempted=8)], 'distributions': [], 'gauges': []} games
Pipeline Source:
def main(args):
DEFAULT_REPLAY_IDS_PATH = "./replay_ids.txt"
DEFAULT_BQ_TABLE_OUT = "<PROJECT REDACTED>:<DATASET REDACTED>.games"
# configure logging
logging.basicConfig(level=logging.WARNING)
# set up replay source
replay_source = ETLReplayRemoteSource.default()
# TODO: load the example replay and parse schema
logging.warning("Inferring Schema...")
sample_replay = replay_source.load_replay(DEFAULT_REPLAY_IDS[0])
game_schema = ProtobufToBigQuerySchemaGenerator(
sample_replay.analysis.DESCRIPTOR).schema()
# print("GAME SCHEMA:\n{}".format(game_schema)) # DEBUG
# submit beam job that reads replays into bigquery
def count_ones(word_ones):
(word, ones) = word_ones
return (word, sum(ones))
with beam.Pipeline(options=PipelineOptions()) as p:
logging.warning("Defining Beam Pipeline...")
# replay_ids = p | "create_replay_ids" >> beam.Create(DEFAULT_REPLAY_IDS)
(p | "read_replay_ids" >> beam.io.ReadFromText(DEFAULT_REPLAY_IDS_PATH)
| "extract_games" >> beam.ParDo(ExtractGameProtobuf())
| "write_out_bq" >> WriteToBigQuery(
DEFAULT_BQ_TABLE_OUT,
schema=game_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
)
logging.warning("Running Beam Pipeline...")
result = p.run()
result.wait_until_finish()
n_extracted = result.metrics().query(
MetricsFilter().with_name('extracted_games'))
logging.warning("extracted {} games".format(n_extracted))

How to best handle data stored in different locations in Google BigQuery?

My current workflow in BigQuery is as follows:
(1) query data in a public repository (stored in the US), (2) write it to a table in my repository, (3) export a csv to a cloud bucket and (4) download the csv on the server I work on and (5) work with that on the server.
The problem I have now, is that the server I work on is located in EU. Thus, I have to pay quite some fees for transfering data between my US bucket and my EU server. I could now go ahead and locate my bucket in EU, but then I still have the problem that I would transfer data from the US (BigQuery) to EU (bucket). So I could also set my dataset in bq to be located in the EU, but then I cant do any queries anylonger, because the data in the public repository is located in the US, and queries between different locations are not allowed.
Does anyone have an idea of how to approach this?
One way to copy a BigQuery dataset from one region to another is to take advantage of the Storage Data Transfer Service. It doesn't get around the fact that you still have to pay for bucket-to-bucket network traffic, but might save you some CPU time on copying data to a server in the EU.
The flow would be to:
Extract all the BigQuery tables into a bucket in the same region as the tables. (Recommend Avro format for best fidelity in data types and fastest loading speed.)
Run a storage transfer job to copy the extracted files from the starting location bucket to a bucket in the destination location.
Load all the files into a BigQuery dataset located in the destination location.
Python example:
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import sys
import time
import googleapiclient.discovery
from google.cloud import bigquery
import json
import pytz
PROJECT_ID = 'swast-scratch' # TODO: set this to your project name
FROM_LOCATION = 'US' # TODO: set this to the BigQuery location
FROM_DATASET = 'workflow_test_us' # TODO: set to BQ dataset name
FROM_BUCKET = 'swast-scratch-us' # TODO: set to bucket name in same location
TO_LOCATION = 'EU' # TODO: set this to the destination BigQuery location
TO_DATASET = 'workflow_test_eu' # TODO: set to destination dataset name
TO_BUCKET = 'swast-scratch-eu' # TODO: set to bucket name in destination loc
# Construct API clients.
bq_client = bigquery.Client(project=PROJECT_ID)
transfer_client = googleapiclient.discovery.build('storagetransfer', 'v1')
def extract_tables():
# Extract all tables in a dataset to a Cloud Storage bucket.
print('Extracting {}:{} to bucket {}'.format(
PROJECT_ID, FROM_DATASET, FROM_BUCKET))
tables = list(bq_client.list_tables(bq_client.dataset(FROM_DATASET)))
extract_jobs = []
for table in tables:
job_config = bigquery.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.AVRO
extract_job = bq_client.extract_table(
table.reference,
['gs://{}/{}.avro'.format(FROM_BUCKET, table.table_id)],
location=FROM_LOCATION, # Available in 0.32.0 library.
job_config=job_config) # Starts the extract job.
extract_jobs.append(extract_job)
for job in extract_jobs:
job.result()
return tables
def transfer_buckets():
# Transfer files from one region to another using storage transfer service.
print('Transferring bucket {} to {}'.format(FROM_BUCKET, TO_BUCKET))
now = datetime.datetime.now(pytz.utc)
transfer_job = {
'description': '{}-{}-{}_once'.format(
PROJECT_ID, FROM_BUCKET, TO_BUCKET),
'status': 'ENABLED',
'projectId': PROJECT_ID,
'transferSpec': {
'transferOptions': {
'overwriteObjectsAlreadyExistingInSink': True,
},
'gcsDataSource': {
'bucketName': FROM_BUCKET,
},
'gcsDataSink': {
'bucketName': TO_BUCKET,
},
},
# Set start and end date to today (UTC) without a time part to start
# the job immediately.
'schedule': {
'scheduleStartDate': {
'year': now.year,
'month': now.month,
'day': now.day,
},
'scheduleEndDate': {
'year': now.year,
'month': now.month,
'day': now.day,
},
},
}
transfer_job = transfer_client.transferJobs().create(
body=transfer_job).execute()
print('Returned transferJob: {}'.format(
json.dumps(transfer_job, indent=4)))
# Find the operation created for the job.
job_filter = {
'project_id': PROJECT_ID,
'job_names': [transfer_job['name']],
}
# Wait until the operation has started.
response = {}
while ('operations' not in response) or (not response['operations']):
time.sleep(1)
response = transfer_client.transferOperations().list(
name='transferOperations', filter=json.dumps(job_filter)).execute()
operation = response['operations'][0]
print('Returned transferOperation: {}'.format(
json.dumps(operation, indent=4)))
# Wait for the transfer to complete.
print('Waiting ', end='')
while operation['metadata']['status'] == 'IN_PROGRESS':
print('.', end='')
sys.stdout.flush()
time.sleep(5)
operation = transfer_client.transferOperations().get(
name=operation['name']).execute()
print()
print('Finished transferOperation: {}'.format(
json.dumps(operation, indent=4)))
def load_tables(tables):
# Load all tables into the new dataset.
print('Loading tables from bucket {} to {}:{}'.format(
TO_BUCKET, PROJECT_ID, TO_DATASET))
load_jobs = []
for table in tables:
dest_table = bq_client.dataset(TO_DATASET).table(table.table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.AVRO
load_job = bq_client.load_table_from_uri(
['gs://{}/{}.avro'.format(TO_BUCKET, table.table_id)],
dest_table,
location=TO_LOCATION, # Available in 0.32.0 library.
job_config=job_config) # Starts the load job.
load_jobs.append(load_job)
for job in load_jobs:
job.result()
# Actually run the script.
tables = extract_tables()
transfer_buckets()
load_tables(tables)
The preceding sample uses google-cloud-bigquery library for BigQuery API and google-api-python-client for Storage Data Transfer API.
Note that this sample does not account for partitioned tables.
No matter what, you have data in the US that you need in the EU, so I think you have two options:
You could continue to pay many smaller fees to move your reduced datasets from the US to the EU as you're doing today.
You could pay the one-off fee to transfer the original public BQ dataset from the US to your own dataset in the EU. From then on, all queries you run stay in the same region, and you have no more trans-continental transfers.
It really depends on how many queries you plan to do. If it's not a lot, then the way you're doing things today seems like it'd be the most efficient. If it's a lot, then moving the data once (paying the up-front fee) might work out cheaper.
Maybe Google has some magical way to make this better, but as far as I can tell, you're dealing with lots of data on one side of the Atlantic that you need on the other side, and moving it across that wire costs money.