unable to load data into big query table by BigQueryOperator in composer - google-bigquery

I have created the dag to load the data from big query to another big query table. I have used the BigQueryOperator in composer . But this code is not working as expected. I'm not able to get the error can any one please help me resolve this issue.
And i manually created empty table also still data in not loading into the table.Please find the below code and let me know did i missed any thing?
from typing import Any
from datetime import datetime, timedelta
import airflow
from airflow import models
from airflow.operators import bash_operator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
sql="""SELECT * FROM `project_id.dataset_name.source_table`"""
DEFAULT_ARGUMENTS = {
"owner": "Airflow",
"depends_on_past": False,
"start_date": datetime(2019, 8, 7),
"schedule_interval": '0 6 * * *',
"retries": 10
}
dag = models.DAG(
dag_id='Bq_to_bq',
default_args=DEFAULT_ARGUMENTS
)
LOAD_TABLE_TRUNC = BigQueryOperator(
task_id ='load_bq_table_truncate',
dag=dag,
bql=sql,
destination_proect_dataset_table='project-id.dataset-name.table_name',
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
allow_large_results='true',
use_legacy_sql=False,
)
LOAD_TABLE_APPEND = BigQueryOperator(
task_id ='load_bq_table_append',
dag=dag,
bql=sql,
destination_proect_dataset_table='project-id.dataset-name.table_name',
write_disposition='WRITE_APPEND',
create_disposition='CREATE_IF_NEEDED',
allow_large_results='true',
use_legacy_sql=False,
)
LOAD_TABLE_TRUNC.set_downstream(LOAD_TABLE_APPEND)

This is to find out error specific to the DAG's failure
You can find out error in two ways
Web Interface:
Go to the DAG and select Graph view.
Select the task and click on View Log.
Stack Driver logging:
Go to this URL https://console.cloud.google.com/logs/viewer? project=project_id.
Select 'Cloud Composer Environment' from first dropdown followed by location and DAG name.
Select Error as from Log level dropdown.

As Josh mentioned in his comment on your post, the value for allow_large_results should be True without the quotes. Additionally, I see that you have a typo in your spelling of destination_proect_dataset_table. You're missing a 'j': destination_project_dataset_table
BVSKanth's recommendations for finding DAG errors are also spot-on and worth considering for future debugging.

Related

Sensor file from gcs to bigquery. If success load it to GCS bucket, else load it to another bucket

I want to build a Dag like this
Description:
I have a file named population_YYYYMMDD.csv in local. Then, I load it to GCS bucket - folder A using GCSObjectExistenceSensor => Done
Then, I transform it using DataflowTemplatedJobStartOperator. Transform sth like column name, data type,...
Base on whether the population_YYYYMMDD file was success or failure
If success, I want to load it into Bigquery - dataset A, table named population_YYYYMMDD. And the csv file will move to another folder - Success Folder (Same or Different Bucket is also ok)
If failure, the csv file will move to Failure Folder
You can make use of the BranchPythonOperator in Airflow for implementing the conditional block in your case. This operator calls a python callable function which in turn returns the task_id of the next task to be executed.
Since you've not shared any code, here is a simple exampler dag you can follow. Pls feel free to replace the DummyOperator with your requirement specific operators.
from airflow import DAG
from airflow.operators.python import BranchPythonOperator
from airflow.operators.dummy import DummyOperator
from datetime import datetime
from airflow.models.taskinstance import TaskInstance
default_args = {
'start_date': datetime(2020, 1, 1)
}
def _choose_best_model(**kwargs):
dag_instance = kwargs['dag']
execution_date= kwargs['execution_date']
operator_instance = dag_instance.get_task("Dataflow-operator")
task_status = TaskInstance(operator_instance, execution_date).current_state()
if task_status == 'success':
return 'upload-to-BQ'
else:
return 'upload-to-GCS'
with DAG('branching', schedule_interval=None, default_args=default_args, catchup=False) as dag:
task1 = DummyOperator(task_id='Local-to-GCS')
task2 = DummyOperator(task_id='Dataflow-operator')
branching = BranchPythonOperator(
task_id='conditional-task',
python_callable=_choose_best_model,
provide_context=True
)
accurate = DummyOperator(task_id='upload-to-BQ')
inaccurate = DummyOperator(task_id='upload-to-GCS')
task1>>task2
branching >> [accurate, inaccurate]
Here _choose_best_model function is called by BranchPythonOperator, this function determines the state of Task2 (which in your case would be DataflowTemplatedJobStartOperator) and if the task is successful it will return the one task skipping the other.

how to avoid needing to restart spark session after overwriting external table

I have an Azure data lake external table, and want to remove all rows from it. I know that the 'truncate' command doesn't work for external tables, and BTW I don't really want to re-create the table (but might consider that option for certain flows). Anyway, the best I've gotten to work so far is to create an empty data frame (with a defined schema) and overwrite the folder containing the data, e.g.:
from pyspark.sql.types import *
data = []
schema = StructType(
[
StructField('Field1', IntegerType(), True),
StructField('Field2', StringType(), True),
StructField('Field3', DecimalType(18, 8), True)
]
)
sdf = spark.createDataFrame(data, schema)
#sdf.printSchema()
#display(sdf)
sdf.write.format("csv").option('header',True).mode("overwrite").save("/db1/table1")
This mostly works, except that if I go to select from the table, it will fail with the below error:
Error: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 13) (vm-9cb62393 executor 2): java.io.FileNotFoundException: Operation failed: "The specified path does not exist."
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I tried running 'refresh' on the table but the error persisted. Restarting the spark session fixes it, but that's not ideal. Is there a correct way for me to be doing this?
UPDATE: I don't have it working yet, but at least I now have a function that dynamically clears the table:
from pyspark.sql.types import *
from pyspark.sql.types import _parse_datatype_string
def empty_table(database_name, table_name):
data = []
schema = StructType()
for column in spark.catalog.listColumns(table_name, database_name):
datatype_string = _parse_datatype_string(column.dataType)
schema.add(column.name, datatype_string, True)
sdf = spark.createDataFrame(data, schema)
path = "/{}/{}".format(database_name, table_name)
sdf.write.format("csv").mode("overwrite").save(path)

NullPointerException on loading data into Grakn

I have created a backup of Grakn with the exporter tool like this:
./grakn server export 'old_test' backup.grakn
$x isa export,
has status "completed",
has progress (100.0%),
has count (105 / 105);
I then wanted to import this into a new keyspace with
./grakn server import 'new_test' backup.grakn
But I got this error below:
An error has occurred during boot-up. Please run 'grakn server status' or check the logs located under the 'logs' directory.
io.grpc.StatusRuntimeException: INTERNAL: java.lang.NullPointerException
You need to import your schema into the new keyspace first, this error occurs because the server cannot find a schema label in your dataset. The steps for migrating schema are described in the docs: https://dev.grakn.ai/docs/management/migration-and-backup

Failed to connect to BigQuery with Python - ServiceUnavailable

querying data from BigQuery has been working for me. Then I updated my google packages (e. g. google-cloud-bigquery) and suddenly I could no longer download data. Unfortunately, I don't know the old version of the package I was using any more. Now, I'm using version '1.26.1' of google-cloud-bigquery.
Here is my code which was running:
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
KEY_FILE_LOCATION = "path_to_json"
PROCECT_ID = 'bigquery-123454'
credentials = service_account.Credentials.from_service_account_file(KEY_FILE_LOCATION)
client = bigquery.Client(credentials= credentials,project=PROCECT_ID)
query_job = client.query("""
SELECT
x,
y
FROM
`bigquery-123454.624526435.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20200501' AND '20200502'
""")
results = query_job.result()
df = results.to_dataframe()
Except of the last line df = results.to_dataframe() the code works perfectly. Now I get a weired error which consists of three parts:
Part 1:
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1596627109.629000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1596627109.629000000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>
Part 2:
ServiceUnavailable: 503 failed to connect to all addresses
Part 3:
RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x0000000010BD3C80>, table_reference {
project_id: "bigquery-123454"
dataset_id: "_a0003e6c1ab4h23rfaf0d9cf49ac0e90083ca349e"
table_id: "anon2d0jth_f891_40f5_8c63_76e21ab5b6f5"
}
requested_streams: 1
read_options {
}
format: ARROW
parent: "projects/bigquery-123454"
, metadata=[('x-goog-request-params', 'table_reference.project_id=bigquery-123454&table_reference.dataset_id=_a0003e6c1abanaw4egacf0d9cf49ac0e90083ca349e'), ('x-goog-api-client', 'gl-python/3.7.3 grpc/1.30.0 gax/1.22.0 gapic/1.0.0')]), last exception: 503 failed to connect to all addresses
I don't have an explanation for this error. I don't think it has something to do with me updating the packages.
Once I had problems with the proxy but these problems caused another/different error.
My colleague said that the project "bigquery-123454" is still available in BigQuery.
Any ideas?
Thanks for your help in advance!
503 error occurs when there is a network issue. Try again after some time or retry the job.
You can read more about the error on Google Cloud Page
I found the answer:
After downgrading the package "google-cloud-bigquery" from version 1.26.1 to 1.18.1 the code worked again! So the new package caused the errors.
I downgraded the package using pip install google-cloud-bigquery==1.18.1 --force-reinstall

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another.
The job fails with the error:
Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error:
Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].
I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene.
The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one.
Any idea of how to work with this issue ?
Here's the code:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'
with beam.Pipeline(options=options) as p:
query = "SELECT ..."
bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)
table_schema = ...
bq_data | beam.io.WriteToBigQuery(
project="prj",
dataset="test",
table="test",
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.
So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.
This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().
You can specify the dataset to use. That way the process doesn't create a temp dataset.
Example:
TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
.fromQuery("selectQuery").withQueryTempDataset("existingDataset")
.usingStandardSql().withMethod(TypedRead.Method.DEFAULT);