Cloud Composer - DAG error: java.lang.ClassNotFoundException: Failed to find data source: bigquery - apache-spark-sql

I'm trying to execute a DAG which create a Dataproc Cluster at Cloud Composer. But It fails when trying to save on Big Query. I suppose that is missing a jar file ( --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar) but I don't know how to add to my code.
code:
submit_job = DataprocSubmitJobOperator(
task_id="pyspark_task",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID)
If a call this job at the Cluster, it works.
gcloud dataproc jobs submit pyspark --cluster cluster-bc4b --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar --region us-central1 ~/examen/ETL/loadBQ.py
But I don't know how can I replicate on Airflow
Code on PySpark:
df.write .format("bigquery") .mode("append") .option("temporaryGcsBucket","ds1-dataproc/temp") .save("test-opi-330322.test.Base3")

In your example
submit_job = DataprocSubmitJobOperator(
task_id="pyspark_task",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID)
The jars should be part of PYSPARK_JOB like
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {
"main_python_file_uri": PYSPARK_URI,
"jar_file_uris": ["gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar"],
},
}
See this doc.

Related

Trying to run a data flow job in us central 1 region but source and target is in asia-south1

Wanted to check on the similar error that is also mentioned in the post "https://stackoverflow.com/questions/37298504/google-dataflow-job-and-bigquery-failing-on-different-regions?rq=1"
I am facing a similar issue in my data flow job where I am getting error as below
2021-03-10T06:02:26.115216545ZWorkflow failed. Causes: S01:Read File from GCS/Read+String To BigQuery Row+Write to BigQuery/NativeWrite failed., BigQuery import job "dataflow_job_15712075439082970546-B" failed., BigQuery job "dataflow_job_15712075439082970546-B" in project "whr-asia-datalake-prod" finished with error(s): errorResult: Cannot read and write in different locations: source: US, destination: asia-south1, error: Cannot read and write in different locations: source: US, destination: asia-south1
This is the error when I try to run the code using a cloud function trigger. Please find below the cloud function code. both my source data and target big query dataset resides in asia south 1
"""
Google cloud funtion used for executing dataflow jobs.
"""
from googleapiclient.discovery import build
import time
def df_load_function(file, context):
filesnames = [
'5667788_OPTOUT_',
'WHR_AD_EMAIL_CNSNT_RESP_'
]
# Check the uploaded file and run related dataflow jobs.
for i in filesnames:
if 'inbound/{}'.format(i) in file['name']:
print("Processing file: {filename}".format(filename=file['name']))
project = '<my project>'
inputfile = 'gs://<my bucket>/inbound/' + file['name']
job = 'df_load_wave1_{}'.format(i)
template = 'gs://<my bucket>/template/df_load_wave1_{}'.format(i)
location = 'us-central1'
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
request = dataflow.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location=location,
body={
'jobName': job,
"environment": {
"workerZone": "us-central1-a"
}
}
)
# Execute the dataflowjob
response = request.execute()
job_id = response["job"]["id"]
I have kept both location and workerzone as us-central1 and us-central1-a respectively. I need to run my data flow job in us central 1 due to some resource issues but read and write data from asia-south1. what else do I need to add in cloud function so that region and zone are both us-central1 but data is read and written from asia south 1.
However, when I run my job manually using cloud shell using below commands it works fine and data is loaded. here both region and zone is us-central1
python -m <python script where the data is read from bucket and load big query> \
--project <my_project> \
--region us-central1 \
--runner DataflowRunner \
--staging_location gs://<bucket_name>/staging \
--temp_location gs://<bucket_name>/temp \
--subnetwork https://www.googleapis.com/compute/v1/projects/whr-ios-network/regions/us-central1/subnetworks/<subnetwork> \
--network projects/whr-ios-network/global/networks/<name> \
--zone us-central1-a \
--save_main_session
Please help anyone. Have been struggling with this issue.
I was able to fix the below error
"2021-03-10T06:02:26.115216545ZWorkflow failed. Causes: S01:Read File from GCS/Read+String To BigQuery Row+Write to BigQuery/NativeWrite failed., BigQuery import job "dataflow_job_15712075439082970546-B" failed., BigQuery job "dataflow_job_15712075439082970546-B" in project "whr-asia-datalake-prod" finished with error(s): errorResult: Cannot read and write in different locations: source: US, destination: asia-south1, error: Cannot read and write in different locations: source: US, destination: asia-south1"
I just changed by cloud function to add the temp location of my asia-south bucket location because though I was providing the tmp location as of asia-south1 while creating the template, bigquery IO in my data flow job was trying to use the temp location of us-central1 and not asia-south1 and hence the above error.

Flink on EMR - no output, either to console or to file

I'm trying to deploy my flink job on AWS EMR (version 5.15 with Flink 1.4.2). However, I could not get any output from my stream.
I tried to create a simple job:
object StreamingJob1 {
def main(args: Array[String]) {
val path = args(0)
val file_input_format = new TextInputFormat(
new org.apache.flink.core.fs.Path(path))
file_input_format.setFilesFilter(FilePathFilter.createDefaultFilter())
file_input_format.setNestedFileEnumeration(true)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val myStream: DataStream[String] =
env.readFile(file_input_format,
path,
FileProcessingMode.PROCESS_CONTINUOUSLY,
1000L)
.map(s => s.split(",").toString)
myStream.print()
// execute program
env.execute("Flink Streaming Scala")
}
}
And I executed it using the following command:
HADOOP_CONF_DIR=/etc/hadoop/conf; flink run -m yarn-cluster -yn 4 -c my.pkg.StreamingJob1 /home/hadoop/flink-test-0.1.jar hdfs:///user/hadoop/data/
There was no error, but no output on the screen except flink's INFO logs.
I tried to output to a Kinesis stream, or to an S3 file. Nothing was recorded.
myStream.addSink(new BucketingSink[String](output_path))
I also tried to write to a HDFS file. In this case, a file was created, but with size = 0.
I am sure that the input file has been processed using a simple check:
myStream.map(s => {"abc".toInt})
which generated an exception.
What am I missing here?
It looks like stream.print() doesn't work on EMR.
Output to file: HDFS is used, and sometimes (or most of the time) I need to wait for the file to be updated.
Output to Kinesis: I had a typo in my stream name. I don't know why I didn't get any exception for that stream-not-exist. However, after get the name corrected, I got my expected message.

How to export from BigQuery to Datastore?

I have tables in BigQuery which I want to export and import in Datastore.
How to achieve that?
Table from BigQuery can be exported and imported to your datastore.
Download the jar file from https://github.com/yu-iskw/bigquery-to-datastore/releases
Then run the command
java -cp bigquery-to-datastore-bundled-0.5.1.jar com.github.yuiskw.beam.BigQuery2Datastore --project=yourprojectId --runner=DataflowRunner --inputBigQueryDataset=datastore --inputBigQueryTable=metainfo_internal_2 --outputDatastoreNamespace=default --outputDatastoreKind=meta_internal --keyColumn=key --indexedColumns=column1,column2 --tempLocation=gs://gsheetbackup_live/temp --gcpTempLocation=gs://gsheetlogfile_live/temp
--tempLocation and --gcpTempLocation are valid cloud storage bucket urls.
--keyColumn=key - the key here is the unique field on your big query table
2020 anwer,
use GoogleCloudPlatform/DataflowTemplates, BigQueryToDatastore
# Builds the Java project and uploads an artifact to GCS
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.BigQueryToDatastore \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=<project-id> \
--region=<region-name> \
--stagingLocation=gs://<bucket-name>/staging \
--tempLocation=gs://<bucket-name>/temp \
--templateLocation=gs://<bucket-name>/templates/<template-name>.json \
--runner=DataflowRunner"
# Uses the GCS artifact to run the transfer job
gcloud dataflow jobs run <job-name> \
--gcs-location=<template-location> \
--zone=<zone> \
--parameters "\
readQuery=SELECT * FROM <dataset>.<table>,readIdColumn=<id>,\
invalidOutputPath=gs://your-bucket/path/to/error.txt,\
datastoreWriteProjectId=<project-id>,\
datastoreWriteNamespace=<namespace>,\
datastoreWriteEntityKind=<kind>,\
errorWritePath=gs://your-bucket/path/to/errors.txt"
I hope this will get a proper user interface in GCP Console on day! (as this is already possible for Pub/Sub to BigQuery using Dataflow SQL)
You may export BigQuery data to CSV, then import CSV into Datastore. The first step is easy and well documented https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_stored_in_bigquery. For the second step, there are many resources that help you achieve that. For example,
https://groups.google.com/forum/#!topic/google-appengine/L64wByP7GAY
Import CSV into google cloud datastore

PySpark - Spark clusters EC2 - unable to save to S3

I have set up a spark cluster with a master and 2 slaves (I'm using Spark Standalone). The cluster is working well with some of the examples but not my application. My application workflow is that, it will read the csv -> extract each line in the csv along with the header -> convert to JSON -> save to S3. Here is my code:
def upload_func(row):
f = row.toJSON()
f.saveAsTextFile("s3n://spark_data/"+ row.name +".json")
print(f)
print(row.name)
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.getOrCreate()
df = spark.read.csv("sample.csv", header=True, mode="DROPMALFORMED")
df.rdd.map(upload_func)
I have also export the AWS_Key_ID and AWS_Secret_Key into the ec2 environment. However with the above code, my application does not work. Below are the issues:
The JSON files are not saved in S3, I have tried run the application few times and also reload the S3 page but no data. The application completed without any error in the log. Also, the print(f) and print(row.name) are not printed out in the log. What do I need to fix to get the JSON save on S3 and is there anyway for me to print on the log for debug purpose?
Currently I need to put the csv file in the worker node so the application can read the csv file. How can I put the file in another place, let say the master node and when the application runs, it will split the csv file to all the worker nodes so they can do the upload parallel as a distributed system?
Help is really appreciated. Thanks for your help in advance.
UPDATED
After putting Logger to debug, I have identified the issue that the map function upload_func() is not being called or the application could not get inside this function (Logger printed messages before and after function call). Please help if you know the reason why?
you need to force the map to be evaluated; spark will only execute work on demand.
df.rdd.map(upload_func).count() should do it

Hive on spark doesn't work in hue

I am trying to trigger hive on spark using hue interface . The job works perfectly when run from commandline but when i try to run from hue it throws exceptions. In hue, I tried mainly two things:
1) when I give all the properties in .hql file using set commands
set spark.home=/usr/lib/spark;
set hive.execution.engine=spark;
set spark.eventLog.enabled=true;
add jar /usr/lib/spark/assembly/lib/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar;
set spark.eventLog.dir=hdfs://10.11.50.81:8020/tmp/;
set spark.executor.memory=2899102923;
I get an error
ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Unsupported execution engine: Spark. Please set hive.execution.engine=mr)'
org.apache.hadoop.hive.ql.metadata.HiveException: Unsupported execution engine: Spark. Please set hive.execution.engine=mr
2) when I give properties in hue properties, it just works with mr engine but not spark execution engine.
Any help would be appreciated
I have solved this issue by using a shell action in oozie.
This shell action invokes a pyspark action bearing my sql file.
Even though the job shows as MR in jobtracker, spark history server recognizes as a spark action and the output is achieved.
shell file:
#!/bin/bash
export PYTHONPATH=`pwd`
spark-submit --master local testabc.py
python file:
from pyspark.sql import HiveContext
from pyspark import SparkContext
sc = SparkContext();
sqlContext = HiveContext(sc)
result = sqlContext.sql("insert into table testing_oozie.table2 select * from testing_oozie.table1 ");
result.show()