Copy records from one table to another using spark-sql-jdbc - apache-spark-sql

I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()

It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "

Related

how to avoid needing to restart spark session after overwriting external table

I have an Azure data lake external table, and want to remove all rows from it. I know that the 'truncate' command doesn't work for external tables, and BTW I don't really want to re-create the table (but might consider that option for certain flows). Anyway, the best I've gotten to work so far is to create an empty data frame (with a defined schema) and overwrite the folder containing the data, e.g.:
from pyspark.sql.types import *
data = []
schema = StructType(
[
StructField('Field1', IntegerType(), True),
StructField('Field2', StringType(), True),
StructField('Field3', DecimalType(18, 8), True)
]
)
sdf = spark.createDataFrame(data, schema)
#sdf.printSchema()
#display(sdf)
sdf.write.format("csv").option('header',True).mode("overwrite").save("/db1/table1")
This mostly works, except that if I go to select from the table, it will fail with the below error:
Error: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 13) (vm-9cb62393 executor 2): java.io.FileNotFoundException: Operation failed: "The specified path does not exist."
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I tried running 'refresh' on the table but the error persisted. Restarting the spark session fixes it, but that's not ideal. Is there a correct way for me to be doing this?
UPDATE: I don't have it working yet, but at least I now have a function that dynamically clears the table:
from pyspark.sql.types import *
from pyspark.sql.types import _parse_datatype_string
def empty_table(database_name, table_name):
data = []
schema = StructType()
for column in spark.catalog.listColumns(table_name, database_name):
datatype_string = _parse_datatype_string(column.dataType)
schema.add(column.name, datatype_string, True)
sdf = spark.createDataFrame(data, schema)
path = "/{}/{}".format(database_name, table_name)
sdf.write.format("csv").mode("overwrite").save(path)

PyHive unable to fetch logs from HiveServer2 when running in async mode

I am running into a strange issue with PyHive running a Hive query in async mode. Internally, PyHive uses Thrift client to execute the query and to fetch logs (along with execution status). I am unable to fetch the logs of Hive query (map/reduce tasks, etc). cursor.fetch_logs() returns an empty data structure
Here is the code snippet
rom pyhive import hive # or import hive or import trino
from TCLIService.ttypes import TOperationState
def run():
cursor = hive.connect(host="10.x.y.z", port='10003', username='xyz', password='xyz', auth='LDAP').cursor()
cursor.execute("select count(*) from schema1.table1 where date = '2021-03-13' ", async_=True)
status = cursor.poll(True).operationState
print(status)
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
logs = cursor.fetch_logs()
for message in logs:
print("running ")
print(message)
# If needed, an asynchronous query can be cancelled at any time with:
# cursor.cancel()
print("running ")
status = cursor.poll().operationState
print
cursor.fetchall()
The cursor is able to get operationState correctly but its unable to fetch the logs. Is there anything on HiveServer2 side that needs to be configured?
Thanks in advance
Closing the loop here in case someone else has same or similar issue with hive.
In my case the problem was the hiveserver configuration. Hive Server won't stream the logs if logging operation is not enabled. Following is the list I configured
hive.server2.logging.operation.enabled - true
hive.server2.logging.operation.level EXECUTION (basic logging - There are other values that increases the logging level)
hive.async.log.enabled false
hive.server2.logging.operation.log.location

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another.
The job fails with the error:
Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error:
Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].
I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene.
The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one.
Any idea of how to work with this issue ?
Here's the code:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'
with beam.Pipeline(options=options) as p:
query = "SELECT ..."
bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)
table_schema = ...
bq_data | beam.io.WriteToBigQuery(
project="prj",
dataset="test",
table="test",
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.
So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.
This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().
You can specify the dataset to use. That way the process doesn't create a temp dataset.
Example:
TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
.fromQuery("selectQuery").withQueryTempDataset("existingDataset")
.usingStandardSql().withMethod(TypedRead.Method.DEFAULT);

Getting duplicate AMI name client error, even though it is creating for the first time

I am using a jenkins to trigger the lambda , which creates an AMI image from an ec2 instance, and then creates a launch configuration , and updates the auto scaling group with the new launch configuration
, and let the auto scaling group creates the instances using the latest Launch configuration and terminates the older instances.
But my code some time runs properly but some time it give me "ClientError: An error occurred (InvalidAMIName.Duplicate) when calling the CreateImage operation: AMI name "API_AMI_200220_1629" is already in use by AMI ami-033a3681473f9acbd"
but my AMI names are dynamically created like "API_AMI_$(date +%d%m%y_%H%M)", so there will not be any duplicate AMIs that can be created technically. But I am getting this error and the AMI will be in a Pending state. can any one have any suggestion or a solution why it is happening only sometimes and not on all times.Please check the below codescript.
import json
import boto3
import time
def lambda_handler(event, context):
flag_image=1
instance_id=event['instanceId']
ami_name=event['amiName']
launch_config=event['launchConfig']
autoscaling_name=event['autoscalingName']
ec2_client = boto3.client('ec2',region_name='us-east-1')
autoscaling_client = boto3.client('autoscaling',region_name='us-east-1')
print ##############Creating Image########################
response = ec2_client.create_image(InstanceId=instance_id, Name=ami_name)
print response
imageId=response['ImageId']
print imageId
describe_image = ec2_client.describe_images(ImageIds=[imageId])
while flag_image == 1 :
for i in describe_image['Images']:
time.sleep(5)
print i['State']
if i['State'] == 'available':
flag_image=0
describe_image = ec2_client.describe_images(ImageIds=[imageId])
Thanks in advance.

How join in SparkSQL data from mysql and Oracle?

Is it possible in SparkSQL to join the data from mysql and Oracle databases? I tried to join them, but I have some troubles with set the multiple jars (jdbc drivers for mysql and Oracle) in SPARK_CLASSPATH.
Here is my code:
import os
import sys
os.environ['SPARK_HOME']="/home/x/spark-1.5.2"
sys.path.append("/home/x/spark-1.5.2/python/")
try:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
MYSQL_DRIVER_PATH = "/home/x/spark-1.5.2/python/lib/mysql-connector-java-5.1.38-bin.jar"
MYSQL_CONNECTION_URL = "jdbc:mysql://192.111.333.999:3306/db?user=us&password=pasw"
ORACLE_DRIVER_PATH = "/home/x/spark-1.5.2/python/lib/ojdbc6.jar"
Oracle_CONNECTION_URL = "jdbc:oracle:thin:user/pasw#192.111.333.999:1521:xe"
# Define Spark configuration
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("MySQL_Oracle_imp_exp")
# Initialize a SparkContext and SQLContext
sc = SparkContext(conf=conf)
#sc.addJar(MYSQL_DRIVER_PATH)
sqlContext = SQLContext(sc)
ora_tmp=sqlContext.read.format('jdbc').options(
url=Oracle_CONNECTION_URL,
dbtable="TABLE1",
driver="oracle.jdbc.OracleDriver"
).load()
ora_tmp.show()
tmp2=sqlContext.load(
source="jdbc",
path=MYSQL_DRIVER_PATH,
url=MYSQL_CONNECTION_URL,
dbtable="(select city,zip from TABLE2 limit 10) as tmp2",
driver="com.mysql.jdbc.Driver")
c_rows=tmp2.collect()
....
except Exception as e:
print e
sys.exit(1)
Could someone please help me to solve this problem?
Thanks in advance :)
Here are the steps you need to follow:
First register SPARK_CLASSPATH to jars of one of the databases say mysql using command
os.environ['SPARK_CLASSPATH'] = "/usr/share/java/mysql-connector-java.jar"
Run query against mysql database and assign to RDD
Register SPARK_CLASSPATH with jars of second database by changing the path from above command
Run query against second database
If you have issues with lazy evaluation, make sure you first write first data set into files and then proceed further.