Submitting Spark Job to Livy (in EMR) from Airflow (using airflow Livy operator) - amazon-emr

I am trying to schedule a job in EMR using airflow livy operator. Here is the example code I followed. The issue here is... nowhere Livy connection string (Host name & Port) is specified. How do I provide the Livy Server host name & port for the operator?
Also, the operator has parameter livy_conn_id, which in the example is set a value of livy_conn_default. Is that the right value?... or do I have set some other value?

You should be having livy_conn_default under connections in Admin tab of your Airflow dashboard, If that's set alright then yes, you can use this. Otherwise, you can change this or create another connection id and use that in livy_conn_id

There are 2 APIs we can use to connect Livy and Airflow:
Using LivyBatchOperator
Using LivyOperator
In the following example, i will cover LivyOperator API.
LivyOperator
Step1: Update the livy configuration:
Login to airflow ui --> click on Admin tab --> Connections --> Search for livy. Click on edit button and update the Host and Port parameters.
Step2: Install the apache-airflow-providers-apache-livy
pip install apache-airflow-providers-apache-livy
Step3: Create the data file under $AIRFLOW_HOME/dags directory.
vi $AIRFLOW_HOME/dags/livy_operator_sparkpi_dag.py
from datetime import timedelta, datetime
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.providers.apache.livy.operators.livy import LivyOperator
default_args = {
'owner': 'RangaReddy',
"retries": 3,
"retry_delay": timedelta(minutes=5),
}
# Initiate DAG
livy_operator_sparkpi_dag = DAG(
dag_id = "livy_operator_sparkpi_dag",
default_args=default_args,
schedule_interval='#once',
start_date = datetime(2022, 3, 2),
tags=['example', 'spark', 'livy']
)
# define livy task with LivyOperator
livy_sparkpi_submit_task = LivyOperator(
file="/root/spark-3.2.1-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.1.jar",
class_name="org.apache.spark.examples.SparkPi",
driver_memory="1g",
driver_cores=1,
executor_memory="1g",
executor_cores=2,
num_executors=1,
name="LivyOperator SparkPi",
task_id="livy_sparkpi_submit_task",
dag=livy_operator_sparkpi_dag,
)
begin_task = DummyOperator(task_id="begin_task")
end_task = DummyOperator(task_id="end_task")
begin_task >> livy_sparkpi_submit_task >> end_task
LIVY_HOST=192.168.0.1
curl http://${LIVY_HOST}:8998/batches/0/log | python3 -m json.tool
Output:
"Pi is roughly 3.14144103141441"

Related

How to query Hive from python connecting using Zookeeper?

I can connect to a Hive (or LLAP) database using pyhive and I can query the database fixing the server host. Here is a code example:
from pyhive import hive
host_name = "vrt1553.xxx.net"
port = 10000
connection = hive.Connection(
host=host_name,
port=port,
username=user,
kerberos_service_name='hive',
auth='KERBEROS',
)
cursor = connection.cursor()
cursor.execute('show databases')
print(cursor.fetchall())
How could I connect using Zookeeper to get a server name?
You must install the Kazoo package to query Zookeeper and find the host and port of your Hive servers:
import random
from kazoo.client import KazooClient
zk = KazooClient(hosts='vrt1554.xxx.net:2181,vrt1552.xxx.net:2181,vrt1558.xxx.net:2181', read_only=True)
zk.start()
servers = [hiveserver2.split(';')[0].split('=')[1].split(':')
for hiveserver2
in zk.get_children(path='hiveserver2')]
hive_host, hive_port = random.choice(servers)
zk.stop()
print(hive_host, hive_port)
Then just pass hive_host and hive_port to your Connection constructor:
connection = hive.Connection(
host=hive_host,
port=hive_port,
username=user,
kerberos_service_name="hive",
auth="KERBEROS",
)
And query as a standard python sql cursor. Here is using pandas:
df = pd.read_sql(sql_query, connection)

PyHive unable to fetch logs from HiveServer2 when running in async mode

I am running into a strange issue with PyHive running a Hive query in async mode. Internally, PyHive uses Thrift client to execute the query and to fetch logs (along with execution status). I am unable to fetch the logs of Hive query (map/reduce tasks, etc). cursor.fetch_logs() returns an empty data structure
Here is the code snippet
rom pyhive import hive # or import hive or import trino
from TCLIService.ttypes import TOperationState
def run():
cursor = hive.connect(host="10.x.y.z", port='10003', username='xyz', password='xyz', auth='LDAP').cursor()
cursor.execute("select count(*) from schema1.table1 where date = '2021-03-13' ", async_=True)
status = cursor.poll(True).operationState
print(status)
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
logs = cursor.fetch_logs()
for message in logs:
print("running ")
print(message)
# If needed, an asynchronous query can be cancelled at any time with:
# cursor.cancel()
print("running ")
status = cursor.poll().operationState
print
cursor.fetchall()
The cursor is able to get operationState correctly but its unable to fetch the logs. Is there anything on HiveServer2 side that needs to be configured?
Thanks in advance
Closing the loop here in case someone else has same or similar issue with hive.
In my case the problem was the hiveserver configuration. Hive Server won't stream the logs if logging operation is not enabled. Following is the list I configured
hive.server2.logging.operation.enabled - true
hive.server2.logging.operation.level EXECUTION (basic logging - There are other values that increases the logging level)
hive.async.log.enabled false
hive.server2.logging.operation.log.location

Copy records from one table to another using spark-sql-jdbc

I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()
It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "

Getting duplicate AMI name client error, even though it is creating for the first time

I am using a jenkins to trigger the lambda , which creates an AMI image from an ec2 instance, and then creates a launch configuration , and updates the auto scaling group with the new launch configuration
, and let the auto scaling group creates the instances using the latest Launch configuration and terminates the older instances.
But my code some time runs properly but some time it give me "ClientError: An error occurred (InvalidAMIName.Duplicate) when calling the CreateImage operation: AMI name "API_AMI_200220_1629" is already in use by AMI ami-033a3681473f9acbd"
but my AMI names are dynamically created like "API_AMI_$(date +%d%m%y_%H%M)", so there will not be any duplicate AMIs that can be created technically. But I am getting this error and the AMI will be in a Pending state. can any one have any suggestion or a solution why it is happening only sometimes and not on all times.Please check the below codescript.
import json
import boto3
import time
def lambda_handler(event, context):
flag_image=1
instance_id=event['instanceId']
ami_name=event['amiName']
launch_config=event['launchConfig']
autoscaling_name=event['autoscalingName']
ec2_client = boto3.client('ec2',region_name='us-east-1')
autoscaling_client = boto3.client('autoscaling',region_name='us-east-1')
print ##############Creating Image########################
response = ec2_client.create_image(InstanceId=instance_id, Name=ami_name)
print response
imageId=response['ImageId']
print imageId
describe_image = ec2_client.describe_images(ImageIds=[imageId])
while flag_image == 1 :
for i in describe_image['Images']:
time.sleep(5)
print i['State']
if i['State'] == 'available':
flag_image=0
describe_image = ec2_client.describe_images(ImageIds=[imageId])
Thanks in advance.

How to connect to HIVE using python?

I'm using a CDH cluster which is kerberous enabled and I'd like to use pyhive to connect to HIVE and read HIVE tables. Here is the code I have
from pyhive import hive
from TCLIService.ttypes import TOperationState
cursor = hive.connect(host = 'xyz', port = 10000, username = 'my_username', auth = 'KERBEROS', database = 'poc', kerberos_service_name = 'hive' ).cursor()
I'm getting the value of xyz from hive-site.xml under hive.metastore.uris, however it says xyz:9083, but if I replace 10000 with 9083, it complains.
My problem is when I connect (using port = 10000), it gives me permission error when executing a query, while I can read that table if I use HIVE CLI or beeline. My question is 1) if the xyz is the value I should use? 2) which port I should use? 3) if all is correct, why I'm still getting a permission issue?