PyHive unable to fetch logs from HiveServer2 when running in async mode - hive

I am running into a strange issue with PyHive running a Hive query in async mode. Internally, PyHive uses Thrift client to execute the query and to fetch logs (along with execution status). I am unable to fetch the logs of Hive query (map/reduce tasks, etc). cursor.fetch_logs() returns an empty data structure
Here is the code snippet
rom pyhive import hive # or import hive or import trino
from TCLIService.ttypes import TOperationState
def run():
cursor = hive.connect(host="10.x.y.z", port='10003', username='xyz', password='xyz', auth='LDAP').cursor()
cursor.execute("select count(*) from schema1.table1 where date = '2021-03-13' ", async_=True)
status = cursor.poll(True).operationState
print(status)
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
logs = cursor.fetch_logs()
for message in logs:
print("running ")
print(message)
# If needed, an asynchronous query can be cancelled at any time with:
# cursor.cancel()
print("running ")
status = cursor.poll().operationState
print
cursor.fetchall()
The cursor is able to get operationState correctly but its unable to fetch the logs. Is there anything on HiveServer2 side that needs to be configured?
Thanks in advance

Closing the loop here in case someone else has same or similar issue with hive.
In my case the problem was the hiveserver configuration. Hive Server won't stream the logs if logging operation is not enabled. Following is the list I configured
hive.server2.logging.operation.enabled - true
hive.server2.logging.operation.level EXECUTION (basic logging - There are other values that increases the logging level)
hive.async.log.enabled false
hive.server2.logging.operation.log.location

Related

NullPointerException on loading data into Grakn

I have created a backup of Grakn with the exporter tool like this:
./grakn server export 'old_test' backup.grakn
$x isa export,
has status "completed",
has progress (100.0%),
has count (105 / 105);
I then wanted to import this into a new keyspace with
./grakn server import 'new_test' backup.grakn
But I got this error below:
An error has occurred during boot-up. Please run 'grakn server status' or check the logs located under the 'logs' directory.
io.grpc.StatusRuntimeException: INTERNAL: java.lang.NullPointerException
You need to import your schema into the new keyspace first, this error occurs because the server cannot find a schema label in your dataset. The steps for migrating schema are described in the docs: https://dev.grakn.ai/docs/management/migration-and-backup

Copy records from one table to another using spark-sql-jdbc

I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()
It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another.
The job fails with the error:
Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error:
Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].
I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene.
The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one.
Any idea of how to work with this issue ?
Here's the code:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'
with beam.Pipeline(options=options) as p:
query = "SELECT ..."
bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)
table_schema = ...
bq_data | beam.io.WriteToBigQuery(
project="prj",
dataset="test",
table="test",
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.
So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.
This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().
You can specify the dataset to use. That way the process doesn't create a temp dataset.
Example:
TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
.fromQuery("selectQuery").withQueryTempDataset("existingDataset")
.usingStandardSql().withMethod(TypedRead.Method.DEFAULT);

Getting duplicate AMI name client error, even though it is creating for the first time

I am using a jenkins to trigger the lambda , which creates an AMI image from an ec2 instance, and then creates a launch configuration , and updates the auto scaling group with the new launch configuration
, and let the auto scaling group creates the instances using the latest Launch configuration and terminates the older instances.
But my code some time runs properly but some time it give me "ClientError: An error occurred (InvalidAMIName.Duplicate) when calling the CreateImage operation: AMI name "API_AMI_200220_1629" is already in use by AMI ami-033a3681473f9acbd"
but my AMI names are dynamically created like "API_AMI_$(date +%d%m%y_%H%M)", so there will not be any duplicate AMIs that can be created technically. But I am getting this error and the AMI will be in a Pending state. can any one have any suggestion or a solution why it is happening only sometimes and not on all times.Please check the below codescript.
import json
import boto3
import time
def lambda_handler(event, context):
flag_image=1
instance_id=event['instanceId']
ami_name=event['amiName']
launch_config=event['launchConfig']
autoscaling_name=event['autoscalingName']
ec2_client = boto3.client('ec2',region_name='us-east-1')
autoscaling_client = boto3.client('autoscaling',region_name='us-east-1')
print ##############Creating Image########################
response = ec2_client.create_image(InstanceId=instance_id, Name=ami_name)
print response
imageId=response['ImageId']
print imageId
describe_image = ec2_client.describe_images(ImageIds=[imageId])
while flag_image == 1 :
for i in describe_image['Images']:
time.sleep(5)
print i['State']
if i['State'] == 'available':
flag_image=0
describe_image = ec2_client.describe_images(ImageIds=[imageId])
Thanks in advance.

mysql python multiprocessing pool issues

Error I keep getting:
Lost connection to MySQL server during query
My code:
def runDBQuery(bl_sel):
dbResponse = []
bl_cur.execute(bl_sel)
myresult2 = bl_cur.fetchall()
dbResponse.append(myresult2)
return(dbResponse)
if __name__ == '__main__':
p1abl_sel = bl_sel_template.replace("{firstupc}",p1afirstupc).replace("{lastupc}",p1alastupc)
p2abl_sel = bl_sel_template.replace("{firstupc}",p2afirstupc).replace("{lastupc}",p2alastupc)
list_of_columns = [ p1abl_sel, p2abl_sel ]
#list_of_columns = [ p1abl_sel ]
p = Pool(processes=2)
data = p.map(runDBQuery, [i for i in list_of_columns])
# the 4 lines below are my failed attempts to try to resolve this.
bl_cur.close()
if cur and con:
cur.close()
con.close()
p.close()
print(data)
Whenever I uncomment the list_of_columns so there's only one element(query) in the list, it works and I get back a response from the DB. However, if I have more than one element in the list, I encounter the listed error.
Can anyone help me solve this problem?
The problem can be not in your code.
MySQL error "Lost connection to MySQL server during query" can accrue because of reading timeout. It can be either on the client side or mysql server configuration
MySQL
max_execution_time: The execution timeout for SELECT statements, in milliseconds. If the value is 0, timeouts are not enabled.
connect_timeout: Number of seconds the mysqld server waits for a connect packet before responding with 'Bad handshake'
interactive_timeout Number of seconds the server waits for activity on an interactive connection before closing it
wait_timeout Number of seconds the server waits for activity on a connection before closing it
https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_max_execution_time
For pyMysql check read_timeout
https://pymysql.readthedocs.io/en/latest/modules/connections.html