I use pyhive with dolphinscheduler in my program, it works good in develop environment, but it sometimes failed, sometimes succeed in production environment. I do not know why?
Example code
from pyhive import hive
conn = hive.Connection(host="cdh1", port=10000, username="root")
cursor = conn.cursor()
cursor.execute("""
set hive.exec.dynamic.partition.mode=nonstrict
""")
cursor.execute("""
INSERT INTO TABLE table_name
SELECT ...
""")
cursor.close()
conn.close()
Software version
CDH6.3
hive version 2.1.1
hadoop version 3.0.0
/tmp/hive/XXX log
ERROR [main] hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Maybe, you have to make sure your CDH cluster is healthy
Related
I have a file called im.db in my current working directory, as shown here:
I also am able to query this database directly from sqlite3 at the command line:
% sqlite3
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite\> .open im.db
sqlite\> select count(\*) from writers;
255873
However, when running the same query inside of my notebook:
import sqlite3 as sql
con = sql.connect("im.db")
writers_dataframe = pd.read_sql_query("SELECT COUNT(\*) from writers", con)
I get an error message which states no such table: writers
Any help would be much appreciated. Thanks!
I am running into a strange issue with PyHive running a Hive query in async mode. Internally, PyHive uses Thrift client to execute the query and to fetch logs (along with execution status). I am unable to fetch the logs of Hive query (map/reduce tasks, etc). cursor.fetch_logs() returns an empty data structure
Here is the code snippet
rom pyhive import hive # or import hive or import trino
from TCLIService.ttypes import TOperationState
def run():
cursor = hive.connect(host="10.x.y.z", port='10003', username='xyz', password='xyz', auth='LDAP').cursor()
cursor.execute("select count(*) from schema1.table1 where date = '2021-03-13' ", async_=True)
status = cursor.poll(True).operationState
print(status)
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
logs = cursor.fetch_logs()
for message in logs:
print("running ")
print(message)
# If needed, an asynchronous query can be cancelled at any time with:
# cursor.cancel()
print("running ")
status = cursor.poll().operationState
print
cursor.fetchall()
The cursor is able to get operationState correctly but its unable to fetch the logs. Is there anything on HiveServer2 side that needs to be configured?
Thanks in advance
Closing the loop here in case someone else has same or similar issue with hive.
In my case the problem was the hiveserver configuration. Hive Server won't stream the logs if logging operation is not enabled. Following is the list I configured
hive.server2.logging.operation.enabled - true
hive.server2.logging.operation.level EXECUTION (basic logging - There are other values that increases the logging level)
hive.async.log.enabled false
hive.server2.logging.operation.log.location
I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()
It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "
I configured sparkr normally from the tutorials, and everything was working. I was able to read the database with read.df, but suddenly nothing else works, and the following error appears:
Error in sparkR.init(master = "local") : JVM is not ready after 10 seconds
Why does it appear now suddenly? I've read other users with the same problem, but the solutions given did not work. Below is my code:
Sys.setenv(SPARK_HOME= "C:/Spark")
Sys.setenv(HADOOP_HOME = "C:/Hadoop")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
#initialeze SparkR environment
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.2.0" "sparkr-shell"')
Sys.setenv(SPARK_MEM="4g")
#Create a spark context and a SQL context
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
Try to do few things below:
Check if c:/Windows/System32/ is there in the PATH.
Check if spark-submit.cmd has proper execute permissions.
If both the above things are true and even if it is giving the same error, then delete spark directory and again create a fresh one by unzipping spark gzip file.
I'm a beginner of R, and I have solved the same problem "JVM is not ready after 10 seconds" by installing JDK(version 7+) before installing sparkr in my mac. And it works well now. Hope this can help you with your problem.
Just encountered a similar issue as described in the below article:
Question: Article with similar error description
java.rmi.UnmarshalException: cannot unmarshaling return; nested exception is:
java.rmi.UnexpectedException: Failed to parse descriptor file; nested exception is:
java.rmi.server.ExportException: Failed to export class
I found that the issue described is totally unrelated to any Java update and is rather an issue with the Weblogic bean-cache. It seems to use old compiled versions of classes when updating a deployment. I was hunting a similar issue in a related question (Question: Interface-Implementation-mismatch).
How can I fix this properly to allow proper automatic deployment (with WLST)?
After some feedback from the Oracle community it now works like this:
1) Shutdown the remote Managed Server
2) Delete directory "domains/#MyDomain#/servers/#MyManagedServer#/cache/EJBCompilerCache"
3) Redeploy EAR/application
In WLST (which one would need to automate this) this is quite tricky:
import shutil
servers=cmo.getServers()
domainPath = get('RootDirectory')
for thisServer in servers:
pathToManagedServer = domainPath + "\\servers\\" + thisServer.getName()
print ">Found managed server:" + pathToManagedServer
pathToCacheDir = pathToManagedServer + "\\" + "cache\\EJBCompilerCache"
if(os.path.exists(pathToCacheDir) and os.path.isdir(pathToCacheDir) ):
print ">Found a cache directory that will be deleted:" + pathToCacheDir
# shutil.rmtree(pathToCacheDir)
Note: Be careful when testing this, the path that is returned by "pathToCacheDir" depends on the MBean-context that is currently set. See samples for WLST command "cd()". You should first test the path output with "print domainPath" and later add the "rmtree" python command! (I uncommented the delete command in my sample, so that nobody accidentially deletes an entire domain!)