Kudu with PySpark2: Error with KuduStorageHandler - hive

I am trying to read data in stored as Kudu using PySpark 2.1.0
>>> from os.path import expanduser, join, abspath
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> spark = SparkSession.builder \
.master("local") \
.appName("HivePyspark") \
.config("hive.metastore.warehouse.dir", "hdfs:///user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
>>> spark.sql("select count(*) from mySchema.myTable").show()
I have Kudu 1.2.0 installed on the cluster. Those are hive/ Impala tables.
When I execute the last line, I get the following error:
.
.
.
: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
.
.
.
aused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getStorageHandler(HiveUtils.java:315)
at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:284)
... 61 more
Caused by: java.lang.ClassNotFoundException: com.cloudera.kudu.hive.KuduStorageHandler
I am referring to the following resources:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
https://issues.apache.org/jira/browse/KUDU-1603
https://github.com/bkvarda/iot_demo/blob/master/total_data_count.py
https://kudu.apache.org/docs/developing.html#_kudu_python_client
I am interested to know how I can include the Kudu related dependencies into my pyspark program so that I can move past this error.

The way I solved this issue was to pass the respective Jar for kudu-spark to the pyspark2 shell or to the spark2-submit command

Apache Spark 2.3 
Below is the code for your reference:
Read kudu table from pyspark with below code:
kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").load()
kuduDF.show(5)
Write to kudu table with below code:
DF.write.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").mode("append").save()
Reference link: https://medium.com/#sciencecommitter/how-to-read-from-and-write-to-kudu-tables-in-pyspark-via-impala-c4334b98cf05
If in case you want to use Scala below is the reference link:
https://kudu.apache.org/docs/developing.html

Related

is there a method to conect to postgresql (dbeaver ) from pyspark?

hello i installed pyspark now and i have a database postgres in local in DBeaver :
how can i connect to postgres from pyspark please
i tried this
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/coucou'
properties = {'user': 'postgres', 'password': 'admin'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tw_db', properties=properties
)
but i have an error
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc.
: java.lang.ClassNotFoundException: C:/Users/Desktop/postgresql-42.2.23.jre7.jar
You need to add the jars you want to use when creating the sparkSession.
See this :
https://spark.apache.org/docs/2.4.7/submitting-applications.html#advanced-dependency-management
Either when you start pyspark
pyspark --repositories MAVEN_REPO
# OR
pyspark --jars PATH_TO_JAR
or when you create your sparkSession objects
SparkSession.builder.master("yarn").appName(app_name).config("spark.jars.packages", "MAVEN_PACKAGE")
# OR
SparkSession.builder.master("yarn").appName(app_name).config("spark.jars", "PATH_TO_JAR")
You need maven packages when you do not have the jar in local or your jars needs some dependencies.

How to create a Spark data frame using snow flake connection in python?

I am new to Spark and Python and I have a sql which is stored in a variable in python and we use SnowFlake database. How to create a spark datafrom using SQL with snowflake connection?
import sf_connectivity (we have a code for establishing connection with Snowflake database)
emp = 'Select * From Employee'
snowflake_connection = sf_connectivity.collector() (It is a method to establish snowflake conenction)
requirement 1: Create Spark Dataframe (sf_df) using 'emp' and 'snowflake_connection '
requirement 2: sf_df.createOrReplaceTempView(Temp_Employee)
What are the packages or libraries it requires? How can I make this work?
The documentation that helped me figure this out is here: https://docs.databricks.com/data/data-sources/snowflake.html
Took me awhile to figure out how to get it working though! After a lot of questions, I had my company's IT department configure a snowflake user account with private/public key authentication, and they configured that ID to be accessible within our corporate Databricks account.
After they set this up, the following code is an example how to pass a sql command as variable to Spark, and have Spark convert it into a dataframe.
optionsSource = dict(sfUrl="mycompany.east-us-2.azure.snowflakecomputing.com", # Snowflake Account Name
sfUser="my_service_acct",
pem_private_key=dbutils.secrets.get("my_scope", "my_secret"),
sfDatabase="mydatabase", # Snowflake Database
sfSchema="myschema", # Snowflake Schema
sfWarehouse="mywarehouse",
sfRole="myrole"
)
sqlcmd = '''
select current_date;
'''
df = spark.read.format("snowflake").options(**optionsSource).option("query", sqlcmd).load()
display(df)
With Public/Private key , you need to generate a cert
https://community.snowflake.com/s/article/How-to-connect-snowflake-with-Spark-connector-using-Public-Private-Key
and then you can use the below code .
from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import serialization
import re
import os
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config("spark.jars", "<path/to/>/snowflake-jdbc-<version>.jar,<path/to/>/spark-snowflake_2.11-2.4.13-spark_2.4.jar") \
.config("spark.repl.local.jars",
"<path/to/>/snowflake-jdbc-<version>.jar,<path/to/>/spark-snowflake_2.11-2.4.13-spark_2.4.jar") \
.config("spark.sql.catalogImplementation", "in-memory") \
.getOrCreate()
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.disablePushdownSession(
spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
with open("<path/to/>/rsa_key.p8", "rb") as key_file:
p_key = serialization.load_pem_private_key(
key_file.read(),
password=os.environ['PRIVATE_KEY_PASSPHRASE'].encode(),
backend=default_backend()
)
pkb = p_key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.PKCS8,
encryption_algorithm=serialization.NoEncryption()
)
pkb = pkb.decode("UTF-8")
pkb = re.sub("-*(BEGIN|END) PRIVATE KEY-*\n", "", pkb).replace("\n", "")
sfOptions = {
"sfURL": "<URL>",
"sfAccount": "<ACCOUNTNAME>",
"sfUser": "<USER_NAME",
"pem_private_key": pkb,
# "sfPassword": "xxxxxxxxx",
"sfDatabase": "<DBNAME>",
"sfSchema": "<SCHEMA_NAME>",
"sfWarehouse": "<WH_NAME>",
"sfRole": "<ROLENAME>",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "<TABLENAME>") \
.load()
df.show()

Reading Spark DataFrame from Redshift returns empty DataFrame

I'm using:
python 3.6.8
spark 2.4.4
I run pyspark with an EMR cluster (emr-5.28.0) with: pyspark --packages org.apache.spark:spark-avro_2.11:2.4.4
I have the following jars in the spark classpath:
http://repo1.maven.org/maven2/com/databricks/spark-redshift_2.11/2.0.1/spark-redshift_2.11-2.0.1.jar
http://repo1.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar
https://github.com/ralfstx/minimal-json/releases/download/0.9.5/minimal-json-0.9.5.jar
https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.15.1025/RedshiftJDBC41-no-awssdk-1.2.15.1025.jar
I execute this piece of code:
url = "jdbc:redshift://my.cluster:5439/my_db?user=my_user&password=my_password"
query = "select * from schema.table where trunc(timestamp)='2019-09-10'"
df = sqlContext.read.format('com.databricks.spark.redshift')\
.option("url", url)\
.option("tempdir", "s3a://bucket/tmp_folder")\
.option("query", query)\
.option("aws_iam_role", "arn_iam_role")\
.load()
And doing df.count() it returns 73 (which is the number of rows of that query) but if I do df.show(4) it returns an empty DataFrame, no error, it just prints the schema.
I've made it work by changing the format to 'jdbc' and only use the databricks driver to write data, not read.

connecting pyspark to ignite

All,
I have been struggling against PySpark and Ignite itegration for like 2 last weeks and I am at my wits' end.
I have been trying to upload a table created in pyspark to ignite.
I have been starting the script like
spark-submit --master spark://my_host:my_port --jars $IGNITE_HOME/libs/*jar, $IGNITE_HOME/libs/optional/ignite-spark/jar, $IGNITE_HOME/libs/ignite-spring/*jar $IGNITE_HOME/libs/ignite-indexking/*jar my_python_script.py
and my_python_script.py was like:
import pyspark
spark = pyspark.sql.SparkSession\
.builder\
.appName("Ignite")\
.getOrCreate()
# create the data frame
columns = ["COL1", "COL2", "ID"]
vals = [("a", "b", 0), ("c", "d", 1)]
df = spark.createDataFrame(vals, columns)
df.write\
.format("jdbc")\
.option("driver", "org.apache.ignite.IgniteJdbcThinDriver")\
.option("url", "jdbs:ignite:thin://my_url:my_port")\
.option("user", "my_user")\
.option("password", "my_password")\
.option("dbtable", "my_table")\
.option("mode", "overwrite")\
.save()
And I keep getting errors... For the above the error is py4j.protocol.Py4JavaError: An error occurred while calling o48.save. :java.sql.SQLException: no PRIMARY KEY defined for CREATE TABLE at org.apache.ignite.internal.jdbc.thin.JdbcThinConnection.sendRequest(JdbcThinConnection.java:750
Can anyone please help?
My spark version is 2.4.0, python 2.7, ignite 2.7
Is there a reason that you're not using the Spark-Ignite integration? JDBC should work but there's a better way, especially since you're already including all the right JARs.
df.write.format("ignite")
.option("table","my_table")
.option("primaryKeyFields","COL1")
.option("config",configFile)
.option("mode","overwrite")
.save()
Also note the inclusion of the "primaryKeyFields" option. As your error message notes, the version using JDBC fails because you've not defined a primary key.

Hive on spark doesn't work in hue

I am trying to trigger hive on spark using hue interface . The job works perfectly when run from commandline but when i try to run from hue it throws exceptions. In hue, I tried mainly two things:
1) when I give all the properties in .hql file using set commands
set spark.home=/usr/lib/spark;
set hive.execution.engine=spark;
set spark.eventLog.enabled=true;
add jar /usr/lib/spark/assembly/lib/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar;
set spark.eventLog.dir=hdfs://10.11.50.81:8020/tmp/;
set spark.executor.memory=2899102923;
I get an error
ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Unsupported execution engine: Spark. Please set hive.execution.engine=mr)'
org.apache.hadoop.hive.ql.metadata.HiveException: Unsupported execution engine: Spark. Please set hive.execution.engine=mr
2) when I give properties in hue properties, it just works with mr engine but not spark execution engine.
Any help would be appreciated
I have solved this issue by using a shell action in oozie.
This shell action invokes a pyspark action bearing my sql file.
Even though the job shows as MR in jobtracker, spark history server recognizes as a spark action and the output is achieved.
shell file:
#!/bin/bash
export PYTHONPATH=`pwd`
spark-submit --master local testabc.py
python file:
from pyspark.sql import HiveContext
from pyspark import SparkContext
sc = SparkContext();
sqlContext = HiveContext(sc)
result = sqlContext.sql("insert into table testing_oozie.table2 select * from testing_oozie.table1 ");
result.show()