is there a method to conect to postgresql (dbeaver ) from pyspark? - sql

hello i installed pyspark now and i have a database postgres in local in DBeaver :
how can i connect to postgres from pyspark please
i tried this
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/coucou'
properties = {'user': 'postgres', 'password': 'admin'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tw_db', properties=properties
)
but i have an error
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc.
: java.lang.ClassNotFoundException: C:/Users/Desktop/postgresql-42.2.23.jre7.jar

You need to add the jars you want to use when creating the sparkSession.
See this :
https://spark.apache.org/docs/2.4.7/submitting-applications.html#advanced-dependency-management
Either when you start pyspark
pyspark --repositories MAVEN_REPO
# OR
pyspark --jars PATH_TO_JAR
or when you create your sparkSession objects
SparkSession.builder.master("yarn").appName(app_name).config("spark.jars.packages", "MAVEN_PACKAGE")
# OR
SparkSession.builder.master("yarn").appName(app_name).config("spark.jars", "PATH_TO_JAR")
You need maven packages when you do not have the jar in local or your jars needs some dependencies.

Related

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

i'm trying to read a partitionned dataset in aws s3, it looks like :
MyDirectory--code=1--file.parquet
--code=2--another.parquet
--code=3--another.parquet
i created a file_list containing the path to all the files in the directory then executed
df = pq.ParquetDataset(file_list, filesystem=fs).read().to_pandas()
everything works except that the partition column code doesn't exist in the dataframe df.
i tried it also using one path to MyDirectory insted of file_list, but found an error
"Found files in an intermediate directory: s3://bucket/Mydirectoty", i can't find any answer online.
Thank you!
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
This snippet should work:
import awswrangler as wr
# Write
wr.s3.to_parquet(
df=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_databse", # Optional, only if you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...", dataset=True)
If you're happy with other tools you can give dask a try. Assume all the data you want to read is in s3://folder you can just use
import dask.dataframe as dd
storage_options = {
'key': your_key,
'secret': your_secret}
df = dd.read_parquet("s3://folder",
storage_options=storage_options)

Reading Spark DataFrame from Redshift returns empty DataFrame

I'm using:
python 3.6.8
spark 2.4.4
I run pyspark with an EMR cluster (emr-5.28.0) with: pyspark --packages org.apache.spark:spark-avro_2.11:2.4.4
I have the following jars in the spark classpath:
http://repo1.maven.org/maven2/com/databricks/spark-redshift_2.11/2.0.1/spark-redshift_2.11-2.0.1.jar
http://repo1.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar
https://github.com/ralfstx/minimal-json/releases/download/0.9.5/minimal-json-0.9.5.jar
https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.15.1025/RedshiftJDBC41-no-awssdk-1.2.15.1025.jar
I execute this piece of code:
url = "jdbc:redshift://my.cluster:5439/my_db?user=my_user&password=my_password"
query = "select * from schema.table where trunc(timestamp)='2019-09-10'"
df = sqlContext.read.format('com.databricks.spark.redshift')\
.option("url", url)\
.option("tempdir", "s3a://bucket/tmp_folder")\
.option("query", query)\
.option("aws_iam_role", "arn_iam_role")\
.load()
And doing df.count() it returns 73 (which is the number of rows of that query) but if I do df.show(4) it returns an empty DataFrame, no error, it just prints the schema.
I've made it work by changing the format to 'jdbc' and only use the databricks driver to write data, not read.

connecting pyspark to ignite

All,
I have been struggling against PySpark and Ignite itegration for like 2 last weeks and I am at my wits' end.
I have been trying to upload a table created in pyspark to ignite.
I have been starting the script like
spark-submit --master spark://my_host:my_port --jars $IGNITE_HOME/libs/*jar, $IGNITE_HOME/libs/optional/ignite-spark/jar, $IGNITE_HOME/libs/ignite-spring/*jar $IGNITE_HOME/libs/ignite-indexking/*jar my_python_script.py
and my_python_script.py was like:
import pyspark
spark = pyspark.sql.SparkSession\
.builder\
.appName("Ignite")\
.getOrCreate()
# create the data frame
columns = ["COL1", "COL2", "ID"]
vals = [("a", "b", 0), ("c", "d", 1)]
df = spark.createDataFrame(vals, columns)
df.write\
.format("jdbc")\
.option("driver", "org.apache.ignite.IgniteJdbcThinDriver")\
.option("url", "jdbs:ignite:thin://my_url:my_port")\
.option("user", "my_user")\
.option("password", "my_password")\
.option("dbtable", "my_table")\
.option("mode", "overwrite")\
.save()
And I keep getting errors... For the above the error is py4j.protocol.Py4JavaError: An error occurred while calling o48.save. :java.sql.SQLException: no PRIMARY KEY defined for CREATE TABLE at org.apache.ignite.internal.jdbc.thin.JdbcThinConnection.sendRequest(JdbcThinConnection.java:750
Can anyone please help?
My spark version is 2.4.0, python 2.7, ignite 2.7
Is there a reason that you're not using the Spark-Ignite integration? JDBC should work but there's a better way, especially since you're already including all the right JARs.
df.write.format("ignite")
.option("table","my_table")
.option("primaryKeyFields","COL1")
.option("config",configFile)
.option("mode","overwrite")
.save()
Also note the inclusion of the "primaryKeyFields" option. As your error message notes, the version using JDBC fails because you've not defined a primary key.

Kudu with PySpark2: Error with KuduStorageHandler

I am trying to read data in stored as Kudu using PySpark 2.1.0
>>> from os.path import expanduser, join, abspath
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> spark = SparkSession.builder \
.master("local") \
.appName("HivePyspark") \
.config("hive.metastore.warehouse.dir", "hdfs:///user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
>>> spark.sql("select count(*) from mySchema.myTable").show()
I have Kudu 1.2.0 installed on the cluster. Those are hive/ Impala tables.
When I execute the last line, I get the following error:
.
.
.
: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
.
.
.
aused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getStorageHandler(HiveUtils.java:315)
at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:284)
... 61 more
Caused by: java.lang.ClassNotFoundException: com.cloudera.kudu.hive.KuduStorageHandler
I am referring to the following resources:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
https://issues.apache.org/jira/browse/KUDU-1603
https://github.com/bkvarda/iot_demo/blob/master/total_data_count.py
https://kudu.apache.org/docs/developing.html#_kudu_python_client
I am interested to know how I can include the Kudu related dependencies into my pyspark program so that I can move past this error.
The way I solved this issue was to pass the respective Jar for kudu-spark to the pyspark2 shell or to the spark2-submit command
Apache Spark 2.3 
Below is the code for your reference:
Read kudu table from pyspark with below code:
kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").load()
kuduDF.show(5)
Write to kudu table with below code:
DF.write.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").mode("append").save()
Reference link: https://medium.com/#sciencecommitter/how-to-read-from-and-write-to-kudu-tables-in-pyspark-via-impala-c4334b98cf05
If in case you want to use Scala below is the reference link:
https://kudu.apache.org/docs/developing.html

Hive on spark doesn't work in hue

I am trying to trigger hive on spark using hue interface . The job works perfectly when run from commandline but when i try to run from hue it throws exceptions. In hue, I tried mainly two things:
1) when I give all the properties in .hql file using set commands
set spark.home=/usr/lib/spark;
set hive.execution.engine=spark;
set spark.eventLog.enabled=true;
add jar /usr/lib/spark/assembly/lib/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar;
set spark.eventLog.dir=hdfs://10.11.50.81:8020/tmp/;
set spark.executor.memory=2899102923;
I get an error
ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Unsupported execution engine: Spark. Please set hive.execution.engine=mr)'
org.apache.hadoop.hive.ql.metadata.HiveException: Unsupported execution engine: Spark. Please set hive.execution.engine=mr
2) when I give properties in hue properties, it just works with mr engine but not spark execution engine.
Any help would be appreciated
I have solved this issue by using a shell action in oozie.
This shell action invokes a pyspark action bearing my sql file.
Even though the job shows as MR in jobtracker, spark history server recognizes as a spark action and the output is achieved.
shell file:
#!/bin/bash
export PYTHONPATH=`pwd`
spark-submit --master local testabc.py
python file:
from pyspark.sql import HiveContext
from pyspark import SparkContext
sc = SparkContext();
sqlContext = HiveContext(sc)
result = sqlContext.sql("insert into table testing_oozie.table2 select * from testing_oozie.table1 ");
result.show()