All,
I have been struggling against PySpark and Ignite itegration for like 2 last weeks and I am at my wits' end.
I have been trying to upload a table created in pyspark to ignite.
I have been starting the script like
spark-submit --master spark://my_host:my_port --jars $IGNITE_HOME/libs/*jar, $IGNITE_HOME/libs/optional/ignite-spark/jar, $IGNITE_HOME/libs/ignite-spring/*jar $IGNITE_HOME/libs/ignite-indexking/*jar my_python_script.py
and my_python_script.py was like:
import pyspark
spark = pyspark.sql.SparkSession\
.builder\
.appName("Ignite")\
.getOrCreate()
# create the data frame
columns = ["COL1", "COL2", "ID"]
vals = [("a", "b", 0), ("c", "d", 1)]
df = spark.createDataFrame(vals, columns)
df.write\
.format("jdbc")\
.option("driver", "org.apache.ignite.IgniteJdbcThinDriver")\
.option("url", "jdbs:ignite:thin://my_url:my_port")\
.option("user", "my_user")\
.option("password", "my_password")\
.option("dbtable", "my_table")\
.option("mode", "overwrite")\
.save()
And I keep getting errors... For the above the error is py4j.protocol.Py4JavaError: An error occurred while calling o48.save. :java.sql.SQLException: no PRIMARY KEY defined for CREATE TABLE at org.apache.ignite.internal.jdbc.thin.JdbcThinConnection.sendRequest(JdbcThinConnection.java:750
Can anyone please help?
My spark version is 2.4.0, python 2.7, ignite 2.7
Is there a reason that you're not using the Spark-Ignite integration? JDBC should work but there's a better way, especially since you're already including all the right JARs.
df.write.format("ignite")
.option("table","my_table")
.option("primaryKeyFields","COL1")
.option("config",configFile)
.option("mode","overwrite")
.save()
Also note the inclusion of the "primaryKeyFields" option. As your error message notes, the version using JDBC fails because you've not defined a primary key.
Related
hello i installed pyspark now and i have a database postgres in local in DBeaver :
how can i connect to postgres from pyspark please
i tried this
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/coucou'
properties = {'user': 'postgres', 'password': 'admin'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tw_db', properties=properties
)
but i have an error
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc.
: java.lang.ClassNotFoundException: C:/Users/Desktop/postgresql-42.2.23.jre7.jar
You need to add the jars you want to use when creating the sparkSession.
See this :
https://spark.apache.org/docs/2.4.7/submitting-applications.html#advanced-dependency-management
Either when you start pyspark
pyspark --repositories MAVEN_REPO
# OR
pyspark --jars PATH_TO_JAR
or when you create your sparkSession objects
SparkSession.builder.master("yarn").appName(app_name).config("spark.jars.packages", "MAVEN_PACKAGE")
# OR
SparkSession.builder.master("yarn").appName(app_name).config("spark.jars", "PATH_TO_JAR")
You need maven packages when you do not have the jar in local or your jars needs some dependencies.
I am new to Spark and Python and I have a sql which is stored in a variable in python and we use SnowFlake database. How to create a spark datafrom using SQL with snowflake connection?
import sf_connectivity (we have a code for establishing connection with Snowflake database)
emp = 'Select * From Employee'
snowflake_connection = sf_connectivity.collector() (It is a method to establish snowflake conenction)
requirement 1: Create Spark Dataframe (sf_df) using 'emp' and 'snowflake_connection '
requirement 2: sf_df.createOrReplaceTempView(Temp_Employee)
What are the packages or libraries it requires? How can I make this work?
The documentation that helped me figure this out is here: https://docs.databricks.com/data/data-sources/snowflake.html
Took me awhile to figure out how to get it working though! After a lot of questions, I had my company's IT department configure a snowflake user account with private/public key authentication, and they configured that ID to be accessible within our corporate Databricks account.
After they set this up, the following code is an example how to pass a sql command as variable to Spark, and have Spark convert it into a dataframe.
optionsSource = dict(sfUrl="mycompany.east-us-2.azure.snowflakecomputing.com", # Snowflake Account Name
sfUser="my_service_acct",
pem_private_key=dbutils.secrets.get("my_scope", "my_secret"),
sfDatabase="mydatabase", # Snowflake Database
sfSchema="myschema", # Snowflake Schema
sfWarehouse="mywarehouse",
sfRole="myrole"
)
sqlcmd = '''
select current_date;
'''
df = spark.read.format("snowflake").options(**optionsSource).option("query", sqlcmd).load()
display(df)
With Public/Private key , you need to generate a cert
https://community.snowflake.com/s/article/How-to-connect-snowflake-with-Spark-connector-using-Public-Private-Key
and then you can use the below code .
from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import serialization
import re
import os
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config("spark.jars", "<path/to/>/snowflake-jdbc-<version>.jar,<path/to/>/spark-snowflake_2.11-2.4.13-spark_2.4.jar") \
.config("spark.repl.local.jars",
"<path/to/>/snowflake-jdbc-<version>.jar,<path/to/>/spark-snowflake_2.11-2.4.13-spark_2.4.jar") \
.config("spark.sql.catalogImplementation", "in-memory") \
.getOrCreate()
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.disablePushdownSession(
spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
with open("<path/to/>/rsa_key.p8", "rb") as key_file:
p_key = serialization.load_pem_private_key(
key_file.read(),
password=os.environ['PRIVATE_KEY_PASSPHRASE'].encode(),
backend=default_backend()
)
pkb = p_key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.PKCS8,
encryption_algorithm=serialization.NoEncryption()
)
pkb = pkb.decode("UTF-8")
pkb = re.sub("-*(BEGIN|END) PRIVATE KEY-*\n", "", pkb).replace("\n", "")
sfOptions = {
"sfURL": "<URL>",
"sfAccount": "<ACCOUNTNAME>",
"sfUser": "<USER_NAME",
"pem_private_key": pkb,
# "sfPassword": "xxxxxxxxx",
"sfDatabase": "<DBNAME>",
"sfSchema": "<SCHEMA_NAME>",
"sfWarehouse": "<WH_NAME>",
"sfRole": "<ROLENAME>",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "<TABLENAME>") \
.load()
df.show()
I'm using:
python 3.6.8
spark 2.4.4
I run pyspark with an EMR cluster (emr-5.28.0) with: pyspark --packages org.apache.spark:spark-avro_2.11:2.4.4
I have the following jars in the spark classpath:
http://repo1.maven.org/maven2/com/databricks/spark-redshift_2.11/2.0.1/spark-redshift_2.11-2.0.1.jar
http://repo1.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar
https://github.com/ralfstx/minimal-json/releases/download/0.9.5/minimal-json-0.9.5.jar
https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.15.1025/RedshiftJDBC41-no-awssdk-1.2.15.1025.jar
I execute this piece of code:
url = "jdbc:redshift://my.cluster:5439/my_db?user=my_user&password=my_password"
query = "select * from schema.table where trunc(timestamp)='2019-09-10'"
df = sqlContext.read.format('com.databricks.spark.redshift')\
.option("url", url)\
.option("tempdir", "s3a://bucket/tmp_folder")\
.option("query", query)\
.option("aws_iam_role", "arn_iam_role")\
.load()
And doing df.count() it returns 73 (which is the number of rows of that query) but if I do df.show(4) it returns an empty DataFrame, no error, it just prints the schema.
I've made it work by changing the format to 'jdbc' and only use the databricks driver to write data, not read.
I am trying to read data in stored as Kudu using PySpark 2.1.0
>>> from os.path import expanduser, join, abspath
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> spark = SparkSession.builder \
.master("local") \
.appName("HivePyspark") \
.config("hive.metastore.warehouse.dir", "hdfs:///user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
>>> spark.sql("select count(*) from mySchema.myTable").show()
I have Kudu 1.2.0 installed on the cluster. Those are hive/ Impala tables.
When I execute the last line, I get the following error:
.
.
.
: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
.
.
.
aused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getStorageHandler(HiveUtils.java:315)
at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:284)
... 61 more
Caused by: java.lang.ClassNotFoundException: com.cloudera.kudu.hive.KuduStorageHandler
I am referring to the following resources:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
https://issues.apache.org/jira/browse/KUDU-1603
https://github.com/bkvarda/iot_demo/blob/master/total_data_count.py
https://kudu.apache.org/docs/developing.html#_kudu_python_client
I am interested to know how I can include the Kudu related dependencies into my pyspark program so that I can move past this error.
The way I solved this issue was to pass the respective Jar for kudu-spark to the pyspark2 shell or to the spark2-submit command
Apache Spark 2.3
Below is the code for your reference:
Read kudu table from pyspark with below code:
kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").load()
kuduDF.show(5)
Write to kudu table with below code:
DF.write.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").mode("append").save()
Reference link: https://medium.com/#sciencecommitter/how-to-read-from-and-write-to-kudu-tables-in-pyspark-via-impala-c4334b98cf05
If in case you want to use Scala below is the reference link:
https://kudu.apache.org/docs/developing.html
I am very new to Apache Spark.
I have already configured spark 2.0.2 on my local windows machine.
I have done with "word count" example with spark.
Now, I have the problem in executing the SQL Queries.
I have searched for the same , but not getting proper guidance .
So you need to do these things to get it done ,
In Spark 2.0.2 we have SparkSession which contains SparkContext instance as well as sqlContext instance.
Hence the steps would be :
Step 1: Create SparkSession
val spark = SparkSession.builder().appName("MyApp").master("local[*]").getOrCreate()
Step 2: Load from the database in your case Mysql.
val loadedData=spark
.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/mydatabase")
.option("driver", "com.mysql.jdbc.Driver")
.option("mytable", "mydatabase")
.option("user", "root")
.option("password", "toor")
.load().createOrReplaceTempView("mytable")
Step 3: Now you can run your SqlQuery just like you do in SqlDatabase.
val dataFrame=spark.sql("Select * from mytable")
dataFrame.show()
P.S: It would be better if you use DataFrame Api's or even better if DataSet Api's , but for those you need to go through the documentation.
Link to Documentation: https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.sql.Dataset
In Spark 2.x you no longer reference sqlContext, but rather spark, so you need to do:
spark
.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/mydb")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "mydb")
.option("user", "root")
.option("password", "")
.load()
You should have your Spark DataFrame.
Create a TempView out of DataFrame
df.createOrReplaceTempView("dftable")
dfsql = sc.sql("select * from dftable")
You can use long queries in statement format:
sql_statement = """
select sensorid, objecttemp_c,
year(DateTime) as year_value,
month(DateTime) as month_value,
day(DateTime) as day_value,
hour(DateTime) as hour_value
from dftable
order by 1 desc
"""
dfsql = sc.sql(sql_statement)
Its rather simple now in spark to do SQL queries.
You can do SQL on dataframes as pointed out by others but the questions is really how to do SQL.
spark.sql("SHOW TABLES;")
that's it.