I am very new to Apache Spark.
I have already configured spark 2.0.2 on my local windows machine.
I have done with "word count" example with spark.
Now, I have the problem in executing the SQL Queries.
I have searched for the same , but not getting proper guidance .
So you need to do these things to get it done ,
In Spark 2.0.2 we have SparkSession which contains SparkContext instance as well as sqlContext instance.
Hence the steps would be :
Step 1: Create SparkSession
val spark = SparkSession.builder().appName("MyApp").master("local[*]").getOrCreate()
Step 2: Load from the database in your case Mysql.
val loadedData=spark
.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/mydatabase")
.option("driver", "com.mysql.jdbc.Driver")
.option("mytable", "mydatabase")
.option("user", "root")
.option("password", "toor")
.load().createOrReplaceTempView("mytable")
Step 3: Now you can run your SqlQuery just like you do in SqlDatabase.
val dataFrame=spark.sql("Select * from mytable")
dataFrame.show()
P.S: It would be better if you use DataFrame Api's or even better if DataSet Api's , but for those you need to go through the documentation.
Link to Documentation: https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.sql.Dataset
In Spark 2.x you no longer reference sqlContext, but rather spark, so you need to do:
spark
.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/mydb")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "mydb")
.option("user", "root")
.option("password", "")
.load()
You should have your Spark DataFrame.
Create a TempView out of DataFrame
df.createOrReplaceTempView("dftable")
dfsql = sc.sql("select * from dftable")
You can use long queries in statement format:
sql_statement = """
select sensorid, objecttemp_c,
year(DateTime) as year_value,
month(DateTime) as month_value,
day(DateTime) as day_value,
hour(DateTime) as hour_value
from dftable
order by 1 desc
"""
dfsql = sc.sql(sql_statement)
Its rather simple now in spark to do SQL queries.
You can do SQL on dataframes as pointed out by others but the questions is really how to do SQL.
spark.sql("SHOW TABLES;")
that's it.
Related
I am new to Spark and Python and I have a sql which is stored in a variable in python and we use SnowFlake database. How to create a spark datafrom using SQL with snowflake connection?
import sf_connectivity (we have a code for establishing connection with Snowflake database)
emp = 'Select * From Employee'
snowflake_connection = sf_connectivity.collector() (It is a method to establish snowflake conenction)
requirement 1: Create Spark Dataframe (sf_df) using 'emp' and 'snowflake_connection '
requirement 2: sf_df.createOrReplaceTempView(Temp_Employee)
What are the packages or libraries it requires? How can I make this work?
The documentation that helped me figure this out is here: https://docs.databricks.com/data/data-sources/snowflake.html
Took me awhile to figure out how to get it working though! After a lot of questions, I had my company's IT department configure a snowflake user account with private/public key authentication, and they configured that ID to be accessible within our corporate Databricks account.
After they set this up, the following code is an example how to pass a sql command as variable to Spark, and have Spark convert it into a dataframe.
optionsSource = dict(sfUrl="mycompany.east-us-2.azure.snowflakecomputing.com", # Snowflake Account Name
sfUser="my_service_acct",
pem_private_key=dbutils.secrets.get("my_scope", "my_secret"),
sfDatabase="mydatabase", # Snowflake Database
sfSchema="myschema", # Snowflake Schema
sfWarehouse="mywarehouse",
sfRole="myrole"
)
sqlcmd = '''
select current_date;
'''
df = spark.read.format("snowflake").options(**optionsSource).option("query", sqlcmd).load()
display(df)
With Public/Private key , you need to generate a cert
https://community.snowflake.com/s/article/How-to-connect-snowflake-with-Spark-connector-using-Public-Private-Key
and then you can use the below code .
from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import serialization
import re
import os
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config("spark.jars", "<path/to/>/snowflake-jdbc-<version>.jar,<path/to/>/spark-snowflake_2.11-2.4.13-spark_2.4.jar") \
.config("spark.repl.local.jars",
"<path/to/>/snowflake-jdbc-<version>.jar,<path/to/>/spark-snowflake_2.11-2.4.13-spark_2.4.jar") \
.config("spark.sql.catalogImplementation", "in-memory") \
.getOrCreate()
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.disablePushdownSession(
spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
with open("<path/to/>/rsa_key.p8", "rb") as key_file:
p_key = serialization.load_pem_private_key(
key_file.read(),
password=os.environ['PRIVATE_KEY_PASSPHRASE'].encode(),
backend=default_backend()
)
pkb = p_key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.PKCS8,
encryption_algorithm=serialization.NoEncryption()
)
pkb = pkb.decode("UTF-8")
pkb = re.sub("-*(BEGIN|END) PRIVATE KEY-*\n", "", pkb).replace("\n", "")
sfOptions = {
"sfURL": "<URL>",
"sfAccount": "<ACCOUNTNAME>",
"sfUser": "<USER_NAME",
"pem_private_key": pkb,
# "sfPassword": "xxxxxxxxx",
"sfDatabase": "<DBNAME>",
"sfSchema": "<SCHEMA_NAME>",
"sfWarehouse": "<WH_NAME>",
"sfRole": "<ROLENAME>",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "<TABLENAME>") \
.load()
df.show()
I'm using:
python 3.6.8
spark 2.4.4
I run pyspark with an EMR cluster (emr-5.28.0) with: pyspark --packages org.apache.spark:spark-avro_2.11:2.4.4
I have the following jars in the spark classpath:
http://repo1.maven.org/maven2/com/databricks/spark-redshift_2.11/2.0.1/spark-redshift_2.11-2.0.1.jar
http://repo1.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar
https://github.com/ralfstx/minimal-json/releases/download/0.9.5/minimal-json-0.9.5.jar
https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.15.1025/RedshiftJDBC41-no-awssdk-1.2.15.1025.jar
I execute this piece of code:
url = "jdbc:redshift://my.cluster:5439/my_db?user=my_user&password=my_password"
query = "select * from schema.table where trunc(timestamp)='2019-09-10'"
df = sqlContext.read.format('com.databricks.spark.redshift')\
.option("url", url)\
.option("tempdir", "s3a://bucket/tmp_folder")\
.option("query", query)\
.option("aws_iam_role", "arn_iam_role")\
.load()
And doing df.count() it returns 73 (which is the number of rows of that query) but if I do df.show(4) it returns an empty DataFrame, no error, it just prints the schema.
I've made it work by changing the format to 'jdbc' and only use the databricks driver to write data, not read.
How to enable the %sql Magic string on jupyter notebook and how to use %sql magic string on a cell with the below line of code.
spark.sql('select * from test').show()
Try
%%sparksql
select * from test
Before trying install
pip install sparksql-magic
Refer: https://github.com/cryeo/sparksql-magic
You don't need the %sql magic string to work with Spark SQL. You need to first create a Spark DataFrame as described in the SparkSession API docs, like by using df = createDataFrame(data). Then you would create a global view, calling df.createOrReplaceTempView("test"). Then your above query would work.
Try
%%sql
select * from test
Link
https://github.com/jupyter-incubator/sparkmagic
All,
I have been struggling against PySpark and Ignite itegration for like 2 last weeks and I am at my wits' end.
I have been trying to upload a table created in pyspark to ignite.
I have been starting the script like
spark-submit --master spark://my_host:my_port --jars $IGNITE_HOME/libs/*jar, $IGNITE_HOME/libs/optional/ignite-spark/jar, $IGNITE_HOME/libs/ignite-spring/*jar $IGNITE_HOME/libs/ignite-indexking/*jar my_python_script.py
and my_python_script.py was like:
import pyspark
spark = pyspark.sql.SparkSession\
.builder\
.appName("Ignite")\
.getOrCreate()
# create the data frame
columns = ["COL1", "COL2", "ID"]
vals = [("a", "b", 0), ("c", "d", 1)]
df = spark.createDataFrame(vals, columns)
df.write\
.format("jdbc")\
.option("driver", "org.apache.ignite.IgniteJdbcThinDriver")\
.option("url", "jdbs:ignite:thin://my_url:my_port")\
.option("user", "my_user")\
.option("password", "my_password")\
.option("dbtable", "my_table")\
.option("mode", "overwrite")\
.save()
And I keep getting errors... For the above the error is py4j.protocol.Py4JavaError: An error occurred while calling o48.save. :java.sql.SQLException: no PRIMARY KEY defined for CREATE TABLE at org.apache.ignite.internal.jdbc.thin.JdbcThinConnection.sendRequest(JdbcThinConnection.java:750
Can anyone please help?
My spark version is 2.4.0, python 2.7, ignite 2.7
Is there a reason that you're not using the Spark-Ignite integration? JDBC should work but there's a better way, especially since you're already including all the right JARs.
df.write.format("ignite")
.option("table","my_table")
.option("primaryKeyFields","COL1")
.option("config",configFile)
.option("mode","overwrite")
.save()
Also note the inclusion of the "primaryKeyFields" option. As your error message notes, the version using JDBC fails because you've not defined a primary key.
I am using the code below to write a DataFrame of 43 columns and about 2,000,000 rows into a table in SQL Server:
dataFrame
.write
.format("jdbc")
.mode("overwrite")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", url)
.option("dbtable", tablename)
.option("user", user)
.option("password", password)
.save()
Sadly, while it does work for small DataFrames it's either extremely slow or gets timed out for large ones. Any hints on how to optimize it?
I've tried setting rewriteBatchedStatements=true
Thanks.
Try adding batchsize option to your statement with atleast > 10000(change this value accordingly to get better performance) and execute the write again.
From spark docs:
The JDBC batch size, which determines how many rows to insert per
round trip. This can help performance on JDBC drivers. This option
applies only to writing. It defaults to 1000.
Also its worth to check out:
numPartitions option to increase the parallelism (This also determines the maximum number of concurrent JDBC connections)
queryTimeout option to increase the timeouts for the write option.
We resorted to using the azure-sqldb-spark library instead of the default built-in exporting functionality of Spark. This library gives you a bulkCopyToSqlDB method which is a real batch insert and goes a lot faster. It's a bit less practical to use than the built-in functionality, but in my experience it's still worth it.
We use it more or less like this:
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
import com.microsoft.azure.sqldb.spark.query._
val options = Map(
"url" -> "***",
"databaseName" -> "***",
"user" -> "***",
"password" -> "***",
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver"
)
// first make sure the table exists, with the correct column types
// and is properly cleaned up if necessary
val query = dropAndCreateQuery(df, "myTable")
val createConfig = Config(options ++ Map("QueryCustom" -> query))
spark.sqlContext.sqlDBQuery(createConfig)
val bulkConfig = Config(options ++ Map(
"dbTable" -> "myTable",
"bulkCopyBatchSize" -> "20000",
"bulkCopyTableLock" -> "true",
"bulkCopyTimeout" -> "600"
))
df.bulkCopyToSqlDB(bulkConfig)
As you can see we generate the CREATE TABLE query ourselves. You can let the library create the table, but it will just do dataFrame.limit(0).write.sqlDB(config) which can still be pretty inefficient, probably requires you to cache your DataFrame, and it doesn't allow you to choose the SaveMode.
Also potentially interesting: we had to use an ExclusionRule when adding this library to our sbt build, or the assembly task would fail.
libraryDependencies += "com.microsoft.azure" % "azure-sqldb-spark" % "1.0.2" excludeAll(
ExclusionRule(organization = "org.apache.spark")
)
In order improve the performance using PY-Spark (due to Administrative restrictions to use python, SQL and R only) one can use below options.
Method 1: Using JDBC Connector
This method reads or writes the data row by row, resulting in performance issues. Not Recommended.
df.write \
.format("jdbc") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.save()
Method 2: Using Apache Spark connector (SQL Server & Azure SQL)
This method uses bulk insert to read/write data. There are a lot more options that can be further explored.
First Install the Library using Maven Coordinate in the Data-bricks cluster, and then use the below code.
Recommended for Azure SQL DB or Sql Server Instance
https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver15
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("batchsize", as per need) \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED")\
.save()
Method 3: Using Connector for Azure Dedicated SQL Pool (formerly SQL DW)
This method previously uses Poly-base to read and write data to and from Azure Synapse using a staging server (mainly, blob storage or a Data Lake storage directory), but now data are being read and write using Copy, as the Copy method has improved performance.
Recommended for Azure Synapse
https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
is converting data to CSV files and copying those CSV's is an option for you?
we have automated this process for bigger tables and transferring those in GCP in CSV format. rather than reading this through JDBC.
you can use the sql-spark connector
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.save()
More info also here