Query sql server table in azure databricks - apache-spark-sql

I am using the below code to query a sql server table hr.employee in my azure sql server database using azure databricks. I am new to spark sql and trying to learn the nuances one step at a time.
Azure Databricks:
%scala
val jdbcHostname = dbutils.widgets.get("hostName")
val jdbcPort = 1433
val jdbcDatabase = dbutils.widgets.get("database")
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
%scala
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
%scala
val employee = spark.read.jdbc(jdbcUrl, "hr.Employee", connectionProperties)
%scala
spark.sql("select * from employee")
%sql
select * from employee
employee.select("col1","col2").show()
I get the below error. Not sure what wrong am I doing. Tried a couple of variations as well and no luck so far.
Error:
';' expected but integer literal found.
command-922779590419509:26: error: not found: value %
%sql
command-922779590419509:27: error: not found: value select
select * from employee
command-922779590419509:27: error: not found: value from
select * from employee
command-922779590419509:16: error: not found: value %
%scala

Related

Push a SQL query to a server from JDBC connection which reads from multiple databases within that server

I'm pushing a query down to a server to read data into Databricks as below:
val jdbcUsername = dbutils.secrets.get(scope = "", key = "")
val jdbcPassword = dbutils.secrets.get(scope = "", key = "")
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
val jdbcHostname = ""
val jdbcPort = ...
val jdbcDatabase = ""
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
// define a query to be passed to database to display the tables available for a given DB
val query_results = "(SELECT * FROM INFORMATION_SCHEMA.TABLES) as tables"
// push the query down to the server to retrieve the list of available tables
val table_names = spark.read.jdbc(jdbcUrl, query_results, connectionProperties)
table_names.createOrReplaceTempView("table_names")
Running display(table_names) would provide a list of tables for a given defined database. This is no issue, however when trying to read and join tables from multiple databases in the same server I havent yet found a solution that works.
An example would be:
// define a query to be passed to database to display a result across many tables
val report1_results = "(SELECT a.Field1, b.Field2 FROM database_1 as a left join database_2 as b on a.Field4 == b.Field8) as report1"
// push the query down to the server to retrieve the query result
val report1_results = spark.read.jdbc(jdbcUrl, report1_results, connectionProperties)
report1_results .createOrReplaceTempView("report1_results")
Any pointers appreciated wrt to restructuring this code (equivalent in Python would also be super helpful).
SQL Server uses 3-part naming like database.schema.table. This example comes from the SQL Server information_schema docs:
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLUMN_DEFAULT
FROM AdventureWorks2012.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = N'Product';
To query across databases you need to specify all 3 parts in the query being pushed down to SQL Server.
SELECT a.Field1, b.Field2
FROM database_1.schema_1.table_1 as a
LEFT JOIN database_2.schema_2.table_2 as b
on a.Field4 == b.Field8

Spark SQL push down query issue with max function in MS SQL

I want to execute an aggregate function MAX on a table's ID column residing in MS SQL. I am using spark SQL 1.6 and JDBC push down_query approach as I don't want spark SQL to pull all the data on spark side and do MAX (ID) calculation, but when I execute below code I get below exception, whereas If I try SELECT * FROM in code it works as expected.
Code:
def getMaxID(sqlContext: SQLContext,tableName:String) =
{
val pushdown_query = s"(SELECT MAX(ID) FROM ${tableName}) as t"
val maxID = sqlContext.read.jdbc(url = getJdbcProp(sqlContext.sparkContext).toString, table = pushdown_query, properties = getDBConnectionProperties(sqlContext.sparkContext))
.head().getLong(0)
maxID
}
Exception:
Exception in thread "main" java.sql.SQLException: No column name was specified for column 1 of 't'.
at net.sourceforge.jtds.jdbc.SQLDiagnostic.addDiagnostic(SQLDiagnostic.java:372)
at net.sourceforge.jtds.jdbc.TdsCore.tdsErrorToken(TdsCore.java:2988)
at net.sourceforge.jtds.jdbc.TdsCore.nextToken(TdsCore.java:2421)
at net.sourceforge.jtds.jdbc.TdsCore.getMoreResults(TdsCore.java:671)
at net.sourceforge.jtds.jdbc.JtdsStatement.executeSQLQuery(JtdsStatement.java:505)
at net.sourceforge.jtds.jdbc.JtdsPreparedStatement.executeQuery(JtdsPreparedStatement.java:1029)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:124)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
This exception is not related to Spark. You have to provide an alias for the column
val pushdown_query = s"(SELECT MAX(ID) AS max_id FROM ${tableName}) as t"

error while creating hive table from mapr-db (hbase like)

I have created a MapR-DB table in a spark code:
case class MyLog(count: Int, message: String)
val conf = new SparkConf().setAppName("Launcher").setMaster("local[2]")
val sc = new SparkContext(conf)
val data = Seq(MyLog(3, "monmessage"))
val log_rdd = sc.parallelize(data)
log_rdd.saveToMapRDB("/tables/tablelog",createTable = true, idFieldPath = "message")
when I print this line from spark code I get in the console:
{"_id":"monmessage","count":3,"message":"monmessage"}
I would like to create an hive table to make select or other query on this table so I try this:
CREATE EXTERNAL TABLE mapr_table_2(count int, message string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "message")
TBLPROPERTIES("hbase.table.name" = "/tables/tablelog");
but I get:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Error: the HBase columns mapping contains a badly formed column family, column qualifier specification.)
I took the create table query from this link :
http://maprdocs.mapr.com/home/Hive/HiveAndMapR-DBIntegration-GettingStarted.html
By the way, I don't understand what do I need to put in the line :
hbase.columns.mapping" =
Do you have any idea how to create the table ? Thanks

Spark sql SQLContext

I'm trying to select data from MSSQL database via SQLContext.sql in Spark application.
Connection works but I'm not able to select data from table, because it always fail on table name.
Here is my code:
val prop=new Properties()
val url2="jdbc:jtds:sqlserver://servername;instance=MSSQLSERVER;user=sa;password=Pass;"
prop.setProperty("user","username")
prop.setProperty("driver" , "net.sourceforge.jtds.jdbc.Driver")
prop.setProperty("password","mypassword")
val test=sqlContext.read.jdbc(url2,"[dbName].[dbo].[Table name]",prop)
sqlContext.sql("""
SELECT *
FROM 'dbName.dbo.Table name'
""")
I tried table name without (') or [dbName].[dbo].[Table name] but still the same ....
Exception in thread "main" java.lang.RuntimeException: [3.14] failure:
``union'' expected but `.' found
dependencies:
// https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.1" //%"provided"
// https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.6.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.1" //%"provided"
I think the problem in your code is that the query you pass to sqlContext has no access to the original table in the source database. It has only access to the tables saved within the sql context, for example with df.write.saveAsTable() or with df.registerTempTable() (df.createTempView in Spark 2+).
So, in your specific case, I can suggest a couple of options:
1) if you want the query to be executed on the source database with the exact syntax of your database SQL, you can pass the query to the "dbtable" argument:
val query = "SELECT * FROM dbName.dbo.TableName"
val df = sqlContext.read.jdbc(url2, s"($query) AS subquery", prop)
df.show
Note that the query needs to be in parentheses, because it will be passed to a "FROM" clause, as specified in the docs:
dbtable: The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
2) If you don't need to run the query on the source database, you can just pass the table name and then create a temp view in the sqlContext:
val table = sqlContext.read.jdbc(url2, "dbName.dbo.TableName", prop)
table.registerTempTable("temp_table")
val df = sqlContext.sql("SELECT * FROM temp_table")
// or sqlContext.table("temp_table")
df.show()

Spark Sql Aggregation from cassandra

this is my query written for mysql database,
SELECT dcm.user, du.full_name, ROUND(AVG(fcg.final)) ,ROUND(AVG(fcg.participation))
FROM dimclassmem dcm LEFT JOIN factGlobal fcg on fcg.class_id=dcm.class_id GROUP BY dcm.user ORDER BY dcm.user
i can run this using java + mysql.
now i want to write this query using Spark Sql.
How can i write Aggregate function in Spark.
I am fetching data from Cassandra table , and perform simple query & it below code works,
val conf = new SparkConf()
conf.set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext("local", "Cassandra Connector Test", conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext
.read.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "Sample", "table" -> "testTable"))
.load()
df.registerTempTable("testTable")
val ddf = sqlContext.sql("select name from testTable order by name desc limit 10")
ddf.show()
but , if i used follwing code it won't work,
val countallRec = sqlContext.sql("Select count(name) from testTable")
countallRec.show()
i am getting below exception
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/MapPartitionsWithPreparationRDD
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
I want to run above query Using Spark Sql, How can i do it?