this is my query written for mysql database,
SELECT dcm.user, du.full_name, ROUND(AVG(fcg.final)) ,ROUND(AVG(fcg.participation))
FROM dimclassmem dcm LEFT JOIN factGlobal fcg on fcg.class_id=dcm.class_id GROUP BY dcm.user ORDER BY dcm.user
i can run this using java + mysql.
now i want to write this query using Spark Sql.
How can i write Aggregate function in Spark.
I am fetching data from Cassandra table , and perform simple query & it below code works,
val conf = new SparkConf()
conf.set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext("local", "Cassandra Connector Test", conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext
.read.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "Sample", "table" -> "testTable"))
.load()
df.registerTempTable("testTable")
val ddf = sqlContext.sql("select name from testTable order by name desc limit 10")
ddf.show()
but , if i used follwing code it won't work,
val countallRec = sqlContext.sql("Select count(name) from testTable")
countallRec.show()
i am getting below exception
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/MapPartitionsWithPreparationRDD
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
I want to run above query Using Spark Sql, How can i do it?
Related
I'm pushing a query down to a server to read data into Databricks as below:
val jdbcUsername = dbutils.secrets.get(scope = "", key = "")
val jdbcPassword = dbutils.secrets.get(scope = "", key = "")
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
val jdbcHostname = ""
val jdbcPort = ...
val jdbcDatabase = ""
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
// define a query to be passed to database to display the tables available for a given DB
val query_results = "(SELECT * FROM INFORMATION_SCHEMA.TABLES) as tables"
// push the query down to the server to retrieve the list of available tables
val table_names = spark.read.jdbc(jdbcUrl, query_results, connectionProperties)
table_names.createOrReplaceTempView("table_names")
Running display(table_names) would provide a list of tables for a given defined database. This is no issue, however when trying to read and join tables from multiple databases in the same server I havent yet found a solution that works.
An example would be:
// define a query to be passed to database to display a result across many tables
val report1_results = "(SELECT a.Field1, b.Field2 FROM database_1 as a left join database_2 as b on a.Field4 == b.Field8) as report1"
// push the query down to the server to retrieve the query result
val report1_results = spark.read.jdbc(jdbcUrl, report1_results, connectionProperties)
report1_results .createOrReplaceTempView("report1_results")
Any pointers appreciated wrt to restructuring this code (equivalent in Python would also be super helpful).
SQL Server uses 3-part naming like database.schema.table. This example comes from the SQL Server information_schema docs:
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLUMN_DEFAULT
FROM AdventureWorks2012.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = N'Product';
To query across databases you need to specify all 3 parts in the query being pushed down to SQL Server.
SELECT a.Field1, b.Field2
FROM database_1.schema_1.table_1 as a
LEFT JOIN database_2.schema_2.table_2 as b
on a.Field4 == b.Field8
I am using the below code to query a sql server table hr.employee in my azure sql server database using azure databricks. I am new to spark sql and trying to learn the nuances one step at a time.
Azure Databricks:
%scala
val jdbcHostname = dbutils.widgets.get("hostName")
val jdbcPort = 1433
val jdbcDatabase = dbutils.widgets.get("database")
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
%scala
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
%scala
val employee = spark.read.jdbc(jdbcUrl, "hr.Employee", connectionProperties)
%scala
spark.sql("select * from employee")
%sql
select * from employee
employee.select("col1","col2").show()
I get the below error. Not sure what wrong am I doing. Tried a couple of variations as well and no luck so far.
Error:
';' expected but integer literal found.
command-922779590419509:26: error: not found: value %
%sql
command-922779590419509:27: error: not found: value select
select * from employee
command-922779590419509:27: error: not found: value from
select * from employee
command-922779590419509:16: error: not found: value %
%scala
I'm trying to create a table in spark (scala) and then insert values from two existing dataframes but I got this exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable `stat_type_predicate_percentage`, ErrorIfExists
Here is the code :
case class stat_type_predicate_percentage (type1: Option[String], predicate: Option[String], outin: Option[INT], percentage: Option[FLOAT])
object LoadFiles1 {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "LoadFiles1")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
//statistics
val create = spark.sql("CREATE TABLE stat_type_predicate_percentage (type1 String, predicate String, outin INT, percentage FLOAT) USING hive")
val insert1 = spark.sql("INSERT INTO stat_type_predicate_percentage SELECT types.type, res.predicate, 0, 1.0*COUNT(subject)/(SELECT COUNT(subject) FROM MappingBasedProperties AS resinner WHERE res.predicate = resinner.predicate) FROM MappingBasedProperties AS res, MappingBasedTypes AS types WHERE res.subject = types.resource GROUP BY res.predicate,types.type")
val select = spark.sql("SELECT * from stat_type_predicate_percentage" )
}
How should I solve it?
--- Yo have to enable hive support in you sparksession
val spark = new SparkSession
.Builder()
.appName("JOB2")
.master("local")
.enableHiveSupport()
.getOrCreate()
This problem may be two fold
for one you might want to do what #Tanjin suggested in the comments and it might work afterwards ( Try adding .config("spark.sql.catalogImplementation","hive") to your SparkSession.builder )
but if you actually want to use an existing hive instance with its own metadata which you'll be able to query from outside your job. Or you might already want to use existing tables you might like to add to you configuration the hive-site.xml.
This configuration file contains some properties you probably want like the hive.metastore.uris which will enable your context add a new table which will be save in the store. And it will be able to read from tables in your hive instance thanks to the metastore which contains tables and locations.
I want to execute an aggregate function MAX on a table's ID column residing in MS SQL. I am using spark SQL 1.6 and JDBC push down_query approach as I don't want spark SQL to pull all the data on spark side and do MAX (ID) calculation, but when I execute below code I get below exception, whereas If I try SELECT * FROM in code it works as expected.
Code:
def getMaxID(sqlContext: SQLContext,tableName:String) =
{
val pushdown_query = s"(SELECT MAX(ID) FROM ${tableName}) as t"
val maxID = sqlContext.read.jdbc(url = getJdbcProp(sqlContext.sparkContext).toString, table = pushdown_query, properties = getDBConnectionProperties(sqlContext.sparkContext))
.head().getLong(0)
maxID
}
Exception:
Exception in thread "main" java.sql.SQLException: No column name was specified for column 1 of 't'.
at net.sourceforge.jtds.jdbc.SQLDiagnostic.addDiagnostic(SQLDiagnostic.java:372)
at net.sourceforge.jtds.jdbc.TdsCore.tdsErrorToken(TdsCore.java:2988)
at net.sourceforge.jtds.jdbc.TdsCore.nextToken(TdsCore.java:2421)
at net.sourceforge.jtds.jdbc.TdsCore.getMoreResults(TdsCore.java:671)
at net.sourceforge.jtds.jdbc.JtdsStatement.executeSQLQuery(JtdsStatement.java:505)
at net.sourceforge.jtds.jdbc.JtdsPreparedStatement.executeQuery(JtdsPreparedStatement.java:1029)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:124)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
This exception is not related to Spark. You have to provide an alias for the column
val pushdown_query = s"(SELECT MAX(ID) AS max_id FROM ${tableName}) as t"
I have a dataframe called df with column named employee_id. I am doing:
df.registerTempTable("d_f")
val query = """SELECT *, ROW_NUMBER() OVER (ORDER BY employee_id) row_number FROM d_f"""
val result = Spark.getSqlContext().sql(query)
But getting following issue. Any help?
[1.29] failure: ``union'' expected but `(' found
SELECT *, ROW_NUMBER() OVER (ORDER BY employee_id) row_number FROM d_f
^
java.lang.RuntimeException: [1.29] failure: ``union'' expected but `(' found
SELECT *, ROW_NUMBER() OVER (ORDER BY employee_id) row_number FROM d_f
Spark 2.0+
Spark 2.0 introduces native implementation of window functions (SPARK-8641) so HiveContext should be no longer required. Nevertheless similar errors, not related to window functions, can be still attributed to the differences between SQL parsers.
Spark <= 1.6
Window functions have been introduced in Spark 1.4.0 and require HiveContext to work. SQLContext won't work here.
Be sure you you use Spark >= 1.4.0 and create the HiveContext:
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
Yes It is true,
I am using spark version 1.6.0 and there you need a HiveContext to implement 'dense_rank' method.
From Spark 2.0.0 on words there will be no more 'dense_rank' method.
So for Spark 1.4,1.6 <2.0 you should apply like this.
table hive_employees having three fields ::
place : String,
name : String,
salary : Int
val conf = new SparkConf().setAppName("denseRank test")//.setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val hqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val result = hqlContext.sql("select empid,empname, dense_rank() over(partition by empsalary order by empname) as rank from hive_employees")
result.show()