I have a dataframe called df with column named employee_id. I am doing:
df.registerTempTable("d_f")
val query = """SELECT *, ROW_NUMBER() OVER (ORDER BY employee_id) row_number FROM d_f"""
val result = Spark.getSqlContext().sql(query)
But getting following issue. Any help?
[1.29] failure: ``union'' expected but `(' found
SELECT *, ROW_NUMBER() OVER (ORDER BY employee_id) row_number FROM d_f
^
java.lang.RuntimeException: [1.29] failure: ``union'' expected but `(' found
SELECT *, ROW_NUMBER() OVER (ORDER BY employee_id) row_number FROM d_f
Spark 2.0+
Spark 2.0 introduces native implementation of window functions (SPARK-8641) so HiveContext should be no longer required. Nevertheless similar errors, not related to window functions, can be still attributed to the differences between SQL parsers.
Spark <= 1.6
Window functions have been introduced in Spark 1.4.0 and require HiveContext to work. SQLContext won't work here.
Be sure you you use Spark >= 1.4.0 and create the HiveContext:
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
Yes It is true,
I am using spark version 1.6.0 and there you need a HiveContext to implement 'dense_rank' method.
From Spark 2.0.0 on words there will be no more 'dense_rank' method.
So for Spark 1.4,1.6 <2.0 you should apply like this.
table hive_employees having three fields ::
place : String,
name : String,
salary : Int
val conf = new SparkConf().setAppName("denseRank test")//.setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val hqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val result = hqlContext.sql("select empid,empname, dense_rank() over(partition by empsalary order by empname) as rank from hive_employees")
result.show()
Related
The following scala code (you could run it in a scala worksheet)
import org.apache.spark.sql.catalyst.parser._
import org.apache.spark.sql.internal.SQLConf
val sqlParser = new CatalystSqlParser(SQLConf.get)
val query = """select col1 from table1;"""
//import sqlParser.astBuilder
val parsed = sqlParser.parseExpression(query)
//println(astBuilder.toString)
println(s"parsed: ${parsed.prettyJson}")
throws what looks like an absurd error -
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'from' expecting {<EOF>, '-'}(line 1, pos 12)
== SQL ==
select col1 from table1;
------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:49)
... 37 elided
Has anybody seen this before? I saw the error message on SO, but this is a very simple query, and it shouldn't be erroring out this way.
I am not familiar with calling CatalystSqlParser directly, but the SparkSession.sql method seems happy with your query. Perhaps using that suits your needs?
The following is successfully parsed:
val query = """select col1 from table1;"""
val df = spark.sql(query)
(CatalystSqlParser doesn't appear to be part of the documented API).
I am using the below code to query a sql server table hr.employee in my azure sql server database using azure databricks. I am new to spark sql and trying to learn the nuances one step at a time.
Azure Databricks:
%scala
val jdbcHostname = dbutils.widgets.get("hostName")
val jdbcPort = 1433
val jdbcDatabase = dbutils.widgets.get("database")
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
%scala
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
%scala
val employee = spark.read.jdbc(jdbcUrl, "hr.Employee", connectionProperties)
%scala
spark.sql("select * from employee")
%sql
select * from employee
employee.select("col1","col2").show()
I get the below error. Not sure what wrong am I doing. Tried a couple of variations as well and no luck so far.
Error:
';' expected but integer literal found.
command-922779590419509:26: error: not found: value %
%sql
command-922779590419509:27: error: not found: value select
select * from employee
command-922779590419509:27: error: not found: value from
select * from employee
command-922779590419509:16: error: not found: value %
%scala
I'm trying to create a table in spark (scala) and then insert values from two existing dataframes but I got this exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable `stat_type_predicate_percentage`, ErrorIfExists
Here is the code :
case class stat_type_predicate_percentage (type1: Option[String], predicate: Option[String], outin: Option[INT], percentage: Option[FLOAT])
object LoadFiles1 {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "LoadFiles1")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
//statistics
val create = spark.sql("CREATE TABLE stat_type_predicate_percentage (type1 String, predicate String, outin INT, percentage FLOAT) USING hive")
val insert1 = spark.sql("INSERT INTO stat_type_predicate_percentage SELECT types.type, res.predicate, 0, 1.0*COUNT(subject)/(SELECT COUNT(subject) FROM MappingBasedProperties AS resinner WHERE res.predicate = resinner.predicate) FROM MappingBasedProperties AS res, MappingBasedTypes AS types WHERE res.subject = types.resource GROUP BY res.predicate,types.type")
val select = spark.sql("SELECT * from stat_type_predicate_percentage" )
}
How should I solve it?
--- Yo have to enable hive support in you sparksession
val spark = new SparkSession
.Builder()
.appName("JOB2")
.master("local")
.enableHiveSupport()
.getOrCreate()
This problem may be two fold
for one you might want to do what #Tanjin suggested in the comments and it might work afterwards ( Try adding .config("spark.sql.catalogImplementation","hive") to your SparkSession.builder )
but if you actually want to use an existing hive instance with its own metadata which you'll be able to query from outside your job. Or you might already want to use existing tables you might like to add to you configuration the hive-site.xml.
This configuration file contains some properties you probably want like the hive.metastore.uris which will enable your context add a new table which will be save in the store. And it will be able to read from tables in your hive instance thanks to the metastore which contains tables and locations.
this is my query written for mysql database,
SELECT dcm.user, du.full_name, ROUND(AVG(fcg.final)) ,ROUND(AVG(fcg.participation))
FROM dimclassmem dcm LEFT JOIN factGlobal fcg on fcg.class_id=dcm.class_id GROUP BY dcm.user ORDER BY dcm.user
i can run this using java + mysql.
now i want to write this query using Spark Sql.
How can i write Aggregate function in Spark.
I am fetching data from Cassandra table , and perform simple query & it below code works,
val conf = new SparkConf()
conf.set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext("local", "Cassandra Connector Test", conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext
.read.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "Sample", "table" -> "testTable"))
.load()
df.registerTempTable("testTable")
val ddf = sqlContext.sql("select name from testTable order by name desc limit 10")
ddf.show()
but , if i used follwing code it won't work,
val countallRec = sqlContext.sql("Select count(name) from testTable")
countallRec.show()
i am getting below exception
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/MapPartitionsWithPreparationRDD
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
I want to run above query Using Spark Sql, How can i do it?
For a table with
create table mytable (
..
)
partitioned by (my_part_column String)
We are executing a hive sql as follows:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
data = hc.sql("select * from my_table limit 10")
The values read back show the "my_part_columns" as the FIRST items for each row instead of the last ones.
Turns out this is a known bug fixed in spark 1.3.0 and 1.2.1
https://issues.apache.org/jira/browse/SPARK-5049