Spark SQL equivalent for Oracle functions - apache-spark-sql

I am trying to find the spark SQL for the following oracle function
IIF(ISNULL(TYPE_CODE),'UNASSIGNED',TYPE_CODE)

In scala :
import org.apache.spark.sql.{functions => f}
val new_df = df_name.withColumn("col_name",f.when(f.col("other_col_name").isNull(),TYPE_CODE)

Related

Cannot Create table with spark SQL : Hive support is required to CREATE Hive TABLE (AS SELECT);

I'm trying to create a table in spark (scala) and then insert values from two existing dataframes but I got this exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable `stat_type_predicate_percentage`, ErrorIfExists
Here is the code :
case class stat_type_predicate_percentage (type1: Option[String], predicate: Option[String], outin: Option[INT], percentage: Option[FLOAT])
object LoadFiles1 {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "LoadFiles1")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
//statistics
val create = spark.sql("CREATE TABLE stat_type_predicate_percentage (type1 String, predicate String, outin INT, percentage FLOAT) USING hive")
val insert1 = spark.sql("INSERT INTO stat_type_predicate_percentage SELECT types.type, res.predicate, 0, 1.0*COUNT(subject)/(SELECT COUNT(subject) FROM MappingBasedProperties AS resinner WHERE res.predicate = resinner.predicate) FROM MappingBasedProperties AS res, MappingBasedTypes AS types WHERE res.subject = types.resource GROUP BY res.predicate,types.type")
val select = spark.sql("SELECT * from stat_type_predicate_percentage" )
}
How should I solve it?
--- Yo have to enable hive support in you sparksession
val spark = new SparkSession
.Builder()
.appName("JOB2")
.master("local")
.enableHiveSupport()
.getOrCreate()
This problem may be two fold
for one you might want to do what #Tanjin suggested in the comments and it might work afterwards ( Try adding .config("spark.sql.catalogImplementation","hive") to your SparkSession.builder )
but if you actually want to use an existing hive instance with its own metadata which you'll be able to query from outside your job. Or you might already want to use existing tables you might like to add to you configuration the hive-site.xml.
This configuration file contains some properties you probably want like the hive.metastore.uris which will enable your context add a new table which will be save in the store. And it will be able to read from tables in your hive instance thanks to the metastore which contains tables and locations.

Listagg Alternative in Spark SQL [duplicate]

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')
I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?
Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.
Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
Example usage:
val df = sc.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")
)).toDF("username", "friend")
df.groupBy($"username").agg(GroupConcat($"friend")).show
## +---------+---------------+
## | username| friends|
## +---------+---------------+
## |username1|friend1,friend2|
## |username2|friend1,friend3|
## +---------+---------------+
You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?
In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.
You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:
import org.apache.spark.sql.functions.{collect_list, udf, lit}
df.groupBy($"username")
.agg(concat_ws(",", collect_list($"friend")).alias("friends"))
You can try the collect_list function
sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A
Or you can regieter a UDF something like
sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))
and you can use this function in the query
sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")
In Spark 2.4+ this has become simpler with the help of collect_list() and array_join().
Here's a demonstration in PySpark, though the code should be very similar for Scala too:
from pyspark.sql.functions import array_join, collect_list
friends = spark.createDataFrame(
[
('jacques', 'nicolas'),
('jacques', 'georges'),
('jacques', 'francois'),
('bob', 'amelie'),
('bob', 'zoe'),
],
schema=['username', 'friend'],
)
(
friends
.orderBy('friend', ascending=False)
.groupBy('username')
.agg(
array_join(
collect_list('friend'),
delimiter=', ',
).alias('friends')
)
.show(truncate=False)
)
In Spark SQL the solution is likewise:
SELECT
username,
array_join(collect_list(friend), ', ') AS friends
FROM friends
GROUP BY username;
The output:
+--------+--------------------------+
|username|friends |
+--------+--------------------------+
|jacques |nicolas, georges, francois|
|bob |zoe, amelie |
+--------+--------------------------+
This is similar to MySQL's GROUP_CONCAT() and Redshift's LISTAGG().
Here is a function you can use in PySpark:
import pyspark.sql.functions as F
def group_concat(col, distinct=False, sep=','):
if distinct:
collect = F.collect_set(col.cast(StringType()))
else:
collect = F.collect_list(col.cast(StringType()))
return F.concat_ws(sep, collect)
table.groupby('username').agg(F.group_concat('friends').alias('friends'))
In SQL:
select username, concat_ws(',', collect_list(friends)) as friends
from table
group by username
-- the spark SQL resolution with collect_set
SELECT id, concat_ws(', ', sort_array( collect_set(colors))) as csv_colors
FROM (
VALUES ('A', 'green'),('A','yellow'),('B', 'blue'),('B','green')
) as T (id, colors)
GROUP BY id
One way to do it with pyspark < 1.6, which unfortunately doesn't support user-defined aggregate function:
byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)
and if you want to make it a dataframe again:
sqlContext.createDataFrame(byUsername, ["username", "friends"])
As of 1.6, you can use collect_list and then join the created list:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
join_ = F.udf(lambda x: ", ".join(x), StringType())
df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))
Language: Scala
Spark version: 1.5.2
I had the same issue and also tried to resolve it using udfs but, unfortunately, this has led to more problems later in the code due to type inconsistencies. I was able to work my way around this by first converting the DF to an RDD then grouping by and manipulating the data in the desired way and then converting the RDD back to a DF as follows:
val df = sc
.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")))
.toDF("username", "friend")
+---------+-------+
| username| friend|
+---------+-------+
|username1|friend1|
|username1|friend2|
|username2|friend1|
|username2|friend3|
+---------+-------+
val dfGRPD = df.map(Row => (Row(0), Row(1)))
.groupByKey()
.map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))}
.toDF("username", "groupOfFriends")
+---------+---------------+
| username| groupOfFriends|
+---------+---------------+
|username1|friend2,friend1|
|username2|friend3,friend1|
+---------+---------------+
Below python-based code that achieves group_concat functionality.
Input Data:
Cust_No,Cust_Cars
1, Toyota
2, BMW
1, Audi
2, Hyundai
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
spark = SparkSession.builder.master('yarn').getOrCreate()
# Udf to join all list elements with "|"
def combine_cars(car_list,sep='|'):
collect = sep.join(car_list)
return collect
test_udf = udf(combine_cars,StringType())
car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)
Output Data:
Cust_No, Final_List
1, Toyota|Audi
2, BMW|Hyundai
You can also use Spark SQL function collect_list and after you will need to cast to string and use the function regexp_replace to replace the special characters.
regexp_replace(regexp_replace(regexp_replace(cast(collect_list((column)) as string), ' ', ''), ',', '|'), '[^A-Z0-9|]', '')
it's an easier way.
Higher order function concat_ws() and collect_list() can be a good alternative along with groupBy()
import pyspark.sql.functions as F
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
Sample Output
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

PySpark and HIVE/Impala

I want to build a classification model in PySpark. My input to this model is result of select query or view from Hive or Impala. is any way to include this query in PySpark code itself instead of storing result in text file feeding to our model
Yes for this you need to use HiveContext with sparkContext.
Here is example:-
sqlContext = HiveContext(sc)
tableData = sqlContext.sql("SELECT * FROM TABLE")
#tableData is a dataframe containing reference to schema of table, check this using tableData.printSchema()
tableData.collect() #collect executes query and provide all rows from sql
or you may refer here
https://spark.apache.org/docs/1.6.0/sql-programming-guide.html

Spark Sql Aggregation from cassandra

this is my query written for mysql database,
SELECT dcm.user, du.full_name, ROUND(AVG(fcg.final)) ,ROUND(AVG(fcg.participation))
FROM dimclassmem dcm LEFT JOIN factGlobal fcg on fcg.class_id=dcm.class_id GROUP BY dcm.user ORDER BY dcm.user
i can run this using java + mysql.
now i want to write this query using Spark Sql.
How can i write Aggregate function in Spark.
I am fetching data from Cassandra table , and perform simple query & it below code works,
val conf = new SparkConf()
conf.set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext("local", "Cassandra Connector Test", conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext
.read.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "Sample", "table" -> "testTable"))
.load()
df.registerTempTable("testTable")
val ddf = sqlContext.sql("select name from testTable order by name desc limit 10")
ddf.show()
but , if i used follwing code it won't work,
val countallRec = sqlContext.sql("Select count(name) from testTable")
countallRec.show()
i am getting below exception
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/MapPartitionsWithPreparationRDD
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
I want to run above query Using Spark Sql, How can i do it?

how to write not like queries in spark sql using scala api?

I want to convert the following query to Spark SQL using Scala API:
select ag.part_id name from sample c join testing ag on c.part=ag.part and concat(c.firstname,c.lastname) not like 'Dummy%'
Any ideas?
Thanks in advance
Maybe this would work:
import org.apache.spark.sql.functions._
val c = sqlContext.table("sample")
val ag = sqlContext.table("testing")
val fullnameCol = concat(c("firstname"), c("lastname))
val resultDF = c.join(ag, (c("part") === ag("part")) && !fullnameCol.like("Dummy%"))
For more information about the functions I used above, please check the following links:
org.apache.spark.sql.functions
org.apache.spark.sql.DataFrame
org.apache.spark.sql.Column
You mean this
df.filter("filed1 not like 'Dummy%'").show
or
df.filter("filed1 ! like 'Dummy%'").show
use like this
df.filter(!'col1.like("%COND%").show