Get the last element from Apache Spark SQL split() Function - apache-spark-sql

I want to get the last element from the Array that return from Spark SQL split() function.
split(4:3-2:3-5:4-6:4-5:2,'-')
I know it can get by
split(4:3-2:3-5:4-6:4-5:2,'-')[4]
But i want another way when i don't know the length of the Array .
please help me.

You can also use SparkSql Reverse() function on a column after Split().
For example:
SELECT reverse(split(MY_COLUMN,'-'))[0] FROM MY_TABLE
Here [0] gives you the first element of the reversed array, which is the last element of the initial array.

Please check substring_index it should work exactly as you want:
substring_index(lit("1-2-3-4"), "-", -1) // 4

You could use an UDF to do that, as follow:
val df = sc.parallelize(Seq((1L,"one-last1"), (2L,"two-last2"), (3L,"three-last3"))).toDF("key","Value")
+---+-----------+
|key|Value |
+---+-----------+
|1 |one-last1 |
|2 |two-last2 |
|3 |three-last3|
+---+-----------+
val get_last = udf((xs: Seq[String]) => Try(xs.last).toOption)
val with_just_last = df.withColumn("Last" , get_last(split(col("Value"), "-")))
+---+-----------+--------+
|key|Value |Last |
+---+-----------+--------+
|1 |one-last1 |last1 |
|2 |two-last2 |last2 |
|3 |three-last3|last3 |
+---+-----------+--------+
Remember that the split function from SparkSQL can be applied to a column of the DataFrame.

use split(MY_COLUMN,'-').getItem(0) if you are using Java

Related

In Hive, how to compare array of string with hivevar list?

In Hive, I have a column date that looks like below, array of string. I have another hivevar that look like this
set hivevar:sunny = ('2022-12-17', '2022-12-21', '2023-01-15');
|date|
|----|
|[["2022-11-14"],["2022-12-14"]]|
|[["2022-11-14","2022-11-17"],["2022-12-14","2022-12-17"]]|
|[["2022-11-21"],["2022-12-21"]]|
|[["2023-01-08"]]|
|[["2022-11-15"],["2022-12-15"],["2023-01-15"]]|
I want to check - for each row, if any of the value is part of the sunny list. So i want to get something like. I thought of using any, array && but they don't work in Hive. Can anyone help?
|result|
|----|
|false|
|true|
|true|
|false|
|true|
SELECT date, (array_contains(sunny, explode(date)) as result
FROM mytable

Issue while converting string data to decimal in proper format in sparksql

I am facing issue in spark sql while converting string to decimal(15,7).
Input data is:
'0.00'
'28.12'
'-39.02'
'28.00'
I have tried converting it into float and then converting into decimal but got unexpected results.
sqlContext.sql("select cast(cast('0.00' as float) as decimal(15,7)) from table").show()
The result I received is as follows
0
But I need to have data in the below format:
0.0000000
28.1200000
-39.0200000
28.0000000
You can try using the format_number method. Something like this.
df.withColumn("num", format_number(col("value").cast("decimal(15,7)"), 7)).show()
The results should be like this.
+------+-----------+
| value| num|
+------+-----------+
| 0.00| 0.0000000|
| 28.12| 28.1200000|
|-39.02|-39.0200000|
| 28.00| 28.0000000|
+------+-----------+

How to pivot on arbitrary column?

I use Apache Spark 2.2.0 and Scala.
I'm following the question as a guide to pivot a dataframe without using the pivot function.
I need to pivot the dataframe without using the pivot function as I have non-numerical data and pivot works with an aggregation function like sum, min, max on numerical data only. I've got a non-numerical column I'd like to use in pivot aggregation.
Here's my data:
+---+-------------+----------+-------------+----------+-------+
|Qid| Question|AnswerText|ParticipantID|Assessment| GeoTag|
+---+-------------+----------+-------------+----------+-------+
| 1|Question1Text| Yes| abcde1| 0|(x1,y1)|
| 2|Question2Text| No| abcde1| 0|(x1,y1)|
| 3|Question3Text| 3| abcde1| 0|(x1,y1)|
| 1|Question1Text| No| abcde2| 0|(x2,y2)|
| 2|Question2Text| Yes| abcde2| 0|(x2,y2)|
+---+-------------+----------+-------------+----------+-------+
I want it to group by ParticipantID, Assessment and GeoTag tags and "pivot" on Question column and take the values from AnswerText column. In the end, the output should look as follows:
+-------------+-----------+----------+-------+-----+----- +
|ParticipantID|Assessment |GeoTag |Qid_1 |Qid_2|Qid_3 |
+-------------+-----------+----------+-------+-----+------+
|abcde1 |0 |(x1,y1) |Yes |No |3 |
|abcde2 |0 |(x2,y2) |No |Yes |null |
+-------------+-----------+----------+-------+-----+------+
I have tried this:
val questions: Array[String] = df.select("Q_id")
.distinct()
.collect()
.map(_.getAs[String]("Q_id"))
.sortWith(_<_)
val df2: DataFrame = questions.foldLeft(df) {
case (data, question) => data.selectExpr("*", s"IF(Q_id = '$question', AnswerText, 0) AS $question")
}
[followed by a GroupBy expression]
But I'm getting the following error, which must be something to do with the syntax of the final statement AS $question
17/12/08 16:13:12 INFO SparkSqlParser: Parsing command: *
17/12/08 16:13:12 INFO SparkSqlParser: Parsing command: IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
extraneous input '?' expecting <EOF>(line 1, pos 104)
== SQL ==
IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
--------------------------------------------------------------------------------------------------------^^^
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '?' expecting <EOF>(line 1, pos 104)
== SQL ==
IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
--------------------------------------------------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
Any ideas where I am going wrong? Is there a better way? I thought about reverting to Pandas and Python outside Spark if necessary, but I'd rather write all the code within the same framework if possible.
As $question is substituting the value of the question variable into the SQL statement, you end up with a column name with '?' in it in SQL. ? is not a valid character in a column name so you have to at least use backticks to quote:
s"IF(Q_id = '$question', AnswerText, 0) AS `$question`"
or use select / withColumn:
import org.apache.spark.sql.functions.when
case (data, question) =>
data.withColumn(question, when($"Q_id" === question, $"AnswerText"))
or santize strings first, using regexp_replace.
need to pivot the dataframe without using the pivot function as I have non-numerical data and df.pivot only works with an aggregation function like sum, min, max on numerical data.
You can use first: How to use pivot and calculate average on a non-numeric column (facing AnalysisException "is not a numeric column")?
data.groupBy($"ParticipantID", $"Assessment", $"GeoTag")
.pivot($"Question", questions).agg(first($"AnswerText"))
Just a note to the accepted answer by #user8371915 to make the query a bit faster.
There is a way to avoid the expensive scan to generate questions with the headers.
Simply generate the headers (in the same job and stage!) followed by pivot on the column.
// It's a simple and cheap map-like transformation
val qid_header = input.withColumn("header", concat(lit("Qid_"), $"Qid"))
scala> qid_header.show
+---+-------------+----------+-------------+----------+-------+------+
|Qid| Question|AnswerText|ParticipantID|Assessment| GeoTag|header|
+---+-------------+----------+-------------+----------+-------+------+
| 1|Question1Text| Yes| abcde1| 0|(x1,y1)| Qid_1|
| 2|Question2Text| No| abcde1| 0|(x1,y1)| Qid_2|
| 3|Question3Text| 3| abcde1| 0|(x1,y1)| Qid_3|
| 1|Question1Text| No| abcde2| 0|(x2,y2)| Qid_1|
| 2|Question2Text| Yes| abcde2| 0|(x2,y2)| Qid_2|
+---+-------------+----------+-------------+----------+-------+------+
With the headers as part of the dataset, let's pivot.
val solution = qid_header
.groupBy('ParticipantID, 'Assessment, 'GeoTag)
.pivot('header)
.agg(first('AnswerText))
scala> solution.show
+-------------+----------+-------+-----+-----+-----+
|ParticipantID|Assessment| GeoTag|Qid_1|Qid_2|Qid_3|
+-------------+----------+-------+-----+-----+-----+
| abcde1| 0|(x1,y1)| Yes| No| 3|
| abcde2| 0|(x2,y2)| No| Yes| null|
+-------------+----------+-------+-----+-----+-----+

str_to_map returns map<string,string>. How to make it return map<string,int>?

I am using str_to_map as:-
hive> select str_to_map("A:1,B:1,C:1");
OK
{"C":"1","A":"1","B":"1"}
As you can notice it is returning object of type map < string,string>. I want object of type map. Is there any way out? Can we typecast it in some way to map< string,int>?
I stared at this and looked around for a while and didn't see much, but turns out it's quite simple -- just use cast:
select cast(str_to_map("A:1,B:1,C:1") as map<string, int>)
-- {'A': 1, 'B': 1, 'C': 1}
You can try to use map_remove_keys UDF:
> add jar hdfs:///lib/brickhouse-0.7.1-SNAPSHOT.jar;
> create temporary function map_remove_keys as 'brickhouse.udf.collect.MapRemoveKeysUDF';
> select map_remove_keys(map("", 0), array(""));
+------+
| _c0 |
+------+
| {} |
+------+

Aggregate array type in Spark Dataframe

I have a DataFrame orders:
+-----------------+-----------+--------------+
| Id| Order | Gender|
+-----------------+-----------+--------------+
| 1622|[101330001]| Male|
| 1622| [147678]| Male|
| 3837| [1710544]| Male|
+-----------------+-----------+--------------+
which I want to groupBy on Id and Gender and then aggregate orders.
I am using org.apache.spark.sql.functions package and code looks like:
DataFrame group = orders.withColumn("orders", col("order"))
.groupBy(col("Id"), col("Gender"))
.agg(collect_list("products"));
However since column Order is of type array I get this exception because it expects a primitive type:
User class threw exception: org.apache.spark.sql.AnalysisException: No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but array<string> was passed as parameter 1
I have looked in the package and there are sort functions for arrays but no aggregate functions. Any idea how to do it? Thanks.
In this case you can define your own function and register it as UDF
val userDefinedFunction = ???
val udfFunctionName = udf[U,T](userDefinedFunction)
Then instead of then pass that column inside that function so that it gets converted into primitive type and then pass it in the with Columns method.
Something like this:
val dataF:(Array[Int])=>Int=_.head
val dataUDF=udf[Int,Array[Int]](dataF)
DataFrame group = orders.withColumn("orders", dataUDF(col("order")))
.groupBy(col("Id"), col("Gender"))
.agg(collect_list("products"));
I hope it works !