I use Apache Spark 2.2.0 and Scala.
I'm following the question as a guide to pivot a dataframe without using the pivot function.
I need to pivot the dataframe without using the pivot function as I have non-numerical data and pivot works with an aggregation function like sum, min, max on numerical data only. I've got a non-numerical column I'd like to use in pivot aggregation.
Here's my data:
+---+-------------+----------+-------------+----------+-------+
|Qid| Question|AnswerText|ParticipantID|Assessment| GeoTag|
+---+-------------+----------+-------------+----------+-------+
| 1|Question1Text| Yes| abcde1| 0|(x1,y1)|
| 2|Question2Text| No| abcde1| 0|(x1,y1)|
| 3|Question3Text| 3| abcde1| 0|(x1,y1)|
| 1|Question1Text| No| abcde2| 0|(x2,y2)|
| 2|Question2Text| Yes| abcde2| 0|(x2,y2)|
+---+-------------+----------+-------------+----------+-------+
I want it to group by ParticipantID, Assessment and GeoTag tags and "pivot" on Question column and take the values from AnswerText column. In the end, the output should look as follows:
+-------------+-----------+----------+-------+-----+----- +
|ParticipantID|Assessment |GeoTag |Qid_1 |Qid_2|Qid_3 |
+-------------+-----------+----------+-------+-----+------+
|abcde1 |0 |(x1,y1) |Yes |No |3 |
|abcde2 |0 |(x2,y2) |No |Yes |null |
+-------------+-----------+----------+-------+-----+------+
I have tried this:
val questions: Array[String] = df.select("Q_id")
.distinct()
.collect()
.map(_.getAs[String]("Q_id"))
.sortWith(_<_)
val df2: DataFrame = questions.foldLeft(df) {
case (data, question) => data.selectExpr("*", s"IF(Q_id = '$question', AnswerText, 0) AS $question")
}
[followed by a GroupBy expression]
But I'm getting the following error, which must be something to do with the syntax of the final statement AS $question
17/12/08 16:13:12 INFO SparkSqlParser: Parsing command: *
17/12/08 16:13:12 INFO SparkSqlParser: Parsing command: IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
extraneous input '?' expecting <EOF>(line 1, pos 104)
== SQL ==
IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
--------------------------------------------------------------------------------------------------------^^^
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '?' expecting <EOF>(line 1, pos 104)
== SQL ==
IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
--------------------------------------------------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
Any ideas where I am going wrong? Is there a better way? I thought about reverting to Pandas and Python outside Spark if necessary, but I'd rather write all the code within the same framework if possible.
As $question is substituting the value of the question variable into the SQL statement, you end up with a column name with '?' in it in SQL. ? is not a valid character in a column name so you have to at least use backticks to quote:
s"IF(Q_id = '$question', AnswerText, 0) AS `$question`"
or use select / withColumn:
import org.apache.spark.sql.functions.when
case (data, question) =>
data.withColumn(question, when($"Q_id" === question, $"AnswerText"))
or santize strings first, using regexp_replace.
need to pivot the dataframe without using the pivot function as I have non-numerical data and df.pivot only works with an aggregation function like sum, min, max on numerical data.
You can use first: How to use pivot and calculate average on a non-numeric column (facing AnalysisException "is not a numeric column")?
data.groupBy($"ParticipantID", $"Assessment", $"GeoTag")
.pivot($"Question", questions).agg(first($"AnswerText"))
Just a note to the accepted answer by #user8371915 to make the query a bit faster.
There is a way to avoid the expensive scan to generate questions with the headers.
Simply generate the headers (in the same job and stage!) followed by pivot on the column.
// It's a simple and cheap map-like transformation
val qid_header = input.withColumn("header", concat(lit("Qid_"), $"Qid"))
scala> qid_header.show
+---+-------------+----------+-------------+----------+-------+------+
|Qid| Question|AnswerText|ParticipantID|Assessment| GeoTag|header|
+---+-------------+----------+-------------+----------+-------+------+
| 1|Question1Text| Yes| abcde1| 0|(x1,y1)| Qid_1|
| 2|Question2Text| No| abcde1| 0|(x1,y1)| Qid_2|
| 3|Question3Text| 3| abcde1| 0|(x1,y1)| Qid_3|
| 1|Question1Text| No| abcde2| 0|(x2,y2)| Qid_1|
| 2|Question2Text| Yes| abcde2| 0|(x2,y2)| Qid_2|
+---+-------------+----------+-------------+----------+-------+------+
With the headers as part of the dataset, let's pivot.
val solution = qid_header
.groupBy('ParticipantID, 'Assessment, 'GeoTag)
.pivot('header)
.agg(first('AnswerText))
scala> solution.show
+-------------+----------+-------+-----+-----+-----+
|ParticipantID|Assessment| GeoTag|Qid_1|Qid_2|Qid_3|
+-------------+----------+-------+-----+-----+-----+
| abcde1| 0|(x1,y1)| Yes| No| 3|
| abcde2| 0|(x2,y2)| No| Yes| null|
+-------------+----------+-------+-----+-----+-----+
Related
I will filter a column on dataframe for to have only the number (digit code).
main_column
HKA1774348
null
774970331205
160-27601033
SGSIN/62/898805
null
LOCAL
217-29062806
null
176-07027893
724-22100374
297-00371663
217-11580074
I obtain this column
main_column
774970331205
160-27601033
217-29062806
176-07027893
724-22100374
297-00371663
217-11580074
You can use rlike with an regexp that only includes digits and a hyphen:
df.where(df['main_column'].rlike('^[0-9\-]+$')).show()
Output:
+------------+
| main_column|
+------------+
|774970331205|
|160-27601033|
|217-29062806|
|176-07027893|
|724-22100374|
|297-00371663|
|217-11580074|
+------------+
spark.sql("select case when length(regexp_replace(date,'[^0-9]', ''))==8 then regexp_replace(date,'[^0-9]', '') else regexp_replace(date,'[^0-9]','') end as date from input").show(false)
In the above I need to add the requirements such as,
1.the output should be validated with the format 'yyyymmdd' using unix_timestamp.
if it is not valid then should transform the extracted digits string by moving the first four (4) characters to the end of the extracted digits string (MMDDYYYY to YYYYMMDD) and then should be validated with the 'yyyymmdd' format, if this condition is satisfied then print that date.
I'm not sure how to include the Unix timestamp in my query.
Sample input and output 1:
input: 2021dgsth02hdg02
output: 20210202
Sample input and output 2:
input: 0101def20dr21 (note: MMDDYYYY TO YYYYMMDD)
output: 20210101
Using unix_timestamp in place of to_date
spark.sql("select (case when length(regexp_replace(date,'[^0-9]', ''))==8 then CASE WHEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') IS NULL THEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy') ,'MMddyyyy') ELSE from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') END else regexp_replace(date,'[^0-9]','') end ) AS dt from input").show(false)
Try Below code.
scala> val df = Seq("2021dgsth02hdg02","0101def20dr21").toDF("dt")
df: org.apache.spark.sql.DataFrame = [dt: string]
scala> df.show(false)
+----------------+
|dt |
+----------------+
|2021dgsth02hdg02|
|0101def20dr21 |
+----------------+
scala> df
.withColumn("dt",regexp_replace($"dt","[a-zA-Z]+",""))
.withColumn("dt",
when(
to_date($"dt","yyyyMMdd").isNull,
to_date($"dt","MMddyyyy")
)
.otherwise(to_date($"dt","yyyyMMdd"))
).show(false)
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+
// Entering paste mode (ctrl-D to finish)
spark.sql("""
select (
CASE WHEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') IS NULL
THEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy')
ELSE to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd')
END
) AS dt from input
""")
.show(false)
// Exiting paste mode, now interpreting.
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+
I want to select all columns in a table except StudentAddress and hence I wrote following query:
select `(StudentAddress)?+.+` from student;
It gives following error in Squirrel Sql client.
org.apache.spark.sql.AnalysisException: cannot resolve '(StudentAddress)?+.+' given input columns
You can use drop() method in the DataFrame API to drop a particular column and then select all the columns.
For example:
val df = hiveContext.read.table("student")
val dfWithoutStudentAddress = df.drop("StudentAddress")
Using spark sql try with
select * except(<columns to be excluded>) from tablename
Example:
select * from tmp
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#|a |b |c |d |
#+----+----+----+----+
#exclude col1,col2
select * except(col1,col2) from table_name
#+----+----+
#|col3|col4|
#+----+----+
#|c |d |
#+----+----+
I want to get the last element from the Array that return from Spark SQL split() function.
split(4:3-2:3-5:4-6:4-5:2,'-')
I know it can get by
split(4:3-2:3-5:4-6:4-5:2,'-')[4]
But i want another way when i don't know the length of the Array .
please help me.
You can also use SparkSql Reverse() function on a column after Split().
For example:
SELECT reverse(split(MY_COLUMN,'-'))[0] FROM MY_TABLE
Here [0] gives you the first element of the reversed array, which is the last element of the initial array.
Please check substring_index it should work exactly as you want:
substring_index(lit("1-2-3-4"), "-", -1) // 4
You could use an UDF to do that, as follow:
val df = sc.parallelize(Seq((1L,"one-last1"), (2L,"two-last2"), (3L,"three-last3"))).toDF("key","Value")
+---+-----------+
|key|Value |
+---+-----------+
|1 |one-last1 |
|2 |two-last2 |
|3 |three-last3|
+---+-----------+
val get_last = udf((xs: Seq[String]) => Try(xs.last).toOption)
val with_just_last = df.withColumn("Last" , get_last(split(col("Value"), "-")))
+---+-----------+--------+
|key|Value |Last |
+---+-----------+--------+
|1 |one-last1 |last1 |
|2 |two-last2 |last2 |
|3 |three-last3|last3 |
+---+-----------+--------+
Remember that the split function from SparkSQL can be applied to a column of the DataFrame.
use split(MY_COLUMN,'-').getItem(0) if you are using Java
I have a DataFrame orders:
+-----------------+-----------+--------------+
| Id| Order | Gender|
+-----------------+-----------+--------------+
| 1622|[101330001]| Male|
| 1622| [147678]| Male|
| 3837| [1710544]| Male|
+-----------------+-----------+--------------+
which I want to groupBy on Id and Gender and then aggregate orders.
I am using org.apache.spark.sql.functions package and code looks like:
DataFrame group = orders.withColumn("orders", col("order"))
.groupBy(col("Id"), col("Gender"))
.agg(collect_list("products"));
However since column Order is of type array I get this exception because it expects a primitive type:
User class threw exception: org.apache.spark.sql.AnalysisException: No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but array<string> was passed as parameter 1
I have looked in the package and there are sort functions for arrays but no aggregate functions. Any idea how to do it? Thanks.
In this case you can define your own function and register it as UDF
val userDefinedFunction = ???
val udfFunctionName = udf[U,T](userDefinedFunction)
Then instead of then pass that column inside that function so that it gets converted into primitive type and then pass it in the with Columns method.
Something like this:
val dataF:(Array[Int])=>Int=_.head
val dataUDF=udf[Int,Array[Int]](dataF)
DataFrame group = orders.withColumn("orders", dataUDF(col("order")))
.groupBy(col("Id"), col("Gender"))
.agg(collect_list("products"));
I hope it works !