spark.sql("select case when length(regexp_replace(date,'[^0-9]', ''))==8 then regexp_replace(date,'[^0-9]', '') else regexp_replace(date,'[^0-9]','') end as date from input").show(false)
In the above I need to add the requirements such as,
1.the output should be validated with the format 'yyyymmdd' using unix_timestamp.
if it is not valid then should transform the extracted digits string by moving the first four (4) characters to the end of the extracted digits string (MMDDYYYY to YYYYMMDD) and then should be validated with the 'yyyymmdd' format, if this condition is satisfied then print that date.
I'm not sure how to include the Unix timestamp in my query.
Sample input and output 1:
input: 2021dgsth02hdg02
output: 20210202
Sample input and output 2:
input: 0101def20dr21 (note: MMDDYYYY TO YYYYMMDD)
output: 20210101
Using unix_timestamp in place of to_date
spark.sql("select (case when length(regexp_replace(date,'[^0-9]', ''))==8 then CASE WHEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') IS NULL THEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy') ,'MMddyyyy') ELSE from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') END else regexp_replace(date,'[^0-9]','') end ) AS dt from input").show(false)
Try Below code.
scala> val df = Seq("2021dgsth02hdg02","0101def20dr21").toDF("dt")
df: org.apache.spark.sql.DataFrame = [dt: string]
scala> df.show(false)
+----------------+
|dt |
+----------------+
|2021dgsth02hdg02|
|0101def20dr21 |
+----------------+
scala> df
.withColumn("dt",regexp_replace($"dt","[a-zA-Z]+",""))
.withColumn("dt",
when(
to_date($"dt","yyyyMMdd").isNull,
to_date($"dt","MMddyyyy")
)
.otherwise(to_date($"dt","yyyyMMdd"))
).show(false)
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+
// Entering paste mode (ctrl-D to finish)
spark.sql("""
select (
CASE WHEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') IS NULL
THEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy')
ELSE to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd')
END
) AS dt from input
""")
.show(false)
// Exiting paste mode, now interpreting.
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+
Related
I will filter a column on dataframe for to have only the number (digit code).
main_column
HKA1774348
null
774970331205
160-27601033
SGSIN/62/898805
null
LOCAL
217-29062806
null
176-07027893
724-22100374
297-00371663
217-11580074
I obtain this column
main_column
774970331205
160-27601033
217-29062806
176-07027893
724-22100374
297-00371663
217-11580074
You can use rlike with an regexp that only includes digits and a hyphen:
df.where(df['main_column'].rlike('^[0-9\-]+$')).show()
Output:
+------------+
| main_column|
+------------+
|774970331205|
|160-27601033|
|217-29062806|
|176-07027893|
|724-22100374|
|297-00371663|
|217-11580074|
+------------+
Is it possible to write a function that will get Tabular expression as input, and return string?
E.g. if I want a function that will return the name of a datetime column in a schema:
let timeCol = (T1:(*)){T1 | getschema
| ColumnType == "datetime"
tostring(any(ColumnName));
};
Sure. Here you go (note the toscalar function, that transforms a tabular result into a scalar):
let timeCol = (T1:(*)){
toscalar(T1
| getschema
| where ColumnType == "datetime"
| project ColumnName
| take 1);
};
print (MyTable | invoke timeCol())
If I have a df like the one shown below
SELECT `case_id` AS ID,`updt_dt` AS Update_date,`updt_tm` AS Update_time
FROM case_dly_snap
LIMIT 2"
My df comes like below:
where update_date is in date format and update_time is in string format.
How can I convert it into one in a date+time format?
Expected output
You could try this approach.
With Hive Built-in Functions format update_time and concat update_date to convert it into a timestamp.
As an example
val lstData = List((1,"2018-05-14","012230.627"),(2,"2018-05-15","070026.886"),(3,"2018-05-16","023525.669"))
val cols = Array("ID","update_date","update_time")
val dfTime = sc.parallelize(lstData).toDF(cols: _*)
dfTime.show()
dfTime.createOrReplaceTempView("df_time")
spark.sql(
"""SELECT ID, update_date, update_time,
|to_utc_timestamp(concat_ws(' ',CAST(update_date AS STRING),concat_ws(':',substr(split(update_time,'\\.')[0],0,2),substr(split(update_time,'\\.')[0],3,2),substr(split(update_time,'\\.')[0],5,2))),'GMT') AS tms
|FROM df_time """.stripMargin)
.show()
+---+-----------+-----------+-------------------+
| ID|update_date|update_time| tms|
+---+-----------+-----------+-------------------+
| 1| 2018-05-14| 012230.627|2018-05-14 01:22:30|
| 2| 2018-05-15| 070026.886|2018-05-15 07:00:26|
| 3| 2018-05-16| 023525.669|2018-05-16 02:35:25|
+---+-----------+-----------+-------------------+
I use Apache Spark 2.2.0 and Scala.
I'm following the question as a guide to pivot a dataframe without using the pivot function.
I need to pivot the dataframe without using the pivot function as I have non-numerical data and pivot works with an aggregation function like sum, min, max on numerical data only. I've got a non-numerical column I'd like to use in pivot aggregation.
Here's my data:
+---+-------------+----------+-------------+----------+-------+
|Qid| Question|AnswerText|ParticipantID|Assessment| GeoTag|
+---+-------------+----------+-------------+----------+-------+
| 1|Question1Text| Yes| abcde1| 0|(x1,y1)|
| 2|Question2Text| No| abcde1| 0|(x1,y1)|
| 3|Question3Text| 3| abcde1| 0|(x1,y1)|
| 1|Question1Text| No| abcde2| 0|(x2,y2)|
| 2|Question2Text| Yes| abcde2| 0|(x2,y2)|
+---+-------------+----------+-------------+----------+-------+
I want it to group by ParticipantID, Assessment and GeoTag tags and "pivot" on Question column and take the values from AnswerText column. In the end, the output should look as follows:
+-------------+-----------+----------+-------+-----+----- +
|ParticipantID|Assessment |GeoTag |Qid_1 |Qid_2|Qid_3 |
+-------------+-----------+----------+-------+-----+------+
|abcde1 |0 |(x1,y1) |Yes |No |3 |
|abcde2 |0 |(x2,y2) |No |Yes |null |
+-------------+-----------+----------+-------+-----+------+
I have tried this:
val questions: Array[String] = df.select("Q_id")
.distinct()
.collect()
.map(_.getAs[String]("Q_id"))
.sortWith(_<_)
val df2: DataFrame = questions.foldLeft(df) {
case (data, question) => data.selectExpr("*", s"IF(Q_id = '$question', AnswerText, 0) AS $question")
}
[followed by a GroupBy expression]
But I'm getting the following error, which must be something to do with the syntax of the final statement AS $question
17/12/08 16:13:12 INFO SparkSqlParser: Parsing command: *
17/12/08 16:13:12 INFO SparkSqlParser: Parsing command: IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
extraneous input '?' expecting <EOF>(line 1, pos 104)
== SQL ==
IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
--------------------------------------------------------------------------------------------------------^^^
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '?' expecting <EOF>(line 1, pos 104)
== SQL ==
IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
--------------------------------------------------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
Any ideas where I am going wrong? Is there a better way? I thought about reverting to Pandas and Python outside Spark if necessary, but I'd rather write all the code within the same framework if possible.
As $question is substituting the value of the question variable into the SQL statement, you end up with a column name with '?' in it in SQL. ? is not a valid character in a column name so you have to at least use backticks to quote:
s"IF(Q_id = '$question', AnswerText, 0) AS `$question`"
or use select / withColumn:
import org.apache.spark.sql.functions.when
case (data, question) =>
data.withColumn(question, when($"Q_id" === question, $"AnswerText"))
or santize strings first, using regexp_replace.
need to pivot the dataframe without using the pivot function as I have non-numerical data and df.pivot only works with an aggregation function like sum, min, max on numerical data.
You can use first: How to use pivot and calculate average on a non-numeric column (facing AnalysisException "is not a numeric column")?
data.groupBy($"ParticipantID", $"Assessment", $"GeoTag")
.pivot($"Question", questions).agg(first($"AnswerText"))
Just a note to the accepted answer by #user8371915 to make the query a bit faster.
There is a way to avoid the expensive scan to generate questions with the headers.
Simply generate the headers (in the same job and stage!) followed by pivot on the column.
// It's a simple and cheap map-like transformation
val qid_header = input.withColumn("header", concat(lit("Qid_"), $"Qid"))
scala> qid_header.show
+---+-------------+----------+-------------+----------+-------+------+
|Qid| Question|AnswerText|ParticipantID|Assessment| GeoTag|header|
+---+-------------+----------+-------------+----------+-------+------+
| 1|Question1Text| Yes| abcde1| 0|(x1,y1)| Qid_1|
| 2|Question2Text| No| abcde1| 0|(x1,y1)| Qid_2|
| 3|Question3Text| 3| abcde1| 0|(x1,y1)| Qid_3|
| 1|Question1Text| No| abcde2| 0|(x2,y2)| Qid_1|
| 2|Question2Text| Yes| abcde2| 0|(x2,y2)| Qid_2|
+---+-------------+----------+-------------+----------+-------+------+
With the headers as part of the dataset, let's pivot.
val solution = qid_header
.groupBy('ParticipantID, 'Assessment, 'GeoTag)
.pivot('header)
.agg(first('AnswerText))
scala> solution.show
+-------------+----------+-------+-----+-----+-----+
|ParticipantID|Assessment| GeoTag|Qid_1|Qid_2|Qid_3|
+-------------+----------+-------+-----+-----+-----+
| abcde1| 0|(x1,y1)| Yes| No| 3|
| abcde2| 0|(x2,y2)| No| Yes| null|
+-------------+----------+-------+-----+-----+-----+
I want to get the last element from the Array that return from Spark SQL split() function.
split(4:3-2:3-5:4-6:4-5:2,'-')
I know it can get by
split(4:3-2:3-5:4-6:4-5:2,'-')[4]
But i want another way when i don't know the length of the Array .
please help me.
You can also use SparkSql Reverse() function on a column after Split().
For example:
SELECT reverse(split(MY_COLUMN,'-'))[0] FROM MY_TABLE
Here [0] gives you the first element of the reversed array, which is the last element of the initial array.
Please check substring_index it should work exactly as you want:
substring_index(lit("1-2-3-4"), "-", -1) // 4
You could use an UDF to do that, as follow:
val df = sc.parallelize(Seq((1L,"one-last1"), (2L,"two-last2"), (3L,"three-last3"))).toDF("key","Value")
+---+-----------+
|key|Value |
+---+-----------+
|1 |one-last1 |
|2 |two-last2 |
|3 |three-last3|
+---+-----------+
val get_last = udf((xs: Seq[String]) => Try(xs.last).toOption)
val with_just_last = df.withColumn("Last" , get_last(split(col("Value"), "-")))
+---+-----------+--------+
|key|Value |Last |
+---+-----------+--------+
|1 |one-last1 |last1 |
|2 |two-last2 |last2 |
|3 |three-last3|last3 |
+---+-----------+--------+
Remember that the split function from SparkSQL can be applied to a column of the DataFrame.
use split(MY_COLUMN,'-').getItem(0) if you are using Java