In Spark, how to check the date format? - dataframe

How can we check the date format in below code.
DF = DF.withColumn("DATE", to_date(trim(col("DATE")), "yyyyMMdd"))
Error:
Caused by: java.time.format.DateTimeParseException: Text '2171121' could not be parsed at index 6
Expectation:
If the format is correct use the same data otherwise populate null in the same column.

In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp, to_timestamp and to_date will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result NULL. Check documentaion here.
To switch back to previous behavior you can use below configuration.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
Read what has been changed from spark 3.0 w.r.t datetime parser here.
You can use when() and otherwise() functions to get desired result, after using above configuration.
>>> from pyspark.sql.functions import *
>>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
>>> df = spark.createDataFrame([(20210822,),(1234,)]).toDF("date")
# casting column to string as to_date function will accept string or date or timestamp type columns
>>> df.withColumn("date", when(to_date(df["date"].cast("string"),"yyyyMMdd").isNull(), None).otherwise(df["date"])).show()
+--------+
| date|
+--------+
|20210822|
| null|
+--------+

Related

Unix time stamp conversion in Azure synapse analytics

I am using the below script to do refining the data in silver layer:
# Read from existing internal table
dfAsset =(spark.read.option(Constants.SERVER,"xyz.sql.azuresynapse.net")
.synapsesql("abc.Salesforce.Asset")
.select("Id","ContactId","CreatedDate","CreatedById","LastModifiedDate")
.filter(col("productCode").contains("11061164")).limit(10))
dfAsset.show()
For particular column CreatedDate the data is appearing in the Unix format.Please refer
the below :
CreateDate
1652108980000
1632313243000
1632312269000
1632312410000
I need to convert the data into YYYY-MM-DD. In the above script
Please advise how it can be done.
Regards
RK
This is my sample Dataframe saved in the variable dfAsset.
#+-----------+
#| date1 |
#+-----------+
#|16521089 |
#|16323132 |
#|16323122 |
#|16323124 |
#+-----------+
Using below code you can convert the data into YYYY-MM-DD.
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import col,to_date
df = dfAsset.withColumn('date',to_date(col('date1').cast(TimestampType())))
df.show()
Output:

Scala dataframe get last 6 months latest data

I have column in dataframe like below
+-------------------+
| timestampCol|
+-------------------+
|2020-11-27 00:00:00|
|2020-11-27 00:00:00|
+-------------------+
I need to filter the data based on this date and I want to get last 6 moths data only , could anyone please suggest how can I do that ?
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
dataset.filter(dataset.col("timestampCol").cast("date")
.gt(add_months(current_date(),-6)));
This will filter all the timestampCol values that are older than 6 months.
Depending on the dataset schema you may need to cast the value as a date.
If it's a date just compare it directly with a java.sql.Timestamp instance.
val someMomentInTime =
java.sql.Timestamp.valueOf("yyyy-[m]m-[d]d hh:mm:ss")
val df: Dataframe =
???
df.filter(col("timestampCol") > someMomentInTime) //Dataframe is Dataset[Row]

PySpark: Transform values of given column in the DataFrame

I am new to PySpark and Spark in general.
I would like to apply transformation on a given column in the DataFrame, essentially call a function for each value on that specific column.
I have my DataFrame df that looks like this:
df.show()
+------------+--------------------+
|version | body |
+------------+--------------------+
| 1|9gIAAAASAQAEAAAAA...|
| 2|2gIAAAASAQAEAAAAA...|
| 3|3gIAAAASAQAEAAAAA...|
| 1|7gIAKAASAQAEAAAAA...|
+------------+--------------------+
I need to read value of body column for each row where the version is 1 and then decrypt it (I have my own logic/function which takes a string and returns a decrypted string). Finally, write the decrypted values in csv format to a S3 bucket.
def decrypt(encrypted_string: str):
# code that returns decrypted string
So, When I do following, I get the corresponding filtered values to which I need to apply my decrypt function.
df.where(col('version') =='1')\
.select(col('body')).show()
+--------------------+
| body|
+--------------------+
|9gIAAAASAQAEAAAAA...|
|7gIAKAASAQAEAAAAA...|
+--------------------+
However, I am not clear how to do that. I tried to use collect() but then it defeats the purpose of using Spark.
I also tried using .rdd.map as follows but that did not work.
df.where(col('version') =='1')\
.select(col('body'))\
.rdd.map(lambda x: decrypt).toDF().show()
OR
.rdd.map(decrypt).toDF().show()
Could someone please help with this.
Please try:
from pyspark.sql.functions import udf
decrypt_udf = udf(decrypt, StringType())
df.where(col('version') =='1').withColumn('body', decrypt_udf('body'))
Got some clue from this post: Pyspark DataFrame UDF on Text Column.
Looks like I can simply get it with following. I was doing it without using udf earlier, so it wasn't working.
dummy_function_udf = udf(decrypt, StringType())
df.where(col('version') == '1')\
.select(col('body')) \
.withColumn('decryptedBody', dummy_function_udf('body')) \
.show()

How to add Extra column with current date in Spark dataframe

I am trying to add one column in my existing Pyspark Dataframe using withColumn method.I want to insert current date in this column.From my Source I don't have any date column so i am adding this current date column in my dataframe and saving this dataframe in my table so later for tracking purpose i can use this current date column.
I am using below code
df2=df.withColumn("Curr_date",datetime.now().strftime('%Y-%m-%d'))
here df is my existing Dataframe and i want to save df2 as table with Curr_date column.
but here its expecting existing column or lit method instead of datetime.now().strftime('%Y-%m-%d').
someone please guide me how should i add this Date column in my dataframe.?
use either lit or current_date
from pyspark.sql import functions as F
df2 = df.withColumn("Curr_date", F.lit(datetime.now().strftime("%Y-%m-%d")))
# OR
df2 = df.withColumn("Curr_date", F.current_date())
current_timestamp() is good but it is evaluated during the serialization time.
If you prefer to use the timestamp of the processing time of a row, then you may use the below method,
withColumn('current', expr("reflect('java.time.LocalDateTime', 'now')"))
There is a spark function current_timestamp().
from pyspark.sql.functions import *
df.withColumn('current', date_format(current_timestamp(), 'yyyy-MM-dd')).show()
+----+----------+
|test| current|
+----+----------+
|test|2020-09-09|
+----+----------+

How to convert a dataframe to a variable

Is there any direct function to convert a dataframe and assign to a variable?
For example below returns this
>>> partitionRecordCount= spark.sql("select count(*) from mydb.mytable where partition_date="yyyymmdd")
>>> partitionRecordCount.show()
+--------+
|count(1)|
+--------+
| 206157|
+--------+
what i need is like below
>>> partitionRecordCount
206157
I need that record count integer value directly in that variable on the left hand side rather than a dataframe. Please advice
See this answer
get value out of dataframe
So for your example you can just change it to:
partitionRecordCount = partitionRecordCount.collect()[0]
Try
partitionRecordCount.collect()[0][0]