Unix time stamp conversion in Azure synapse analytics - azure-synapse

I am using the below script to do refining the data in silver layer:
# Read from existing internal table
dfAsset =(spark.read.option(Constants.SERVER,"xyz.sql.azuresynapse.net")
.synapsesql("abc.Salesforce.Asset")
.select("Id","ContactId","CreatedDate","CreatedById","LastModifiedDate")
.filter(col("productCode").contains("11061164")).limit(10))
dfAsset.show()
For particular column CreatedDate the data is appearing in the Unix format.Please refer
the below :
CreateDate
1652108980000
1632313243000
1632312269000
1632312410000
I need to convert the data into YYYY-MM-DD. In the above script
Please advise how it can be done.
Regards
RK

This is my sample Dataframe saved in the variable dfAsset.
#+-----------+
#| date1 |
#+-----------+
#|16521089 |
#|16323132 |
#|16323122 |
#|16323124 |
#+-----------+
Using below code you can convert the data into YYYY-MM-DD.
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import col,to_date
df = dfAsset.withColumn('date',to_date(col('date1').cast(TimestampType())))
df.show()
Output:

Related

How to convert String to JSON in Spark SQL?

I was working with the "Delta Logs" of Delta Table and the data of Delta table was stored in the Azure Blob Storage.
I used the below query to fetch the JSON data of Delta Log:
SELECT * FROM json.`/mnt/blob/deltaTables/employees/_delta_log/00000000000000000000.json`
I was able to fetch the JSON data as a table in the Databricks notebook (screenshot below):
Another screenshot for add.stats
Now I traverse through the JSON data in the query itself:
SELECT add.stats FROM json.`/mnt/blob/deltaTables/employees/_delta_log/00000000000000000000.json`
But after the stats the JSON data is in the string format and I am not able to traverse through further.
I am attaching the JSON file here for reference:
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"d21b496a-7282-49c9-a71c-3013d780fbeb","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"city\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"age\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"department\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"degree\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"workingLocation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"maxEducation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"experience\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"jobRole\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1664788829119}}
{"add":{"path":"part-00000-91e882d3-f9bc-481f-ba50-a7d061040401-c000.snappy.parquet","partitionValues":{},"size":6337133,"modificationTime":1664788872000,"dataChange":true,"stats":"{\"numRecords\":136251,\"minValues\":{\"Id\":1,\"Name\":\"Robin\",\"city\":\"Jhunjhunu\",\"age\":22,\"department\":\"Data Integration\",\"degree\":\"bsc\",\"workingLocation\":\"Jaipur\",\"maxEducation\":\"Graduation\",\"experience\":2,\"jobRole\":\"Data Engineer\"},\"maxValues\":{\"Id\":136251,\"Name\":\"Robin99999\",\"city\":\"Jhunjhunu99999\",\"age\":136272,\"department\":\"Data Integration99999\",\"degree\":\"bsc99999\",\"workingLocation\":\"Jaipur99999\",\"maxEducation\":\"Graduation99999\",\"experience\":136252,\"jobRole\":\"Data Engineer99999\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000000","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00001-25af3704-bc8d-4201-bbe8-6e6b07864e40-c000.snappy.parquet","partitionValues":{},"size":5969233,"modificationTime":1664788870000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":136252,\"Name\":\"Robin136251\",\"city\":\"Jhunjhunu136251\",\"age\":136273,\"department\":\"Data Integration136251\",\"degree\":\"bsc136251\",\"workingLocation\":\"Jaipur136251\",\"maxEducation\":\"Graduation136251\",\"experience\":136253,\"jobRole\":\"Data Engineer136251\"},\"maxValues\":{\"Id\":264149,\"Name\":\"Robin264148\",\"city\":\"Jhunjhunu264148\",\"age\":264170,\"department\":\"Data Integration264148\",\"degree\":\"bsc264148\",\"workingLocation\":\"Jaipur264148\",\"maxEducation\":\"Graduation264148\",\"experience\":264150,\"jobRole\":\"Data Engineer264148\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000001","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00002-0fbaae5e-371e-45ea-b2ce-b959248ba88a-c000.snappy.parquet","partitionValues":{},"size":5961349,"modificationTime":1664788871000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":264150,\"Name\":\"Robin264149\",\"city\":\"Jhunjhunu264149\",\"age\":264171,\"department\":\"Data Integration264149\",\"degree\":\"bsc264149\",\"workingLocation\":\"Jaipur264149\",\"maxEducation\":\"Graduation264149\",\"experience\":264151,\"jobRole\":\"Data Engineer264149\"},\"maxValues\":{\"Id\":392047,\"Name\":\"Robin392046\",\"city\":\"Jhunjhunu392046\",\"age\":392068,\"department\":\"Data Integration392046\",\"degree\":\"bsc392046\",\"workingLocation\":\"Jaipur392046\",\"maxEducation\":\"Graduation392046\",\"experience\":392048,\"jobRole\":\"Data Engineer392046\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000002","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00003-1c8ac662-7680-42b8-8ffb-eef96ad37085-c000.snappy.parquet","partitionValues":{},"size":5961263,"modificationTime":1664788862000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":392048,\"Name\":\"Robin392047\",\"city\":\"Jhunjhunu392047\",\"age\":392069,\"department\":\"Data Integration392047\",\"degree\":\"bsc392047\",\"workingLocation\":\"Jaipur392047\",\"maxEducation\":\"Graduation392047\",\"experience\":392049,\"jobRole\":\"Data Engineer392047\"},\"maxValues\":{\"Id\":519945,\"Name\":\"Robin519944\",\"city\":\"Jhunjhunu519944\",\"age\":519966,\"department\":\"Data Integration519944\",\"degree\":\"bsc519944\",\"workingLocation\":\"Jaipur519944\",\"maxEducation\":\"Graduation519944\",\"experience\":519946,\"jobRole\":\"Data Engineer519944\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000003","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00004-5c355575-1751-464f-93a5-672ae1f60c29-c000.snappy.parquet","partitionValues":{},"size":5990168,"modificationTime":1664788867000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":519946,\"Name\":\"Robin519945\",\"city\":\"Jhunjhunu519945\",\"age\":519967,\"department\":\"Data Integration519945\",\"degree\":\"bsc519945\",\"workingLocation\":\"Jaipur519945\",\"maxEducation\":\"Graduation519945\",\"experience\":519947,\"jobRole\":\"Data Engineer519945\"},\"maxValues\":{\"Id\":647843,\"Name\":\"Robin647842\",\"city\":\"Jhunjhunu647842\",\"age\":647864,\"department\":\"Data Integration647842\",\"degree\":\"bsc647842\",\"workingLocation\":\"Jaipur647842\",\"maxEducation\":\"Graduation647842\",\"experience\":647844,\"jobRole\":\"Data Engineer647842\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000004","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00005-fac91d9b-9528-4381-8983-419261e7b6ba-c000.snappy.parquet","partitionValues":{},"size":5962518,"modificationTime":1664788865000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":647844,\"Name\":\"Robin647843\",\"city\":\"Jhunjhunu647843\",\"age\":647865,\"department\":\"Data Integration647843\",\"degree\":\"bsc647843\",\"workingLocation\":\"Jaipur647843\",\"maxEducation\":\"Graduation647843\",\"experience\":647845,\"jobRole\":\"Data Engineer647843\"},\"maxValues\":{\"Id\":775741,\"Name\":\"Robin775740\",\"city\":\"Jhunjhunu775740\",\"age\":775762,\"department\":\"Data Integration775740\",\"degree\":\"bsc775740\",\"workingLocation\":\"Jaipur775740\",\"maxEducation\":\"Graduation775740\",\"experience\":775742,\"jobRole\":\"Data Engineer775740\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000005","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00006-41366707-d8cc-1 QQQQQQQAQQQQQQQQQQ46d2-b781-4d9bcc5b5210-c000.snappy.parquet","partitionValues":{},"size":5967367,"modificationTime":1664788866000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":775742,\"Name\":\"Robin775741\",\"city\":\"Jhunjhunu775741\",\"age\":775763,\"department\":\"Data Integration775741\",\"degree\":\"bsc775741\",\"workingLocation\":\"Jaipur775741\",\"maxEducation\":\"Graduation775741\",\"experience\":775743,\"jobRole\":\"Data Engineer775741\"},\"maxValues\":{\"Id\":903639,\"Name\":\"Robin903638\",\"city\":\"Jhunjhunu903638\",\"age\":903660,\"department\":\"Data Integration903638\",\"degree\":\"bsc903638\",\"workingLocation\":\"Jaipur903638\",\"maxEducation\":\"Graduation903638\",\"experience\":903640,\"jobRole\":\"Data Engineer903638\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000006","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00007-8c0a3670-cb24-4cdc-ac37-a9f173c7fe71-c000.snappy.parquet","partitionValues":{},"size":4498552,"modificationTime":1664788864000,"dataChange":true,"stats":"{\"numRecords\":96361,\"minValues\":{\"Id\":903640,\"Name\":\"Robin903639\",\"city\":\"Jhunjhunu903639\",\"age\":903661,\"department\":\"Data Integration903639\",\"degree\":\"bsc903639\",\"workingLocation\":\"Jaipur903639\",\"maxEducation\":\"Graduation903639\",\"experience\":903641,\"jobRole\":\"Data Engineer903639\"},\"maxValues\":{\"Id\":1000000,\"Name\":\"Robin999999\",\"city\":\"Jhunjhunu999999\",\"age\":1000021,\"department\":\"Data Integration999999\",\"degree\":\"bsc999999\",\"workingLocation\":\"Jaipur999999\",\"maxEducation\":\"Graduation999999\",\"experience\":1000001,\"jobRole\":\"Data Engineer999999\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000007","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"commitInfo":{"timestamp":1664788877620,"userId":"2102279527814428","userName":"robin30121999#gmail.com","operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"notebook":{"notebookId":"2204946593481188"},"clusterId":"1003-090126-rztrsnm4","isolationLevel":"WriteSerializable","isBlindAppend":false,"operationMetrics":{"numFiles":"8","numOutputRows":"1000000","numOutputBytes":"46647583"},"engineInfo":"Databricks-Runtime/10.4.x-scala2.12","txnId":"17e89e4f-380f-4888-9aaf-2ada42e13b8c"}}
Ultimately, I want to fetch the maximum value of experience using SQL
Starting with the example "stats" column of type string which you have...
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("{\"numRecords\":136251,\"minValues\":{\"Id\":1,\"Name\":\"Robin\",\"city\":\"Jhunjhunu\",\"age\":22,\"department\":\"Data Integration\",\"degree\":\"bsc\",\"workingLocation\":\"Jaipur\",\"maxEducation\":\"Graduation\",\"experience\":2,\"jobRole\":\"Data Engineer\"},\"maxValues\":{\"Id\":136251,\"Name\":\"Robin99999\",\"city\":\"Jhunjhunu99999\",\"age\":136272,\"department\":\"Data Integration99999\",\"degree\":\"bsc99999\",\"workingLocation\":\"Jaipur99999\",\"maxEducation\":\"Graduation99999\",\"experience\":136252,\"jobRole\":\"Data Engineer99999\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}",),
("{\"numRecords\":127898,\"minValues\":{\"Id\":136252,\"Name\":\"Robin136251\",\"city\":\"Jhunjhunu136251\",\"age\":136273,\"department\":\"Data Integration136251\",\"degree\":\"bsc136251\",\"workingLocation\":\"Jaipur136251\",\"maxEducation\":\"Graduation136251\",\"experience\":136253,\"jobRole\":\"Data Engineer136251\"},\"maxValues\":{\"Id\":264149,\"Name\":\"Robin264148\",\"city\":\"Jhunjhunu264148\",\"age\":264170,\"department\":\"Data Integration264148\",\"degree\":\"bsc264148\",\"workingLocation\":\"Jaipur264148\",\"maxEducation\":\"Graduation264148\",\"experience\":264150,\"jobRole\":\"Data Engineer264148\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}",),
("{\"numRecords\":127898,\"minValues\":{\"Id\":264150,\"Name\":\"Robin264149\",\"city\":\"Jhunjhunu264149\",\"age\":264171,\"department\":\"Data Integration264149\",\"degree\":\"bsc264149\",\"workingLocation\":\"Jaipur264149\",\"maxEducation\":\"Graduation264149\",\"experience\":264151,\"jobRole\":\"Data Engineer264149\"},\"maxValues\":{\"Id\":392047,\"Name\":\"Robin392046\",\"city\":\"Jhunjhunu392046\",\"age\":392068,\"department\":\"Data Integration392046\",\"degree\":\"bsc392046\",\"workingLocation\":\"Jaipur392046\",\"maxEducation\":\"Graduation392046\",\"experience\":392048,\"jobRole\":\"Data Engineer392046\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}",)],
["stats"])
You can use from_json (providing schema path to the object that you need ("experience")) to extract that object together with the structure leading to the object. Then you can remove the structure by just telling the struct path to it (F.col("exp.maxValues.experience")).
df = df.withColumn("exp", F.from_json("stats", "maxValues struct<experience:long>"))
df = df.withColumn("exp", F.col("exp.maxValues.experience"))
df.show()
# +--------------------+------+
# | stats| exp|
# +--------------------+------+
# |{"numRecords":136...|136252|
# |{"numRecords":127...|264150|
# |{"numRecords":127...|392048|
# +--------------------+------+
In SQL you could do it like this:
spark.sql("""
SELECT from_json(stats, 'maxValues struct<experience:long>').maxValues.experience as exp
FROM df
""").show()
# +------+
# | exp|
# +------+
# |136252|
# |264150|
# |392048|
# +------+

Scala dataframe get last 6 months latest data

I have column in dataframe like below
+-------------------+
| timestampCol|
+-------------------+
|2020-11-27 00:00:00|
|2020-11-27 00:00:00|
+-------------------+
I need to filter the data based on this date and I want to get last 6 moths data only , could anyone please suggest how can I do that ?
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
dataset.filter(dataset.col("timestampCol").cast("date")
.gt(add_months(current_date(),-6)));
This will filter all the timestampCol values that are older than 6 months.
Depending on the dataset schema you may need to cast the value as a date.
If it's a date just compare it directly with a java.sql.Timestamp instance.
val someMomentInTime =
java.sql.Timestamp.valueOf("yyyy-[m]m-[d]d hh:mm:ss")
val df: Dataframe =
???
df.filter(col("timestampCol") > someMomentInTime) //Dataframe is Dataset[Row]

In Spark, how to check the date format?

How can we check the date format in below code.
DF = DF.withColumn("DATE", to_date(trim(col("DATE")), "yyyyMMdd"))
Error:
Caused by: java.time.format.DateTimeParseException: Text '2171121' could not be parsed at index 6
Expectation:
If the format is correct use the same data otherwise populate null in the same column.
In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp, to_timestamp and to_date will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result NULL. Check documentaion here.
To switch back to previous behavior you can use below configuration.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
Read what has been changed from spark 3.0 w.r.t datetime parser here.
You can use when() and otherwise() functions to get desired result, after using above configuration.
>>> from pyspark.sql.functions import *
>>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
>>> df = spark.createDataFrame([(20210822,),(1234,)]).toDF("date")
# casting column to string as to_date function will accept string or date or timestamp type columns
>>> df.withColumn("date", when(to_date(df["date"].cast("string"),"yyyyMMdd").isNull(), None).otherwise(df["date"])).show()
+--------+
| date|
+--------+
|20210822|
| null|
+--------+

How to replace the Timedelta Pandas function with a pure PySpark function?

I am developing a small script in PySpark that generates a date sequence (36 months before today's date) and (while applying a truncate to be the first day of the month). Overall I succeeded this task however
But with the help of the Pandas package Timedelta to calculate the time delta .
Is there a way to replace this Timedelta from Pandas with a pure PySpark function ?
import pandas as pd
from datetime import date, timedelta, datetime
from pyspark.sql.functions import col, date_trunc
today = datetime.today()
data = [((date(today.year, today.month, 1) - pd.Timedelta(36,'M')),date(today.year, today.month, 1))] # I want to replace this Pandas function
df = spark.createDataFrame(data, ["minDate", "maxDate"])
+----------+----------+
| minDate| maxDate|
+----------+----------+
|2016-10-01|2019-10-01|
+----------+----------+
import pyspark.sql.functions as f
df = df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\ #
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
+----------+
| date|
+----------+
|2016-10-01|
|2016-11-01|
|2016-12-01|
|2017-01-01|
|2017-02-01|
|2017-03-01|
etc...
+----------+
You can use Pyspark inbuilt trunc function.
pyspark.sql.functions.trunc(date, format)
Returns date truncated to the unit specified by the format.
Parameters:
format – ‘year’, ‘YYYY’, ‘yy’ or ‘month’, ‘mon’, ‘mm’
Imagine I have a below dataframe.
list = [(1,),]
df=spark.createDataFrame(list, ['id'])
import pyspark.sql.functions as f
df=df.withColumn("start_date" ,f.add_months(f.trunc(f.current_date(),"month") ,-36))
df=df.withColumn("max_date" ,f.trunc(f.current_date(),"month"))
>>> df.show()
+---+----------+----------+
| id|start_date| max_date|
+---+----------+----------+
| 1|2016-10-01|2019-10-01|
+---+----------+----------+
Here's a link with more details on Spark date functions.
Pyspark date Functions

How to convert timestamp to bigint in a pyspark dataframe

Am using python on spark environment and want to convert a dataframe coulmn from TIMESTAMP datatype to bigint (UNIX timestamp). The columns are as such: ("yyyy-MM-dd hh:mm:ss.SSSSSS")
timestamp_col
2014-06-04 10:09:13.334422
2015-06-03 10:09:13.443322
2015-08-03 10:09:13.232431
I have read around and tried this among others:
from pyspark.sql.functions import from_unixtime, unix_timestamp
from pyspark.sql.types import TimestampType
df1 = df.select((from_unixtime(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss.SSSSSS"))).cast(TimestampType()).alias("unix_time_col"))
but the output gives rather NULL values.
+-------------+
|unix_time_col|
+-------------+
| null|
| null|
| null|
Am using python3.7 on spark on hadoop environment with spark & hadoop versions: spark-2.3.1-bin-hadoop2.7 on google-colaboratory
I must be missing out something. Please, any help?
please remove ".SSSSSS" in your code, then it will work while converting to unixtimestamp i.e. instead of "yyyy-MM-dd hh:mm:ss.SSSSSS" write as below:
df1 = df.select(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss"))
from pyspark.sql import SparkSession
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import (DateType, StructType, StructField, StringType)
spark = SparkSession.builder.appName('abc').getOrCreate()
column_schema = StructType([StructField("timestamp_col", StringType())])
data = [['2014-06-04 10:09:13.334422'], ['2015-06-03 10:09:13.443322'], ['2015-08-03 10:09:13.232431']]
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.withColumn("timestamp_col", data_frame['timestamp_col'].cast(DateType()))
data_frame = data_frame.withColumn('timestamp_col', unix_timestamp('timestamp_col'))
data_frame.show()
output
+-------------+
|timestamp_col|
+-------------+
| 1401894553|
| 1433344153|
| 1438614553|
+-------------+