How to convert timestamp to bigint in a pyspark dataframe - dataframe

Am using python on spark environment and want to convert a dataframe coulmn from TIMESTAMP datatype to bigint (UNIX timestamp). The columns are as such: ("yyyy-MM-dd hh:mm:ss.SSSSSS")
timestamp_col
2014-06-04 10:09:13.334422
2015-06-03 10:09:13.443322
2015-08-03 10:09:13.232431
I have read around and tried this among others:
from pyspark.sql.functions import from_unixtime, unix_timestamp
from pyspark.sql.types import TimestampType
df1 = df.select((from_unixtime(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss.SSSSSS"))).cast(TimestampType()).alias("unix_time_col"))
but the output gives rather NULL values.
+-------------+
|unix_time_col|
+-------------+
| null|
| null|
| null|
Am using python3.7 on spark on hadoop environment with spark & hadoop versions: spark-2.3.1-bin-hadoop2.7 on google-colaboratory
I must be missing out something. Please, any help?

please remove ".SSSSSS" in your code, then it will work while converting to unixtimestamp i.e. instead of "yyyy-MM-dd hh:mm:ss.SSSSSS" write as below:
df1 = df.select(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss"))

from pyspark.sql import SparkSession
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import (DateType, StructType, StructField, StringType)
spark = SparkSession.builder.appName('abc').getOrCreate()
column_schema = StructType([StructField("timestamp_col", StringType())])
data = [['2014-06-04 10:09:13.334422'], ['2015-06-03 10:09:13.443322'], ['2015-08-03 10:09:13.232431']]
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.withColumn("timestamp_col", data_frame['timestamp_col'].cast(DateType()))
data_frame = data_frame.withColumn('timestamp_col', unix_timestamp('timestamp_col'))
data_frame.show()
output
+-------------+
|timestamp_col|
+-------------+
| 1401894553|
| 1433344153|
| 1438614553|
+-------------+

Related

Unix time stamp conversion in Azure synapse analytics

I am using the below script to do refining the data in silver layer:
# Read from existing internal table
dfAsset =(spark.read.option(Constants.SERVER,"xyz.sql.azuresynapse.net")
.synapsesql("abc.Salesforce.Asset")
.select("Id","ContactId","CreatedDate","CreatedById","LastModifiedDate")
.filter(col("productCode").contains("11061164")).limit(10))
dfAsset.show()
For particular column CreatedDate the data is appearing in the Unix format.Please refer
the below :
CreateDate
1652108980000
1632313243000
1632312269000
1632312410000
I need to convert the data into YYYY-MM-DD. In the above script
Please advise how it can be done.
Regards
RK
This is my sample Dataframe saved in the variable dfAsset.
#+-----------+
#| date1 |
#+-----------+
#|16521089 |
#|16323132 |
#|16323122 |
#|16323124 |
#+-----------+
Using below code you can convert the data into YYYY-MM-DD.
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import col,to_date
df = dfAsset.withColumn('date',to_date(col('date1').cast(TimestampType())))
df.show()
Output:

In Spark, how to check the date format?

How can we check the date format in below code.
DF = DF.withColumn("DATE", to_date(trim(col("DATE")), "yyyyMMdd"))
Error:
Caused by: java.time.format.DateTimeParseException: Text '2171121' could not be parsed at index 6
Expectation:
If the format is correct use the same data otherwise populate null in the same column.
In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp, to_timestamp and to_date will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result NULL. Check documentaion here.
To switch back to previous behavior you can use below configuration.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
Read what has been changed from spark 3.0 w.r.t datetime parser here.
You can use when() and otherwise() functions to get desired result, after using above configuration.
>>> from pyspark.sql.functions import *
>>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
>>> df = spark.createDataFrame([(20210822,),(1234,)]).toDF("date")
# casting column to string as to_date function will accept string or date or timestamp type columns
>>> df.withColumn("date", when(to_date(df["date"].cast("string"),"yyyyMMdd").isNull(), None).otherwise(df["date"])).show()
+--------+
| date|
+--------+
|20210822|
| null|
+--------+

Pyspark dataframes left join with conditions (spatial join)

I use pyspark and I have created (from txt files) two dataframes
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
sc = spark.sparkContext
+---+--------------------+------------------+-------------------+
| id| name| lat| lon|
+---+--------------------+------------------+-------------------+
| 1|.
.
.
+---+-------------------+------------------+-------------------+
| id| name| lat| lon|
+---+-------------------+------------------+-------------------+
| 1||
.
.
What I want is, through Spark techniques, to get every pair between the items of the Dataframes where their euclidean distance is below a certain value (let's say "0.5"). Like:
record1, record2
or in any form like this, this is not the matter.
Any help will be appreciated, thank you.
Since Spark does not include any provisions for geospatial computations, you need a user-defined function that computes the geospatial distance between two points, for example by using the haversine formula (from here):
from math import radians, cos, sin, asin, sqrt
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
#udf(returnType=FloatType())
def haversine(lat1, lon1, lat2, lon2):
R = 6372.8
dLat = radians(lat2 - lat1)
dLon = radians(lon2 - lon1)
lat1 = radians(lat1)
lat2 = radians(lat2)
a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
c = 2*asin(sqrt(a))
return R * c
Then you simply perform a cross join conditioned on the result from calling haversine():
df1.join(df2, haversine(df1.lat, df1.lon, df2.lat, df2.lon) < 100, 'cross') \
.select(df1.name, df2.name)
You need a cross join since Spark cannot embed the Python UDF in the join itself. That's expensive, but this is something that PySpark users have to live with.
Here is an example:
>>> df1.show()
+---------+-------------------+--------------------+
| lat| lon| name|
+---------+-------------------+--------------------+
|37.776181|-122.41341399999999|AAE SSFF European...|
|38.959716| -119.945595|Ambassador Motor ...|
| 37.66169| -121.887367|Alameda County Fa...|
+---------+-------------------+--------------------+
>>> df2.show()
+------------------+-------------------+-------------------+
| lat| lon| name|
+------------------+-------------------+-------------------+
| 34.19198813|-118.93756299999998|Daphnes Greek Cafe1|
| 37.755557|-122.25036084651899|Daphnes Greek Cafe2|
|38.423435999999995| -121.41361| Laguna Pizza|
+------------------+-------------------+-------------------+
>>> df1.join(df2, haversine(df1.lat, df1.lon, df2.lat, df2.lon) < 100, 'cross') \
.select(df1.name.alias("name1"), df2.name.alias("name2")).show()
+--------------------+-------------------+
| name1| name2|
+--------------------+-------------------+
|AAE SSFF European...|Daphnes Greek Cafe2|
|Alameda County Fa...|Daphnes Greek Cafe2|
|Alameda County Fa...| Laguna Pizza|
+--------------------+-------------------+

Strange convertion of pandas dataframe to spark dataframe with defined schema

I'm facing the following problem and cound't get an answer yet: when converting a pandas dataframe with integers to a pyspark dataframe with a schema that supposes data comes as a string, the values change to "strange" strings, just like the example below. I've saved a lot of important data like that, and I wonder why that happened and if it is possible to "decode" these symbols back to integer forms. Thanks in advance!
import pandas as pd
from pyspark.sql.types import StructType, StructField,StringType
df = pd.DataFrame(data = {"a": [111,222, 333]})
schema = StructType([
StructField("a", StringType(), True)
])
sparkdf = spark.createDataFrame(df, schema)
sparkdf.show()
Output:
--+
+---+
| a|
+---+
| o|
| Þ|
| ō|
+---+
I cannot reproduce the problem on any recent version but the most likely reason is that you incorrectly defined the schema (in combination with enabled Arrow support).
Either cast the input:
df["a"] = df.a.astype("str")
or define the correct schema:
from pyspark.sql.types import LongType
schema = StructType([
StructField("a", LongType(), True)
])

How to replace the Timedelta Pandas function with a pure PySpark function?

I am developing a small script in PySpark that generates a date sequence (36 months before today's date) and (while applying a truncate to be the first day of the month). Overall I succeeded this task however
But with the help of the Pandas package Timedelta to calculate the time delta .
Is there a way to replace this Timedelta from Pandas with a pure PySpark function ?
import pandas as pd
from datetime import date, timedelta, datetime
from pyspark.sql.functions import col, date_trunc
today = datetime.today()
data = [((date(today.year, today.month, 1) - pd.Timedelta(36,'M')),date(today.year, today.month, 1))] # I want to replace this Pandas function
df = spark.createDataFrame(data, ["minDate", "maxDate"])
+----------+----------+
| minDate| maxDate|
+----------+----------+
|2016-10-01|2019-10-01|
+----------+----------+
import pyspark.sql.functions as f
df = df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\ #
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
+----------+
| date|
+----------+
|2016-10-01|
|2016-11-01|
|2016-12-01|
|2017-01-01|
|2017-02-01|
|2017-03-01|
etc...
+----------+
You can use Pyspark inbuilt trunc function.
pyspark.sql.functions.trunc(date, format)
Returns date truncated to the unit specified by the format.
Parameters:
format – ‘year’, ‘YYYY’, ‘yy’ or ‘month’, ‘mon’, ‘mm’
Imagine I have a below dataframe.
list = [(1,),]
df=spark.createDataFrame(list, ['id'])
import pyspark.sql.functions as f
df=df.withColumn("start_date" ,f.add_months(f.trunc(f.current_date(),"month") ,-36))
df=df.withColumn("max_date" ,f.trunc(f.current_date(),"month"))
>>> df.show()
+---+----------+----------+
| id|start_date| max_date|
+---+----------+----------+
| 1|2016-10-01|2019-10-01|
+---+----------+----------+
Here's a link with more details on Spark date functions.
Pyspark date Functions