How to replace the Timedelta Pandas function with a pure PySpark function?

How to replace the Timedelta Pandas function with a pure PySpark function? - pandas

I am developing a small script in PySpark that generates a date sequence (36 months before today's date) and (while applying a truncate to be the first day of the month). Overall I succeeded this task however
But with the help of the Pandas package Timedelta to calculate the time delta .
Is there a way to replace this Timedelta from Pandas with a pure PySpark function ?
import pandas as pd
from datetime import date, timedelta, datetime
from pyspark.sql.functions import col, date_trunc
today = datetime.today()
data = [((date(today.year, today.month, 1) - pd.Timedelta(36,'M')),date(today.year, today.month, 1))] # I want to replace this Pandas function
df = spark.createDataFrame(data, ["minDate", "maxDate"])
+----------+----------+
| minDate| maxDate|
+----------+----------+
|2016-10-01|2019-10-01|
+----------+----------+
import pyspark.sql.functions as f
df = df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\ #
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
+----------+
| date|
+----------+
|2016-10-01|
|2016-11-01|
|2016-12-01|
|2017-01-01|
|2017-02-01|
|2017-03-01|
etc...
+----------+

You can use Pyspark inbuilt trunc function.
pyspark.sql.functions.trunc(date, format)
Returns date truncated to the unit specified by the format.
Parameters:
format – ‘year’, ‘YYYY’, ‘yy’ or ‘month’, ‘mon’, ‘mm’
Imagine I have a below dataframe.
list = [(1,),]
df=spark.createDataFrame(list, ['id'])
import pyspark.sql.functions as f
df=df.withColumn("start_date" ,f.add_months(f.trunc(f.current_date(),"month") ,-36))
df=df.withColumn("max_date" ,f.trunc(f.current_date(),"month"))
>>> df.show()
+---+----------+----------+
| id|start_date| max_date|
+---+----------+----------+
| 1|2016-10-01|2019-10-01|
+---+----------+----------+
Here's a link with more details on Spark date functions.
Pyspark date Functions

Related

pandas_udf with pd.Series and other object as arguments

I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe.
However, the most straight forward solution doesn't seem to be supported by the Pandas on Spark API:
A very simple example like below
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
import pandas as pd
#F.pandas_udf(IntegerType())
def addition(arr: pd.Series, addition: int) -> pd.Series:
return arr.add(addition)
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show()
df.withColumn("added", addition(F.col("array"),F.col("addition")))
throws the following exception on the udf definition line
NotImplementedError: Unsupported signature: (arr: pandas.core.series.Series, addition: int) -> pandas.core.series.Series.
Am i tackling this problem in a wrong way? I could reimplement the whole "addition" function in native PySpark, but the real function I am talking about is terribly complex and would mean an enormous amount of rework.

Loading the example, adding import array
from pyspark.sql.types as T
import pyspark.sql.functions as F
import pandas as pd
from array import array
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show(truncate=False)
print(df.schema.fields)
The response is,
+---------+--------+
| array|addition|
+---------+--------+
|[1, 2, 3]| 10|
|[4, 5, 6]| 20|
+---------+--------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True)]
If you must use a Pandas function to complete your task here is an option for a solution that uses a Pandas function within a PySpark UDF,
The Spark DF arr column is ArrayType, convert it into a Pandas Series
Apply the Pandas function
Then, convert the Pandas Series back to an array
#F.udf(T.ArrayType(T.LongType()))
def addition_pd(arr, addition):
pd_arr = pd.Series(arr)
added = pd_arr.add(addition)
return array("l", added)
df = df.withColumn("added", addition_pd(F.col("array"),F.col("addition")))
df.show(truncate=False)
print(df.schema.fields)
Returns
+---------+--------+------------+
|array |addition|added |
+---------+--------+------------+
|[1, 2, 3]|10 |[11, 12, 13]|
|[4, 5, 6]|20 |[24, 25, 26]|
+---------+--------+------------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True), StructField('added', ArrayType(LongType(), True), True)]
However, it is worth stating that when possible it is recommended to use PySpark Functions over the use of PySpark UDF (see here)

Unix time stamp conversion in Azure synapse analytics

I am using the below script to do refining the data in silver layer:
# Read from existing internal table
dfAsset =(spark.read.option(Constants.SERVER,"xyz.sql.azuresynapse.net")
.synapsesql("abc.Salesforce.Asset")
.select("Id","ContactId","CreatedDate","CreatedById","LastModifiedDate")
.filter(col("productCode").contains("11061164")).limit(10))
dfAsset.show()
For particular column CreatedDate the data is appearing in the Unix format.Please refer
the below :
CreateDate
1652108980000
1632313243000
1632312269000
1632312410000
I need to convert the data into YYYY-MM-DD. In the above script
Please advise how it can be done.
Regards
RK

This is my sample Dataframe saved in the variable dfAsset.
#+-----------+
#| date1 |
#+-----------+
#|16521089 |
#|16323132 |
#|16323122 |
#|16323124 |
#+-----------+
Using below code you can convert the data into YYYY-MM-DD.
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import col,to_date
df = dfAsset.withColumn('date',to_date(col('date1').cast(TimestampType())))
df.show()
Output:

how to split one spark dataframe column into two columns by conditional when

I would like to replace a column of pyspark dataframe.
the dataframe:
price
90.16|USD
I need:
dollar_price currency
9016 USD
Pyspark code:
new_col = F.when(F.col("price").isNull() == False, F.substring(F.col('price'), 1, F.instr(F.col('retail_value'), '|')-1)).otherwise(null)
new_df = df.withColumn('dollar_price', new_col)
new_col = F.when(F.col("price").isNull() == False, F.substring(F.col('price'), F.instr(F.col('retail_value'), '|')+1, 3)).otherwise(null)
new_df_1 = new_df.withColumn('currency', new_col)
I got error:
TypeError: Column is not iterable
Could you please tell me what I missed ?
I have tried
Split a dataframe column's list into two dataframe columns
but it does not work.
thanks

Try with expr as you are computing value from instr function.
Example:
df.show()
#+---------+
#| price|
#+---------+
#|90.16|USD|
#+---------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("dollar_price",when(col("price").isNull()==False,expr("substring(price,1,instr(price,'|')-1)")).otherwise(None)).\
withColumn("currency",when(col("price").isNull()==False,expr("substring(price,instr(price,'|')+1,3)")).otherwise(None)).\
show()
#+---------+------------+--------+
#| price|dollar_price|currency|
#+---------+------------+--------+
#|90.16|USD| 90.16| USD|
#+---------+------------+--------+

How to convert timestamp to bigint in a pyspark dataframe

Am using python on spark environment and want to convert a dataframe coulmn from TIMESTAMP datatype to bigint (UNIX timestamp). The columns are as such: ("yyyy-MM-dd hh:mm:ss.SSSSSS")
timestamp_col
2014-06-04 10:09:13.334422
2015-06-03 10:09:13.443322
2015-08-03 10:09:13.232431
I have read around and tried this among others:
from pyspark.sql.functions import from_unixtime, unix_timestamp
from pyspark.sql.types import TimestampType
df1 = df.select((from_unixtime(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss.SSSSSS"))).cast(TimestampType()).alias("unix_time_col"))
but the output gives rather NULL values.
+-------------+
|unix_time_col|
+-------------+
| null|
| null|
| null|
Am using python3.7 on spark on hadoop environment with spark & hadoop versions: spark-2.3.1-bin-hadoop2.7 on google-colaboratory
I must be missing out something. Please, any help?

please remove ".SSSSSS" in your code, then it will work while converting to unixtimestamp i.e. instead of "yyyy-MM-dd hh:mm:ss.SSSSSS" write as below:
df1 = df.select(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss"))

from pyspark.sql import SparkSession
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import (DateType, StructType, StructField, StringType)
spark = SparkSession.builder.appName('abc').getOrCreate()
column_schema = StructType([StructField("timestamp_col", StringType())])
data = [['2014-06-04 10:09:13.334422'], ['2015-06-03 10:09:13.443322'], ['2015-08-03 10:09:13.232431']]
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.withColumn("timestamp_col", data_frame['timestamp_col'].cast(DateType()))
data_frame = data_frame.withColumn('timestamp_col', unix_timestamp('timestamp_col'))
data_frame.show()
output
+-------------+
|timestamp_col|
+-------------+
| 1401894553|
| 1433344153|
| 1438614553|
+-------------+

How do I sum a certain value for certain day of the week?

I have a DataFrame of phone calls that contains timestamp and duration of the call. How would I sum the total duration for each day for all phone calls? The timestamp is a string so I am having trouble parsing it to an actual date. I'm not sure if spark has any support for timestamps.
DataFrame table
timestamp | duration
1414592818364 | 210
1414575535061 | 110
1411328461890 | 140
1434606396339 | 90

You can use UDF to parse timestamps. Below you can find a Python solution but it should pretty easy to do the same thing using another supported language:
With raw SQL:
from datetime import datetime
df = sqlContext.createDataFrame(sc.parallelize([
{'timestamp': 1414592818364, 'duration': 210},
{'timestamp': 1414575535061, 'duration': 110},
{'timestamp': 1411328461890, 'duration': 140},
{'timestamp': 1434606396339, 'duration': 90}]))
def parse_timestamp(tm):
dt = datetime.fromtimestamp(tm / 1000)
return '{0}-{1}-{2}'.format(dt.year, dt.month, dt.day)
sqlContext.registerFunction('parse_timestamp', parse_timestamp)
df.registerTempTable('df')
query = '''
SELECT parse_timestamp(timestamp) AS date, sum(duration) AS total_durtaion
FROM df GROUP BY parse_timestamp(timestamp)'''
(sqlContext
.sql(query)
.show())
or SQL DSL:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
(df
.withColumn('date', udf(parse_timestamp, StringType())(df.timestamp))
.select('date', 'duration')
.groupby('date')
.sum()
.show())
EDIT:
Since Spark 1.5 there is no need for a custom udf.
from pyspark.sql.functions import from_unixtime, col, sum
(df
.groupBy(from_unixtime(df.timestamp / 1000, "yyyy-MM-dd").alias("date"))
.agg(sum(col("duration"))))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to replace the Timedelta Pandas function with a pure PySpark function? - pandas

Related

pandas_udf with pd.Series and other object as arguments

Unix time stamp conversion in Azure synapse analytics

how to split one spark dataframe column into two columns by conditional when

How to convert timestamp to bigint in a pyspark dataframe

How do I sum a certain value for certain day of the week?

Categories

Resources