I am developing a small script in PySpark that generates a date sequence (36 months before today's date) and (while applying a truncate to be the first day of the month). Overall I succeeded this task however
But with the help of the Pandas package Timedelta to calculate the time delta .
Is there a way to replace this Timedelta from Pandas with a pure PySpark function ?
import pandas as pd
from datetime import date, timedelta, datetime
from pyspark.sql.functions import col, date_trunc
today = datetime.today()
data = [((date(today.year, today.month, 1) - pd.Timedelta(36,'M')),date(today.year, today.month, 1))] # I want to replace this Pandas function
df = spark.createDataFrame(data, ["minDate", "maxDate"])
+----------+----------+
| minDate| maxDate|
+----------+----------+
|2016-10-01|2019-10-01|
+----------+----------+
import pyspark.sql.functions as f
df = df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\ #
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
+----------+
| date|
+----------+
|2016-10-01|
|2016-11-01|
|2016-12-01|
|2017-01-01|
|2017-02-01|
|2017-03-01|
etc...
+----------+
You can use Pyspark inbuilt trunc function.
pyspark.sql.functions.trunc(date, format)
Returns date truncated to the unit specified by the format.
Parameters:
format – ‘year’, ‘YYYY’, ‘yy’ or ‘month’, ‘mon’, ‘mm’
Imagine I have a below dataframe.
list = [(1,),]
df=spark.createDataFrame(list, ['id'])
import pyspark.sql.functions as f
df=df.withColumn("start_date" ,f.add_months(f.trunc(f.current_date(),"month") ,-36))
df=df.withColumn("max_date" ,f.trunc(f.current_date(),"month"))
>>> df.show()
+---+----------+----------+
| id|start_date| max_date|
+---+----------+----------+
| 1|2016-10-01|2019-10-01|
+---+----------+----------+
Here's a link with more details on Spark date functions.
Pyspark date Functions
Related
I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe.
However, the most straight forward solution doesn't seem to be supported by the Pandas on Spark API:
A very simple example like below
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
import pandas as pd
#F.pandas_udf(IntegerType())
def addition(arr: pd.Series, addition: int) -> pd.Series:
return arr.add(addition)
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show()
df.withColumn("added", addition(F.col("array"),F.col("addition")))
throws the following exception on the udf definition line
NotImplementedError: Unsupported signature: (arr: pandas.core.series.Series, addition: int) -> pandas.core.series.Series.
Am i tackling this problem in a wrong way? I could reimplement the whole "addition" function in native PySpark, but the real function I am talking about is terribly complex and would mean an enormous amount of rework.
Loading the example, adding import array
from pyspark.sql.types as T
import pyspark.sql.functions as F
import pandas as pd
from array import array
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show(truncate=False)
print(df.schema.fields)
The response is,
+---------+--------+
| array|addition|
+---------+--------+
|[1, 2, 3]| 10|
|[4, 5, 6]| 20|
+---------+--------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True)]
If you must use a Pandas function to complete your task here is an option for a solution that uses a Pandas function within a PySpark UDF,
The Spark DF arr column is ArrayType, convert it into a Pandas Series
Apply the Pandas function
Then, convert the Pandas Series back to an array
#F.udf(T.ArrayType(T.LongType()))
def addition_pd(arr, addition):
pd_arr = pd.Series(arr)
added = pd_arr.add(addition)
return array("l", added)
df = df.withColumn("added", addition_pd(F.col("array"),F.col("addition")))
df.show(truncate=False)
print(df.schema.fields)
Returns
+---------+--------+------------+
|array |addition|added |
+---------+--------+------------+
|[1, 2, 3]|10 |[11, 12, 13]|
|[4, 5, 6]|20 |[24, 25, 26]|
+---------+--------+------------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True), StructField('added', ArrayType(LongType(), True), True)]
However, it is worth stating that when possible it is recommended to use PySpark Functions over the use of PySpark UDF (see here)
I am using the below script to do refining the data in silver layer:
# Read from existing internal table
dfAsset =(spark.read.option(Constants.SERVER,"xyz.sql.azuresynapse.net")
.synapsesql("abc.Salesforce.Asset")
.select("Id","ContactId","CreatedDate","CreatedById","LastModifiedDate")
.filter(col("productCode").contains("11061164")).limit(10))
dfAsset.show()
For particular column CreatedDate the data is appearing in the Unix format.Please refer
the below :
CreateDate
1652108980000
1632313243000
1632312269000
1632312410000
I need to convert the data into YYYY-MM-DD. In the above script
Please advise how it can be done.
Regards
RK
This is my sample Dataframe saved in the variable dfAsset.
#+-----------+
#| date1 |
#+-----------+
#|16521089 |
#|16323132 |
#|16323122 |
#|16323124 |
#+-----------+
Using below code you can convert the data into YYYY-MM-DD.
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import col,to_date
df = dfAsset.withColumn('date',to_date(col('date1').cast(TimestampType())))
df.show()
Output:
I would like to replace a column of pyspark dataframe.
the dataframe:
price
90.16|USD
I need:
dollar_price currency
9016 USD
Pyspark code:
new_col = F.when(F.col("price").isNull() == False, F.substring(F.col('price'), 1, F.instr(F.col('retail_value'), '|')-1)).otherwise(null)
new_df = df.withColumn('dollar_price', new_col)
new_col = F.when(F.col("price").isNull() == False, F.substring(F.col('price'), F.instr(F.col('retail_value'), '|')+1, 3)).otherwise(null)
new_df_1 = new_df.withColumn('currency', new_col)
I got error:
TypeError: Column is not iterable
Could you please tell me what I missed ?
I have tried
Split a dataframe column's list into two dataframe columns
but it does not work.
thanks
Try with expr as you are computing value from instr function.
Example:
df.show()
#+---------+
#| price|
#+---------+
#|90.16|USD|
#+---------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("dollar_price",when(col("price").isNull()==False,expr("substring(price,1,instr(price,'|')-1)")).otherwise(None)).\
withColumn("currency",when(col("price").isNull()==False,expr("substring(price,instr(price,'|')+1,3)")).otherwise(None)).\
show()
#+---------+------------+--------+
#| price|dollar_price|currency|
#+---------+------------+--------+
#|90.16|USD| 90.16| USD|
#+---------+------------+--------+
Am using python on spark environment and want to convert a dataframe coulmn from TIMESTAMP datatype to bigint (UNIX timestamp). The columns are as such: ("yyyy-MM-dd hh:mm:ss.SSSSSS")
timestamp_col
2014-06-04 10:09:13.334422
2015-06-03 10:09:13.443322
2015-08-03 10:09:13.232431
I have read around and tried this among others:
from pyspark.sql.functions import from_unixtime, unix_timestamp
from pyspark.sql.types import TimestampType
df1 = df.select((from_unixtime(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss.SSSSSS"))).cast(TimestampType()).alias("unix_time_col"))
but the output gives rather NULL values.
+-------------+
|unix_time_col|
+-------------+
| null|
| null|
| null|
Am using python3.7 on spark on hadoop environment with spark & hadoop versions: spark-2.3.1-bin-hadoop2.7 on google-colaboratory
I must be missing out something. Please, any help?
please remove ".SSSSSS" in your code, then it will work while converting to unixtimestamp i.e. instead of "yyyy-MM-dd hh:mm:ss.SSSSSS" write as below:
df1 = df.select(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss"))
from pyspark.sql import SparkSession
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import (DateType, StructType, StructField, StringType)
spark = SparkSession.builder.appName('abc').getOrCreate()
column_schema = StructType([StructField("timestamp_col", StringType())])
data = [['2014-06-04 10:09:13.334422'], ['2015-06-03 10:09:13.443322'], ['2015-08-03 10:09:13.232431']]
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.withColumn("timestamp_col", data_frame['timestamp_col'].cast(DateType()))
data_frame = data_frame.withColumn('timestamp_col', unix_timestamp('timestamp_col'))
data_frame.show()
output
+-------------+
|timestamp_col|
+-------------+
| 1401894553|
| 1433344153|
| 1438614553|
+-------------+
I have a DataFrame of phone calls that contains timestamp and duration of the call. How would I sum the total duration for each day for all phone calls? The timestamp is a string so I am having trouble parsing it to an actual date. I'm not sure if spark has any support for timestamps.
DataFrame table
timestamp | duration
1414592818364 | 210
1414575535061 | 110
1411328461890 | 140
1434606396339 | 90
You can use UDF to parse timestamps. Below you can find a Python solution but it should pretty easy to do the same thing using another supported language:
With raw SQL:
from datetime import datetime
df = sqlContext.createDataFrame(sc.parallelize([
{'timestamp': 1414592818364, 'duration': 210},
{'timestamp': 1414575535061, 'duration': 110},
{'timestamp': 1411328461890, 'duration': 140},
{'timestamp': 1434606396339, 'duration': 90}]))
def parse_timestamp(tm):
dt = datetime.fromtimestamp(tm / 1000)
return '{0}-{1}-{2}'.format(dt.year, dt.month, dt.day)
sqlContext.registerFunction('parse_timestamp', parse_timestamp)
df.registerTempTable('df')
query = '''
SELECT parse_timestamp(timestamp) AS date, sum(duration) AS total_durtaion
FROM df GROUP BY parse_timestamp(timestamp)'''
(sqlContext
.sql(query)
.show())
or SQL DSL:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
(df
.withColumn('date', udf(parse_timestamp, StringType())(df.timestamp))
.select('date', 'duration')
.groupby('date')
.sum()
.show())
EDIT:
Since Spark 1.5 there is no need for a custom udf.
from pyspark.sql.functions import from_unixtime, col, sum
(df
.groupBy(from_unixtime(df.timestamp / 1000, "yyyy-MM-dd").alias("date"))
.agg(sum(col("duration"))))