Pyspark: how to convert hours with decimal to hh:mm

Pyspark: how to convert hours with decimal to hh:mm - apache-spark-sql

I have the following sample dataframe that has object ids and total hours. the decimal values are minutes converted into a fraction of an hour.
# +----+----+--------+
# |col1|total_hours |
# +----+-------------+
# |obj1| 48387.837 |
# |obj2| 45570.0201 |
# |obj3| 39339.669 |
# |obj4| 37673.235 |
# |obj5| 3576 |
# |obj6| 15287.9999 |
# +----+-------------+
I want to show the total hours in hours: minutes format.
desired output:
# +----+----+--------+
# |col1|total_hours |
# +----+-------------+
# |obj1| 48387:50 |
# |obj2| 45570:01 |
# |obj3| 39339:40 |
# |obj4| 37673:14 |
# |obj5| 3576:00 |
# |obj6| 15288:00 |
# +----+-------------+
in SQL I am able to do this with the following function :
hr = trunc(col1);
minutes = round(hr -trunc(hr)* 0.6, 2);
hours_minutes= trim(replace(to_char(hr + minutes ,'999999999990.90'),'.',':'));
How can this be done in Pyspark?

This is going to require string manipulation given that simple formatting can't work.
This is picking up the mod of the number, multiplying it by 60, formatting both and then concatenating:
df.withColumn('total_hours_str',
f.concat(f.regexp_replace(f.format_number(f.floor(df.total_hours), 0), ',', ''),
f.lit(':'),
f.lpad(f.format_number(df.total_hours%1*60, 0), 2, '0'))).show()
Output:
+----+-----------+---------------+
|col1|total_hours|total_hours_str|
+----+-----------+---------------+
|obj1| 48387.837| 48387:50|
|obj2| 45570.0201| 45570:01|
|obj3| 39339.669| 39339:40|
|obj4| 37673.235| 37673:14|
|obj5| 3576.0| 3576:00|
+----+-----------+---------------+
EDIT:
As you're having fractional values that end up being rounded to a whole hour, I suggest you round before processing the column:
df.withColumn('rounded_total_hours', f.round(df['total_hours'],2))\
.withColumn('total_hours_str',
f.concat(f.regexp_replace(f.format_number(f.floor(f.col('rounded_total_hours')), 0), ',', ''),
f.lit(':'),
f.lpad(f.format_number(f.col('rounded_total_hours')%1*60, 0), 2, '0'))).show()
Which produces:
+----+-----------+-------------------+---------------+
|col1|total_hours|rounded_total_hours|total_hours_str|
+----+-----------+-------------------+---------------+
|obj1| 48387.837| 48387.84| 48387:50|
|obj2| 45570.0201| 45570.02| 45570:01|
|obj3| 39339.669| 39339.67| 39339:40|
|obj4| 37673.235| 37673.24| 37673:14|
|obj5| 3576.0| 3576.0| 3576:00|
|obj6| 15287.9999| 15288.0| 15288:00|
+----+-----------+-------------------+---------------+

If your desired datatype is a string then this can be done with string concat.
Steps:
Extract the hours by creating a column that casts total_hours to
IntegerType()
Extract the fraction of hours by subtracting that value from the total_hours
multiply that decimal by 60 to get the number of minutes
casts to string and concat with a : seperator.
Code:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import concat_ws
df = df.withColumn('total_hour_int', df['total_hours'].cast(IntegerType())
df = df.withColumn('hours_remainder', df['total_hours']-df['total_hour_int'])
df = df.withColumn('minutes', df['hours_remainder']*60)
df = df.withColumn('minutes_full', df['minutes'].cast(IntegerType())
df = df.withColumn('total_hours_string', concat_ws(':', df['total_hour_int'], df['minutes_full'])

Related

Get last business day of the month in PySpark without UDF

I would like to get the last business day (LBD) of the month, and use LBD to filter records in a dataframe, I did come up with python code. But to achieve this functionality I need to use UDF. Is there any way to get the last business day of the month without using PySpark UDF?
import calendar
def last_business_day_in_month(calendarYearMonth):
year = int(calendarYearMonth[0:4])
month = int(calendarYearMonth[4:])
return str(year) + str(month) + str(max(calendar.monthcalendar(year, month)[-1:][0][:5]))
last_business_day_in_month(calendarYearMonth)
calendarYearMonth is in format YYYYMM
Ref: https://stackoverflow.com/a/62392077/6187792

You can calculate it using last_day and its dayofweek.
from pyspark.sql import functions as func
spark.sparkContext.parallelize([(202010,), (202201,)]).toDF(['yrmth']). \
withColumn('lastday_mth', func.last_day(func.to_date(func.col('yrmth').cast('string'), 'yyyyMM'))). \
withColumn('dayofwk', func.dayofweek('lastday_mth')). \
withColumn('lastbizday_mth',
func.when(func.col('dayofwk') == 7, func.date_add('lastday_mth', -1)).
when(func.col('dayofwk') == 1, func.date_add('lastday_mth', -2)).
otherwise(func.col('lastday_mth'))
). \
show()
# +------+-----------+-------+--------------+
# | yrmth|lastday_mth|dayofwk|lastbizday_mth|
# +------+-----------+-------+--------------+
# |202010| 2020-10-31| 7| 2020-10-30|
# |202201| 2022-01-31| 2| 2022-01-31|
# +------+-----------+-------+--------------+

Create a small sequence of last dates of the month, filter out weekends and use array_max to return the max date.
from pyspark.sql import functions as F
df = spark.createDataFrame([('202010',), ('202201',)], ['yrmth'])
last_day = F.last_day(F.to_date('yrmth', 'yyyyMM'))
last_days = F.sequence(F.date_sub(last_day, 3), last_day)
df = df.withColumn(
'last_business_day_in_month',
F.array_max(F.filter(last_days, lambda x: ~F.dayofweek(x).isin([1, 7])))
)
df.show()
# +------+--------------------------+
# | yrmth|last_business_day_in_month|
# +------+--------------------------+
# |202010| 2020-10-30|
# |202201| 2022-01-31|
# +------+--------------------------+
For lower Spark versions:
last_day = "last_day(to_date(yrmth, 'yyyyMM'))"
df = df.withColumn(
'last_business_day_in_month',
F.expr(f"array_max(filter(sequence(date_sub({last_day}, 3), {last_day}), x -> weekday(x) < 5))")
)

How can I find the average of every nth number of rows in PySpark

I have 1440 rows in my dataframe (one row for every minute of the day). I want to convert this into hours so that I have 24 values (rows) left in total.
This is a 2 column dataframe. First column is minutes, second column is integers. I would like a 2 X 24 dataframe where the first column is hours and the second column is an average of 60 values.

If your minutes column is an integer starting at 0, something along these lines should work:
hour = F.floor(F.col('minute') / 60).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))
The example where I assume that every hour has 3 minutes:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(0, 5),
(1, 5),
(2, 5),
(3, 0),
(4, 0),
(5, 1)],
['minute', 'integer'])
hour = F.floor(F.col('minute') / 3).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))
df.show()
# +----+------------------+
# |hour| average|
# +----+------------------+
# | 0| 5.0|
# | 1|0.3333333333333333|
# +----+------------------+

Print value as german thousand seperator along with decimal value using pySpark

I need to convert a dataframe column of String Type to double and add the format mask like thousand seperator and decimal place.
input dataframe:
column(StringType)
2655.00
15722.50
235354.66
required format:
(-1) * to_number(df.column, format mask)
Data is delivered as . as thousand separator and , as decimal separator and with 2 decimal numbers
Output column:
2.655,00
15.722,50
235.354,66

Spark date_format returns string number formatted like #,###,###.## so you need to replace . by , and . by , to get the European format you want.
First, replace dots by # then commas by dots and finally replace # by a dot.
df.withColumn("european_format", regexp_replace(regexp_replace(regexp_replace(
format_number(col("column").cast("double"), 2), '\\.', '#'), ',', '\\.'), '#', ',')
).show()
Gives:
+---------+---------------+
| column|european_format|
+---------+---------------+
| 2655.00| 2.655,00|
| 15722.50| 15.722,50|
|235354.66| 235.354,66|
+---------+---------------+

You can simply do:
import pyspark.sql.functions as F
# create a new colum with formatted date
df = df.withColumn('num_format', F.format_number('col', 2))
# switch the dot and comma
df = df.withColumn('num_format', F.regexp_replace(F.regexp_replace(F.regexp_replace('num_format', '\\.', '#'), ',', '\\.'), '#', ','))
df.show()
+---------+----------+
| col|num_format|
+---------+----------+
| 2655.0| 2.655,00|
| 15722.5| 15.722,50|
|235354.66|235.354,66|
+---------+----------+

Finding the longest sequence of dates in a dataframe

I'd like to know how to find the longest unbroken sequence of dates (formatted as 2016-11-27) in a publish_date column (dates are not the index, though I suppose they could be).
There are a number of stack overflow questions which are similar, but AFAICT all proposed answers return the size of the longest sequence, which is not what I'm after.
I want to know e.g. that the stretch from 2017-01-01 to 2017-06-01 had no missing dates and was the longest such streak.

Here is an example of how you can do this:
import pandas as pd
import datetime
# initialize data
data = {'a': [1,2,3,4,5,6,7],
'date': ['2017-01-01', '2017-01-03', '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-09', '2017-01-31']}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# create mask that indicates sequential pair of days (except the first date)
df['mask'] = 1
df.loc[df['date'] - datetime.timedelta(days=1) == df['date'].shift(),'mask'] = 0
# convert mask to numbers - each sequence have its own number
df['mask'] = df['mask'].cumsum()
# find largest sequence number and get this sequence
res = df.loc[df['mask'] == df['mask'].value_counts().idxmax(), 'date']
# extract min and max dates if you need
min_date = res.min()
max_date = res.max()
# print result
print('min_date: {}'.format(min_date))
print('max_date: {}'.format(max_date))
print('result:')
print(res)
The result will be:
min_date: 2017-01-05 00:00:00
max_date: 2017-01-07 00:00:00
result:
2 2017-01-05
3 2017-01-06
4 2017-01-07

Spark SQL - How to find total number of transactions on an hourly basis

For example, if i have a table with transaction number and transaction date [as timestamp] columns, how do i find out the total number of transactions on an hourly basis?
Is there any Spark sql functions available for this kind of range calculation?

You can use from_unixtime function.
val sqlContext = new SQLContext(sc)
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = // your dataframe, assuming transaction_date is timestamp in seconds
df.select('transaction_number, hour(from_unixtime('transaction_date)) as 'hour)
.groupBy('hour)
.agg(count('transaction_number) as 'transactions)
Result:
+----+------------+
|hour|transactions|
+----+------------+
| 10| 1000|
| 12| 2000|
| 13| 3000|
| 14| 4000|
| ..| ....|
+----+------------+

Here I'm trying to give some pointer to approach, rather complete code, please see this
Time Interval Literals :
Using interval literals, it is possible to perform subtraction or addition of an arbitrary amount of time from a date or timestamp value. This representation can be useful when you want to add or subtract a time period from a fixed point in time. For example, users can now easily express queries like
“Find all transactions that have happened during the past hour”.
An interval literal is constructed using the following syntax:
[sql]INTERVAL value unit[/sql]
Below is the way in python. you can modify the below example to match your requirement i.e transaction date start time, end time accordingly. instead of id in your case its transaction number.
# Import functions.
from pyspark.sql.functions import *
# Create a simple DataFrame.
data = [
("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1),
("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2),
("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3)]
df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
df = df.select(
df.start_time.cast("timestamp").alias("start_time"),
df.end_time.cast("timestamp").alias("end_time"),
df.id)
# Get all records that have a start_time and end_time in the
# same day, and the difference between the end_time and start_time
# is less or equal to 1 hour.
condition = \
(to_date(df.start_time) == to_date(df.end_time)) & \
(df.start_time + expr("INTERVAL 1 HOUR") >= df.end_time)
df.filter(condition).show()
+———————+———————+—+
|start_time | end_time |id |
+———————+———————+—+
|2015-01-02 23:00:00.0|2015-01-02 23:59:59.0|2 |
+———————+———————+—+
using this method, you can apply group function to find total number of transactions in your case.
Above is python code, what about scala ?
expr function used above also available in scala as well
Also have a look at spark-scala-datediff-of-two-columns-by-hour-or-minute
which describes below..
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark: how to convert hours with decimal to hh:mm - apache-spark-sql

Related

Get last business day of the month in PySpark without UDF

How can I find the average of every nth number of rows in PySpark

Print value as german thousand seperator along with decimal value using pySpark

Finding the longest sequence of dates in a dataframe

Spark SQL - How to find total number of transactions on an hourly basis

Categories

Resources