How do I use flatmap with multiple columns in Dataframe using Pyspark

How do I use flatmap with multiple columns in Dataframe using Pyspark - apache-spark-sql

I have a DF as below:
Name city starttime endtime
user1 London 2019-08-02 03:34:45 2019-08-02 03:52:03
user2 Boston 2019-08-13 13:34:10 2019-08-13 15:02:10
I would like to check the endtime and if it crosses into the next hour then update the current record with the last minute/second of current hour and append another row or rows with similar data as shown below(user2). Do I use flapmap or convert the DF to RDD and use map or is another way?
Name city starttime endtime
user1 London 2019-08-02 03:34:45 2019-08-02 03:52:03
user2 Boston 2019-08-13 13:34:10 2019-08-13 13:59:59
user2 Boston 2019-08-13 14:00:00 2019-08-13 14:59:59
user2 Boston 2019-08-13 15:00:00 2019-08-13 15:02:10
Thanks

>>> from pyspark.sql.functions import *
>>> df.show()
+-----+------+-------------------+-------------------+
| Name| city| starttime| endtime|
+-----+------+-------------------+-------------------+
|user1|London|2019-08-02 03:34:45|2019-08-02 03:52:03|
|user2|Boston|2019-08-13 13:34:10|2019-08-13 15:02:10|
+-----+------+-------------------+-------------------+
>>> df1 = df.withColumn("diff", ((hour(col("endtime")) - hour(col("starttime")))).cast("Int"))
.withColumn("loop", expr("split(repeat(':', diff),':')"))
.select(col("*"), posexplode(col("loop")).alias("pos", "value"))
.drop("value", "loop")
>>> df1.withColumn("starttime", when(col("pos") == 0, col("starttime")).otherwise(from_unixtime(unix_timestamp(col("starttime")) + (col("pos") * 3600) - minute(col("starttime"))*60 - second(col("starttime")))))
.withColumn("endtime", when((col("diff") - col("pos")) == 0, col("endtime")).otherwise(from_unixtime(unix_timestamp(col("endtime")) - ((col("diff") - col("pos")) * 3600) - minute(col("endtime"))*60 - second(col("endtime")) + lit(59) * lit(60) + lit(59))))
.drop("diff", "pos")
.show()
+-----+------+-------------------+-------------------+
| Name| city| starttime| endtime|
+-----+------+-------------------+-------------------+
|user1|London|2019-08-02 03:34:45|2019-08-02 03:52:03|
|user2|Boston|2019-08-13 13:34:10|2019-08-13 13:59:59|
|user2|Boston|2019-08-13 14:00:00|2019-08-13 14:59:59|
|user2|Boston|2019-08-13 15:00:00|2019-08-13 15:02:10|
+-----+------+-------------------+-------------------+

Related

how do i transform data in sql or in pyspark

input dataset
Act status from to
123 1 2011-03-29 00:00:00 2011-03-29 23:59:59
123 1 2011-03-30 00:00:00 2011-03-30 23:59:59
123 1 2011-03-31 00:00:00 2011-03-31 23:59:59
123 0 2011-04-01 00:00:00 2011-04-03 23:59:59
123 0 2011-04-04 00:00:00 2011-04-04 23:59:59
123 0 2011-04-05 00:00:00 2011-04-05 23:59:59
123 1 2011-04-06 00:00:00 2011-04-06 23:59:59
123 1 2011-04-07 00:00:00 2011-04-07 23:59:59
123 1 2011-04-08 00:00:00 2011-04-10 23:59:59
I want output to be
act status from to
123 1 2011-03-29 00:00:00 2011-03-31 23:59:59
123 0 2011-04-01 00:00:00 2011-04-05 23:59:59
123 1 2011-04-06 00:00:00 2011-04-10 23:59:59

You can use the lag function to track the status change. After applying the lag function, you would use the results to build your rankings and use the rankings as your groupBy parameter. For example:
status
lag
changed
rankings
1
null
1
1
1
1
0
1
0
1
1
2
1
0
1
3
0
1
1
4
1
0
1
5
1
1
0
5
where:
status : current status
lag : status from previous row
changed : 0 if status == lag, otherwise 0
rankings : cumulative sum of changed
Anyway, here's the answer to your question in SQL.
spark.sql('''
SELECT Act, status, from, to
FROM (
SELECT Act, status, MIN(from) AS from, MAX(to) AS to, rankings
FROM (
SELECT *, SUM(changed) OVER(ORDER BY from) AS rankings
FROM (
SELECT *, IF(LAG(status) OVER(PARTITION BY Act ORDER BY from) = status, 0, 1) AS changed
FROM sample
)
)
GROUP BY 1,2,5
)
''').show()
output:
+---+------+-------------------+-------------------+
|Act|status| from| to|
+---+------+-------------------+-------------------+
|123| 1|2011-03-29 00:00:00|2011-03-31 23:59:59|
|123| 0|2011-04-01 00:00:00|2011-04-05 23:59:59|
|123| 1|2011-04-06 00:00:00|2011-04-10 23:59:59|
+---+------+-------------------+-------------------+
And, here's the pyspark version:
from pyspark.sql.functions import lag, sum, max, min, when
from pyspark.sql.window import Window
df2 = df.withColumn("lag_tracker",lag("status",1).over(Window.partitionBy("Act").orderBy("from")))
df2 = df2.withColumn("changed", when(df2.lag_tracker == df2.status, 0).otherwise(1))
df2 = df2.withColumn("rankings", sum("changed").over(Window.orderBy("from")))
df2 = df2.groupBy("Act", "status", "rankings").agg(min("from").alias("from"), max("to").alias("to"))
df2.select("Act", "status", "from", "to").show()
output:
+---+------+-------------------+-------------------+
|Act|status| from| to|
+---+------+-------------------+-------------------+
|123| 1|2011-03-29 00:00:00|2011-03-31 23:59:59|
|123| 0|2011-04-01 00:00:00|2011-04-05 23:59:59|
|123| 1|2011-04-06 00:00:00|2011-04-10 23:59:59|
+---+------+-------------------+-------------------+

If you have no gaps in the dates, I would suggest uses the difference of row numbers:
select act, status, min(from), max(to)
from (select t.*,
row_number() over (partition by act order by from) as seqnum,
row_number() over (partition by act, status order by from) as seqnum_2
from t
) t
group by act, status, (seqnum - seqnum_2);
Why this works is a little tricky to explain. However, if you look at the results from the subquery, you will see that the difference between seqnum and seqnum_2 is constant on adjacent rows with the same status.
Note: I would advise you to fix your data model so you don't miss the last second on each day. The to datetime of one row should be the same as the from datetime of the next row. When querying, you can use >= and < to get the values that the row matches.

Merge tables in R and update rows where dates overlap

I hope this makes sense - it's my first post here so I'm sorry if the question is badly formed.
I have tables OldData and NewData:
OldData
ID DateFrom DateTo Priority
1 2018-11-01 2018-12-01* 5
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-04-05* 5
NewData
ID DateFrom DateTo Priority
1 2018-11-13 2018-12-01 6
2 2018-03-21 2018-05-01 6
I need merge these tables as below. Where IDs match, dates overlap, and Priority is higher in NewData, I need to update the dates in OldData to reflect NewData.
ID DateFrom DateTo Priority
1 2018-11-01 2018-11-13 5
1 2018-11-13 2018-12-01 6
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-03-21 5
2 2018-03-21 2018-05-01 6
I first tried to run nested for loops through each table, matching criteria and making changes one at a time, but I'm sure there is a much better way. e.g. possibly using sql in r?

In general, I interpret this to be an rbind operation with some cleanup: per-ID, if there is any overlap in the date ranges, then the lower-priority date range is truncated to match. Though not shown in the data, if you have situations where two higher-priority rows may completely negate a middle row, then you might need to add to the logic (it might then turn into an iterative process).
tidyverse
library(dplyr)
out_tidyverse <- bind_rows(OldData, NewData) %>%
arrange(ID, DateFrom) %>%
group_by(ID) %>%
mutate(
DateTo = if_else(row_number() < n() &
DateTo > lead(DateFrom) & Priority < lead(Priority),
lead(DateFrom), DateTo),
DateFrom = if_else(row_number() > 1 &
DateFrom < lag(DateTo) & Priority < lag(Priority),
lag(DateTo), DateFrom)
) %>%
ungroup()
out_tidyverse
# # A tibble: 6 x 4
# ID DateFrom DateTo Priority
# <int> <chr> <chr> <int>
# 1 1 2018-11-01 2018-11-13 5
# 2 1 2018-11-13 2018-12-01 6
# 3 1 2018-12-01 2019-02-01 5
# 4 2 2017-06-01 2018-03-01 5
# 5 2 2018-03-01 2018-03-21 5
# 6 2 2018-03-21 2018-05-01 6
### confirm it is the same as your expected output
all(mapply(`==`, FinData, out_tidyverse))
# [1] TRUE
data.table
I am using magrittr here in order to break out the flow in a readable fashion, but it is not required. If you're comfortable with data.table by itself, then translating from the magrittr::%>% to a native data.table piping should be straight-forward.
Also, I am using as.data.table instead of the often-preferred side-effect setDT, primarily so that you don't use it on your production frame and not realize that many data.frame operations in R (on those two frames) now behave somewhat differently. If you're up for using data.table, then feel free to step around this precaution.
library(data.table)
library(magrittr)
OldDT <- as.data.table(OldData)
NewDT <- as.data.table(NewData)
out_DT <- rbind(OldDT, NewDT) %>%
.[ order(ID, DateFrom), ] %>%
.[, .i := seq_len(.N), by = .(ID) ] %>%
.[, DateTo := fifelse(.i < .N &
DateTo > shift(DateFrom, type = "lead") &
Priority < shift(Priority, type = "lead"),
shift(DateFrom, type = "lead"), DateTo),
by = .(ID) ] %>%
.[, DateFrom := fifelse(.i > 1 &
DateFrom < shift(DateTo) &
Priority < shift(Priority),
shift(DateTo), DateFrom),
by = .(ID) ] %>%
.[, .i := NULL ]
out_DT[]
# ID DateFrom DateTo Priority
# 1: 1 2018-11-01 2018-11-13 5
# 2: 1 2018-11-13 2018-12-01 6
# 3: 1 2018-12-01 2019-02-01 5
# 4: 2 2017-06-01 2018-03-01 5
# 5: 2 2018-03-01 2018-03-21 5
# 6: 2 2018-03-21 2018-05-01 6
all(mapply(`==`, FinData, out_DT))
# [1] TRUE
Data:
OldData <- read.table(header = TRUE, text="
ID DateFrom DateTo Priority
1 2018-11-01 2018-12-01 5
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-04-05 5")
NewData <- read.table(header = TRUE, text="
ID DateFrom DateTo Priority
1 2018-11-13 2018-12-01 6
2 2018-03-21 2018-05-01 6")
FinData <- read.table(header = TRUE, text="
ID DateFrom DateTo Priority
1 2018-11-01 2018-11-13 5
1 2018-11-13 2018-12-01 6
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-03-21 5
2 2018-03-21 2018-05-01 6")

To my understanding this should be helpful:
library(datatable)
df_new <- setDT(df_new)
df_old <- setDT(df_old)
df_all <- rbind(df_old, df_new)
df_all[, .SD[.N], by = .(ID, DateFrom, DateTo)]
You simply rbind both dataframes, then group the resulting df by ID, DateFrom and DateTo. Within each group you extract the last row (i.e. the latest). This would result in a dataframe that is basically equal to df_old, except that in some cases the values will be 'updated' with the values from df_new. Should df_new have new groups (i.e. combinations of ID, DateFrom and DateTo) then those rows are also in it.
Edit: (after your comment)
df_all[, .SD[.N], by = .(ID, DateFrom)]

Create a column with date which is 3 years in the past from the given date column (pyspark)?

I want to create a column using pyspark that contains the date which is 3 years prior to the date in a given column. The date column looks like this :
date
2018-08-01
2016-08-11
2014-09-18
2018-12-08
2011-12-18
And I want this result :
date past date
2018-08-01 2015-08-01
2016-08-11 2013-08-11
2014-09-18 2011-09-18
2018-12-08 2015-12-08
2011-12-18 2008-12-18

You can use date_sub function.
Here is Scala code which will be very to python.
df.withColumn("past_date",date_sub(col("date"), 1095))

Try with add_months function in pyspark and multiply 12 with -3!
Example:
l = l=[('2018-08-01',),('2016-08-11',)]
ll=["date"]
df=spark.createDataFrame(l,ll)
df.withColumn("past_date",add_months(col("`date`"),-3*12)).show()
RESULT:
+----------+----------+
| date| past_date|
+----------+----------+
|2018-08-01|2015-08-01|
|2016-08-11|2013-08-11|
+----------+----------+

Pandas - Mapping two Dataframe based on date ranges

I am trying to categorise users based on their lifecycle. Given below Pandas dataframe shows the number of times a customer raised a ticket depending on how long they have used the product.
master dataframe
cust_id,start_date,end_date
101,02/01/2019,12/01/2019
101,14/02/2019,24/04/2019
101,27/04/2019,02/05/2019
102,25/01/2019,02/02/2019
103,02/01/2019,22/01/2019
Master lookup table
start_date,end_date,project_name
01/01/2019,13/01/2019,project_a
14/01/2019,13/02/2019,project_b
15/02/2019,13/03/2019,project_c
14/03/2019,13/06/2019,project_d
I am trying to map the above two data frames such that I am able to add project_name to the master dataframe
Expected output:
cust_id,start_date,end_date,project_name
101,02/01/2019,12/01/2019,project_a
101,14/02/2019,24/04/2019,project_c
101,14/02/2019,24/04/2019,project_d
101,27/04/2019,02/05/2019,project_d
102,25/01/2019,02/02/2019,project_b
103,02/01/2019,22/01/2019,project_a
103,02/01/2019,22/01/2019,project_b
I do expect duplicate rows in the final output as a single row in the master dataframe would fall under multiple rows of master lookup table

I think you need:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a')
m1 = df['start_date_y'].between(df['start_date_x'], df['end_date_x'])
m2 = df['end_date_y'].between(df['start_date_x'], df['end_date_x'])
df = df[m1 | m2]
print (df)
cust_id start_date_x end_date_x a start_date_y end_date_y project_name
1 101 2019-02-01 2019-12-01 1 2019-01-14 2019-02-13 project_b
2 101 2019-02-01 2019-12-01 1 2019-02-15 2019-03-13 project_c
3 101 2019-02-01 2019-12-01 1 2019-03-14 2019-06-13 project_d
6 101 2019-02-14 2019-04-24 1 2019-02-15 2019-03-13 project_c
7 101 2019-02-14 2019-04-24 1 2019-03-14 2019-06-13 project_d

Pandas - Find difference based on two subsequent rows of Dataframe

I have a Dataframe that captures date when ticket was raised by a customer that is captured in column labelled date. If the ref_column for the current cell is same as the following cell then I need to find difference of aging based on date column current cell and the following cell for the same cust_id. if the ref_column is to the same then I need to find difference of date and ref_date of the same row.
Given below is how my data is:
cust_id,date,ref_column,ref_date
101,15/01/19,abc,31/01/19
101,17/01/19,abc,31/01/19
101,19/01/19,xyz,31/01/19
102,15/01/19,abc,31/01/19
102,21/01/19,klm,31/01/19
102,25/01/19,xyz,31/01/19
103,15/01/19,xyz,31/01/19
Expected output:
cust_id,date,ref_column,ref_date,aging(in days)
101,15/01/19,abc,31/01/19,2
101,17/01/19,abc,31/01/19,14
101,19/01/19,xyz,31/01/19,0
102,15/01/19,abc,31/01/19,16
102,21/01/19,klm,31/01/19,10
102,25/01/19,xyz,31/01/19,0
103,15/01/19,xyz,31/01/19,0
Aging(in days) is 0 for the last entry for a given cust_id

Here's my approach:
# convert dates to datetime type
# ignore if already are
df['date'] = pd.to_datetime(df['date'])
df['ref_date'] = pd.to_datetime(df['ref_date'])
# customer group
groups = df.groupby('cust_id')
# where ref_column is the same with the next:
same_ = df['ref_column'].eq(groups['ref_column'].shift(-1))
# update these ones
df['aging'] = np.where(same_,
-groups['date'].diff(-1).dt.days, # same ref as next row
df['ref_date'].sub(df['date']).dt.days) # diff ref than next row
# update last elements in groups:
last_idx = groups['date'].idxmax()
df.loc[last_idx, 'aging'] = 0
Output:
cust_id date ref_column ref_date aging
0 101 2019-01-15 abc 2019-01-31 2.0
1 101 2019-01-17 abc 2019-01-31 14.0
2 101 2019-01-19 xyz 2019-01-31 0.0
3 102 2019-01-15 abc 2019-01-31 16.0
4 102 2019-01-21 klm 2019-01-31 10.0
5 102 2019-01-25 xyz 2019-01-31 0.0
6 103 2019-01-15 xyz 2019-01-31 0.0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How do I use flatmap with multiple columns in Dataframe using Pyspark - apache-spark-sql

Related

how do i transform data in sql or in pyspark

Merge tables in R and update rows where dates overlap

Create a column with date which is 3 years in the past from the given date column (pyspark)?

Pandas - Mapping two Dataframe based on date ranges

Pandas - Find difference based on two subsequent rows of Dataframe

Categories

Resources