pyspark pandas UDF merging date ranges in multiple rows - pandas

I am modifying the function described here to work with pyspark.
Input
from pyspark.sql import functions as F
data_in = spark.createDataFrame([
[1, "2017-1-1", "2017-6-30"], [1, "2017-1-1", "2017-1-3"], [1, "2017-5-1", "2017-9-30"],
[1, "2018-5-1", "2018-9-30"], [1, "2018-5-2", "2018-10-31"], [1, "2017-4-1", "2017-5-30"],
[1, "2017-10-3", "2017-10-3"], [1, "2016-12-5", "2016-12-31"], [1, "2016-12-1", "2016-12-2"],
[2, "2016-12-1", "2016-12-2"], [2, "2016-12-3", "2016-12-25"]
], schema=["id","start_dt","end_dt"])
data_in = data_in.select("id", F.to_date("start_dt","yyyy-M-d").alias("start_dt"),
F.to_date("end_dt","yyyy-M-d").alias("end_dt")).sort(["id","start_dt","end_dt"])
Aggregate function to apply
from datetime import datetime
mydt = datetime(1970,1,1).date()
def merge_dates(grp):
dt_groups = ((grp["start_dt"]-grp["end_dt"].shift(fill_value=mydt)).dt.days > 1).cumsum()
grouped = grp.groupby(dt_groups).agg({"start_dt":"min", "end_dt":"max"})
return grouped if len(grp)==len(grouped) else merge_dates(grouped)
Testing using Pandas
df = data_in.toPandas()
df.groupby("id").apply(merge_dates).reset_index().drop('level_1', axis=1)
Output
id start_dt end_dt
0 1 2016-12-01 2016-12-02
1 1 2016-12-05 2017-09-30
2 1 2017-10-03 2017-10-03
3 1 2018-05-01 2018-10-31
4 2 2016-12-01 2016-12-25
When I try to run this using Spark
data_out = data_in.groupby("id").applyInPandas(merge_dates, schema=data_in.schema)
display(data_out)
I get the following error
PythonException: 'RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema. Expected: 3 Actual: 2'. Full traceback below:
When I change schema to data_in.schema[1:] I get back only the date columns which are computed correctly (matches the Pandas output) but does not return the field id - which is obviously required. How can I fix this so that the final output has the id as well?

With spark only, if we replicate what you have in pandas, it would look like below:
from pyspark.sql import functions as F
w = W.partitionBy("id").orderBy(F.monotonically_increasing_id())
w1 = w.rangeBetween(W.unboundedPreceding,0)
out = (data_in.withColumn("helper",F.datediff(F.col("start_dt"),
F.lag("end_dt").over(w))>1)
.fillna({"helper":True})
.withColumn("helper2",F.sum(F.col("helper").cast("int")).over(w1))
.groupBy("id","helper2").agg(F.min("start_dt").alias("start_dt"),
F.max("end_dt").alias("end_dt")
)
.drop("helper2"))
out.show()
+---+----------+----------+
| id| start_dt| end_dt|
+---+----------+----------+
| 1|2016-12-01|2016-12-02|
| 1|2016-12-05|2017-09-30|
| 1|2017-10-03|2017-10-03|
| 1|2018-05-01|2018-10-31|
| 2|2016-12-01|2016-12-25|
+---+----------+----------+
Note that this assumes that mydt = datetime(1970,1,1).date() is just a placeholder for nulls when shifting the values, .i have used fillna as True for same. if not you can fillna right after the lag which is the same as shift

Related

Creating a dataframe using roll-forward window on multivariate time series

Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6

Filtering based on value and creating list in spark dataframe

I am new to spark and I am trying to do the following, using Pyspark:
I have a dataframe with 3 columns, "id", "number1", "number2".
For each value of "id" I have multiple rows and what I want to do is create a list of tuples with all the rows that correspond to each id.
Eg, for the following dataframe
id | number1 | number2 |
a | 1 | 1 |
a | 2 | 2 |
b | 3 | 3 |
b | 4 | 4 |
the desired outcome would be 2 lists as such:
[(1, 1), (2, 2)]
and
[(3, 3), (4, 4)]
I'm not sure how to approach this, since I'm a newbie. I have managed to get a list of the distinct ids doing the following
distinct_ids = [x for x in df.select('id').distinct().collect()]
In pandas that I'm more familiar with, now I would loop through the dataframe for each distinct id and gather all the rows for it, but I'm sure this is far from optimal.
Can you give me any ideas? Groupby comes to mind but I'm not sure how to approach
You can use groupby and aggregate using collect_list and array:
import pyspark.sql.functions as F
df2 = df.groupBy('id').agg(F.collect_list(F.array('number1', 'number2')).alias('number'))
df2.show()
+---+----------------+
| id| number|
+---+----------------+
| b|[[3, 3], [4, 4]]|
| a|[[1, 1], [2, 2]]|
+---+----------------+
And if you want to get back a list of tuples,
result = [[tuple(j) for j in i] for i in [r[0] for r in df2.select('number').orderBy('number').collect()]]
which gives result as [[(1, 1), (2, 2)], [(3, 3), (4, 4)]]
If you want a numpy array, you can do
result = np.array([r[0] for r in df2.select('number').collect()])
which gives
array([[[3, 3],
[4, 4]],
[[1, 1],
[2, 2]]])

Pandas: drop out of sequence row

My Pandas df:
import pandas as pd
import io
data = """date value
"2015-09-01" 71.925000
"2015-09-06" 71.625000
"2015-09-11" 71.333333
"2015-09-12" 64.571429
"2015-09-21" 72.285714
"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
df.date = pd.to_datetime(df.date)
I Given a user input date ( 01-09-2015).
I would like to keep only those date where difference between date and input date is multiple of 5.
Expected output:
input = 01-09-2015
df:
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
3 2015-09-21 72.285714
My Approach so far:
I am taking the delta between input_date and date in pandas and saving this delta in separate column.
If delta%5 == 0, keep the row else drop. Is this the best that can be done?
Use boolean indexing for filter by mask, here convert input values to datetimes and then timedeltas to days by Series.dt.days:
input1 = '01-09-2015'
df = df[df.date.sub(pd.to_datetime(input1)).dt.days % 5 == 0]
print (df)
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
4 2015-09-21 72.285714

Selecting specific rows by date in a multi-dimensional df Python

image of df
I would like to select a specific date e.g 2020-07-07 and get the Adj Cls and ExMA for each of the symbols. I'm new in Python and I tried using df.loc['xy'], (xy being a specific date on the datetime) and keep getting a KeyError. Any insight is greatly appreciated.
Info on the df MultiIndex: 30 entries, (SNAP, 2020-07-06 00:00:00) to (YUM, 2020-07-10 00:00:00)
Data columns (total 2 columns):
dtypes: float64(2)
You can use pandas.DataFrame.xs for this.
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(8).reshape(4, 2), index=[[0, 0, 1, 1], [2, 3, 2, 3]], columns=list("ab")
)
print(df)
# a b
# 0 2 0 1
# 3 2 3
# 1 2 4 5
# 3 6 7
print(df.xs(3, level=1).filter(["a"]))
# a
# 0 2
# 1 6

Copy columns to dataframe using panda

I have two dataframes and I want to copy the values from one to another. Returned NaN when copying column values to dataframe
These are my df:
data1 = [[1, 2], [3, 4], [5, 6]]
rc = pd.DataFrame(data, columns = ['Sold', 'Leads'])
data2 = [['Company1','2017-05-01',0, 0], ['Company1','2017-05-01',0, 0], ['Company1','2017-05-01',0, 0]]
final = pd.DataFrame(data2, columns = ['company','date','2019_sold', '2019_leads'])
I tried loc indexing
final.loc[(final['date'] == '2017-05-01') & (final['company'] == 'Company1'),['2019_sold','2019_leads']] = rc[['Leads','Sold']]
I expect them to copy the exact value of rc df to final df but the values returned NaN
By using update
rc.index=final.index[(final['date'] == '2017-05-01') & (final['company'] == 'Company1')]
rc.columns=['2019_sold','2019_leads']
final.update(rc)
final
Out[165]:
company date 2019_sold 2019_leads
0 Company1 2017-05-01 1 2
1 Company1 2017-05-01 3 4
2 Company1 2017-05-01 5 6