How to use interesting values with training window in feature tools? - pandas

Code:
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes
#Create item details table
l = [[1, '1', '2018-05-02', 'A', 2.0, 10],
[1, '1', '2018-05-02', 'A', 1.0, 10],
[2, '1', '2018-05-28', 'B', 1.0, 40],
[3, '1', '2018-06-13', 'A', 2.0, 30],
[4, '1', '2019-08-20', 'C', 3.0, 60]]
item_detail = pd.DataFrame(l)
item_detail.columns = ['Ticket_id','Customer_id','trans_date','SKU','Qty','Amount']
item_detail["trans_date"] = pd.to_datetime(item_detail["trans_date"])
item_detail["index"] = item_detail.index
display(item_detail)
#Create ticket details table
b = [['1', '2018-05-02', 1],
['1', '2018-05-28', 2],
['1', '2018-06-13', 3],
['1', '2019-08-20', 4]]
ticket_detail = pd.DataFrame(b)
ticket_detail.columns = ['Customer_id','trans_date','Ticket_id']
ticket_detail["trans_date"] = pd.to_datetime(ticket_detail["trans_date"])
display(ticket_detail)
#Create feature tools relationships & entities
es = ft.EntitySet(id = 'customer_features')
es = es.entity_from_dataframe(entity_id="basket",dataframe=ticket_detail,index="Ticket_id",time_index="trans_date")
es.entity_from_dataframe(entity_id='transactions', dataframe= item_detail,index = 'index')
tr_relationship = ft.Relationship(es["basket"]["Ticket_id"],es["transactions"]["Ticket_id"])
es = es.add_relationships([tr_relationship])
print(es)
es["transactions"]["SKU"].interesting_values = ["A"]
#Create cutoff times table necessary for training window
cutoff_times = pd.DataFrame()
cutoff_times['instance_id'] = es['basket'].df['Ticket_id']
cutoff_times['time'] = es['basket'].df['trans_date']
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="basket",
agg_primitives=["count", "sum"],
where_primitives=["count", "sum"],
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window= '365 days')
display(feature_matrix)
Input data:
Item_detail-
Ticket_id Customer_id trans_date SKU Qty Amount index
1 1 2018-05-02 A 2.0 10 0
1 1 2018-05-02 A 1.0 10 1
2 1 2018-05-28 B 1.0 40 2
3 1 2018-06-13 A 2.0 30 3
4 1 2019-08-20 C 3.0 60 4
Ticket_detail-
Customer_id trans_date Ticket_id
1 2018-05-02 1
1 2018-05-28 2
1 2018-06-13 3
1 2019-08-20 4
Code output:
Ticket_id time Customer_id COUNT(transactions) SUM(transactions.Qty) SUM(transactions.Amount) DAY(trans_date) YEAR(trans_date) MONTH(trans_date) WEEKDAY(trans_date) COUNT(transactions WHERE SKU = A) SUM(transactions.Qty WHERE SKU = A) SUM(transactions.Amount WHERE SKU = A)
1 2018-05-02 1 2 3.0 20 2 2018 5 2 2.0 3.0 20.0
2 2018-05-28 1 1 1.0 40 28 2018 5 0 0.0 0.0 0.0
3 2018-06-13 1 1 2.0 30 13 2018 6 2 1.0 2.0 30.0
4 2019-08-20 1 1 3.0 60 20 2019 8 1 0.0 0.0 0.0
Expected output
(for columns COUNT(transactions WHERE SKU = A) SUM(transactions.Qty WHERE SKU = A) SUM(transactions.Amount WHERE SKU = A)):
Ticket_id time Customer_id COUNT(transactions) SUM(transactions.Qty) SUM(transactions.Amount) DAY(trans_date) YEAR(trans_date) MONTH(trans_date) WEEKDAY(trans_date) COUNT(transactions WHERE SKU = A) SUM(transactions.Qty WHERE SKU = A) SUM(transactions.Amount WHERE SKU = A)
1 2018-05-02 1 2 3.0 20 2 2018 5 2 2.0 3.0 20.0
2 2018-05-28 1 1 1.0 40 28 2018 5 0 0.0 0.0 0.0
3 2018-06-13 1 1 2.0 30 13 2018 6 2 3.0 5.0 50.0
4 2019-08-20 1 1 3.0 60 20 2019 8 1 0.0 0.0 0.0

In the example above, you are correctly using the interesting values with the training window. In the DFS call, the aggregation features are calculated per basket. So, the output feature COUNT(transactions WHERE SKU = A) for Ticket ID 3 is 1, because there is only one transaction for Ticket ID 3 where SKU is A in Item Details. The same reason applies for the other expected output features. Let me know if this helps.

Related

Subtract values from different groups

I have the following DataFrame:
A X
Time
1 a 10
2 b 17
3 b 20
4 c 21
5 c 36
6 d 40
given by pd.DataFrame({'Time': [1, 2, 3, 4, 5, 6], 'A': ['a', 'b', 'b', 'c', 'c', 'd'], 'X': [10, 17, 20, 21, 36, 40]}).set_index('Time')
The desired output is:
Time Difference
0 2 7
1 4 1
2 6 4
The first difference 1 is a result of subtracting 21 from 20: (first "c" value - last "b" value).
I'm open to numPy transformations as well.
Aggregate by GroupBy.agg with GroupBy.first,
GroupBy.last and then subtract shifted values for last column with omit first row by positions:
df = df.reset_index()
df1 = df.groupby('A',as_index=False, sort=False).agg(first=('X', 'first'),
last=('X','last'),
Time=('Time','first'))
df1['Difference'] = df1['first'].sub(df1['last'].shift(fill_value=0))
df1 = df1[['Time','Difference']].iloc[1:].reset_index(drop=True)
print (df1)
Time Difference
0 2 7
1 4 1
2 6 4
IIUC, you can pivot, ffill the columns, and compute the difference:
g = df.reset_index().groupby('A')
(df.assign(col=g.cumcount().values)
.pivot('A', 'col', 'X')
.ffill(axis=1)
.assign(Time=g['Time'].first(),
diff=lambda d: d[0]-d[1].shift())
[['Time', 'diff']].iloc[1:]
.rename_axis(index=None, columns=None)
)
output:
Time Difference
b 2 7.0
c 4 1.0
d 6 4.0
Intermediate, pivoted/ffilled dataframe:
col 0 1 Time Difference
A
a 10.0 10.0 1 NaN
b 17.0 20.0 2 7.0
c 21.0 36.0 4 1.0
d 40.0 40.0 6 4.0
Another possible solution:
(df.assign(Y = df['X'].shift())
.iloc[df.index % 2 == 0]
.assign(Difference = lambda z: z['X'] - z['Y'])
.reset_index()
.loc[:, ['Time', 'Difference']]
)
Output:
Time Difference
0 2 7.0
1 4 1.0
2 6 4.0

Compute difference in column value for n-days before in time series in Pandas

I have a table like the one below. Each row contains the temperature in a city and the date. A given date can be duplicated, but for each city the temperature the same day is the same. I want a new column with the change in temperature from the day before in that day. For example, for 2 January 2019 (rows 4-7), the change in T for city 1 is 5º (20-15), and for city 2 the change is 1º (19-18).
I've tried pandas using grouping, transforming and merge operations, but cannot get it work. Of course, a for loop works, but it's quite slow. I would also like other columns with changes in Tº of more than 1 day.
Index
Date
City
Temp
Temp Diff 1 Day
1
01/01/2019
1
15
na
2
01/01/2019
1
15
na
3
01/01/2019
2
18
na
4
01/01/2019
3
10
na
5
02/01/2019
1
20
5 (20-15)
6
02/01/2019
2
19
1 (19-18)
7
02/01/2019
2
19
1 (19-18)
8
02/01/2019
2
25
1
9
03/01/2019
3
22
na (nothing 2 Jan)
10
03/01/2019
1
22
2 (22-20)
Edit: I'm sorry, I've not included the fact that there can be no info for one city on a given date. I've inserted a row for city=3 the first day (1 Jan) and another for 3 Jan. Because city=3 has nothing on 2 Jan, for day 3 Jan it should report NA.
it's not the most efficient solution, but it seems working:
df['Date'] = pd.to_datetime(df['Date'],dayfirst=True)
df1 = df.drop_duplicates(['Date','City'])
df1['Date_prev'] = df1['Date'] + pd.Timedelta(days=1)
df_temp = df.merge(df1, how='left',
left_on = ['Date','City'],
right_on=['Date_prev','City'])
df['Temp Diff 1 Day'] = df_temp['Temp_x'] - df_temp['Temp_y']
>>> df
'''
Index Date City Temp Temp Diff 1 Day
0 1 2019-01-01 1 15 NaN
1 2 2019-01-01 1 15 NaN
2 3 2019-01-01 2 18 NaN
3 4 2019-01-01 3 10 NaN
4 5 2019-01-02 1 20 5.0
5 6 2019-01-02 2 19 1.0
6 7 2019-01-02 2 19 1.0
7 8 2019-01-02 2 19 1.0
8 9 2019-01-03 3 22 NaN
9 10 2019-01-03 1 22 2.0
One way to achieve this is as follows:
import pandas as pd
data = {'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9},
'Date': {0: '01/01/2019', 1: '01/01/2019', 2: '01/01/2019',
3: '02/01/2019', 4: '02/01/2019', 5: '02/01/2019',
6: '02/01/2019', 7: '03/01/2019', 8: '03/01/2019'},
'City': {0: 1, 1: 1, 2: 2, 3: 1, 4: 2, 5: 2, 6: 3, 7: 1, 8: 1},
'Temp': {0: 15, 1: 15, 2: 18, 3: 20, 4: 19, 5: 19, 6: 25, 7: 22, 8: 22},
'Temp Diff 1 Day': {0: 'na', 1: 'na', 2: 'na', 3: '5 (20-15)',
4: '1 (19-18)', 5: '1 (19-18)', 6: 'na',
7: '2 (22-20)', 8: '2 (22-20)'}}
df = pd.DataFrame(data)
# create dupl series that keeps first
dupl_keep_first = df.duplicated(subset=['Date','City'], keep='first')
# create another that keeps none
dupl_keep_false = df.duplicated(subset=['Date','City'], keep=False)
# get diff on groupby all rows *except* dup_keep_first (N.B.: ~ operator "flips" True/False)
df['Temp Diff 1 Day'] = df.loc[~dupl_keep_first].groupby('City')['Temp'].diff()
# now select *all* duplicates and ffill to add same diff values to all dupl rows
df.loc[dupl_keep_false, 'Temp Diff 1 Day'] = \
df.loc[dupl_keep_false, 'Temp Diff 1 Day'].ffill()
print(df)
Index Date City Temp Temp Diff 1 Day
0 1 01/01/2019 1 15 NaN
1 2 01/01/2019 1 15 NaN
2 3 01/01/2019 2 18 NaN
3 4 02/01/2019 1 20 5.0
4 5 02/01/2019 2 19 1.0
5 6 02/01/2019 2 19 1.0
6 7 02/01/2019 3 25 NaN
7 8 03/01/2019 1 22 2.0
8 9 03/01/2019 1 22 2.0
# Group by city, perform a transform where you drop duplicates and find the diff.
df['Temp_diff'] = df.groupby('City')['Temp'].transform(lambda x: x.drop_duplicates().diff())
# Then group by city and date, and ffill the values.
df['Temp_diff'] = df.groupby(['City', 'Date'])['Temp_diff'].ffill()
print(df)
Output:
Index Date City Temp Temp_diff
0 1 2019-01-01 1 15 NaN
1 2 2019-01-01 1 15 NaN
2 3 2019-01-01 2 18 NaN
3 4 2019-02-01 1 20 5.0
4 5 2019-02-01 2 19 1.0
5 6 2019-02-01 2 19 1.0
6 7 2019-02-01 3 25 NaN
7 8 2019-03-01 1 22 2.0
8 9 2019-03-01 1 22 2.0

Python - count and Difference data frames

I have two data frames about occupation in industry in 2005 and 2006. I would like to create a df using the column with the result of the changed of these years, if it growth or decreased. Here is a sample:
import pandas as pd
d = {'OCC2005': [1234, 1234, 1234 ,1234, 2357,2357,2357,2357, 4321,4321,4321,4321, 3333], 'IND2005': [4, 5, 6, 7, 5,6,7,4, 6,7,5,4,5], 'Result': [7, 8, 12, 1, 11,15,20,1,5,12,8,4,3]}
df = pd.DataFrame(data=d)
print(df)
d2 = {'OCC2006': [1234, 1234, 1234 ,1234, 2357,2357,2357,2357, 4321,4321,4361,4321, 3333,4444], 'IND2006': [4, 5, 6, 7, 5,6,7,4, 6,7,5,4,5,8], 'Result': [17, 18, 12, 1, 1,5,20,1,5,2,18,4,0,15]}
df2 = pd.DataFrame(data=d2)
print(df2)
Final_Result = df2['Result'] - df['Result']
print(Final_Result)
I would like to create a df with occ- ind- final_result
Rename columns of df to match column names of df2:
MAP = dict(zip(df.columns, df2.columns))
out = (df2.set_index(['OCC2006', 'IND2006'])
.sub(df.rename(columns=MAP).set_index(['OCC2006', 'IND2006']))
.reset_index())
print(out)
# Output
OCC2006 IND2006 Result
0 1234 4 10.0
1 1234 5 10.0
2 1234 6 0.0
3 1234 7 0.0
4 2357 4 0.0
5 2357 5 -10.0
6 2357 6 -10.0
7 2357 7 0.0
8 3333 5 -3.0
9 4321 4 0.0
10 4321 5 NaN
11 4321 6 0.0
12 4321 7 -10.0
13 4361 5 NaN
14 4444 8 NaN

Calculating temporal and sptial gradients while using groupby in multi-index pandas dataframe

Say I have the following sample pandas dataframe of water content (i.e. "wc") values at specified depths along a column of soil:
import pandas as pd
df = pd.DataFrame([[1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1], [1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1]], columns=pd.MultiIndex.from_product([['wc'], [10, 20, 30, 45, 80]]))
df['model'] = [5,5, 5, 6,6,6]
df['time'] = [0, 1, 2,0, 1, 2]
df.set_index(['time', 'model'], inplace=True)
>> df
[Out]:
wc
10 20 30 45 80
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
I would like to calulate the spatial (between columns) and temporal (between rows) gradients for each model "group" in the following structure:
wc temp_grad spat_grad
10 20 30 45 80 10 20 30 45 80 10 20 30 45
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
My attempt involved writing a function first for the temporal gradients and combining this with groupby:
def temp_grad(df):
temp_grad = np.gradient(df[('wc', 10.0)], df.index.get_level_values(0))
return pd.Series(temp_grad, index=x.index)
df[('temp_grad', 10.0)] = (df.groupby(level = ['model'], group_keys=False)
.apply(temp_grad))
but I am not sure how to automate this to apply for all wc columns as well as navigate the multi-indexing issues.
Assuming the function you write is actually what you want, then for temp_grad, you can do at once all the columns in the apply. use np.gradient the same way you did in your function but specify along the axis=0 (rows). Built a dataframe with index and columns as the original data. For the spat_grad, I think the model does not really matter, so no need of the groupby, do np.gradient directly on df['wc'], and along the axis=1 (columns) this time. Built a dataframe the same way. To get the expected output, concat all three of them like:
df = pd.concat([
df['wc'], # original data
# add the temp_grad
df['wc'].groupby(level = ['model'], group_keys=False)
.apply(lambda x: #do all the columns at once, specifying the axis in gradient
pd.DataFrame(np.gradient(x, x.index.get_level_values(0), axis=0),
columns=x.columns, index=x.index)), # build a dataframe
# for spat, no need of groupby as it is row-wise operation
# change the axis, and the values for the x
pd.DataFrame(np.gradient(df['wc'], df['wc'].columns, axis=1),
columns=df['wc'].columns, index=df['wc'].index)
],
keys=['wc','temp_grad','spat_grad'], # redefine the multiindex columns
axis=1 # concat along the columns
)
and you get
print(df)
wc temp_grad spat_grad \
10 20 30 45 80 10 20 30 45 80 10 20
time model
0 5 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 5 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 5 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
0 6 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 6 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 6 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
30 45 80
time model
0 5 0.126667 -0.110476 -0.057143
1 5 0.066667 -0.101905 -0.028571
2 5 -0.080000 -0.157143 -0.057143
0 6 0.126667 -0.110476 -0.057143
1 6 0.066667 -0.101905 -0.028571
2 6 -0.080000 -0.157143 -0.057143

Complete an incomplete dataframe in pandas

Good morning.
I have a dataframe that can be both like this:
df1 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
and like this:
df2 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
The difference between the two is only that the case may arise in which one, or several but not all, zones do have data for the highest of the time periods (column date). My desired result is to be able to complete the dataframe until a certain period of time (3 in the example), in the following way in each of the cases:
df1_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
7 B 3 6809 20
8 C 3 288 5
df2_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 1280 3
7 B 3 6809 20
8 C 3 288 5
I've tried different combinations of pivot and fillna with different methods, but I can't achieve the previous result.
I hope my explanation was understood.
Many thanks in advance.
You can use reindex to create entries for all dates in the range, and then forward fill the last value into it.
import pandas as pd
df1 = pd.DataFrame([['A', 1,154, 2],
['B', 1,2647, 7],
['C', 1,0, 0],
['A', 2,1280, 3],
['B', 2,6809, 20],
['C', 2,288, 5],
['A', 3,2000, 4]],
columns=['zone', 'date', 'p1', 'p2'])
result = df1.groupby("zone").apply(lambda x: x.set_index("date").reindex(range(1, 4), method='ffill'))
print(result)
To get
zone p1 p2
zone date
A 1 A 154 2
2 A 1280 3
3 A 2000 4
B 1 B 2647 7
2 B 6809 20
3 B 6809 20
C 1 C 0 0
2 C 288 5
3 C 288 5
IIUC, you can reconstruct a pd.MultiIndex from your original df and use fillna to get the max from each subgroup of zone you have.
first, build your index
ind = df1.set_index(['zone', 'date']).index
levels = ind.levels
n = len(levels[0])
labels = [np.tile(np.arange(n), n), np.repeat(np.arange(0, n), n)]
Then, use pd.MultiIndex constructor to reindex
df1.set_index(['zone', 'date'])\
.reindex(pd.MultiIndex(levels= levels, labels= labels))\
.fillna(df1.groupby(['zone']).max())
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
To fill df2, just change from df1 in this last line of code to df2 and you get
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
I suggest not to copy/paste directly the code and try to run, but rather try to understand the process and make slight changes if needed depending on how different your original data frame is from what you posted.