Python - count and Difference data frames - pandas

I have two data frames about occupation in industry in 2005 and 2006. I would like to create a df using the column with the result of the changed of these years, if it growth or decreased. Here is a sample:
import pandas as pd
d = {'OCC2005': [1234, 1234, 1234 ,1234, 2357,2357,2357,2357, 4321,4321,4321,4321, 3333], 'IND2005': [4, 5, 6, 7, 5,6,7,4, 6,7,5,4,5], 'Result': [7, 8, 12, 1, 11,15,20,1,5,12,8,4,3]}
df = pd.DataFrame(data=d)
print(df)
d2 = {'OCC2006': [1234, 1234, 1234 ,1234, 2357,2357,2357,2357, 4321,4321,4361,4321, 3333,4444], 'IND2006': [4, 5, 6, 7, 5,6,7,4, 6,7,5,4,5,8], 'Result': [17, 18, 12, 1, 1,5,20,1,5,2,18,4,0,15]}
df2 = pd.DataFrame(data=d2)
print(df2)
Final_Result = df2['Result'] - df['Result']
print(Final_Result)
I would like to create a df with occ- ind- final_result

Rename columns of df to match column names of df2:
MAP = dict(zip(df.columns, df2.columns))
out = (df2.set_index(['OCC2006', 'IND2006'])
.sub(df.rename(columns=MAP).set_index(['OCC2006', 'IND2006']))
.reset_index())
print(out)
# Output
OCC2006 IND2006 Result
0 1234 4 10.0
1 1234 5 10.0
2 1234 6 0.0
3 1234 7 0.0
4 2357 4 0.0
5 2357 5 -10.0
6 2357 6 -10.0
7 2357 7 0.0
8 3333 5 -3.0
9 4321 4 0.0
10 4321 5 NaN
11 4321 6 0.0
12 4321 7 -10.0
13 4361 5 NaN
14 4444 8 NaN

Related

Setting multiple column at once give error "Not in index error!"

import pandas as pd
df = pd.DataFrame(
[
[5, 2],
[3, 5],
[5, 5],
[8, 9],
[90, 55]
],
columns = ['max_speed', 'shield']
)
df.loc[(df.max_speed > df.shield), ['stat', 'delta']] \
= 'overspeed', df['max_speed'] - df['shield']
I am setting multiple column using .loc as above, for some cases I get Not in index error!. Am I doing something wrong above?
Create list of tuples by same size like number of Trues with filtered Series after subtract with repeat scalar overspeed:
m = (df.max_speed > df.shield)
s = df['max_speed'] - df['shield']
df.loc[m, ['stat', 'delta']] = list(zip(['overspeed'] * m.sum(), s[m]))
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0
Another idea with helper DataFrame:
df.loc[m, ['stat', 'delta']] = pd.DataFrame({'stat':'overspeed', 'delta':s})[m]
Details:
print(list(zip(['overspeed'] * m.sum(), s[m])))
[('overspeed', 3), ('overspeed', 35)]
print (pd.DataFrame({'stat':'overspeed', 'delta':s})[m])
stat delta
0 overspeed 3
4 overspeed 35
Simpliest is assign separately:
df.loc[m, 'stat'] = 'overspeed'
df.loc[m, 'delta'] = df['max_speed'] - df['shield']
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0

Subtract values from different groups

I have the following DataFrame:
A X
Time
1 a 10
2 b 17
3 b 20
4 c 21
5 c 36
6 d 40
given by pd.DataFrame({'Time': [1, 2, 3, 4, 5, 6], 'A': ['a', 'b', 'b', 'c', 'c', 'd'], 'X': [10, 17, 20, 21, 36, 40]}).set_index('Time')
The desired output is:
Time Difference
0 2 7
1 4 1
2 6 4
The first difference 1 is a result of subtracting 21 from 20: (first "c" value - last "b" value).
I'm open to numPy transformations as well.
Aggregate by GroupBy.agg with GroupBy.first,
GroupBy.last and then subtract shifted values for last column with omit first row by positions:
df = df.reset_index()
df1 = df.groupby('A',as_index=False, sort=False).agg(first=('X', 'first'),
last=('X','last'),
Time=('Time','first'))
df1['Difference'] = df1['first'].sub(df1['last'].shift(fill_value=0))
df1 = df1[['Time','Difference']].iloc[1:].reset_index(drop=True)
print (df1)
Time Difference
0 2 7
1 4 1
2 6 4
IIUC, you can pivot, ffill the columns, and compute the difference:
g = df.reset_index().groupby('A')
(df.assign(col=g.cumcount().values)
.pivot('A', 'col', 'X')
.ffill(axis=1)
.assign(Time=g['Time'].first(),
diff=lambda d: d[0]-d[1].shift())
[['Time', 'diff']].iloc[1:]
.rename_axis(index=None, columns=None)
)
output:
Time Difference
b 2 7.0
c 4 1.0
d 6 4.0
Intermediate, pivoted/ffilled dataframe:
col 0 1 Time Difference
A
a 10.0 10.0 1 NaN
b 17.0 20.0 2 7.0
c 21.0 36.0 4 1.0
d 40.0 40.0 6 4.0
Another possible solution:
(df.assign(Y = df['X'].shift())
.iloc[df.index % 2 == 0]
.assign(Difference = lambda z: z['X'] - z['Y'])
.reset_index()
.loc[:, ['Time', 'Difference']]
)
Output:
Time Difference
0 2 7.0
1 4 1.0
2 6 4.0

Replace all values from one pandas dataframe to another without extra columns

These are my two dataframes:
df1 = pd.DataFrame({'animal': ['falcon', 'dog', 'spider', 'fish'],'num_legs': [2, 4, 8, 0],'num_wings': [2, 0, 0, 0],'num_specimen_seen': [10, 2, 1, 8]})
df2 = pd.DataFrame({'animal': ['falcon', 'dog'],'num_legs': [4, 2],'num_wings': [0, 2],'num_specimen_seen': [2, 10]})
When I use left join , this is the result:
merge = df1.merge(df2, on='animal', how='left')
Output:
animal num_legs_x num_wings_x num_specimen_seen_x num_legs_y num_wings_y num_specimen_seen_y
falcon 2 2 10 4 0 2
dog 4 0 2 2 2 10
spider 8 0 1 NaN NaN NaN
fish 0 0 8 NaN NaN NaN
I am looking for an output like this , where row 1 and 2 values are replaced by values coming from df2 :
animal num_legs num_wings num_specimen_seen
falcon 4 0 2
dog 2 2 10
spider 8 0 1
fish 0 0 8
I attempted using np.where but couldnt write something correctly
df = np.where(df1.animal == df2.animal, ?, ?)
Maybe left join isnt correct way to achieve what I want. I am new to pandas , any help would be appreciated.
Let us do update
df1 = df1.set_index('animal')
df1.update(df2.set_index('animal'))
df1 = df1.reset_index()
df1
animal num_legs num_wings num_specimen_seen
0 falcon 4.0 0.0 2.0
1 dog 2.0 2.0 10.0
2 spider 8.0 0.0 1.0
3 fish 0.0 0.0 8.0

How to use interesting values with training window in feature tools?

Code:
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes
#Create item details table
l = [[1, '1', '2018-05-02', 'A', 2.0, 10],
[1, '1', '2018-05-02', 'A', 1.0, 10],
[2, '1', '2018-05-28', 'B', 1.0, 40],
[3, '1', '2018-06-13', 'A', 2.0, 30],
[4, '1', '2019-08-20', 'C', 3.0, 60]]
item_detail = pd.DataFrame(l)
item_detail.columns = ['Ticket_id','Customer_id','trans_date','SKU','Qty','Amount']
item_detail["trans_date"] = pd.to_datetime(item_detail["trans_date"])
item_detail["index"] = item_detail.index
display(item_detail)
#Create ticket details table
b = [['1', '2018-05-02', 1],
['1', '2018-05-28', 2],
['1', '2018-06-13', 3],
['1', '2019-08-20', 4]]
ticket_detail = pd.DataFrame(b)
ticket_detail.columns = ['Customer_id','trans_date','Ticket_id']
ticket_detail["trans_date"] = pd.to_datetime(ticket_detail["trans_date"])
display(ticket_detail)
#Create feature tools relationships & entities
es = ft.EntitySet(id = 'customer_features')
es = es.entity_from_dataframe(entity_id="basket",dataframe=ticket_detail,index="Ticket_id",time_index="trans_date")
es.entity_from_dataframe(entity_id='transactions', dataframe= item_detail,index = 'index')
tr_relationship = ft.Relationship(es["basket"]["Ticket_id"],es["transactions"]["Ticket_id"])
es = es.add_relationships([tr_relationship])
print(es)
es["transactions"]["SKU"].interesting_values = ["A"]
#Create cutoff times table necessary for training window
cutoff_times = pd.DataFrame()
cutoff_times['instance_id'] = es['basket'].df['Ticket_id']
cutoff_times['time'] = es['basket'].df['trans_date']
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="basket",
agg_primitives=["count", "sum"],
where_primitives=["count", "sum"],
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window= '365 days')
display(feature_matrix)
Input data:
Item_detail-
Ticket_id Customer_id trans_date SKU Qty Amount index
1 1 2018-05-02 A 2.0 10 0
1 1 2018-05-02 A 1.0 10 1
2 1 2018-05-28 B 1.0 40 2
3 1 2018-06-13 A 2.0 30 3
4 1 2019-08-20 C 3.0 60 4
Ticket_detail-
Customer_id trans_date Ticket_id
1 2018-05-02 1
1 2018-05-28 2
1 2018-06-13 3
1 2019-08-20 4
Code output:
Ticket_id time Customer_id COUNT(transactions) SUM(transactions.Qty) SUM(transactions.Amount) DAY(trans_date) YEAR(trans_date) MONTH(trans_date) WEEKDAY(trans_date) COUNT(transactions WHERE SKU = A) SUM(transactions.Qty WHERE SKU = A) SUM(transactions.Amount WHERE SKU = A)
1 2018-05-02 1 2 3.0 20 2 2018 5 2 2.0 3.0 20.0
2 2018-05-28 1 1 1.0 40 28 2018 5 0 0.0 0.0 0.0
3 2018-06-13 1 1 2.0 30 13 2018 6 2 1.0 2.0 30.0
4 2019-08-20 1 1 3.0 60 20 2019 8 1 0.0 0.0 0.0
Expected output
(for columns COUNT(transactions WHERE SKU = A) SUM(transactions.Qty WHERE SKU = A) SUM(transactions.Amount WHERE SKU = A)):
Ticket_id time Customer_id COUNT(transactions) SUM(transactions.Qty) SUM(transactions.Amount) DAY(trans_date) YEAR(trans_date) MONTH(trans_date) WEEKDAY(trans_date) COUNT(transactions WHERE SKU = A) SUM(transactions.Qty WHERE SKU = A) SUM(transactions.Amount WHERE SKU = A)
1 2018-05-02 1 2 3.0 20 2 2018 5 2 2.0 3.0 20.0
2 2018-05-28 1 1 1.0 40 28 2018 5 0 0.0 0.0 0.0
3 2018-06-13 1 1 2.0 30 13 2018 6 2 3.0 5.0 50.0
4 2019-08-20 1 1 3.0 60 20 2019 8 1 0.0 0.0 0.0
In the example above, you are correctly using the interesting values with the training window. In the DFS call, the aggregation features are calculated per basket. So, the output feature COUNT(transactions WHERE SKU = A) for Ticket ID 3 is 1, because there is only one transaction for Ticket ID 3 where SKU is A in Item Details. The same reason applies for the other expected output features. Let me know if this helps.

Take product of columns in dataframe with lags

Have following dataframe.
A = pd.Series([2, 3, 4, 5], index=[1, 2, 3, 4])
B = pd.Series([6, 7, 8, 9], index=[1, 2, 3, 4])
Aw = pd.Series([0.25, 0.3, 0.33, 0.36], index=[1, 2, 3, 4])
Bw = pd.Series([0.75, 0.7, 0.67, 0.65], index=[1, 2, 3, 4])
df = pd.DataFrame({'A': A, 'B': B, 'Aw': Aw, 'Bw', Bw})
df
Index A B Aw Bw
1 2 6 0.25 0.75
2 3 7 0.30 0.70
3 4 8 0.33 0.67
4 5 9 0.36 0.64
What I would like to do is multiply 'A' and lag of 'Aw' and likewise 'B' with 'Bw'. The resulting dataframe will look like the following:
Index A B Aw Bw A_ctr B_ctr
1 2 6 NaN NaN NaN NaN
2 3 7 0.25 0.75 0.75 5.25
3 4 8 0.3 0.7 1.2 5.6
4 5 9 0.33 0.64 1.65 5.76
Thank you in advance
To get your desired output, first shift Aw and Bw, then multiply them by A and B:
df[['Aw','Bw']] = df[['Aw','Bw']].shift()
df[['A_ctr','B_ctr']] = df[['A','B']].values*df[['Aw','Bw']]
A B Aw Bw A_ctr B_ctr
1 2 6 NaN NaN NaN NaN
2 3 7 0.25 0.75 0.75 5.25
3 4 8 0.30 0.70 1.20 5.60
4 5 9 0.33 0.67 1.65 6.03