How to apply subtraction to groupby object - pandas

I have a dataframe like this
test = pd.DataFrame({'category':[1,1,2,2,3,3],
'type':['new', 'old','new', 'old','new', 'old'],
'ratio':[0.1,0.2,0.2,0.4,0.4,0.8]})
category ratio type
0 1 0.10000 new
1 1 0.20000 old
2 2 0.20000 new
3 2 0.40000 old
4 3 0.40000 new
5 3 0.80000 old
I would like to subtract each category's old ratio from the new ratio but not sure how to reshape the DF to do so

Use DataFrame.pivot first, so possible subtract very easy:
df = test.pivot('category','type','ratio')
df['val'] = df['old'] - df['new']
print (df)
type new old val
category
1 0.1 0.2 0.1
2 0.2 0.4 0.2
3 0.4 0.8 0.4

Another approach
df = df.groupby('category').apply(lambda x: x[x['type'] == 'old'].reset_index()['ratio'][0] - x[x['type'] == 'new'].reset_index()['ratio'][0]).reset_index(name='val')
Output
category val
0 1 0.1
1 2 0.2
2 3 0.4

Related

groupby shows unobserved values of non-categorical columns

I created this simple example to illustrate my issue:
x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})
it yields this DataFrame:
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
or I can show unobserved values for all grouping variables like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
b 0.0
4 a 0.0
b 0.0
5 a 0.0
b 0.0
1 3 a 0.0
b 0.0
4 a 0.0
b 0.2
5 a 0.0
b 0.0
2 3 a 0.0
b 0.0
4 a 0.0
b 0.0
5 a 0.3
b 0.0
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?
You can unstack the level of interest, cat_var in this case:
(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
.agg({'value':'sum'})
.unstack('cat_var',fill_value=0)
)
Output:
value
cat_var a b
int_var1 int_var2
0 3 0.1 0.0
1 4 0.0 0.2
2 5 0.3 0.0

Excel sumproudct function in pandas dataframes

Ok, as a python beginner I found multiplication matrix in pandas dataframes is very difficult to conduct.
I have two tables look like:
df1
Id lifetime 0 1 2 3 4 5 .... 30
0 1 4 0.1 0.2 0.1 0.4 0.5 0.4... 0.2
1 2 7 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
2 3 8 0.5 0.2 0.1 0.4 0.5 0.4... 0.6
.......
9 6 10 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
df2
Group lifetime 0 1 2 3 4 5 .... 30
0 2 4 0.9 0.8 0.9 0.8 0.8 0.8... 0.9
1 2 7 0.8 0.9 0.9 0.9 0.8 0.8... 0.9
2 3 8 0.9 0.7 0.8 0.8 0.9 0.9... 0.9
.......
9 5 10 0.8 0.9 0.7 0.7 0.9 0.7... 0.9
I want to perform excel's sumproduct function in my codes and the length of the columns that need to be summed are based on the lifetime in column 1 of both dfs, e,g.,
for row 0 in df1&df2, lifetime=4:
sumproduct(df1 row 0 from column 0 to column 3,
df2 row 0 from column 0 to column 3)
for row 1 in df1&df2, lifetime=7
sumproduct(df1 row 2 from column 0 to column 6,
df2 row 2 from column 0 to column 6)
.......
How can I do this?
You can use .iloc to access row and columns with integers.
So where lifetime==4 is row 0, and if you count the column numbers where Id is zero, then column labeled as 0 would be 2, and column labeled as 3 would be 5, to get that interval you would enter 2:6.
Once you get the correct data in both data frames with .iloc[0,2:6], you run np.dot
See below:
import numpy as np
np.dot(df1.iloc[0,2:6], df2.iloc[1,2:6])
Just to make sure you have the right data, try just running
df1.iloc[0,2:6]
Then try the np.dot product. You can read up on "pandas iloc" and "slicing" for more info.

Pandas dividing rows from 2 df

Is it to perform rows division between 2 dfs by matching columns. For example,
df1:
Name 1 2 3 5 Total
-----------------------------
A 2 2 2 2 8
B 1 1 1 1 4
C 0 1 2 3 6
df2:
Alias 1 2 3 4 Total
-----------------------------
X 5 5 5 5 20
Y 10 10 0 0 20
Z 1 2 3 4 10
The result would be:
r
NewName 1 2 3 4 5 Total
---------------------------------------- (These rows will be set manually)
I 2/5 2/5 2/5 0/5 - 8/20 <---I = A/X
J 1/5 1/5 1/5 0/5 - 4/20 <---J = B/X
K 1/10 1/10 - - - 4/20 <---K = B/Y
L 0/5 1/5 2/5 0/5 - 6/20 <---L = C/X
Thanks! :)
This needs an involved solution, but can be done. First, declare your manually controlled parameters.
i = ['A', 'B', 'B', 'C']
j = ['X', 'X', 'Y', 'X']
k = ['I', 'J', 'K', 'L']
Now, the idea is to align the two dataframes.
x = df1.set_index('Name')
y = df2.set_index('Alias')
x, y = x.align(y)
Perform division, and create a new dataframe. Since we're dividing numpy arrays, you might encounter runtime warnings. Ignore them.
z = x.reindex(i, axis=0).values / y.reindex(j, axis=0).values
df = pd.DataFrame(z, index=k, columns=x.columns)
df
1 2 3 4 5 Total
I 0.4 0.4 0.400000 NaN NaN 0.4
J 0.2 0.2 0.200000 NaN NaN 0.2
K 0.1 0.1 inf NaN NaN 0.2
L 0.0 0.2 0.400000 NaN NaN 0.3
Edit; on older versions, reindex does not accept an axis parameter. In that case, use
z = x.reindex(index=i).values / y.reindex(index=j).values
Additionally, to fill up non-finite values, use np.isfinite -
df[np.isfinite(df)].fillna('-')
1 2 3 4 5 Total
I 0.4 0.4 0.4 - - 0.4
J 0.2 0.2 0.2 - - 0.2
K 0.1 0.1 - - - 0.2
L 0.0 0.2 0.4 - - 0.3
I = df1.T['A']/df2.T['X']
J = df1.T['B']/df2.T['X']
K = df1.T['B']/df2.T['Y']
L = df1.T['C']/df2.T['X']
df = pd.concat([I, J, K, L], axis=1).rename(columns={0:'I', 1:'J', 2:'K', 3:'L'}).T
Then, to make it look more like the output you wanted:
df[np.isfinite(df)].fillna('-')
--
Edit
More universally, to not cascade divisions, you can do:
pairs = [('A','X'), ('B','X'), ('B','Y'), ('C','X')]
series_to_concat = [df1.T[col_df1]/df2.T[col_df2] for (col_df1, col_df2) in pairs]
names = ['I', 'J', 'K', 'L']
col_names = {col_num : name for col_num, name in enumerate(names)}
df = pd.concat(series_to_concat, axis=1).rename(columns=col_names).T
It looks like you don't care about indices so this should work.
r = df1.reset_index(drop=True) / df2.reset_index(drop=True)

Efficient method for using formulas in a pandas dataframe

I am trying to add a column to a dataframe based on a formula. I don't think my current solution is very pythonic/efficient. So I am looking for faster options.
I have a table with 3 columns
import pandas as pd
df = pd.DataFrame([
[1,1,20.0],
[1,2,50.0],
[1,3,30.0],
[2,1,30.0],
[2,2,40.0],
[2,3,30.0],
],
columns=['seg', 'reach', 'len']
)
# print df
df
seg reach len
0 1 1 20.0
1 1 2 50.0
2 1 3 30.0
3 2 1 30.0
4 2 2 40.0
5 2 3 30.0
# Formula here
for index, row in df.iterrows():
if row['reach'] ==1:
df.ix[index,'cumseglen'] = row['len'] * 0.5
else:
df.ix[index,'cumseglen'] = df.ix[index-1,'cumseglen'] + 0.5 *(df.ix[index-1,'len'] + row['len'])
#print final results
df
seg reach len cumseglen
0 1 1 20.0 10.0
1 1 2 50.0 45.0
2 1 3 30.0 85.0
3 2 1 30.0 15.0
4 2 2 40.0 50.0
5 2 3 30.0 85.0
How can I improve the efficiency of the formula step?
To me this looks like a group-by operation. That is, within each "segment" group, you want to apply some operation to that group.
Here's one way to perform your calculation from above, using a group-by and some cumulative sums within each group:
import numpy as np
def cumulate(group):
cuml = 0.5 * np.cumsum(group)
return cuml + cuml.shift(1).fillna(0)
df['cumseglen'] = df.groupby('seg')['len'].apply(cumulate)
print(df)
The result:
seg reach len cumseglen
0 1 1 20.0 10.0
1 1 2 50.0 45.0
2 1 3 30.0 85.0
3 2 1 30.0 15.0
4 2 2 40.0 50.0
5 2 3 30.0 85.0
Algorithmically, this is not exactly the same as what you wrote, but under the assumption that the "reach" column starts from 1 at the beginning of each new segment indicated by the "seg" column, this should work.

Transpose table then set and rename index

I want to transpose a table and rename the index.
If I display the df with existing index Time I get
Time v1 v2
1 0.5 0.3
2 0.2 0.1
3 0.3 0.3
and after df.transpose() I'm at
Time 1 2 3
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
Interestingly if I do now df.Time I get
AttributeError: 'DataFrame' object has no attribute 'Time'
although it gets displayed in the output.
I can't find a way to easily rename the column Time to Variable and set that as the new index ..
I tried df.reset_index().set_index("index") but what I get is something that looks like this:
Time 1 2 3
index
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
You need only rename column names by rename_axis:
print (df.transpose().rename_axis('Variable', axis=1))
Variable 1 2 3
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
Or set new column names by assign name:
df1 = df.transpose()
df1.columns.name = 'Var'
print (df1)
Var 1 2 3
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
But I think you need set new column from index and then rename column index to var, also reset column names to None:
df1 = df.transpose().reset_index().rename(columns={'index':'var'})
df1.columns.name = None
print (df1)
var 1 2 3
0 v1 0.5 0.2 0.3
1 v2 0.3 0.1 0.3