Efficient method for using formulas in a pandas dataframe - pandas

I am trying to add a column to a dataframe based on a formula. I don't think my current solution is very pythonic/efficient. So I am looking for faster options.
I have a table with 3 columns
import pandas as pd
df = pd.DataFrame([
[1,1,20.0],
[1,2,50.0],
[1,3,30.0],
[2,1,30.0],
[2,2,40.0],
[2,3,30.0],
],
columns=['seg', 'reach', 'len']
)
# print df
df
seg reach len
0 1 1 20.0
1 1 2 50.0
2 1 3 30.0
3 2 1 30.0
4 2 2 40.0
5 2 3 30.0
# Formula here
for index, row in df.iterrows():
if row['reach'] ==1:
df.ix[index,'cumseglen'] = row['len'] * 0.5
else:
df.ix[index,'cumseglen'] = df.ix[index-1,'cumseglen'] + 0.5 *(df.ix[index-1,'len'] + row['len'])
#print final results
df
seg reach len cumseglen
0 1 1 20.0 10.0
1 1 2 50.0 45.0
2 1 3 30.0 85.0
3 2 1 30.0 15.0
4 2 2 40.0 50.0
5 2 3 30.0 85.0
How can I improve the efficiency of the formula step?

To me this looks like a group-by operation. That is, within each "segment" group, you want to apply some operation to that group.
Here's one way to perform your calculation from above, using a group-by and some cumulative sums within each group:
import numpy as np
def cumulate(group):
cuml = 0.5 * np.cumsum(group)
return cuml + cuml.shift(1).fillna(0)
df['cumseglen'] = df.groupby('seg')['len'].apply(cumulate)
print(df)
The result:
seg reach len cumseglen
0 1 1 20.0 10.0
1 1 2 50.0 45.0
2 1 3 30.0 85.0
3 2 1 30.0 15.0
4 2 2 40.0 50.0
5 2 3 30.0 85.0
Algorithmically, this is not exactly the same as what you wrote, but under the assumption that the "reach" column starts from 1 at the beginning of each new segment indicated by the "seg" column, this should work.

Related

pandas dataframe auto fill values if have same value on specific column [duplicate]

I have the data as below, the new pandas version doesn't preserve the grouped columns after the operation of fillna/ffill/bfill. Is there a way to have the grouped data?
data = """one;two;three
1;1;10
1;1;nan
1;1;nan
1;2;nan
1;2;20
1;2;nan
1;3;nan
1;3;nan"""
df = pd.read_csv(io.StringIO(data), sep=";")
print(df)
one two three
0 1 1 10.0
1 1 1 NaN
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
print(df.groupby(['one','two']).ffill())
three
0 10.0
1 10.0
2 10.0
3 NaN
4 20.0
5 20.0
6 NaN
7 NaN
With the most recent pandas if we would like keep the groupby columns , we need to adding apply here
out = df.groupby(['one','two']).apply(lambda x : x.ffill())
Out[219]:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Does it what you expect?
df['three']= df.groupby(['one','two'])['three'].ffill()
print(df)
# Output:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Yes please set the index and then try grouping it so that it will preserve the columns as shown here:
df = pd.read_csv(io.StringIO(data), sep=";")
df.set_index(['one','two'], inplace=True)
df.groupby(['one','two']).ffill()

Operations with multiple dataframes partialy sharing indexes in pandas

I have two dataframes: (i) One has two indexes and two headers, and (ii) the other one has one index and one header. The second level of each axis in the first dataframe relates to each axis of the second dataframe. I need to multiply both dataframes based on that relation between the axis.
Dataframe 1:
Dataframe 2:
Expected result (multiplication by index/header):
Try using pd.DataFrame.mul with the level parameter:
import pandas as pd
df = pd.DataFrame([[9,10,2,1,6,5],
[4, 0,3,4,6,6],
[9, 3,9,1,2,3],
[3, 5,9,3,9,0],
[4,4,8,5,10,5],
[5, 3,1,8,5,6]])
df.columns = pd.MultiIndex.from_arrays([[2020]*3+[2021]*3,[1,2,3,1,2,3]])
df.index = pd.MultiIndex.from_arrays([[1]*3+[2]*3,[1,2,3,1,2,3]])
print(df)
print('\n')
df2 = pd.DataFrame([[.1,.3,.6],[.4,.4,.3],[.5,.4,.1]], index=[1,2,3], columns=[1,2,3])
print(df2)
print('\n')
df_out = df.mul(df2, level=1)
print(df_out)
Output:
2020 2021
1 2 3 1 2 3
1 1 9 10 2 1 6 5
2 4 0 3 4 6 6
3 9 3 9 1 2 3
2 1 3 5 9 3 9 0
2 4 4 8 5 10 5
3 5 3 1 8 5 6
1 2 3
1 0.1 0.3 0.6
2 0.4 0.4 0.3
3 0.5 0.4 0.1
2020 2021
1 2 3 1 2 3
1 1 0.9 3.0 1.2 0.1 1.8 3.0
2 1.6 0.0 0.9 1.6 2.4 1.8
3 4.5 1.2 0.9 0.5 0.8 0.3
2 1 0.3 1.5 5.4 0.3 2.7 0.0
2 1.6 1.6 2.4 2.0 4.0 1.5
3 2.5 1.2 0.1 4.0 2.0 0.6

How to compare 2 columns of 2 different dataframes pandas, and sum the result pandas

I have 2 dataframes with the same length, but different number of columns.
I'd like to compare 2 specific columns from those dataframes and if the values are even, the counter is added by 1, like so:
df1:
count = o
num
0 0
1 1
2 0
3 0
4 1
df2:
Preg Glu outcome
0 5.0 116.0 0.0
1 10.0 115.0 0.0
2 2.0 197.0 0.0
3 7.0 196.0 1.0
4 10.0 125.0 1.0
Thus, since they were even in 3 values, the result should be:
count = 3
What is the best way to do that?
You can check by performing an elementwise comparison between the two:
>>> (df1['num'] == df2['outcome']).sum()
3

Is there a way to get the nlargest items per group in dask?

I have the following dataset:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
4 13.0
5 4.0
And I'm trying to get the nlargest items of category in dataframe grouped by location. i.e. If I want the top 2 largest percentages for each group the output should be:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
It looks like in pandas this is relatively straight forward using pandas.core.groupby.SeriesGroupBy.nlargest but dask doesn't have an nlargest function for groupby. Have been playing around with apply but can't seem to get it to work properly.
df.groupby(['location'].apply(lambda x: x['percent'].nlargest(2)).compute()
But I just get the error ValueError: Wrong number of items passed 0, placement implies 8
The apply should work, but your syntax is a little off:
In [11]: df
Out[11]:
Dask DataFrame Structure:
Unnamed: 0 location category percent
npartitions=1
int64 object int64 float64
... ... ... ...
Dask Name: from-delayed, 3 tasks
In [12]: df.groupby("location")["percent"].apply(lambda x: x.nlargest(2), meta=('x', 'f8')).compute()
Out[12]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: x, dtype: float64
In pandas you'd have .nlargest and .rank as groupby methods which would let you do this without the apply:
In [21]: df1
Out[21]:
location category percent
0 A 5 100.0
1 B 3 100.0
2 C 2 50.0
3 C 4 13.0
4 D 2 75.0
5 D 3 59.0
6 D 4 13.0
7 D 5 4.0
In [22]: df1.groupby("location")["percent"].nlargest(2)
Out[22]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: percent, dtype: float64
The dask documentation notes:
Dask.dataframe covers a small but well-used portion of the pandas API.
This limitation is for two reasons:
The pandas API is huge
Some operations are genuinely hard to do in parallel (for example sort).

Pandas Dataframe merge 2 columns

I have a datatable like this:
Run, test1, test2
1, 100, 102.
2, 110, 100.
3, 108, 105.
I would like to have the 2 columns merged together like this:
Run, results
1, 100
1, 102
2, 110
2, 100
3, 108
3, 105
How do I do it in Pandas? Thanks a lot!
Use stack with Multiindex to column by double reset_index:
df = df.set_index('Run').stack().reset_index(drop=True, level=1).reset_index(name='results')
print (df)
Run results
0 1 100.0
1 1 102.0
2 2 110.0
3 2 100.0
4 3 108.0
5 3 105.0
Or melt:
df = df.melt('Run', value_name='results').drop('variable', axis=1).sort_values('Run')
print (df)
Run results
0 1 100.0
3 1 102.0
1 2 110.0
4 2 100.0
2 3 108.0
5 3 105.0
Numpy solution with numpy.repeat:
a = np.repeat(df['Run'].values, 2)
b = df[['test1','test2']].values.flatten()
df = pd.DataFrame({'Run':a , 'results': b}, columns=['Run','results'])
print (df)
Run results
0 1 100.0
1 1 102.0
2 2 110.0
3 2 100.0
4 3 108.0
5 3 105.0
This how I achieve this
Option 1
wide_to_long
pd.wide_to_long(df,stubnames='test',i='Run',j='LOL').reset_index().drop('LOL',1)
Out[776]:
Run test
0 1 100.0
1 2 110.0
2 3 108.0
3 1 102.0
4 2 100.0
5 3 105.0
Notice : Here I did not change the column name from test to results, I think by using test as new column name is better in your situation .
Option 2
pd.concat
df=df.set_index('Run')
pd.concat([df[Col] for Col in df.columns],axis=0).reset_index().rename(columns={0:'results'})
Out[786]:
Run results
0 1 100.0
1 2 110.0
2 3 108.0
3 1 102.0
4 2 100.0
5 3 105.0