index compatibility of dataframe with multiindex result from apply on group - pandas

We have to apply an algorithm to columns in a dataframe, the data has to be grouped by a key and the result shall form a new column in the dataframe. Since it is a common use-case we wonder if we have chosen a correct approach or not.
Following code reflects our approach to the problem in a simplified manner.
import numpy as np
import pandas as pd
np.random.seed(42)
N = 100
key = np.random.randint(0, 2, N).cumsum()
x = np.random.rand(N)
data = dict(key=key, x=x)
df = pd.DataFrame(data)
This generates a DataFrame as follows.
key x
0 0 0.969585
1 1 0.775133
2 1 0.939499
3 1 0.894827
4 1 0.597900
.. ... ...
95 53 0.036887
96 54 0.609564
97 55 0.502679
98 56 0.051479
99 56 0.278646
Application of exemplary methods on the DataFrame groups.
def magic(x, const):
return (x + np.abs(np.random.rand(len(x))) + float(const)).round(1)
def pandas_confrom_magic(df_per_key, const=1):
index = df_per_key['x'].index # preserve index
x = df_per_key['x'].to_numpy()
y = magic(x, const) # perform some pandas incompatible magic
return pd.Series(y, index=index) # reconstruct index
g = df.groupby('key')
y_per_g = g.apply(lambda df: pandas_confrom_magic(df, const=5))
When assigning a new column to the result df['y'] = y_per_g it will throw a TypeError.
TypeError: incompatible index of inserted column with frame index
Thus a compatible multiindex needs to be introduced first.
df.index.name = 'index'
df = df.set_index('key', append=True).reorder_levels(['key', 'index'])
df['y'] = y_per_g
df.reset_index('key', inplace=True)
Which yields the intended result.
key x y
index
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
... ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
Now we wonder if there is a more straight forward way of dealing with the index and if we generally have chosen a favorable approach.

Use Series.droplevel to remove first level of MultiIndex, such that it has the same index as df, then assign will working well:
g = df.groupby('key')
df['y'] = g.apply(lambda df: pandas_confrom_magic(df, const=5)).droplevel('key')
print (df)
key x y
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
.. ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
[100 rows x 3 columns]

Related

multi-dimensional indexing warning with pandas

x = df.x_value
y = df.y_value
x = x[:, np.newaxis]
y = y[:, np.newaxis]
polynomial_features= PolynomialFeatures(degree=2)
x_transformed = polynomial_features.fit_transform(x)
The above code is giving following warning...how can I avoid these
FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
A full working example with solution as suggested by the warning:
In [194]: df
Out[194]:
age rank height weight
0 20 2 155 53
1 15 7 159 60
2 34 6 180 75
3 40 5 163 80
4 60 1 170 49
In [195]: df.height
Out[195]:
0 155
1 159
2 180
3 163
4 170
Name: height, dtype: int64
In [196]: df.height[:,None]
<ipython-input-196-1af0bb09495a>:1: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
df.height[:,None]
Out[196]:
array([[155],
[159],
[180],
[163],
[170]])
In [197]: df.height.to_numpy()[:,None]
Out[197]:
array([[155],
[159],
[180],
[163],
[170]])

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

Summing columns and rows

How do I add up rows and columns.
The last column Sum needs to be the sum of the rows R0+R1+R2.
The last row needs to be the sum of these columns.
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
The result:
Type R0 R1 R2 Sum
0 AP 16 20 78 NaN
1 AP+ 10 14 55 NaN
2 SP 32 26 90 NaN
3 Total 0 0 0 NaN
Let us try .iloc position selection
df.iloc[-1,1:]=df.iloc[:-1,1:].sum()
df['Sum']=df.iloc[:,1:].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341
In general it may be better practice to specify column names:
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
# List columns
cols_to_sum=['R0', 'R1', 'R2']
# Access last row and sum columns-wise
df.loc[df.index[-1], cols_to_sum] = df[cols_to_sum].sum(axis=0)
# Create 'Sum' column summing row-wise
df['Sum']=df[cols_to_sum].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341

Finding greatest fall and rise in a dynamic rolling window based on index

Have a df of readings as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1000, size=100), index=range(100), columns = ['reading'])
Want to find the greatest rise and the greatest fall for each row based on its index, which theoretically may be achieved using the formula...
How can this be coded?
Tried:
df.assign(gr8Rise=df.rolling(df.index).apply(lambda x: x[-1]-x[0], raw=True).max())
...and failed with ValueError: window must be an integer
UPDATE: Based on #jezrael dataset the output for gr8Rise is expected as follows:
Use:
np.random.seed(2019)
df = pd.DataFrame(np.random.randint(100, size=10), index=range(10), columns = ['reading'])
df['gr8Rise'] = [df['reading'].rolling(x).apply(lambda x: x[0]-x[-1], raw=True).max()
for x in range(1, len(df)+1)]
df.loc[0, 'gr8Rise']= np.nan
print (df)
reading gr8Rise
0 72 NaN
1 31 41.0
2 37 64.0
3 88 59.0
4 62 73.0
5 24 76.0
6 29 72.0
7 15 57.0
8 12 60.0
9 16 56.0

assigning title to intervals in pandas

import numpy as np
xlist = np.arange(1, 100).tolist()
df = pd.DataFrame(xlist,columns=['Numbers'],dtype=int)
pd.cut(df['Numbers'],5)
how to assign column name to each distinct intervals created ?
IIUC, you can use pd.concat function and join them in a new data frame based on indexes:
# get indexes
l = df.index.tolist()
n =20
indexes = [l[i:i + n] for i in range(0, len(l), n)]
# create new data frame
new_df = pd.concat([df.iloc[x].reset_index(drop=True) for x in indexes], axis=1)
new_df.columns = ['Numbers'+str(x) for x in range(new_df.shape[1])]
print(new_df)
Numbers0 Numbers1 Numbers2 Numbers3 Numbers4
0 1 21 41 61 81.0
1 2 22 42 62 82.0
2 3 23 43 63 83.0
3 4 24 44 64 84.0
4 5 25 45 65 85.0