List of Pandas Dataframes: Merging Function Outputs - pandas

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?

I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

Related

Kronecker product over the rows of a pandas dataframe

So I have these two dataframes and I would like to get a new dataframe which consists of the kronecker product of the rows of the two dataframes. What is the correct way to this?
As an example:
DataFrame1
c1 c2
0 10 100
1 11 110
2 12 120
and
DataFrame2
a1 a2
0 5 7
1 1 10
2 2 4
Then I would like to have the following matrix:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
I hope my question is clear.
PS. I saw this question was posted here kronecker product pandas dataframes. However, the answer given is not the correct answer (I believe to mine and the original question, but definitely not to mine). The answer there gives a Kronecker product of both dataframes, but I only want it over the rows.
Create MultiIndex by MultiIndex.from_product, convert both columns to MultiIndex by DataFrame.reindex and multiple Dataframe, last flatten MultiIndex:
c = pd.MultiIndex.from_product([df1, df2])
df = df1.reindex(c, axis=1, level=0).mul(df2.reindex(c, axis=1, level=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
print (df)
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
Use numpy for efficiency:
import numpy as np
pd.DataFrame(np.einsum('nk,nl->nkl', df1, df2).reshape(df1.shape[0], -1),
columns=pd.MultiIndex.from_product([df1, df2]).map(''.join)
)
Output:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

pandas return multiple DataFrames from apply function

EDIT Based on comments, clarifying the examples further to depict more realistic use case
I want to call a function with df.apply. This function returns multiple DataFrames. I want to join each of these DataFrames into logical groups. I am unable to do that without using for loop (which defeats the purpose of calling with apply).
I have tried calling function for each row of dataframe and it is slower than apply. However, with apply combining the results slows down things again.
Any tips?
# input data frame
data = {'Name':['Ani','Bob','Cal','Dom'], 'Age': [15,12,13,14], 'Score': [93,98,95,99]}
df_in=pd.DataFrame(data)
print(df_in)
Output>
Name Age Score
0 Ani 15 93
1 Bob 12 98
2 Cal 13 95
3 Dom 14 99
Function to be applied>
def func1(name, age):
num_rows = np.random.randint(int(age/3))
age_mul_1 = np.random.randint(low=1, high=age, size = num_rows)
age_mul_2 = np.random.randint(low=1, high=age, size = num_rows)
data = {'Name': [name]*num_rows, 'Age_Mul_1': age_mul_1, 'Age_Mul_2': age_mul_2}
df_func1 = pd.DataFrame(data)
return df_func1
def func2(name, age, score, other_params):
num_rows = np.random.randint(int(score/10))
score_mul_1 = np.random.randint(low=age, high=score, size = num_rows)
data2 = {'Name': [name]*num_rows, 'score_Mul_1': score_mul_1}
df_func2 = pd.DataFrame(data2)
return df_func2
def ret_mul_df(row):
df_A = func1(row['Name'], row['Age'])
#print(df_A)
df_B = func2(row['Name'], row['Age'], row['Score'],1)
#print(df_B)
return df_A, df_B
What I want to do is essentially create is two dataframes df_A_combined and df_B_combined
However, How I am currently combining is as follows:
df_out = df_in.apply(lambda row: ret_mul_df(row), axis=1)
df_A_combined = pd.DataFrame()
df_B_combined = pd.DataFrame()
for ser in df_out:
df_A_combined = df_A_combined.append(ser[0], ignore_index=True)
df_B_combined = df_B_combined.append(ser[1], ignore_index=True)
print(df_A_combined)
Name Age_Mul_1 Age_Mul_2
0 Ani 7 8
1 Ani 1 4
2 Ani 1 8
3 Ani 12 6
4 Bob 9 8
5 Cal 8 7
6 Cal 8 1
7 Cal 4 8
print(df_B_combined)
Name score_Mul_1
0 Ani 28
1 Ani 29
2 Ani 50
3 Ani 35
4 Ani 84
5 Ani 24
6 Ani 51
7 Ani 28
8 Bob 32
9 Cal 26
10 Cal 70
11 Dom 56
12 Dom 53
How can I avoid the iteration?
The func1, func2 are calls to 3rd party libraries (which are very computation intensive) and several such calls are made. Also dataframes df_A_combined and df_B_combined are not combinable among themselves
Note: This is a much simplified example and splitting the function will lead to lot of redundancies.
If this isn't what you want, I'll update if you can post what the two dataframes should look like.
data = {'Name':['Ani','Bob','Cal','Dom'], 'Age': [15,12,13,14], 'Score': [93,98,95,99]}
df_in=pd.DataFrame(data)
print(df_in)
df_A = df_in[['Name','Age']]
df_A['Age_Multiplier'] = df_A['Age'] * 3
print(df_A)
...: print(df_A)
Name Age Age_Multiplier
0 Ani 15 45
1 Bob 12 36
2 Cal 13 39
3 Dom 14 42
df_B = df_in[['Name','Score']]
df_B['Score_Multiplier'] = df_B['Score'] * 2
print(df_B)
...: print(df_B)
Name Score Score_Multiplier
0 Ani 93 186
1 Bob 98 196
2 Cal 95 190
3 Dom 99 198

assigning title to intervals in pandas

import numpy as np
xlist = np.arange(1, 100).tolist()
df = pd.DataFrame(xlist,columns=['Numbers'],dtype=int)
pd.cut(df['Numbers'],5)
how to assign column name to each distinct intervals created ?
IIUC, you can use pd.concat function and join them in a new data frame based on indexes:
# get indexes
l = df.index.tolist()
n =20
indexes = [l[i:i + n] for i in range(0, len(l), n)]
# create new data frame
new_df = pd.concat([df.iloc[x].reset_index(drop=True) for x in indexes], axis=1)
new_df.columns = ['Numbers'+str(x) for x in range(new_df.shape[1])]
print(new_df)
Numbers0 Numbers1 Numbers2 Numbers3 Numbers4
0 1 21 41 61 81.0
1 2 22 42 62 82.0
2 3 23 43 63 83.0
3 4 24 44 64 84.0
4 5 25 45 65 85.0

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100