Calculate mean-deviated values (subtract mean of all columns except one from this one column) - purrr

I have a dataset with the following structure:
df <- data.frame(id = 1:5,
study = c("st1","st2","st3","st4","st5"),
a_var = c(10,20,30,40,50),
b_var = c(6,5,4,3,2),
c_var = c(3,4,5,6,7),
d_var = c(80,70,60,50,40))
I would like to calculate the difference between each column that has _var in its name and the mean of all other columns containing _var in their names, like this:
mean_deviated_value <- function(data, variable) {
md_value = data[,variable] - rowMeans(data[,names(data) != variable])
md_value
}
df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
df$b_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "b_var")
df$c_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "c_var")
df$d_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "d_var")
Which gives me my desired output:
id study a_var b_var c_var d_var a_var_md b_var_md c_var_md d_var_md
1 1 st1 10 6 3 80 -19.666667 -12.33333 -9.80 83.80000
2 2 st2 20 5 4 70 -6.333333 -16.91667 -10.35 70.76667
3 3 st3 30 4 5 60 7.000000 -21.50000 -10.90 57.73333
4 4 st4 40 3 6 50 20.333333 -26.08333 -11.45 44.70000
5 5 st5 50 2 7 40 33.666667 -30.66667 -12.00 31.66667
How do I do it in one go, without repeating the code, preferably with dplyr/purrr?
I tried this:
df %>%
mutate(across(contains("_var"), ~ list(md = .x - rowMeans(select(., contains("_var") & !.x)))))
And got this error:
Error: Problem with `mutate()` input `..1`.
ℹ `..1 = across(...)`.
x no applicable method for 'select' applied to an object of class "c('double', 'numeric')"

We can use map_dfc with transmute to create *_md columns, and glue syntax for the names.
library(tidyverse)
nms <- names(df) %>%
str_subset('^.*_')
bind_cols(df, map_dfc(nms, ~transmute(df, '{.x}_md' := mean_deviated_value(select(df, contains("_var")), .x))))
#> id study a_var b_var c_var d_var a_var_md b_var_md c_var_md d_var_md
#> 1 1 st1 10 6 3 80 -19.666667 -25.00000 -29.00000 73.66667
#> 2 2 st2 20 5 4 70 -6.333333 -26.33333 -27.66667 60.33333
#> 3 3 st3 30 4 5 60 7.000000 -27.66667 -26.33333 47.00000
#> 4 4 st4 40 3 6 50 20.333333 -29.00000 -25.00000 33.66667
#> 5 5 st5 50 2 7 40 33.666667 -30.33333 -23.66667 20.33333
Note that if you use assigment. The first time rowMeans will compute with b_var, c_bar and d_bar. But the second time, contains("_var") will also capture the previously created a_var_md and use it to compute the means. I don't know if this is intended behaviour but it is worth mentioning.
df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
select(df, contains("_var"))
#> a_var b_var c_var d_var a_var_md
#> 1 10 6 3 80 -19.666667
#> 2 20 5 4 70 -6.333333
#> 3 30 4 5 60 7.000000
#> 4 40 3 6 50 20.333333
#> 5 50 2 7 40 33.666667
We can avoid this by replacing contains("_var") with matches("^.*_var$")
Created on 2021-12-20 by the reprex package (v2.0.1)

Related

Add one item each time into a cell of dataframe pandas

I need to add/append the results of various functions into a dataframe, each result in one cell. In the example below I'm putting only 3 functions. I could add to lists first and then drop the lists into each column, but there are too many functions to do this individually. Any help is most welcome!
import pandas as pd
J = pd.DataFrame()
J['f1'] = []
J['f2'] = []
J['f3'] = []
for i in range(1000):
f1x = i + 1
f2x = i ** 2
f3x = 3 * i
J['f1'].append(f1x)
J['f2'].append(f2x)
J['f3'].append(f3x)
print(J)
J.to_csv(r'00/G/graficos/Resultado.csv')
A list comprehension to create the DataFrame directly is likely the most performant:
# Create DataFrame from computations
J = pd.DataFrame([[i + 1, i ** 2, 3 * i] for i in range(1_000)])
# Rename Columns
J.columns = 'f' + (J.columns + 1).astype(str)
This can also be used with def functions:
def f1(i):
return i + 1
def f2(i):
return i ** 2
def f3(i):
return 3 * i
# Create DataFrame from functions
J = pd.DataFrame([[f1(i), f2(i), f3(i)]
for i in range(1_000)])
# Rename Columns
J.columns = 'f' + (J.columns + 1).astype(str)
df:
f1 f2 f3
0 1 0 0
1 2 1 3
2 3 4 6
3 4 9 9
4 5 16 12
.. ... ... ...
995 996 990025 2985
996 997 992016 2988
997 998 994009 2991
998 999 996004 2994
999 1000 998001 2997
[1000 rows x 3 columns]
In my opinion, the best is to take advantage of vector operations:
def f1(i):
return i + 1
def f2(i):
return i ** 2
def f3(i):
return 3 * i
xs = np.arange(1000)
pd.DataFrame({f.__name__: f(xs) for f in (f1,f2,f3)})
output:
f1 f2 f3
0 1 0 0
1 2 1 3
2 3 4 6
3 4 9 9
4 5 16 12
.. ... ... ...
995 996 990025 2985
996 997 992016 2988
997 998 994009 2991
998 999 996004 2994
999 1000 998001 2997
Here is a comparison of the vector and loop approaches:
NB. the scales are log! This means the loop gets even worse when n is higher.

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

How to repeat a dataframe - python

I have a simple csv dataframe as follow:
Date,Data
2000-01-31,9
2000-02-28,8
2000-03-31,7
2000-04-30,6
2000-05-31,5
2000-06-30,4
2000-07-31,3
2000-08-31,2
2000-09-30,1
2000-10-31,0
2000-11-30,11
2000-12-31,12
I would like to repeat this dataframe over 10 years, with the year stamp changing accordingly, as follow:
Date,Data
2000-01-31,9
2000-02-28,8
2000-03-31,7
2000-04-30,6
2000-05-31,5
2000-06-30,4
2000-07-31,3
2000-08-31,2
2000-09-30,1
2000-10-31,0
2000-11-30,11
2000-12-31,12
2001-01-31,9
2001-02-28,8
2001-03-31,7
2001-04-30,6
2001-05-31,5
2001-06-30,4
2001-07-31,3
2001-08-31,2
2001-09-30,1
2001-10-31,0
2001-11-30,11
2001-12-31,12
....
How can I do that?
You can just using concat
n=2
Newdf=pd.concat([df]*n,keys=range(n))
Newdf.Date+=pd.to_timedelta(Newdf.index.get_level_values(level=0),'Y')
Newdf.reset_index(level=0,drop=True, inplace=true)
Try:
df1 = pd.concat([df] * 10)
date_fix = pd.date_range(start='2000-01-31', freq='M', periods=len(df1))
df1['Date'] = date_fix
df1
[out]
Date Data
0 2000-01-31 9
1 2000-02-29 8
2 2000-03-31 7
3 2000-04-30 6
4 2000-05-31 5
5 2000-06-30 4
6 2000-07-31 3
... ... ...
5 2009-06-30 4
6 2009-07-31 3
7 2009-08-31 2
8 2009-09-30 1
9 2009-10-31 0
10 2009-11-30 11
11 2009-12-31 12

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100

How to substitute a column in a pandas dataframe whit a series?

Let's have a dataframe df and a series s1 in pandas
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000,1000))
s1 = pd.Series(range(0,10000))
How can I modify df so that the column 42 become equal to s1?
How can I modify df so that the columns between 42 and 442 become equal to s1?
I would like to know the simplest way to do that but also a way to do that in place.
I think you need first same length Series with DataFrame, here 20:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(20,10))
#print (df)
s1 = pd.Series(range(0,20))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 3:5].columns
df[cols] = pd.concat([s1] * len(cols), axis=1)
print (df)
0 1 2 3 4 5 6 7 8 9
0 -0.668129 -0.498210 0.618576 0 0 0 0.301966 0.449483 0 -0.315231
1 -2.015971 -1.130231 -1.111846 1 1 1 1.915676 0.920348 1 1.157552
2 -0.106208 -0.088752 -0.971485 2 2 2 -0.366948 -0.301085 2 1.141635
3 -1.309529 -0.274381 0.864837 3 3 3 0.670294 0.086347 3 -1.212503
4 0.120359 -0.358880 1.199936 4 4 4 0.389167 1.201631 4 0.445432
5 -1.031109 0.067133 -1.213451 5 5 5 -0.636896 0.013802 5 1.726135
6 -0.491877 0.254206 -0.268168 6 6 6 0.671070 -0.633645 6 1.813671
7 0.080433 -0.882443 1.152671 7 7 7 0.249225 1.385407 7 1.010374
8 0.307274 0.806150 0.071719 8 8 8 1.133853 -0.789922 8 -0.286098
9 -0.767206 1.094445 1.603907 9 9 9 0.083149 2.322640 9 0.396845
10 -0.740018 -0.853377 -2.039522 10 10 10 0.764962 -0.472048 10 -0.071255
11 -0.238565 1.077573 2.143252 11 11 11 1.542892 2.572560 11 -0.803516
12 -0.139521 -0.992107 -0.892619 12 12 12 0.259612 -0.661760 12 -1.508976
13 -1.077001 0.381962 0.205388 13 13 13 -0.023986 -1.293080 13 1.846402
14 -0.714792 -0.728496 -0.127079 14 14 14 0.606065 -2.320500 14 -0.992798
15 -0.127113 -0.563313 -0.101387 15 15 15 0.647325 -0.816023 15 -0.309938
16 -1.151304 -1.673719 0.074930 16 16 16 -0.392157 0.736714 16 1.142983
17 -1.247396 -0.471524 1.173713 17 17 17 -0.005391 0.426134 17 0.781832
18 -0.325111 0.579248 0.040363 18 18 18 0.361926 0.036871 18 0.581314
19 -1.057501 -1.814500 0.109628 19 19 19 -1.738658 -0.061883 19 0.989456
Timings
Another solutions, but it seems concat solution is fastest:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(1000,1000))
#print (df)
s1 = pd.Series(range(0,1000))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 42:442].columns
print (df)
In [310]: %timeit df[cols] = np.broadcast_to(s1.values[:, np.newaxis], (len(df),len(cols)))
1 loop, best of 3: 202 ms per loop
In [311]: %timeit df[cols] = np.repeat(s1.values[:, np.newaxis], len(cols), axis=1)
1 loop, best of 3: 208 ms per loop
In [312]: %timeit df[cols] = np.array([s1.values]*len(cols)).transpose()
10 loops, best of 3: 175 ms per loop
In [313]: %timeit df[cols] = pd.concat([s1] * len(cols), axis=1)
10 loops, best of 3: 53.8 ms per loop