How to substitute a column in a pandas dataframe whit a series? - pandas

Let's have a dataframe df and a series s1 in pandas
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000,1000))
s1 = pd.Series(range(0,10000))
How can I modify df so that the column 42 become equal to s1?
How can I modify df so that the columns between 42 and 442 become equal to s1?
I would like to know the simplest way to do that but also a way to do that in place.

I think you need first same length Series with DataFrame, here 20:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(20,10))
#print (df)
s1 = pd.Series(range(0,20))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 3:5].columns
df[cols] = pd.concat([s1] * len(cols), axis=1)
print (df)
0 1 2 3 4 5 6 7 8 9
0 -0.668129 -0.498210 0.618576 0 0 0 0.301966 0.449483 0 -0.315231
1 -2.015971 -1.130231 -1.111846 1 1 1 1.915676 0.920348 1 1.157552
2 -0.106208 -0.088752 -0.971485 2 2 2 -0.366948 -0.301085 2 1.141635
3 -1.309529 -0.274381 0.864837 3 3 3 0.670294 0.086347 3 -1.212503
4 0.120359 -0.358880 1.199936 4 4 4 0.389167 1.201631 4 0.445432
5 -1.031109 0.067133 -1.213451 5 5 5 -0.636896 0.013802 5 1.726135
6 -0.491877 0.254206 -0.268168 6 6 6 0.671070 -0.633645 6 1.813671
7 0.080433 -0.882443 1.152671 7 7 7 0.249225 1.385407 7 1.010374
8 0.307274 0.806150 0.071719 8 8 8 1.133853 -0.789922 8 -0.286098
9 -0.767206 1.094445 1.603907 9 9 9 0.083149 2.322640 9 0.396845
10 -0.740018 -0.853377 -2.039522 10 10 10 0.764962 -0.472048 10 -0.071255
11 -0.238565 1.077573 2.143252 11 11 11 1.542892 2.572560 11 -0.803516
12 -0.139521 -0.992107 -0.892619 12 12 12 0.259612 -0.661760 12 -1.508976
13 -1.077001 0.381962 0.205388 13 13 13 -0.023986 -1.293080 13 1.846402
14 -0.714792 -0.728496 -0.127079 14 14 14 0.606065 -2.320500 14 -0.992798
15 -0.127113 -0.563313 -0.101387 15 15 15 0.647325 -0.816023 15 -0.309938
16 -1.151304 -1.673719 0.074930 16 16 16 -0.392157 0.736714 16 1.142983
17 -1.247396 -0.471524 1.173713 17 17 17 -0.005391 0.426134 17 0.781832
18 -0.325111 0.579248 0.040363 18 18 18 0.361926 0.036871 18 0.581314
19 -1.057501 -1.814500 0.109628 19 19 19 -1.738658 -0.061883 19 0.989456
Timings
Another solutions, but it seems concat solution is fastest:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(1000,1000))
#print (df)
s1 = pd.Series(range(0,1000))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 42:442].columns
print (df)
In [310]: %timeit df[cols] = np.broadcast_to(s1.values[:, np.newaxis], (len(df),len(cols)))
1 loop, best of 3: 202 ms per loop
In [311]: %timeit df[cols] = np.repeat(s1.values[:, np.newaxis], len(cols), axis=1)
1 loop, best of 3: 208 ms per loop
In [312]: %timeit df[cols] = np.array([s1.values]*len(cols)).transpose()
10 loops, best of 3: 175 ms per loop
In [313]: %timeit df[cols] = pd.concat([s1] * len(cols), axis=1)
10 loops, best of 3: 53.8 ms per loop

Related

Find Max Gradient by Row in For Loop Pandas

I have a df of 15 x 4 and I'm trying to compute the maximum gradient in a North (N) minus South (S) direction for each row using a "S" and "N" value for each min or max in the rows below. I'm not sure that this is the best pythonic way to do this. My df "ms" looks like this:
minSlats minNlats maxSlats maxNlats
0 57839.4 54917.0 57962.6 56979.9
0 57763.2 55656.7 58120.0 57766.0
0 57905.2 54968.6 58014.3 57031.6
0 57796.0 54810.2 57969.0 56848.2
0 57820.5 55156.4 58019.5 57273.2
0 57542.7 54330.6 58057.6 56145.1
0 57829.8 54755.4 57978.8 56777.5
0 57796.0 54810.2 57969.0 56848.2
0 57639.4 54286.6 58087.6 56140.1
0 57653.3 56182.7 57996.5 57975.8
0 57665.1 56048.3 58069.7 58031.4
0 57559.9 57121.3 57890.8 58043.0
0 57689.7 55155.5 57959.4 56440.8
0 57649.4 56076.5 58043.0 58037.4
0 57603.9 56290.0 57959.8 57993.9
My loop structure looks like this:
J = len(ms)
grad = pd.DataFrame()
for i in range(J):
if ms.maxSlats.iloc[i] > ms.maxNlats.iloc[i]:
gr = ( ms.maxSlats.iloc[i] - ms.minNlats.iloc[i] ) * -1
grad[gr] = [i+1, i]
elif ms.maxNlats.iloc[i] > ms.maxSlats.iloc[i]:
gr = ms.maxNlats.iloc[i] - ms.minSlats.iloc[i]
grad[gr] = [i+1, i]
grad = grad.T # need to transpose
print(grad)
I obtain the correct answer but I'm wondering if there is a cleaner way to do this to obtain the same answer below:
grad.T
Out[317]:
0 1
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-3158.8 8 7
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
thank you,
Use np.where to compute gradient and keep only last duplicated index.
grad = np.where(ms.maxSlats > ms.maxNlats, (ms.maxSlats - ms.minNlats) * -1,
ms.maxNlats - ms.minSlats)
df = pd.DataFrame({'A': pd.RangeIndex(1, len(ms)+1),
'B': pd.RangeIndex(0, len(ms))},
index=grad)
df = df[~df.index.duplicated(keep='last')]
>>> df
A B
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3158.8 8 7
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

Cumulative value counts of categorical data, with group by

In my data frame, I have a text column group with the group name, and column drop_week, holding a categorical value in range [1,4]. I want to store, for each group, the cumulative count of values 1 to 4 of drop week. I'm doing this:
drop_data = all_data[['group', 'drop_week']].groupby('group')['drop_week'] \
.value_counts().unstack().transpose().fillna(0).cumsum().transpose()
and it works. But since it took me 2 hours of googling to come up with this solution, I was wondering if there is a better way to do it.
You could use pd.crosstab to create the frequency table. Then use cumsum(axis=1) to compute the cumulative sum across each row:
pd.crosstab(index=all_data['group'], columns=all_data['drop_week']).cumsum(axis=1)
# drop_week 1 2 3 4
# group
# 0 12 17 21 27
# 1 7 13 18 25
# 2 9 14 22 26
# 3 5 11 16 22
which agrees with
drop_data = (all_data[['group', 'drop_week']].groupby('group')['drop_week']
.value_counts().unstack().transpose().fillna(0).cumsum().transpose())
# drop_week 1 2 3 4
# group
# 0 12 17 21 27
# 1 7 13 18 25
# 2 9 14 22 26
# 3 5 11 16 22
The setup I used for this was:
import numpy as np
import pandas as pd
np.random.seed(2019)
N = 100
all_data = pd.DataFrame({'group':np.random.randint(4, size=N),
'drop_week':np.random.randint(1,5, size=N)})
drop_data = (all_data[['group', 'drop_week']].groupby('group')['drop_week']
.value_counts().unstack().transpose().fillna(0).cumsum().transpose())

How to repeat a dataframe - python

I have a simple csv dataframe as follow:
Date,Data
2000-01-31,9
2000-02-28,8
2000-03-31,7
2000-04-30,6
2000-05-31,5
2000-06-30,4
2000-07-31,3
2000-08-31,2
2000-09-30,1
2000-10-31,0
2000-11-30,11
2000-12-31,12
I would like to repeat this dataframe over 10 years, with the year stamp changing accordingly, as follow:
Date,Data
2000-01-31,9
2000-02-28,8
2000-03-31,7
2000-04-30,6
2000-05-31,5
2000-06-30,4
2000-07-31,3
2000-08-31,2
2000-09-30,1
2000-10-31,0
2000-11-30,11
2000-12-31,12
2001-01-31,9
2001-02-28,8
2001-03-31,7
2001-04-30,6
2001-05-31,5
2001-06-30,4
2001-07-31,3
2001-08-31,2
2001-09-30,1
2001-10-31,0
2001-11-30,11
2001-12-31,12
....
How can I do that?
You can just using concat
n=2
Newdf=pd.concat([df]*n,keys=range(n))
Newdf.Date+=pd.to_timedelta(Newdf.index.get_level_values(level=0),'Y')
Newdf.reset_index(level=0,drop=True, inplace=true)
Try:
df1 = pd.concat([df] * 10)
date_fix = pd.date_range(start='2000-01-31', freq='M', periods=len(df1))
df1['Date'] = date_fix
df1
[out]
Date Data
0 2000-01-31 9
1 2000-02-29 8
2 2000-03-31 7
3 2000-04-30 6
4 2000-05-31 5
5 2000-06-30 4
6 2000-07-31 3
... ... ...
5 2009-06-30 4
6 2009-07-31 3
7 2009-08-31 2
8 2009-09-30 1
9 2009-10-31 0
10 2009-11-30 11
11 2009-12-31 12

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100