I have a dataframe (movielens dataset)
(Pdb) self.data_train.head()
userId movieId rating timestamp
65414 466 608 4.0 945139883
79720 547 6218 4.0 1089518106
63354 457 4007 3.5 1471383787
29923 213 59333 2.5 1462636955
63651 457 102194 2.5 1471383710
I found mean of each user's rating
user_mean = self.data_train['rating'].groupby(self.data_train['userId']).mean()
(Pdb) user_mean.head()
userId
1 2.527778
2 3.426471
3 3.588889
4 4.363158
5 3.908602
I want to subtract this mean value from the first dataframe for the matching user.
Is there a way of doing it without a explicit for loop?
I think you need GroupBy.transform with mean for Serieswith same size like original DataFrame, so possible subtract column by Series.sub:
s = self.data_train.groupby('userId')['rating'].transform('mean')
self.data_train['new'] = self.data_train['rating'].sub(s)
Sample: Changed data in userId for better sample
print (data_train)
userId movieId rating timestamp
65414 466 608 4.0 945139883
79720 466 6218 4.0 1089518106
63354 457 4007 3.5 1471383787
29923 466 59333 2.5 1462636955
63651 457 102194 2.5 1471383710
s = data_train.groupby('userId')['rating'].transform('mean')
print (s)
65414 3.5
79720 3.5
63354 3.0
29923 3.5
63651 3.0
Name: rating, dtype: float64
data_train['new'] = data_train['rating'].sub(s)
print (data_train)
userId movieId rating timestamp new
65414 466 608 4.0 945139883 0.5
79720 466 6218 4.0 1089518106 0.5
63354 457 4007 3.5 1471383787 0.5
29923 466 59333 2.5 1462636955 -1.0
63651 457 102194 2.5 1471383710 -0.5
Related
Here was the original question:
With only being able to import numpy and pandas, I need to do the following: Scale the medianIncome to express the values in $10,000 of dollars (example: 150000 will become 15, 30000 will become 3, 15000 will become 1.5, etc)
Here's the code that works:
temp_housing['medianIncome'].replace( '[(]','-', regex=True ).astype(float)) / 10000
But when I call the df after, it still shows the original amount instead of the 15 of 1.5. I'm not sure what I'm missing on this.
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0
The result is:
id medianIncome
0 1.7250
1 2.1806
2 2.4038
3 2.4597
4 1.8080
Name: medianIncome, Length: 20640, dtype: float64
But then when I call the df with
housing_cal
, it's back to:
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0
I have created a new column, by adding values from one column, to the index of the column from which I have created this new column. However, my problem is the code works fine when I implement on sample column, but when I pass the already existing dataframe, it throws the error, "can only perform ops with scalar values". As per what I found is the code expects dist and that is why it is throwing error.
I tried converting the dataframe to dictionary or to a list, but no luck.
df = pd.DataFrame({'Name': ['Sam', 'Andrea', 'Alex', 'Robin', 'Kia', 'Sia'], 'Age':[14,25,55,8,21,43], 'd_id_max':[2,1,1,2,0,0]})`
df['Expected_new_col'] = df.loc[df.index + df['d_id_max'].to_list, 'Age'].to_numpy()
print(df)
error: can only perform ops with scalar values.
This is the dataframe I want to implement this code:
Weight Name Age 1 2 abs_max d_id_max
0 45 Sam 14 11.0 41.0 41.0 2
1 88 Andrea 25 30.0 -17.0 30.0 1
2 56 Alex 55 -47.0 -34.0 47.0 1
3 15 Robin 8 13.0 35.0 35.0 2
4 71 Kia 21 22.0 24.0 24.0 2
5 44 Sia 43 2.0 22.0 22.0 2
6 54 Ryan 45 20.0 0.0 20.0 1
Writing your new column like this will not return an error:
df.loc[df.index + df['d_id_max'], 'Age'].to_numpy()
EDIT:
You should first format d_id_max as int (or float):
df['d_id_max'] = df['d_id_max'].astype(int)
The solution was very simple, I was getting the error because the data type of the column d_id_max was object type, which should be either float or integer, so i change the data type and it worked fine.
I'm trying do something that should be really simple in pandas, but it seems anything but. I have two large dataframes
df1 has 243 columns which include:
ID2 K. C type
1 123 1. 2. T
2 132 3. 1. N
3 111 2. 1. U
df2 has 121 columns which include:
ID3 A B
1 123 0. 3.
2 111 2. 3.
3 132 1. 2.
df2 contains different information about the same ID (ID2=ID3) but in different order
I wanted to create a new column in df2 named (type) and match the type column in df1. If it's the same ID to the one in df1, it should copy the same type (T, N or U) from df1. In another word, I need it to look like the following data frame butwith all 121 columns from df2+type
ID3 A B type
123 0. 3. T
111 2. 3. U
132 1. 2. N
I tried
pd.merge and pd.join.
I also tried
df2['type'] = df1['ID2'].map(df2.set_index('ID3')['type'])
but none of them is working.
it shows KeyError: 'ID3'
As far as I can see, your last command is almost correct. Try this:
df2['type'] = df2['ID3'].map(df1.set_index('ID2')['type'])
join
df2.join(df1.set_index('ID2')['type'], on='ID3')
ID3 A B type
1 123 0.0 3.0 T
2 111 2.0 3.0 U
3 132 1.0 2.0 N
merge (take 1)
df2.merge(df1[['ID2', 'type']].rename(columns={'ID2': 'ID3'}))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
merge (take 2)
df2.merge(df1[['ID2', 'type']], left_on='ID3', right_on='ID2').drop('ID2', 1)
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
map and assign
df2.assign(type=df2.ID3.map(dict(zip(df1.ID2, df1['type']))))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
I am using pandas.DataFrame.rolling to calculate rolling means for a stock index close price series. I can do this in Excel. How can I do the same thing in Pandas? Thanks!
Below is my Excel formula to calculate the moving average and the window length is in column ma window:
date close ma window ma
2018/3/21 4061.0502
2018/3/22 4020.349
2018/3/23 3904.9355 3 =AVERAGE(INDIRECT("B"&(ROW(B4)-C4+1)):B4)
2018/3/26 3879.893 2 =AVERAGE(INDIRECT("B"&(ROW(B5)-C5+1)):B5)
2018/3/27 3913.2689 4 =AVERAGE(INDIRECT("B"&(ROW(B6)-C6+1)):B6)
2018/3/28 3842.7155 7 =AVERAGE(INDIRECT("B"&(ROW(B7)-C7+1)):B7)
2018/3/29 3894.0498 1 =AVERAGE(INDIRECT("B"&(ROW(B8)-C8+1)):B8)
2018/3/30 3898.4977 6 =AVERAGE(INDIRECT("B"&(ROW(B9)-C9+1)):B9)
2018/4/2 3886.9189 2 =AVERAGE(INDIRECT("B"&(ROW(B10)-C10+1)):B10)
2018/4/3 3862.4796 8 =AVERAGE(INDIRECT("B"&(ROW(B11)-C11+1)):B11)
2018/4/4 3854.8625 1 =AVERAGE(INDIRECT("B"&(ROW(B12)-C12+1)):B12)
2018/4/9 3852.9292 9 =AVERAGE(INDIRECT("B"&(ROW(B13)-C13+1)):B13)
2018/4/10 3927.1729 3 =AVERAGE(INDIRECT("B"&(ROW(B14)-C14+1)):B14)
2018/4/11 3938.3434 1 =AVERAGE(INDIRECT("B"&(ROW(B15)-C15+1)):B15)
2018/4/12 3898.6354 3 =AVERAGE(INDIRECT("B"&(ROW(B16)-C16+1)):B16)
2018/4/13 3871.1443 8 =AVERAGE(INDIRECT("B"&(ROW(B17)-C17+1)):B17)
2018/4/16 3808.863 2 =AVERAGE(INDIRECT("B"&(ROW(B18)-C18+1)):B18)
2018/4/17 3748.6412 2 =AVERAGE(INDIRECT("B"&(ROW(B19)-C19+1)):B19)
2018/4/18 3766.282 4 =AVERAGE(INDIRECT("B"&(ROW(B20)-C20+1)):B20)
2018/4/19 3811.843 6 =AVERAGE(INDIRECT("B"&(ROW(B21)-C21+1)):B21)
2018/4/20 3760.8543 3 =AVERAGE(INDIRECT("B"&(ROW(B22)-C22+1)):B22)
Here is a snapshot of the Excel version.
I figured it out. But I don't think it is the best solution...
import pandas as pd
data = pd.read_excel('data.xlsx' ,index_col='date')
def get_price_mean(x):
win = data.loc[:,'ma window'].iloc[x.shape[0]-1].astype('int')
win = max(win,0)
return pd.Series(x).rolling(window = win).mean().iloc[-1]
data.loc[:,'ma'] = data.loc[:,'close'].expanding().apply(get_price_mean)
print(data)
The result is:
close ma window ma
date
2018-03-21 4061.0502 NaN NaN
2018-03-22 4020.3490 NaN NaN
2018-03-23 3904.9355 3.0 3995.444900
2018-03-26 3879.8930 2.0 3892.414250
2018-03-27 3913.2689 4.0 3929.611600
2018-03-28 3842.7155 7.0 NaN
2018-03-29 3894.0498 1.0 3894.049800
2018-03-30 3898.4977 6.0 3888.893400
2018-04-02 3886.9189 2.0 3892.708300
2018-04-03 3862.4796 8.0 3885.344862
2018-04-04 3854.8625 1.0 3854.862500
2018-04-09 3852.9292 9.0 3876.179456
2018-04-10 3927.1729 3.0 3878.321533
2018-04-11 3938.3434 1.0 3938.343400
2018-04-12 3898.6354 3.0 3921.383900
2018-04-13 3871.1443 8.0 3886.560775
2018-04-16 3808.8630 2.0 3840.003650
2018-04-17 3748.6412 2.0 3778.752100
2018-04-18 3766.2820 4.0 3798.732625
2018-04-19 3811.8430 6.0 3817.568150
2018-04-20 3760.8543 3.0 3779.659767
I have a pandas dataframe
index A
1 3.4
2 4.5
3 5.3
4 2.1
5 4.0
6 5.3
...
95 3.4
96 1.2
97 8.9
98 3.4
99 2.7
100 7.6
from this I would like to create a dataframe B
1-5 sum(1-5)
6-10 sum(6-10)
...
96-100 sum(96-100)
Any ideas how to do this elegantly rather than brute-force?
Cheers, Mike
This will give you a series with the partial sums:
df['bin'] = df.index / 5
bin_sums = df.groupby('bin')['A'].sum()
Then, if you want to rename the index:
bin_sums.index = ['%s - %s' % (5*i, 5*(i+1)) for i in bin_sums.index]