I am using pandas.DataFrame.rolling to calculate rolling means for a stock index close price series. I can do this in Excel. How can I do the same thing in Pandas? Thanks!
Below is my Excel formula to calculate the moving average and the window length is in column ma window:
date close ma window ma
2018/3/21 4061.0502
2018/3/22 4020.349
2018/3/23 3904.9355 3 =AVERAGE(INDIRECT("B"&(ROW(B4)-C4+1)):B4)
2018/3/26 3879.893 2 =AVERAGE(INDIRECT("B"&(ROW(B5)-C5+1)):B5)
2018/3/27 3913.2689 4 =AVERAGE(INDIRECT("B"&(ROW(B6)-C6+1)):B6)
2018/3/28 3842.7155 7 =AVERAGE(INDIRECT("B"&(ROW(B7)-C7+1)):B7)
2018/3/29 3894.0498 1 =AVERAGE(INDIRECT("B"&(ROW(B8)-C8+1)):B8)
2018/3/30 3898.4977 6 =AVERAGE(INDIRECT("B"&(ROW(B9)-C9+1)):B9)
2018/4/2 3886.9189 2 =AVERAGE(INDIRECT("B"&(ROW(B10)-C10+1)):B10)
2018/4/3 3862.4796 8 =AVERAGE(INDIRECT("B"&(ROW(B11)-C11+1)):B11)
2018/4/4 3854.8625 1 =AVERAGE(INDIRECT("B"&(ROW(B12)-C12+1)):B12)
2018/4/9 3852.9292 9 =AVERAGE(INDIRECT("B"&(ROW(B13)-C13+1)):B13)
2018/4/10 3927.1729 3 =AVERAGE(INDIRECT("B"&(ROW(B14)-C14+1)):B14)
2018/4/11 3938.3434 1 =AVERAGE(INDIRECT("B"&(ROW(B15)-C15+1)):B15)
2018/4/12 3898.6354 3 =AVERAGE(INDIRECT("B"&(ROW(B16)-C16+1)):B16)
2018/4/13 3871.1443 8 =AVERAGE(INDIRECT("B"&(ROW(B17)-C17+1)):B17)
2018/4/16 3808.863 2 =AVERAGE(INDIRECT("B"&(ROW(B18)-C18+1)):B18)
2018/4/17 3748.6412 2 =AVERAGE(INDIRECT("B"&(ROW(B19)-C19+1)):B19)
2018/4/18 3766.282 4 =AVERAGE(INDIRECT("B"&(ROW(B20)-C20+1)):B20)
2018/4/19 3811.843 6 =AVERAGE(INDIRECT("B"&(ROW(B21)-C21+1)):B21)
2018/4/20 3760.8543 3 =AVERAGE(INDIRECT("B"&(ROW(B22)-C22+1)):B22)
Here is a snapshot of the Excel version.
I figured it out. But I don't think it is the best solution...
import pandas as pd
data = pd.read_excel('data.xlsx' ,index_col='date')
def get_price_mean(x):
win = data.loc[:,'ma window'].iloc[x.shape[0]-1].astype('int')
win = max(win,0)
return pd.Series(x).rolling(window = win).mean().iloc[-1]
data.loc[:,'ma'] = data.loc[:,'close'].expanding().apply(get_price_mean)
print(data)
The result is:
close ma window ma
date
2018-03-21 4061.0502 NaN NaN
2018-03-22 4020.3490 NaN NaN
2018-03-23 3904.9355 3.0 3995.444900
2018-03-26 3879.8930 2.0 3892.414250
2018-03-27 3913.2689 4.0 3929.611600
2018-03-28 3842.7155 7.0 NaN
2018-03-29 3894.0498 1.0 3894.049800
2018-03-30 3898.4977 6.0 3888.893400
2018-04-02 3886.9189 2.0 3892.708300
2018-04-03 3862.4796 8.0 3885.344862
2018-04-04 3854.8625 1.0 3854.862500
2018-04-09 3852.9292 9.0 3876.179456
2018-04-10 3927.1729 3.0 3878.321533
2018-04-11 3938.3434 1.0 3938.343400
2018-04-12 3898.6354 3.0 3921.383900
2018-04-13 3871.1443 8.0 3886.560775
2018-04-16 3808.8630 2.0 3840.003650
2018-04-17 3748.6412 2.0 3778.752100
2018-04-18 3766.2820 4.0 3798.732625
2018-04-19 3811.8430 6.0 3817.568150
2018-04-20 3760.8543 3.0 3779.659767
Related
Here was the original question:
With only being able to import numpy and pandas, I need to do the following: Scale the medianIncome to express the values in $10,000 of dollars (example: 150000 will become 15, 30000 will become 3, 15000 will become 1.5, etc)
Here's the code that works:
temp_housing['medianIncome'].replace( '[(]','-', regex=True ).astype(float)) / 10000
But when I call the df after, it still shows the original amount instead of the 15 of 1.5. I'm not sure what I'm missing on this.
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0
The result is:
id medianIncome
0 1.7250
1 2.1806
2 2.4038
3 2.4597
4 1.8080
Name: medianIncome, Length: 20640, dtype: float64
But then when I call the df with
housing_cal
, it's back to:
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0
I have a df (car_data) where there are 2 columns: model and is_4wd.
The is_4wd is either 0 or 1 and have about 25,000 missing values. However, I know that some models are 4wd because they already has a 1, and the same models have nan.
How can I replace the nan values for the models I know they already 1?
I have created a for loop, but I had to change all nan values to 0, create a variable of unique car models and the loop take a long time to complete.
car_data['is_4wd']=car_data['is_4wd'].fillna(0)
car_4wd=car_data.query('is_4wd==1')
caru=car_4wd['model'].unique()
for index, row in car_data.iterrows():
if row['is_4wd']==0:
if row['model'] in caru:
car_data.loc[car_data.model==row['model'],'is_4wd']=1
Is there a better way to do it? Tried several replace() methods but to no avail.
The df head looks like this: (you can see ford f-150 for example has both 1 and nan in is_4wd) the expected outcome is to replace all the nan for the models I know they have values already entered with 1.
price model_year model condition cylinders fuel odometer \
0 9400 2011.0 bmw x5 good 6.0 gas 145000.0
1 25500 NaN ford f-150 good 6.0 gas 88705.0
2 5500 2013.0 hyundai sonata like new 4.0 gas 110000.0
3 1500 2003.0 ford f-150 fair 8.0 gas NaN
4 14900 2017.0 chrysler 200 excellent 4.0 gas 80903.0
transmission type paint_color is_4wd date_posted days_listed
0 automatic SUV NaN 1.0 2018-06-23 19
1 automatic pickup white 1.0 2018-10-19 50
2 automatic sedan red NaN 2019-02-07 79
3 automatic pickup NaN NaN 2019-03-22 9
4 automatic sedan black NaN 2019-04-02 28
Group your data by model column and fill is_4wd column by the max value of the group:
df['is_4wd'] = df.groupby('model')['is_4wd'] \
.transform(lambda x: x.fillna(x.max())).fillna(0).astype(int)
print(df[['model', 'is_4wd']])
# Output:
model is_4wd
0 bmw x5 1
1 ford f-150 1
2 hyundai sonata 0
3 ford f-150 1
4 chrysler 200 0
I have a huge pandas data frame containing data about hospital encounters. This data frame has the following columns: hospital encounter id (hadm_id), a datetime object indicating the time a variable was charted (ce_charttime) and the values of the recorded variables. There are many variables, but for simplicity, I am currently working with only 2 variables heart rate (hr) and respiratory rate (resp). Here is the head of the data frame:
hadm_id ce_charttime hr resp
0 100020 2142-11-30 23:06:00 62.0 NaN
1 100020 2142-11-30 23:06:00 NaN 13.0
2 100021 2109-08-21 20:00:00 134.0 NaN
3 100021 2109-08-21 19:30:00 133.0 NaN
4 100021 2109-08-21 20:00:00 NaN 18.0
If you notice, the encounter with hadm_id=100020, has two rows. However, both the rows have the same ce_charttime with value 2142-11-30 23:06:00, which means it should really be one row, with one ce_charttime having a value for both hr and resp: ce_charttime=2142-11-30 23:06:00, hr=62.0, resp=NaN.
Similarly, for the encounter with hadm_id=100021, there are 3 rows, however, there really needs to be only 2 rows. After sorting by time, the first row would have the values ce_charttime=19:30:00, hr=133.0, resp=NaN and the second row would have the values ce_charttime=2109-08-21 20:00:00, hr=134.0, resp=18.0.
Essentially, I need the data frame to look like this:
hadm_id ce_charttime hr resp
0 100020 2142-11-30 23:06:00 62.0 13.0
1 100021 2109-08-21 19:30:00 133.0 NaN
2 100021 2109-08-21 20:00:00 134.0 18.0
This is just a sample of the dataframe, this dataframe has more than 30 variables, with more than 8000 unique encounters with a lot of rows with this kind of redundant information. Is there way to filter this redundant information?
Any help is appreciated. Please let me know if further information is needed.
Thanks.
Use GroupBy.sum with min_count=1 to keep NaN value:
df.groupby(['hadm_id','ce_charttime']).sum(min_count = 1).reset_index()
This works if there is no more than one rows (hr,resp) with different values per group
Output:
hadm_id ce_charttime hr resp
0 100020 2142 11-30-23:06:00 62.0 13.0
1 100021 2109 08-21-19:30:00 133.0 NaN
2 100021 2109 08-21-20:00:00 134.0 18.0
I have a dataframe (movielens dataset)
(Pdb) self.data_train.head()
userId movieId rating timestamp
65414 466 608 4.0 945139883
79720 547 6218 4.0 1089518106
63354 457 4007 3.5 1471383787
29923 213 59333 2.5 1462636955
63651 457 102194 2.5 1471383710
I found mean of each user's rating
user_mean = self.data_train['rating'].groupby(self.data_train['userId']).mean()
(Pdb) user_mean.head()
userId
1 2.527778
2 3.426471
3 3.588889
4 4.363158
5 3.908602
I want to subtract this mean value from the first dataframe for the matching user.
Is there a way of doing it without a explicit for loop?
I think you need GroupBy.transform with mean for Serieswith same size like original DataFrame, so possible subtract column by Series.sub:
s = self.data_train.groupby('userId')['rating'].transform('mean')
self.data_train['new'] = self.data_train['rating'].sub(s)
Sample: Changed data in userId for better sample
print (data_train)
userId movieId rating timestamp
65414 466 608 4.0 945139883
79720 466 6218 4.0 1089518106
63354 457 4007 3.5 1471383787
29923 466 59333 2.5 1462636955
63651 457 102194 2.5 1471383710
s = data_train.groupby('userId')['rating'].transform('mean')
print (s)
65414 3.5
79720 3.5
63354 3.0
29923 3.5
63651 3.0
Name: rating, dtype: float64
data_train['new'] = data_train['rating'].sub(s)
print (data_train)
userId movieId rating timestamp new
65414 466 608 4.0 945139883 0.5
79720 466 6218 4.0 1089518106 0.5
63354 457 4007 3.5 1471383787 0.5
29923 466 59333 2.5 1462636955 -1.0
63651 457 102194 2.5 1471383710 -0.5
I have a data frame with 2 indexes called "DATE"( it is monthly data) and "ID" and a column variable named Volume. Now I want to iterate over it and fill for every unique ID a new column with the average value of the column Volume in a new column.
The basic idea is to figure out which months are above the yearly avg for every ID.
list(df.index)
(Timestamp('1970-09-30 00:00:00'), 12167.0)
print(df.index.name)
None
I seemed to not find a tutorial to address this :(
Can someone please point me in the right direction
SHRCD EXCHCD SICCD PRC VOL RET SHROUT \
DATE PERMNO
1970-08-31 10559.0 10.0 1.0 5311.0 35.000 1692.0 0.030657 12048.0
12626.0 10.0 1.0 5411.0 46.250 926.0 0.088235 6624.0
12749.0 11.0 1.0 5331.0 45.500 5632.0 0.126173 34685.0
13100.0 11.0 1.0 5311.0 22.000 1759.0 0.171242 15107.0
13653.0 10.0 1.0 5311.0 13.125 141.0 0.220930 1337.0
13936.0 11.0 1.0 2331.0 11.500 270.0 -0.053061 3942.0
14322.0 11.0 1.0 5311.0 64.750 6934.0 0.024409 154187.0
16969.0 10.0 1.0 5311.0 42.875 1069.0 0.186851 13828.0
17072.0 10.0 1.0 5311.0 14.750 777.0 0.026087 5415.0
17304.0 10.0 1.0 5311.0 24.875 1939.0 0.058511 8150.0
You can use transform with year for same size Series like original DataFrame:
print (df)
VOL
DATE PERMNO
1970-08-31 10559.0 1
10559.0 2
12749.0 3
1971-08-31 13100.0 4
13100.0 5
df['avg'] = df.groupby([df.index.get_level_values(0).year, 'PERMNO'])['VOL'].transform('mean')
print (df)
VOL avg
DATE PERMNO
1970-08-31 10559.0 1 1.5
10559.0 2 1.5
12749.0 3 3.0
1971-08-31 13100.0 4 4.5
13100.0 5 4.5