I have a daily data time series in which there are many NaN values. I want to resample to monthly data taking account only months with less than 10 day NaN values.
I've tried using the resample function, by this way:
df =
Date Sr_1 Sr_2 Sr_3
01/12/1978 32.2 20.8 NaN
02/12/1978 32.2 20.6 NaN
03/12/1978 31.6 22 NaN
04/12/1978 28.2 19.4 NaN
05/12/1978 29.8 22.8 24.6
06/12/1978 32 22.2 25.8
07/12/1978 32.8 23.2 NaN
08/12/1978 29.8 NaN 26.8
09/12/1978 31.4 21.4 25.4
10/12/1978 28.8 24 NaN
11/12/1978 30.8 20 NaN
12/12/1978 32 24 25.6
13/12/1978 33 23.2 25.8
14/12/1978 32.4 22.4 24.6
15/12/1978 30 20.6 NaN
16/12/1978 32.6 21.2 NaN
17/12/1978 33 23.4 NaN
18/12/1978 30.4 20.4 26.4
19/12/1978 32 22.2 NaN
20/12/1978 32.2 NaN NaN
21/12/1978 32.8 22.8 NaN
22/12/1978 32 22.2 NaN
23/12/1978 32.2 NaN NaN
24/12/1978 31.4 NaN NaN
25/12/1978 33 NaN 25.6
26/12/1978 33.4 20.6 NaN
27/12/1978 33.6 22.2 NaN
28/12/1978 33.6 23.4 NaN
29/12/1978 33.8 23.4 NaN
30/12/1978 33.2 NaN 25.2
31/12/1978 33.6 23.4 25.2
df.resample('1MS', how='mean')
The result is:
01/12/1978 31.9 22.1 25.5
But Sr_3 have more than 10 NaN values, so the result must be NaN.
Thanks
Here's a hackyish way. First count the number of NaNs then use where to NaN those out.
In [11]: g = df1.groupby(pd.TimeGrouper('1MS'))
Note: count by using isnull and sum.
In [12]: g.apply(lambda x: pd.isnull(x).sum()).unstack(1) # Note: columns match res
Out[12]:
Sr_1 Sr_2 Sr_3
Date
1978-01-01 0 0 1
1978-02-01 0 0 1
1978-03-01 0 0 1
1978-04-01 0 0 1
1978-05-01 0 0 0
1978-06-01 0 0 0
1978-07-01 0 0 1
1978-08-01 0 1 0
1978-09-01 0 0 0
1978-10-01 0 0 1
1978-11-01 0 0 1
1978-12-01 0 5 13
In [13]: under_ten_nan = g.apply(lambda x: pd.isnull(x).sum()).unstack(1) <= 10
use where to NaN those entries over 10:
In [14]: res.where(under_ten_nan)
Out[14]:
Sr_1 Sr_2 Sr_3
Date
1978-01-01 32.20 20.80 NaN
1978-02-01 32.20 20.60 NaN
1978-03-01 31.60 22.00 NaN
1978-04-01 28.20 19.40 NaN
1978-05-01 29.80 22.80 24.6
1978-06-01 32.00 22.20 25.8
1978-07-01 32.80 23.20 NaN
1978-08-01 29.80 NaN 26.8
1978-09-01 31.40 21.40 25.4
1978-10-01 28.80 24.00 NaN
1978-11-01 30.80 20.00 NaN
1978-12-01 32.51 22.36 NaN
You can pre-filter the groups (using a similar algo as #Andy Hayden). Not sure if this is any less hacky!
This is new in 0.14.0 (you can you pd.TimeGrouper('1MS') in prior versions
In [20]: g = pd.Grouper(freq='1MS')
Filter and only keep groups where the column satisfy the criteria of having < 10 nans
Then do the resample (this is what groupby(g).mean() does)
In [28]: pd.concat([
df.groupby(g)[c].filter(lambda x: x.isnull().sum()<10).groupby(g).mean()
for c in df.columns ],axis=1)
Out[28]:
Sr_1 Sr_2 Sr_3
Date
1978-01-01 32.20 20.80 NaN
1978-02-01 32.20 20.60 NaN
1978-03-01 31.60 22.00 NaN
1978-04-01 28.20 19.40 NaN
1978-05-01 29.80 22.80 24.6
1978-06-01 32.00 22.20 25.8
1978-07-01 32.80 23.20 NaN
1978-08-01 29.80 NaN 26.8
1978-09-01 31.40 21.40 25.4
1978-10-01 28.80 24.00 NaN
1978-11-01 30.80 20.00 NaN
1978-12-01 32.51 22.36 NaN
This has to be done columm by column and then concatted because filter works on the entire group.
Related
Link to table: https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page=0
This table goes from page 0 to page 27.
I have successfully scraped the table into a pandas df for page 0:
url = 'https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page=0'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
#getting the table
table = soup.find('table', {'class':'views-table views-view-table cols-20'})
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
df = pd.DataFrame(columns = headers)
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
Now I need to do the same for all the pages and store it into a single a df.
You can use pandas.read_html to parse tables to dataframes and then concat them:
import pandas as pd
url = "https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value=all&page={}"
all_df = []
for page in range(0, 10): # <-- increase number of pages here
print("Getting page", page)
all_df.append(pd.read_html(url.format(page))[0])
final_df = pd.concat(all_df).reset_index(drop=True)
print(final_df.tail(10).to_markdown(index=False))
Date
20 YR
30 YR
Extrapolation Factor
8 WEEKS BANK DISCOUNT
COUPON EQUIVALENT
52 WEEKS BANK DISCOUNT
COUPON EQUIVALENT.1
1 Mo
2 Mo
3 Mo
6 Mo
1 Yr
2 Yr
3 Yr
5 Yr
7 Yr
10 Yr
20 Yr
30 Yr
12/13/2001
nan
nan
nan
nan
nan
nan
nan
1.69
nan
1.69
1.78
2.2
3.09
3.62
4.4
4.9
5.13
5.81
5.53
12/14/2001
nan
nan
nan
nan
nan
nan
nan
1.7
nan
1.73
1.81
2.22
3.2
3.73
4.52
5.01
5.24
5.89
5.59
12/17/2001
nan
nan
nan
nan
nan
nan
nan
1.72
nan
1.74
1.84
2.24
3.21
3.74
4.54
5.03
5.26
5.91
5.61
12/18/2001
nan
nan
nan
nan
nan
nan
nan
1.72
nan
1.71
1.81
2.24
3.13
3.66
4.46
4.93
5.16
5.81
5.52
12/19/2001
nan
nan
nan
nan
nan
nan
nan
1.69
nan
1.69
1.8
2.23
3.11
3.63
4.38
4.84
5.08
5.73
5.45
12/20/2001
nan
nan
nan
nan
nan
nan
nan
1.67
nan
1.69
1.79
2.22
3.15
3.67
4.42
4.86
5.08
5.73
5.43
12/21/2001
nan
nan
nan
nan
nan
nan
nan
1.67
nan
1.71
1.81
2.23
3.17
3.69
4.45
4.89
5.12
5.76
5.45
12/24/2001
nan
nan
nan
nan
nan
nan
nan
1.66
nan
1.72
1.83
2.24
3.22
3.74
4.49
4.95
5.18
5.81
5.49
12/26/2001
nan
nan
nan
nan
nan
nan
nan
1.77
nan
1.75
1.87
2.34
3.26
3.8
4.55
5
5.22
5.84
5.52
12/27/2001
nan
nan
nan
nan
nan
nan
nan
1.75
nan
1.74
1.84
2.27
3.19
3.71
4.46
4.9
5.13
5.78
5.49
I have this df:
CODE YEAR MONTH DAY TMAX TMIN PP
0 130 1991 1 1 32.6 23.4 0.0
1 130 1991 1 2 31.2 22.4 0.0
2 130 1991 1 3 32.0 NaN 0.0
3 130 1991 1 4 32.2 23.0 0.0
4 130 1991 1 5 30.5 22.0 0.0
... ... ... ... ... ... ...
20118 130 2018 9 30 31.8 21.2 NaN
30028 132 1991 1 1 35.2 NaN 0.0
30029 132 1991 1 2 34.6 NaN 0.0
30030 132 1991 1 3 35.8 NaN 0.0
30031 132 1991 1 4 34.8 NaN 0.0
... ... ... ... ... ... ...
45000 132 2019 10 5 35.5 NaN 21.1
46500 133 1991 1 1 35.5 NaN 21.1
I need to count months that have at least 1 non NaN value in TMAX,TMIN and PP columns. If the month have all nan values that month doesn't count. I need to do this by each CODE.
Expected value:
CODE YEAR MONTH DAY TMAX TMIN PP JANUARY_TMAX FEBRUARY_TMAX MARCH_TMAX APRIL_TMAX etc
130 1991 1 1 32.6 23.4 0 23 25 22 27 …
130 1991 1 2 31.2 22.4 0 NaN NaN NaN NaN NaN
130 1991 1 3 32 NaN 0 NaN NaN NaN NaN NaN
130 1991 1 4 32.2 23 0 NaN NaN NaN NaN NaN
130 1991 1 5 30.5 22 0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
130 2018 9 30 31.8 21.2 NaN NaN NaN NaN NaN NaN
132 1991 1 1 35.2 NaN 0 21 23 22 22 …
132 1991 1 2 34.6 NaN 0 NaN NaN NaN NaN NaN
132 1991 1 3 35.8 NaN 0 NaN NaN NaN NaN NaN
132 1991 1 4 34.8 NaN 0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
132 2019 1 1 35.5 NaN 21.1 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
133 1991 1 1 35.5 NaN 21.1 25 22 22 21 …
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
For example: In code 130 for TMAX column, i have 23 Januarys that have at least 1 non NaN value, i have 25 Februarys that have at least 1 non NaN value, etc.
Would you mind to help me? Thanks in advance.
This may not be super efficient, but here is how you can do it for one of columns, TMAX in this case. Just repeat the process for the other columns.
# Count occurrences of each month when TMAX is not null
tmax_cts_long = df[df.TMAX.notnull()].drop_duplicates(subset=['CODE', 'YEAR', 'MONTH']).groupby(['CODE', 'MONTH']).size().reset_index(name='COUNT')
# Transpose the long table of counts to wide format
tmax_cts_wide = tmax_cts_long.pivot(index='CODE', columns='MONTH', values='COUNT')
# Merge table of counts with the original dataframe
final_df = df.merge(tmax_cts_wide, on='CODE', how='left')
# Replace values in new columns in all rows after the first row with NaN
mask = final_df.index.isin(df.groupby(['CODE', 'MONTH']).head(1).index)
final_df.loc[~mask, [col for col in final_df.columns if isinstance(col, int)]] = None
# Rename new columns to follow the desired naming format
mon_dict = {1: 'JANUARY', 2: 'FEBRUARY', ...}
tmax_mon_dict = {k: v + '_TMAX' for k, v in mon_dict.items()}
final_df.rename(columns=tmax_mon_dict, inplace=True)
I have this df:
CODE YEAR MONTH DAY TMAX TMIN PP BAD PERIOD 1 BAD PERIOD 2
9984 000130 1991 1 1 32.6 23.4 0.0 1991 1998
9985 000130 1991 1 2 31.2 22.4 0.0 NaN NaN
9986 000130 1991 1 3 32.0 NaN 0.0 NaN NaN
9987 000130 1991 1 4 32.2 23.0 0.0 NaN NaN
9988 000130 1991 1 5 30.5 22.0 0.0 NaN NaN
... ... ... ... ... ... ...
20118 000130 2018 9 30 31.8 21.2 NaN NaN NaN
30028 000132 1991 1 1 35.2 NaN 0.0 2005 2010
30029 000132 1991 1 2 34.6 NaN 0.0 NaN NaN
30030 000132 1991 1 3 35.8 NaN 0.0 NaN NaN
30031 000132 1991 1 4 34.8 NaN 0.0 NaN NaN
... ... ... ... ... ... ...
50027 000132 2019 10 5 36.5 NaN 13.1 NaN NaN
50028 000133 1991 1 1 36.2 NaN 0.0 1991 2010
50029 000133 1991 1 2 36.6 NaN 0.0 NaN NaN
50030 000133 1991 1 3 36.8 NaN 5.0 NaN NaN
50031 000133 1991 1 4 36.8 NaN 0.0 NaN NaN
... ... ... ... ... ... ...
54456 000133 2019 10 5 36.5 NaN 12.1 NaN NaN
I want to change the values of the columns TMAX TMIN and PP to NaN, only of the periods specified in Bad Period 1 and Bad period 2 AND ONLY IN THEIR RESPECTIVE CODE. For example if I have Bad Period 1 equal to 1991 and Bad period 2 equal to 1998 I want all the values of TMAX, TMIN and PP that have code 000130 have NaN values since 1991 (bad period 1) to 1998 (bad period 2). I have 371 unique CODES in CODE column so i might use df.groupby("CODE").
Expected result after the change:
CODE YEAR MONTH DAY TMAX TMIN PP BAD PERIOD 1 BAD PERIOD 2
9984 000130 1991 1 1 NaN NaN NaN 1991 1998
9985 000130 1991 1 2 NaN NaN NaN NaN NaN
9986 000130 1991 1 3 NaN NaN NaN NaN NaN
9987 000130 1991 1 4 NaN NaN NaN NaN NaN
9988 000130 1991 1 5 NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
20118 000130 2018 9 30 31.8 21.2 NaN NaN NaN
30028 000132 1991 1 1 35.2 NaN 0.0 2005 2010
30029 000132 1991 1 2 34.6 NaN 0.0 NaN NaN
30030 000132 1991 1 3 35.8 NaN 0.0 NaN NaN
30031 000132 1991 1 4 34.8 NaN 0.0 NaN NaN
... ... ... ... ... ... ...
50027 000132 2019 10 5 36.5 NaN 13.1 NaN NaN
50028 000133 1991 1 1 NaN NaN NaN 1991 2010
50029 000133 1991 1 2 NaN NaN NaN NaN NaN
50030 000133 1991 1 3 NaN NaN NaN NaN NaN
50031 000133 1991 1 4 NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
54456 000133 2019 10 5 36.5 NaN 12.1 NaN NaN
you can propagate the values in your bad columns with ffill, if the non nan values are always at the first row per group of CODE and your data is ordered per CODE. If not, with groupby.transform and first. Then use mask to replace by nan where the YEAR is between your two bad columns once filled with the wanted value.
df_ = df[['BAD_1', 'BAD_2']].ffill()
#or more flexible df_ = df.groupby("CODE")[['BAD_1', 'BAD_2']].transform('first')
cols = ['TMAX', 'TMIN', 'PP']
df[cols] = df[cols].mask(df['YEAR'].ge(df_['BAD_1'])
& df['YEAR'].le(df_['BAD_2']))
print(df)
CODE YEAR MONTH DAY TMAX TMIN PP BAD_1 BAD_2
9984 130 1991 1 1 NaN NaN NaN 1991.0 1998.0
9985 130 1991 1 2 NaN NaN NaN NaN NaN
9986 130 1991 1 3 NaN NaN NaN NaN NaN
9987 130 1991 1 4 NaN NaN NaN NaN NaN
9988 130 1991 1 5 NaN NaN NaN NaN NaN
20118 130 2018 9 30 31.8 21.2 NaN NaN NaN
30028 132 1991 1 1 35.2 NaN 0.0 2005.0 2010.0
30029 132 1991 1 2 34.6 NaN 0.0 NaN NaN
30030 132 1991 1 3 35.8 NaN 0.0 NaN NaN
30031 132 1991 1 4 34.8 NaN 0.0 NaN NaN
50027 132 2019 10 5 36.5 NaN 13.1 NaN NaN
50028 133 1991 1 1 NaN NaN NaN 1991.0 2010.0
50029 133 1991 1 2 NaN NaN NaN NaN NaN
50030 133 1991 1 3 NaN NaN NaN NaN NaN
50031 133 1991 1 4 NaN NaN NaN NaN NaN
54456 133 2019 10 5 36.5 NaN 12.1 NaN NaN
I have this df:
CODE DATE TMAX TMIN PP
0 000130 1991-01-01 NaN NaN 0.0
1 000130 1991-01-02 31.2 NaN 0.0
2 000130 1991-01-03 32.0 21.2 0.0
3 000130 1991-01-04 NaN NaN 0.0
4 000130 1991-01-05 NaN 22.0 0.0
... ... ... ... ...
34995 000135 1997-04-24 NaN NaN 0.0
34996 000135 1997-04-25 NaN NaN 4.0
34997 000135 1997-04-26 NaN 22.1 0.0
34998 000135 1997-04-27 31.0 NaN 5.0
34999 000135 1997-04-28 28.8 24.0 0.0
I'm counting the NaN values by CODE column, in columns TMAX TMIN and PP. So i'm using this code.
dfna=df[['TMAX','TMIN','PP']].isna().groupby(df.CODE).sum()
But i want to start counting NaN values since the first non NaN value.
Expected df:
CODE TMAX TMIN PP
000130 2 1 0
000135 0 1 0
...
...
How can i do this?
Thanks in advance.
Think in term of the whole frame, you can use ffill to remove the later nan values. So you can use this to detect the nan's that come after the first valid values:
df.isna() & df.ffill().notna()
Now, you can try groupby.apply
(df[['TMAX','TMIN','PP']].groupby(df['CODE'])
.apply(lambda d: (d.isna() & d.ffill().notna()).sum())
)
Output:
TMAX TMIN PP
CODE
130 2 1 0
135 0 1 0
I am trying to calculate a rolling mean on dataframe with NaNs in pandas, but pandas seems to reset the window when it meets a NaN, hears some code as an example...
import numpy as np
from pandas import *
foo = DataFrame(np.arange(0.0,13.0))
foo['1'] = np.arange(13.0,26.0)
foo.ix[4:6,0] = np.nan
foo.ix[4:7,1] = np.nan
bar = rolling_mean(foo, 4)
gives the rolling mean that resets the window after each NaN's, not just skipping out the NaNs
bar =
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 8.5 NaN
11 9.5 22.5
12 10.5 23.5
I have found an ugly iter/ dropna() work around that gives the right answer
def sparse_rolling_mean(df_data, window):
...: f_data = DataFrame(np.nan,index=df_data.index, columns=df_data.columns)
...: for i in f_data.columns:
...: f_data.ix[:,i] = rolling_mean(df_data.ix[:,i].dropna(),window)
...: return f_data
bar = sparse_rolling_mean(foo,4)
bar
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
does anyone know if it is possible to do this as an array function ?
many thanks in advance.
you may do:
>>> def sparse_rolling_mean(ts, window):
... return rolling_mean(ts.dropna(), window).reindex_like(ts)
...
>>> foo.apply(sparse_rolling_mean, args=(4,))
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
[13 rows x 2 columns]
you can control what get's naned out with the min_periods arg
In [12]: rolling_mean(foo, 4,min_periods=1)
Out[12]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 2.0 15.0
5 2.5 15.5
6 3.0 16.0
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
You can do this if you want results, except when the original was nan
In [27]: rolling_mean(foo, 4,min_periods=1)[foo.notnull()]
Out[27]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
Your expected are a bit odd, as the first 3 rows should have values.