Converting multi-index dataframe to Xarray dataset either loses annual sequence or gives an error - pandas

Firstly - apologies but I am unable to reproduce this error using code. I will try and describe it as best as possible using screenshots of the data and errors.
I've got a large dataframe indexed by 'Year' and 'Season' with values for latitude, longitude, and Rainfall with some others which looks like this:
This is organised to respect the annual sequence of 'Winter', 'Spring', 'Summer', 'Autumn' (numbers 1:4 in Season column) - and I need to keep this sequence after conversion to an Xarray Dataset too. But if I try and convert straight to Dataset:
future = future.to_xarray()
I get the following error:
So it is clear I need to reindex by unique identifiers, I tried using just lat and lon but this gives the same error (as there are duplicates). Resetting the index then reindexing then using lat, lon and time
like so:
future = future.reset_index()
future.head()
future.set_index(['latitude', 'longitude', 'time'], inplace=True)
future.head()
allows for the
future = future.to_xarray()
code to work:
The problem is that this has now lost its annual sequencing, you can see from the Season variable in the dataset that it starts at '1' '1' '1' for the first 3 months of the year but then jumps to '3','3','3' meaning we're going from winter to summer and skipping spring.
This is only the case after re-indexing the dataframe, but I can't convert it to a Dataset without re-indexing, and I can't seem to re-index without disrupting the annual sequence. Is there some way to fix this?
I hope this is clear and the error is illustrated enough for someone to be able to help!
EDIT:
I think the issue here is when it indexes by date it automatically orders the dates chronologically (e.g. 1952 follows 1951 etc), but I don't want this, I want it to maintain the sequence in the initial dataframe (which is organised seasonally, but it could have a spring from 1955 followed by a summer from 2000 followed by an autumn from 1976) - I need to retain this sequence.
EDIT 2:
So the dataset looks like this when I set 'Year' as the index, or just keep the index generic but I need the tg variable to have lat/lon associated with it so the dataset looks like this:
<xarray.Dataset>
Dimensions: (Year: 190080)
Coordinates:
* Year (Year) int64 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
Data variables:
Season (Year) object '1' '1' '2' '2' '2' '3' '3' '3' '4' '4' '4' '1' ...
latitude (Year) float64 51.12 51.12 51.12 51.12 51.12 51.12 51.12 ...
longitude (Year) float64 -10.88 -10.88 -10.88 -10.88 -10.88 -10.88 ...
seasdif (Year) float32 -0.79192877 -0.79192877 -0.55932236 ...
tg (Year, latitude, longitude) float32 nan nan nan nan nan nan nan nan nan nan nan ...
time (Year) datetime64[ns] 1970-01-31 1970-02-28 1970-03-31 ...

Tell me if this works for you. I have added an extra index column and use it to sort in the end.
import pandas as pd
import xarray as xr
import numpy as np
df = pd.DataFrame({'Year':[1951,1951,1951,1951],'Season':[1,1,1,3],'lat':
[51,51,51,51],'long':[10.8,10.8,10.6,10.6],'time':['1950-12-31','1951-01-31','1951-
02-28','1950-12-31']})
Made the index as a separate column 'Order' and then used it along with set_index. This is due to the fact that, I could sort through only an index or 1-D column and we had three coordinates.
df.reset_index(level=0, inplace=True)
df = df.rename(columns={'index': 'Order'})
df['time'] = pd.to_datetime(df['time'])
df.set_index(['lat', 'long', 'time','Order'], inplace=True)
df.head()
df = df.to_xarray()
This should preserve the order and have lat,lon,time associated with tg(I dont have it in my df though).
df2 = df
df2.sortby('Order')
You could also drop the 'Order' column, though I am not sure if it will alter your order.(It does not alter mine)
df2.drop('Order')

Related

Pandas reindex Dates To Subset of Dates from List

I am sorry, but there is online documentation and examples and I'm still not understanding. I have a pandas df with an index of dates in datetime format (yyyy-mm-dd) and I'm trying to resample or reindex this dataframe based on a subset of dates in the same format (yyyy-mm-dd) that are in a list. I have converted the df.index values to datetime using:
dfmla.index = pd.to_datetime(dfmla.index)
I've tried various things and I keep getting NaN's after applying the reindex. I know this must be a datatypes problem and my df is in the form of:
df.dtypes
Out[30]:
month int64
mean_mon_flow float64
std_mon_flow float64
monthly_flow_ln float64
std_anomaly float64
dtype: object
My data looks like this:
df.head(5)
Out[31]:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1949-10-01 10 8.565828 0.216126 8.848631 1.308506
1949-11-01 11 8.598055 0.260254 8.368006 -0.883938
1949-12-01 12 8.612080 0.301156 8.384662 -0.755149
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
My month_list (list datatype) looks like this:
month_list[0:2]
Out[37]: ['1950-08-01', '1950-09-01']
I need my condensed, new reindexed df to look like this:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
thank you for your suggestions,
If you're certain that all month_list are in the index, you can do df.loc[month_list], else you can use reindex:
df.reindex(pd.to_datetime(month_list))
Output:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967

pandas resampling: aggregating monthly values with offset

I work with monthly climate data (e.g. monthly mean temperature or precipitation) where I am often interested in taking several-month means e.g. December-March or May-September. To do this, I'm attempting to aggregrate monthly time series data using offsets in pandas (version 1.3.5) following the documentation.
For example, I have a monthly time series:
import pandas as pd
index = pd.date_range(start="2000-01-31", end="2000-12-31", freq="M")
data = pd.Series(range(12), index=index)
Taking a 4-month mean:
data_4M = data.resample("4M").mean()
>>> data_4M
2000-01-31 0.0
2000-05-31 2.5
2000-09-30 6.5
2001-01-31 10.0
Freq: 4M, dtype: float64
Attempting a 4-month mean with a 2-month offset produces a warning with the same results as the no-offset example above:
data_4M_offset = data.resample("4M", offset="2M").mean()
c:\program files\python39\lib\site-packages\pandas\core\resample.py:1381: FutureWarning: Units 'M', 'Y' and 'y' do not represent unambiguous timedelta values and will be removed in a future version
tg = TimeGrouper(**kwds)
>>> data_4M_offset
2000-01-31 0.0
2000-05-31 2.5
2000-09-30 6.5
2001-01-31 10.0
Freq: 4M, dtype: float64
Does this mean that the monthly offset functionality has already been removed?
Is there another way that I can take multi-month averages with offsets?

Seasonal Decomposition plots won't show despite pandas recognizing DateTime Index

I loaded the data as follows
og_data=pd.read_excel(r'C:\Users\hp\Downloads\dataframe.xlsx',index_col='DateTime')
And it looks like this:
DateTime A
2019-02-04 10:37:54 0
2019-02-04 10:47:54 1
2019-02-04 10:57:54 2
2019-02-04 11:07:54 3
2019-02-04 11:17:54 4
Problem is, I'm trying to set the data to a frequency, but there are NaN values that I have to drop, and even if I don't, it seems to be an irregular frequency. I've got pandas to recognize the DateTime to the index:
og_data.index
DatetimeIndex
dtype='datetime64[ns]', name='DateTime', length=15536, freq=None)
but when I try doing this:
og_data.index.freq = '10T'
That should mean 10min, right?
But I get the following error instead:
ValueError: Inferred frequency None from passed values does not conform to passed frequency 10T
Even if I set the frequency to days:
og_data.index.freq = 'D'
I get a similar error.
The goal is to get a seasonal decomposition plots because I want to forecast the data. But I get the following error when I try to do so:
result=seasonal_decompose(og_data['A'],model='add')
result.plot();
ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
Which makes sense, I can't set the datetime index to a specified frequency. I need it to be every 10min, please advise.

Pandas Interpolation: {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

I am trying to interpolate time series data, df, which looks like:
id data lat notes analysis_date
0 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
1 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
2 17352742 -2.331365 26.125979 None 2019-09-20 12:00:00+00:00
3 17358709 -4.424366 26.125979 None 2019-09-20 12:00:00+00:00
I try: df.groupby(['lat', 'lon']).apply(lambda group: group.interpolate(method='linear')), and it throws {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
I suspect the issue is with the fact that I have None values, and I do not want to interpolate those. What is the solution?
df.dtypes gives me:
id int64
data float64
lat float64
notes object
analysis_date datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
dtype: object
DataFrame.interpolate has issues with timezone-aware datetime64ns columns, which leads to that rather cryptic error message. E.g.
import pandas as pd
df = pd.DataFrame({'time': pd.to_datetime(['2010', '2011', 'foo', '2012', '2013'],
errors='coerce')})
df['time'] = df.time.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
df.interpolate()
ValueError: Invalid fill method. Expecting pad (ffill) or backfill
(bfill). Got linear
In this case interpolating that column is unnecessary so only interpolate the column you need. We still want DataFrame.interpolate so select with [[ ]] (Series.interpolate leads to some odd reshaping)
df['data'] = df.groupby(['lat', 'lon']).apply(lambda x: x[['data']].interpolate())
This error happens because one of the columns you are interpolating is of object data type. Interpolating works only for numerical data types such as integer or float.
If you need to use interpolating for an object or categorical data type, then first convert it to a numerical data type. For this, you need to encode your column first. The following piece of code will resolve your problem:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
notes_encoder=LabelEncoder()
df['notes'] = notes_encoder.fit_transform(df['notes'])
After doing this, check the column's data type. It must be int. If it is categorical ,then change its type to int using the following code:
df['notes']=df['notes'].astype('int32')

What is the functionality of the filling method when reindexing?

When reindexing, say, 1 minute data to daily data (e.g. and index for daily prices at 16:00), if there is a situation that there is no 1 minute data for the 16:00 timestamp on a day, we would want to forward fill from the last non-null 1min data. In the following case, there is no 1min data before 16:00 on the 13th, and the last 1min data comes from 10th.
When using reindex with method='ffill', wouldn't one expect the following code to fill in the value on the 13th at 16:00? Inspecting daily1 shows that it is missing however.
import pandas as pd
import numpy as np
hf_index = pd.date_range(start='2013-05-09 9:00', end='2013-05-13 23:59', freq='1min')
hf_prices = np.random.rand(len(hf_index))
hf = pd.DataFrame(hf_prices, index=hf_index)
hf.ix['2013-05-10 18:00':'2013-05-13 18:00',:]=np.nan
hf.plot()
ind_daily = pd.date_range(start='2013-05-09 16:00', end='2013-05-13 16:00', freq='B')
print(ind_daily.values)
daily1 = hf.reindex(index=ind_daily, method='ffill')
To fill as one (or rather I) would expect, I need to do this:
daily2 = daily1.fillna(method='ffill')
If this is the case, what is the fill method in reindex actually doing. It is not clear to me just from the pandas documentation. It seems to me I should not have to do the above line.
I write my comment on the github here as well:
The current behavior in my opinion makes more sense. 'nan' values can be valid "actual" values in some scenarios. the concept of an actual 'nan' value should be different from 'nan' value because of changing index. If I have a dataframe like this:
A B C
1 1.242 NaN 0.110
3 NaN -0.185 -0.209
5 -0.581 1.483 NaN
and i want to keep all nan as nan, it makes much more sense to have:
df.reindex( [2, 4, 6], method='ffill' )
A B C
2 1.242 NaN 0.110
4 NaN -0.185 -0.209
6 -0.581 1.483 NaN
just take whatever value there is ( nan or not nan ) and fill forward until the next available index. Reindexing should not enforce a mandatory fillna on the data.
This is completely different from
df.reindex( [2, 4, 6], method=None )
which produces
A B C
2 NaN NaN NaN
4 NaN NaN NaN
6 NaN NaN NaN
Here is an example:
np.nan can just mean not applicable; say i have hourly data, and on weekends some calculations are just not applicable. I will fill nan for those columns during the weekends. now if I reindex to finer index, say every minute, the reindex will pick the last value from Friday, and fill it out for the whole weekend. This is wrong.
in reindexing a dataframe, forward flll means just take whatever value there is ( nan or not nan ) and fill forward until the next available index. A 'nan' value can be just an actual valid observation which you want to keep as is.
Reindexing should not enforce a mandatory fillna on the data.