Pandas reindex Dates To Subset of Dates from List - pandas

I am sorry, but there is online documentation and examples and I'm still not understanding. I have a pandas df with an index of dates in datetime format (yyyy-mm-dd) and I'm trying to resample or reindex this dataframe based on a subset of dates in the same format (yyyy-mm-dd) that are in a list. I have converted the df.index values to datetime using:
dfmla.index = pd.to_datetime(dfmla.index)
I've tried various things and I keep getting NaN's after applying the reindex. I know this must be a datatypes problem and my df is in the form of:
df.dtypes
Out[30]:
month int64
mean_mon_flow float64
std_mon_flow float64
monthly_flow_ln float64
std_anomaly float64
dtype: object
My data looks like this:
df.head(5)
Out[31]:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1949-10-01 10 8.565828 0.216126 8.848631 1.308506
1949-11-01 11 8.598055 0.260254 8.368006 -0.883938
1949-12-01 12 8.612080 0.301156 8.384662 -0.755149
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
My month_list (list datatype) looks like this:
month_list[0:2]
Out[37]: ['1950-08-01', '1950-09-01']
I need my condensed, new reindexed df to look like this:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
thank you for your suggestions,

If you're certain that all month_list are in the index, you can do df.loc[month_list], else you can use reindex:
df.reindex(pd.to_datetime(month_list))
Output:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967

Related

unable to fetch row where index is of type dtype='datetime64[ns]'

I have a pandas main_df dataframe with date as index
<bound method Index.get_value of DatetimeIndex(['2021-05-11', '2021-05-12','2021-05-13'],
dtype='datetime64[ns]', name='date', freq=None)>
what am trying to do is fetch row based on certain date.
I tried like this
main_df.loc['2021-05-11'] and it works fine.
But If I pass a date object its failing
main_df.loc[datetime.date(2021, 5, 12)] and its showing key error.
The index is DatetimeIndex then why its throwing an error if I didn't pass key as string?
Reason is DatetimeIndex is simplified array of datetimes, so if select vy dates it failed.
So need select by datetimes:
main_df = pd.DataFrame({'a':range(3)},
index=pd.to_datetime(['2021-05-11', '2021-05-12','2021-05-13']))
print (main_df)
a
2021-05-11 0
2021-05-12 1
2021-05-13 2
print (main_df.index)
DatetimeIndex(['2021-05-11', '2021-05-12', '2021-05-13'], dtype='datetime64[ns]', freq=None)
print (main_df.loc[datetime.datetime(2021, 5, 12)])
a 1
Name: 2021-05-12 00:00:00, dtype: int64
If need select by dates first convert datetimes to dates by DatetimeIndex.date:
main_df.index = main_df.index.date
print (main_df.index)
Index([2021-05-11, 2021-05-12, 2021-05-13], dtype='object')
print (main_df.loc[datetime.date(2021, 5, 12)])
a 1
Name: 2021-05-12, dtype: int64
If use string it use exact indexing, so pandas select in DatetimeIndex correct way.

Changing Excel Dates (As integers) mixed with timestamps in single column - Have tried str.extract

I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

Converting multi-index dataframe to Xarray dataset either loses annual sequence or gives an error

Firstly - apologies but I am unable to reproduce this error using code. I will try and describe it as best as possible using screenshots of the data and errors.
I've got a large dataframe indexed by 'Year' and 'Season' with values for latitude, longitude, and Rainfall with some others which looks like this:
This is organised to respect the annual sequence of 'Winter', 'Spring', 'Summer', 'Autumn' (numbers 1:4 in Season column) - and I need to keep this sequence after conversion to an Xarray Dataset too. But if I try and convert straight to Dataset:
future = future.to_xarray()
I get the following error:
So it is clear I need to reindex by unique identifiers, I tried using just lat and lon but this gives the same error (as there are duplicates). Resetting the index then reindexing then using lat, lon and time
like so:
future = future.reset_index()
future.head()
future.set_index(['latitude', 'longitude', 'time'], inplace=True)
future.head()
allows for the
future = future.to_xarray()
code to work:
The problem is that this has now lost its annual sequencing, you can see from the Season variable in the dataset that it starts at '1' '1' '1' for the first 3 months of the year but then jumps to '3','3','3' meaning we're going from winter to summer and skipping spring.
This is only the case after re-indexing the dataframe, but I can't convert it to a Dataset without re-indexing, and I can't seem to re-index without disrupting the annual sequence. Is there some way to fix this?
I hope this is clear and the error is illustrated enough for someone to be able to help!
EDIT:
I think the issue here is when it indexes by date it automatically orders the dates chronologically (e.g. 1952 follows 1951 etc), but I don't want this, I want it to maintain the sequence in the initial dataframe (which is organised seasonally, but it could have a spring from 1955 followed by a summer from 2000 followed by an autumn from 1976) - I need to retain this sequence.
EDIT 2:
So the dataset looks like this when I set 'Year' as the index, or just keep the index generic but I need the tg variable to have lat/lon associated with it so the dataset looks like this:
<xarray.Dataset>
Dimensions: (Year: 190080)
Coordinates:
* Year (Year) int64 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
Data variables:
Season (Year) object '1' '1' '2' '2' '2' '3' '3' '3' '4' '4' '4' '1' ...
latitude (Year) float64 51.12 51.12 51.12 51.12 51.12 51.12 51.12 ...
longitude (Year) float64 -10.88 -10.88 -10.88 -10.88 -10.88 -10.88 ...
seasdif (Year) float32 -0.79192877 -0.79192877 -0.55932236 ...
tg (Year, latitude, longitude) float32 nan nan nan nan nan nan nan nan nan nan nan ...
time (Year) datetime64[ns] 1970-01-31 1970-02-28 1970-03-31 ...
Tell me if this works for you. I have added an extra index column and use it to sort in the end.
import pandas as pd
import xarray as xr
import numpy as np
df = pd.DataFrame({'Year':[1951,1951,1951,1951],'Season':[1,1,1,3],'lat':
[51,51,51,51],'long':[10.8,10.8,10.6,10.6],'time':['1950-12-31','1951-01-31','1951-
02-28','1950-12-31']})
Made the index as a separate column 'Order' and then used it along with set_index. This is due to the fact that, I could sort through only an index or 1-D column and we had three coordinates.
df.reset_index(level=0, inplace=True)
df = df.rename(columns={'index': 'Order'})
df['time'] = pd.to_datetime(df['time'])
df.set_index(['lat', 'long', 'time','Order'], inplace=True)
df.head()
df = df.to_xarray()
This should preserve the order and have lat,lon,time associated with tg(I dont have it in my df though).
df2 = df
df2.sortby('Order')
You could also drop the 'Order' column, though I am not sure if it will alter your order.(It does not alter mine)
df2.drop('Order')

Understanding resampling of datetime in pandas

I have a question regarding resampling of DataFrames.
import pandas as pd
df = pd.DataFrame([['2005-01-20', 10], ['2005-01-21', 20],
['2005-01-27', 40], ['2005-01-28', 50]],
columns=['date', 'num'])
# Convert the column to datetime
df['date'] = pd.to_datetime(df['date'])
# Resample and aggregate results by week
df = df.resample('W', on='date')['num'].sum().reset_index()
print(df.head())
# OUTPUT:
# date num
# 0 2005-01-23 30
# 1 2005-01-30 90
Everything works as expected, but I would like to better understand what exactly resample(),['num'] and sum() do here.
QUESTION #1
Why the following happens:
The result of df.resample('W', on='date') is DatetimeIndexResampler.
The result of df.resample('W', on='date')['num'] is pandas.core.groupby.SeriesGroupBy.
The result of df.resample('W', on='date')['num'].sum() is
date
2005-01-23 30
2005-01-30 90
Freq: W-SUN, Name: num, dtype: int64
QUESTION #2
Is there a way to produce the same results without resampling? For example, using groupby.
Answer1
As the docs says, .resample returns a Resampler Object. Hence you get DatetimeIndexResampler because date is a datetime object.
Now, you get <pandas.core.groupby.SeriesGroupBy because you are looking for Series from the dataframe based of off the Resampler object.
Oh by the way,
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num']
Would return
<pandas.core.groupby.SeriesGroupBy as well.
Now when you do .sum(), you are getting the sum over the requested axis of the dataframe. You get a Series because you are doing sum over the pandas.core.series.Series.
Answer2
You can achieve results using groupby with the help from Grouper as follow:
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num'].sum()
Output:
date
2005-01-23 30
2005-01-30 90
Name: num, dtype: int64