i have a DF (df2) that is 8784 x 13 that looks like this with a "DATE" column in yyyy-mm-dd format and a "TIME" column in hours like below and I need to calculate daily and monthly averages for the year 2016:
DATE TIME BAFFIN BAY GATUN II GATUN I KLONDIKE IIIG \
8759 2016-01-01 0000 8.112838 3.949518 3.291540 7.629178
8760 2016-01-01 0100 7.977169 4.028678 3.097562 7.477159
KLONDIKE II LAGOA II LAGOA I PENASCAL II PENASCAL I SABINA \
8759 7.095450 NaN NaN 8.250527 8.911508 3.835205
8760 7.362562 NaN NaN 7.877099 7.858908 3.766714
SIERRA QUEMADA
8759 3.405049
8760 4.386598
I have tried converting the 'DATE' column to datetime to use groupby but I'm not sure how to do this. I have tried the following below but it is not grouping my data as expected for day or month averages when i test the calculation in Excel:
davg_df2 = df2.groupby(by=df2['DATE'].dt.date).mean() #
davg_df2m = df2.groupby(by=df2['DATE'].dt.month).mean() #
Thank you as I'm still learning python and to understand working with dates and different data types!
Try this:
df2['DATE'] = pd.to_datetime(df2['DATE'], format='%Y-%m-%d')
# monthly
davg_df2 = df2.groupby(pd.Grouper(freq='M', key='DATE')).mean()
# daily
davg_df2 = df2.groupby(pd.Grouper(freq='D', key='DATE')).mean()
# first convert the DATE column to datetime data type:
df2['DATE'] = pd.to_datetime(df2['DATE'])
# create new columns for month and day like so:
df2['month'] = df2['DATE'].apply(lambda t:t.month)
df2['day'] = df2['DATE'].apply(lambda t:t.day)
# then you group by day and month and get the mean like so:
davg_df2m = df2.groupby('month').mean()
davg_df2 = df2.groupby('day').mean()
Related
I have imported a time series that I resampled to monthly time steps, however I would like to select all the years with only March, April, and May months (months 3,4, and 5).
Unfortunately this is not exactly reproducible data since it's a particular text file, but is there a way to just isolate all months 3, 4, and 5 of this time series?
# loading textfile
mjo = np.loadtxt('.../omi.1x.txt')
# setting up dates
dates = pd.date_range('1979-01', periods=mjo.shape[0], freq='D')
#resampling one of the columns to monthly data
MJO_amp = Series(mjo[:,6], index=dates)
MJO_amp_month = MJO_amp.resample("M").mean()[:-27] #match to precipitation time series (ends feb 2019)
MJO_amp_month_normed = (MJO_amp_month - MJO_amp_month.mean())/MJO_amp_month.std()
MJO_amp_month_normed
1979-01-31 0.032398
1979-02-28 -0.718921
1979-03-31 0.999467
1979-04-30 -0.790618
1979-05-31 1.113730
...
2018-10-31 0.198834
2018-11-30 0.221942
2018-12-31 1.804934
2019-01-31 1.359485
2019-02-28 1.076308
Freq: M, Length: 482, dtype: float64
print(MJO_amp_month_normed['2018-10'])
2018-10-31 0.198834
Freq: M, dtype: float64
I was thinking something along the lines of this:
def is_amj(month):
return (month >= 4) & (month <= 6)
seasonal_data = MJO_amp_month_normed.sel(time=is_amj(MJO_amp_month_normed))
but I think my issue is the textfile isn't exactly in pandas format and doesn't have column titles...
You can use the month attribute of pd.DatetimeIndex with isin like this:
df[df.index.month.isin([3,4,5])]
I have the price history for Google stock in a Google Colab doc, like this:
df = pd.DataReader('GOOG', data_source='yahoo', start='08-01-2004')
These are open, high, low, close and adjusted close prices for each trading day in the price history. I can create a new column in the DataFrame for the rate of return of the stock over the trailing 12 months like this:
df['Trailing 12 month return'] = (df['Adj Close'] -
df['Adj Close'].shift(DAYS_TRADING_PER_YEAR)) /
df['Adj Close'].shift(DAYS_TRADING_PER_YEAR)
But what if what I actually want is one value for rate of the return per year, looking at the return over the previous calendar year? So, for 2015, just find the first trading day (more correctly, the first day for which we have data) in 2014 and the last trading day in 2014 and get the percentage change over that period?
Assuming Date is a proper datetime column:
groupby(df.Date.dt.year) to group by year
apply() the yearly rates computed from first_valid_index() and last_valid_index()
shift() the results to get the previous year
rates = df.groupby(df.Date.dt.year)['Adj Close'].apply(
lambda g: (g.loc[g.last_valid_index()] - g.loc[g.first_valid_index()]) / g.loc[g.first_valid_index()]
).shift()
# Date
# 2014 NaN
# 2015 0.128019
# 2016 2.232232
# 2017 1.041269
# 2018 0.292042
# 2019 0.154558
# 2020 -0.136102
# 2021 0.396961
# Name: Adj Close, dtype: float64
Then map() these rates with df.Date.dt.year to create the new column:
df['Previous year rate of return'] = df.Date.dt.year.map(rates)
# Date Adj Close Previous year rate of return
# 0 2014-08-01 166.724074 NaN
# 1 2014-08-02 69.634211 NaN
# ... ... ... ...
# 999 2017-04-26 165.225121 1.041269
# 1000 2017-04-27 40.165297 1.041269
# ... ... ... ...
# 2433 2021-03-30 67.864861 0.396961
# 2434 2021-03-31 31.408317 0.396961
I have last 5 years monthly data. I am using that to create a forecasting model using fbprophet. Last 5 months of my data is as follows:
data1['ds'].tail()
Out[86]: 55 2019-01-08
56 2019-01-09
57 2019-01-10
58 2019-01-11
59 2019-01-12
I have created the model on this and made a future prediction dataframe.
model = Prophet(
interval_width=0.80,
growth='linear',
daily_seasonality=False,
weekly_seasonality=False,
yearly_seasonality=True,
seasonality_mode='additive'
)
# fit the model to data
model.fit(data1)
future_data = model.make_future_dataframe( periods=4, freq='m', include_history=True)
After 2019 December, I need the next year first four months. But it's adding next 4 months with same year 2019.
future_data.tail()
ds
59 2019-01-12
60 2019-01-31
61 2019-02-28
62 2019-03-31
63 2019-04-30
How to get the next year first 4 months in the future dataframe? Is there any specific parameter in that to adjust the year?
The issue is because of the date-format i.e. the 2019-01-12 (2019 December as per your question) is in format "%Y-%d-%m"
Hence, it creates data with month end frequency (stated by 'm') for the next 4 periods.
Just for reference this is how the future dataframe is created by Prophet:
dates = pd.date_range(
start=last_date,
periods=periods + 1, # An extra in case we include start
freq=freq)
dates = dates[dates > last_date] # Drop start if equals last_date
dates = dates[:periods] # Return correct number of periods
Hence, it infers the date format and extrapolates in the future dataframe.
Solution: Change the date format in training data to "%Y-%m-%d"
Stumbled here searching for the appropriate string for minutes
As per the docs the date time need to be YY-MM-DD format -
The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric, and represents the measurement we wish to forecast.
2019-01-12 in YY-MM-DD is 2019-12-01 ; using this
>>> dates = pd.date_range(start='2019-12-01',periods=4 + 1,freq='M')
>>> dates
DatetimeIndex(['2019-12-31', '2020-01-31', '2020-02-29', '2020-03-31',
'2020-04-30'],
dtype='datetime64[ns]', freq='M')
Other formats here; it is not given explicitly for python in prophet docs
https://pandas.pydata.org/docs/reference/api/pandas.tseries.frequencies.to_offset.html
dates = pd.date_range(start='2022-03-17 11:40:00',periods=10 + 1,freq='min')
>>> dates
DatetimeIndex(['2022-03-17 11:40:00', '2022-03-17 11:41:00',
'2022-03-17 11:42:00', '2022-03-17 11:43:00',
..],
dtype='datetime64[ns]', freq='T')
Please excuse obvious errors - still in the learning process.
I am trying to do a simple timeseries plot on my data with a frequency of 15 minutes. The idea is to plot monthly means, starting with resampling data every hour - including only those hourly means that have atleast 1 observation in the interval. There are subsequent conditions for daily and monthly means.
This is relatively simpler only if this error does not crop up- "None of [DatetimeIndex(['2016-01-01 05:00:00', '2016-01-01 05:15:00',\n....2016-12-31 16:15:00'],\n dtype='datetime64[ns]', length=103458, freq=None)] are in the [columns]"
This is my code:
#Original dataframe
Date value
0 1/1/2016 0:00 405.22
1 1/1/2016 0:15 418.56
Date object
value object
dtype: object
#Conversion of 'value' column to numeric/float values.
df.Date = pd.to_datetime(df.Date,errors='coerce')
year=df.Date.dt.year
df['Year'] = df['Date'].map(lambda x: x.year )
df.value = pd.to_numeric(df.value,errors='coerce' )
Date datetime64[ns]
value float64
Year int64
dtype: object
Date value Year
0 2016-01-01 00:00:00 405.22 2016
1 2016-01-01 00:15:00 418.56 2016
df=df.set_index(Date)
diurnal1 = df[df['Date']].resample('h').mean().count()>=2
**(line of error)**
diurnal_mean_1 = diurnal1.mean()[diurnal1.count() >= 1]
(the code follows)
Any help in solving the error will be appreciated.
I think you want df=df.set_index('Date') (Date is a string). Also I would move the conversions over into the constructor if possible after you get it working.
I have a file containing dates from June 2015 + 365 days. I am using this as a lookup table because there are custom business dates (only certain holidays are observed and there are some no-work dates because of internal reasons). Using the customer business date offsets was just so slow with 3.5 million records.
initial_date | day_1 | day_2 | day_3 | ... | day_365
2015-06-01 2015-06-02 2015-06-03 2015-06-04
2015-06-02 2015-06-03 2015-06-04 2015-06-05
The idea is to 'tag' each row in the data based on the number of (custom) business dates since specific entries. Is there a better way to do this?
For example, if a new entry happens on 2016-06-28 then this is labeled as 'initial_date'. day_1 is tomorrow, day_2 is the next day, day_3 is Friday, and day_4 is next Monday.
My naive way of doing this is to create a loop which basically does this:
df.day_1_label = np.where(df.date == df.day_1, 'Day 1', '')
df.day_2_label = np.where(df.date == df.day_2, 'Day 2', '')
df.day_label = (df.day_1_label + df.day_2_label + ...).replace('', regex=True, inplace=True)
This would result in one label per row which I could then aggregate or plot. Eventually this would be used for forecasting. initial_date + the subset of customers from previous dates = total customers
Also, depending on what events happen, a subset of the customers would have a certain event occur in the future. I want to know on what business date from the initial_date this happens on, on average. So if we have 100 customers today, a certain percent will have an event next July 15th or whatever it might be.
Edit- pastebin with some sample data:
http://pastebin.com/ZuE1Q2KJ
So the output I am looking for is the day_label. This basically checks if the date == each date from day_0 - day_n. Is there a better way to fill that in? I am just trying to match the date of each row with a value in one of the day_ columns.
I think this can be made more efficiently if your data looks as I think it looks.
Say you have a calendar of dates (June 2015 + 365 days) and a data frame as for example:
cal = ['2015-06-01', '2015-06-02', '2015-06-03', '2015-06-04', '2015-06-08',
'2015-06-09']
df = pd.DataFrame({'date': ['2015-06-04', '2015-06-09', '2015-06-09'],
'initial_date': ['2015-06-02', '2015-06-02', '2015-06-01']})
Keeping dates in numpy.datetime types is more efficient, so I convert:
df['date'] = df['date'].astype(np.datetime64)
df['initial_date'] = df['initial_date'].astype(np.datetime64)
Now, let's turn cal into a fast lookup table:
# converting types:
cal = np.asarray(cal, dtype=np.datetime64)
# lookup series
s_cal = pd.Series(range(len(cal)), index=cal)
# a convenience lookup function
def lookup(dates):
return s_cal[dates].reset_index(drop=True)
EDIT:
The above lookup returns a series with a RangeIndex, which may be different from the index of df (this causes problems when assigning the result to a column). So it's probably better to rewrite this lookup so that it returns a plain numpy array:
def lookup2(dates):
return s_cal[dates].values
It's possible to set the correct index in lookup (taken from the input dates). But such a solution would be less flexible (would require that the input be a Series) or unnecessarily complicated.
The series:
s_cal
Out[220]:
2015-06-01 0
2015-06-02 1
2015-06-03 2
2015-06-04 3
2015-06-08 4
2015-06-09 5
dtype: int64
and how it works:
lookup(df.date)
Out[221]:
0 3
1 5
2 5
dtype: int64
lookup2(df.date)
Out[36]: array([3, 5, 5])
The rest is straightforward. This adds a column of integers (day differences) to the data frame:
df['day_label'] = lookup2(df.date) - lookup2(df.initial_date)
and if you want to convert it to labels:
df['day_label'] = 'day ' + df['day_label'].astype(str)
df
Out[225]:
date initial_date day_label
0 2015-06-04 2015-06-02 day 2
1 2015-06-09 2015-06-02 day 4
2 2015-06-09 2015-06-01 day 5
Hope this is what you wanted.