Please excuse obvious errors - still in the learning process.
I am trying to do a simple timeseries plot on my data with a frequency of 15 minutes. The idea is to plot monthly means, starting with resampling data every hour - including only those hourly means that have atleast 1 observation in the interval. There are subsequent conditions for daily and monthly means.
This is relatively simpler only if this error does not crop up- "None of [DatetimeIndex(['2016-01-01 05:00:00', '2016-01-01 05:15:00',\n....2016-12-31 16:15:00'],\n dtype='datetime64[ns]', length=103458, freq=None)] are in the [columns]"
This is my code:
#Original dataframe
Date value
0 1/1/2016 0:00 405.22
1 1/1/2016 0:15 418.56
Date object
value object
dtype: object
#Conversion of 'value' column to numeric/float values.
df.Date = pd.to_datetime(df.Date,errors='coerce')
year=df.Date.dt.year
df['Year'] = df['Date'].map(lambda x: x.year )
df.value = pd.to_numeric(df.value,errors='coerce' )
Date datetime64[ns]
value float64
Year int64
dtype: object
Date value Year
0 2016-01-01 00:00:00 405.22 2016
1 2016-01-01 00:15:00 418.56 2016
df=df.set_index(Date)
diurnal1 = df[df['Date']].resample('h').mean().count()>=2
**(line of error)**
diurnal_mean_1 = diurnal1.mean()[diurnal1.count() >= 1]
(the code follows)
Any help in solving the error will be appreciated.
I think you want df=df.set_index('Date') (Date is a string). Also I would move the conversions over into the constructor if possible after you get it working.
Related
I have imported a time series that I resampled to monthly time steps, however I would like to select all the years with only March, April, and May months (months 3,4, and 5).
Unfortunately this is not exactly reproducible data since it's a particular text file, but is there a way to just isolate all months 3, 4, and 5 of this time series?
# loading textfile
mjo = np.loadtxt('.../omi.1x.txt')
# setting up dates
dates = pd.date_range('1979-01', periods=mjo.shape[0], freq='D')
#resampling one of the columns to monthly data
MJO_amp = Series(mjo[:,6], index=dates)
MJO_amp_month = MJO_amp.resample("M").mean()[:-27] #match to precipitation time series (ends feb 2019)
MJO_amp_month_normed = (MJO_amp_month - MJO_amp_month.mean())/MJO_amp_month.std()
MJO_amp_month_normed
1979-01-31 0.032398
1979-02-28 -0.718921
1979-03-31 0.999467
1979-04-30 -0.790618
1979-05-31 1.113730
...
2018-10-31 0.198834
2018-11-30 0.221942
2018-12-31 1.804934
2019-01-31 1.359485
2019-02-28 1.076308
Freq: M, Length: 482, dtype: float64
print(MJO_amp_month_normed['2018-10'])
2018-10-31 0.198834
Freq: M, dtype: float64
I was thinking something along the lines of this:
def is_amj(month):
return (month >= 4) & (month <= 6)
seasonal_data = MJO_amp_month_normed.sel(time=is_amj(MJO_amp_month_normed))
but I think my issue is the textfile isn't exactly in pandas format and doesn't have column titles...
You can use the month attribute of pd.DatetimeIndex with isin like this:
df[df.index.month.isin([3,4,5])]
So i would like to make a column name call date
The first entry i would make it today's Date, i.e. 23/07/2019
and the following row to be the date + 1 i.e. 24/07/2019 so on...
This is easily done in Excel but i tried this simple thing in pandas and i just cant figure out how!
I already have a dateframe called df
so to put down todays date is relatively simple.
df.Date = pd.datetime.now().date()
But im not sure which function would get me the date+1 in the following rows.
Thanks
pd.date_range can use 'today' to set the dates. Normalize then create the Series yourself, otherwise pandas thinks the DatetimeIndex should be the Index too.
import pandas as pd
pd.Series(pd.date_range('today', periods=30, freq='D').normalize(),
name='Date')
0 2019-07-23
1 2019-07-24
...
28 2019-08-20
29 2019-08-21
Name: Date, dtype: datetime64[ns]
If adding a new column to the DataFrame:
df['Date'] = pd.date_range('today', periods=len(df), freq='D').normalize()
pd.date_range is what you are looking for. To build a series of 31 days starting from today:
today = pd.Timestamp.now().normalize()
s = pd.date_range(today, today + pd.Timedelta(days=30), freq='D').to_series()
I am stuck with this problem. Although I found some similar questions, I could not manage to apply the solutions to my case.
I have a small series in which I have a start and an end date of a experimental deployment. My goal is to get the starting day of the week (monday 00h 00min) in which the deployment was started and the same for the last week.
This is my series:
Input
print(df_startend)
Output
Camera_Deployment_Start 2015-09-28 11:00:00
Camera_Deployment_End 2017-12-25 16:40:00
dtype: datetime64[ns]
I thought I could first get the week number and then go back to a datetime object, which would represent the very start of the week. So I did this:
df_startend=df_startend.apply(lambda x: x.isocalendar())
Input
print(df_startend)
Output
Camera_Deployment_Start (2015, 40, 1)
Camera_Deployment_End (2017, 52, 1)
dtype: object
None
It is worth saying that I can ignore the object in the 3rd position of the (tuple[2]). In this example both are coincidentally 1-the first day of the week- but that may not be the case with other data samples.
And from here on I cannot manage.
My ultimate goal is to generate all the start days of all the weeks in between. Probably using something like:
ws=pd.date_range(start=,end=,freq='W')
Your attention is very appreciated, thank you very much!
If only 2 element Series firstsubtract days extracted by dayofweek and then use floor for remove times and then date_range with W-Mon offset:
print (df_startend)
Camera_Deployment_Start 2015-09-28 11:00:00
Camera_Deployment_End 2015-12-25 16:40:00
dtype: datetime64[ns]
s = (df_startend - pd.to_timedelta(df_startend.dt.dayofweek, unit='d')).dt.floor('d')
ws=pd.date_range(start=s['Camera_Deployment_Start'],
end=s['Camera_Deployment_End'],
freq='W-Mon')
print (ws)
DatetimeIndex(['2015-09-28', '2015-10-05', '2015-10-12', '2015-10-19',
'2015-10-26', '2015-11-02', '2015-11-09', '2015-11-16',
'2015-11-23', '2015-11-30', '2015-12-07', '2015-12-14',
'2015-12-21'],
dtype='datetime64[ns]', freq='W-MON')
Detail:
print (s)
Camera_Deployment_Start 2015-09-28
Camera_Deployment_End 2015-12-21
dtype: datetime64[ns]
Solution with isocalendar:
s = df_startend.apply(lambda x: '-'.join(str(y) for y in x.isocalendar()[:2]))
s = pd.to_datetime(s + '-1', format='%Y-%W-%w') - pd.Timedelta(7, 'd')
print (s)
Camera_Deployment_Start 2015-09-28
Camera_Deployment_End 2015-12-21
dtype: datetime64[ns]
I have a file containing dates from June 2015 + 365 days. I am using this as a lookup table because there are custom business dates (only certain holidays are observed and there are some no-work dates because of internal reasons). Using the customer business date offsets was just so slow with 3.5 million records.
initial_date | day_1 | day_2 | day_3 | ... | day_365
2015-06-01 2015-06-02 2015-06-03 2015-06-04
2015-06-02 2015-06-03 2015-06-04 2015-06-05
The idea is to 'tag' each row in the data based on the number of (custom) business dates since specific entries. Is there a better way to do this?
For example, if a new entry happens on 2016-06-28 then this is labeled as 'initial_date'. day_1 is tomorrow, day_2 is the next day, day_3 is Friday, and day_4 is next Monday.
My naive way of doing this is to create a loop which basically does this:
df.day_1_label = np.where(df.date == df.day_1, 'Day 1', '')
df.day_2_label = np.where(df.date == df.day_2, 'Day 2', '')
df.day_label = (df.day_1_label + df.day_2_label + ...).replace('', regex=True, inplace=True)
This would result in one label per row which I could then aggregate or plot. Eventually this would be used for forecasting. initial_date + the subset of customers from previous dates = total customers
Also, depending on what events happen, a subset of the customers would have a certain event occur in the future. I want to know on what business date from the initial_date this happens on, on average. So if we have 100 customers today, a certain percent will have an event next July 15th or whatever it might be.
Edit- pastebin with some sample data:
http://pastebin.com/ZuE1Q2KJ
So the output I am looking for is the day_label. This basically checks if the date == each date from day_0 - day_n. Is there a better way to fill that in? I am just trying to match the date of each row with a value in one of the day_ columns.
I think this can be made more efficiently if your data looks as I think it looks.
Say you have a calendar of dates (June 2015 + 365 days) and a data frame as for example:
cal = ['2015-06-01', '2015-06-02', '2015-06-03', '2015-06-04', '2015-06-08',
'2015-06-09']
df = pd.DataFrame({'date': ['2015-06-04', '2015-06-09', '2015-06-09'],
'initial_date': ['2015-06-02', '2015-06-02', '2015-06-01']})
Keeping dates in numpy.datetime types is more efficient, so I convert:
df['date'] = df['date'].astype(np.datetime64)
df['initial_date'] = df['initial_date'].astype(np.datetime64)
Now, let's turn cal into a fast lookup table:
# converting types:
cal = np.asarray(cal, dtype=np.datetime64)
# lookup series
s_cal = pd.Series(range(len(cal)), index=cal)
# a convenience lookup function
def lookup(dates):
return s_cal[dates].reset_index(drop=True)
EDIT:
The above lookup returns a series with a RangeIndex, which may be different from the index of df (this causes problems when assigning the result to a column). So it's probably better to rewrite this lookup so that it returns a plain numpy array:
def lookup2(dates):
return s_cal[dates].values
It's possible to set the correct index in lookup (taken from the input dates). But such a solution would be less flexible (would require that the input be a Series) or unnecessarily complicated.
The series:
s_cal
Out[220]:
2015-06-01 0
2015-06-02 1
2015-06-03 2
2015-06-04 3
2015-06-08 4
2015-06-09 5
dtype: int64
and how it works:
lookup(df.date)
Out[221]:
0 3
1 5
2 5
dtype: int64
lookup2(df.date)
Out[36]: array([3, 5, 5])
The rest is straightforward. This adds a column of integers (day differences) to the data frame:
df['day_label'] = lookup2(df.date) - lookup2(df.initial_date)
and if you want to convert it to labels:
df['day_label'] = 'day ' + df['day_label'].astype(str)
df
Out[225]:
date initial_date day_label
0 2015-06-04 2015-06-02 day 2
1 2015-06-09 2015-06-02 day 4
2 2015-06-09 2015-06-01 day 5
Hope this is what you wanted.
I am trying to find the mean for each day and put that value as an input for every hour in a year. The problem is that the last day only contains the first hour.
rng = pd.date_range('2011-01-01', '2011-12-31')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
days = ts.resample('D')
day_mean_in_hours = asfreq('H', method='ffill')
day_mean_in_hours.tail(5)
2011-12-30 20:00:00 -0.606819
2011-12-30 21:00:00 -0.606819
2011-12-30 22:00:00 -0.606819
2011-12-30 23:00:00 -0.606819
2011-12-31 00:00:00 -2.086733
Freq: H, dtype: float64
Is there a nice way to change the frequency to hour and still get the full last day?
You could reindex the frame using an hourly DatetimeIndex that covers the last full day.
hourly_rng = pd.date_range('2011-01-01', '2012-01-01', freq='1H', closed='left')
day_mean_in_hours = day_mean_in_hours.reindex(hourly_rng, method='ffill')
See Resample a time series with the index of another time series for another example.