Selecting data between two dates in Dataframe 'ValueError: Lengths must match to compare' - pandas

I want to select all the values that are between 2 dates in my large df_data. This works when I do this outside of a loop for a single day worth of data:
df_data['datetime'] = pd.to_datetime(df_data['TimeStamp'] )
twelveearlier = datetime.datetime(2017, 12,23, 00,00, 00)
twelvelater = datetime.datetime(2017, 12, 24, 00, 00, 00)
df = df_data[(df_data['datetime']>= twelveearlier) &
(df_data['datetime']< twelvelater)]
But when I try and do this by looping through a list of dates below, I get ValueError: Lengths must match to compare.
event_name_list = ['noEvent_20161208174900', 'NoEvent_20161209174200', 'NoEvent20161211_061400']
for event in event_name_list:
event_time = re.findall(r'\d+', event)
event_timestamp = pd.to_datetime(event_time)
twelvelater = event_timestamp + datetime.timedelta(hours=12)
twelveearlier = event_timestamp - datetime.timedelta(hours=12)
df = df_data[(df_data['datetime']>= twelveearlier.values) &
(df_data['datetime']< twelvelater.values)]
I think this is because twelveearlier and twelvelater are different types in the loop
version due to using event_timestamp - datetime.timedelta(hours=12)but converting them using to_datetime, to_pydatetime etc. doesn't help. How do I get twelveearlier and twelvelater in the same format as df_data[datetime] so that I can create df based on only the dates between twelveearlier and twelvelater?
df_data['datetime']
3250592 2017-12-31 23:40:00
3250593 2017-12-31 23:50:00
Name: datetime, dtype: datetime64[ns]
print event_timestamp
DatetimeIndex(['2016-12-16 06:22:29'], dtype='datetime64[ns]', freq=None)
print twelveearlier
DatetimeIndex(['2016-12-08 05:49:00'], dtype='datetime64[ns]', freq=None)
print twelvelater
DatetimeIndex(['2016-12-09 05:49:00'], dtype='datetime64[ns]', freq=None)

You are trying to compare against a list of date times:twelvelater.values gives you a single element array.
This means you are trying to match a dataframe against 'multiple' elements in the conditional [[datetime]]. Only taking the first element of each of these date time arrays twelvelater.values[0] should fix the problem with minimal code changes.
event_name_list = ['noEvent_20161208174900', 'NoEvent_20161209174200', 'NoEvent20161211_061400']
for event in event_name_list:
event_time = re.findall(r'\d+', event)
event_timestamp = pd.to_datetime(event_time)
twelvelater = event_timestamp + datetime.timedelta(hours=12)
twelveearlier = event_timestamp - datetime.timedelta(hours=12)
df = df_data[(df_data['datetime']>= twelveearlier.values[0]) &
(df_data['datetime']< twelvelater.values[0])]

You are trying to compare datetime to a DatetimeIndex of datetimes of length one. This is because re.findall returns a list of all the matches it finds. Try this:
event_name_list = pd.to_datetime([re.findall(r'\d+', x)[0] for x in event_name_list])
for event_timestamp in event_name_list:
twelvelater = event_timestamp + datetime.timedelta(hours=12)
twelveearlier = event_timestamp - datetime.timedelta(hours=12)
df = df_data[(df_data['datetime']>= twelveearlier) &
(df_data['datetime']< twelvelater)]

Related

Extract PI OSIsoft Monthly Interval in Python

I am trying to extract the sum of PI data from OSIsoft 10m (10 minute) data in a one (1) month interval using Python pandas. However, I either get an error from OSIsoft or Python when I choose the internal notation as "M" for OSIsoft or "1mo" for python. Neither notation seems to work w/out an error. I have a function that calls the interval of data to plot and save and this works for intervals of "1d", "30d", "1w", "1y" for example but I cannot get the sum of data for each 1-month interval. Is it a conflict of how python requires a description of "month" with an "M" and OSISoft that requires "1mo"?? thank you, Here is my code:
def get_tag_history2(tagname, starttime, endtime, interval="10m"):
# pull historical data
tag = PIPoint.FindPIPoint(piServer, tagname)
# name = tag.Name.lower()
timerange = AFTimeRange(starttime, endtime)
span = AFTimeSpan.Parse(interval)
#summariesvalues
summaries = tag.Summaries(timerange, span, AFSummaryTypes.Average, AFCalculationBasis.TimeWeighted, AFTimestampCalculation.Auto)
recordedValuesDict = dict()
for summary in summaries:
for event in summary.Value:
dt = datetime.strptime(
event.Timestamp.LocalTime.ToString(),'%m/%d/%Y %I:%M:%S %p')
recordedValuesDict[dt] = event.Value
# turn dictionary into pd.DataFrame
df = pd.DataFrame(
recordedValuesDict.items(), columns=['TimeStamp', 'Value'])
#Send it to a dateTime Index then set the index
df['TimeStamp'] = pd.to_datetime(df['TimeStamp']) + pd.Timedelta(interval)
df.set_index(['TimeStamp'], inplace=True)
return df
if __name__ == '__main__':
"""
Set inputs
"""
pitags = ['JC1.WF.DOMINA.ProdEffective','HO1.WF.DOMINA.ProdEffective','BC1.WF.DOMINA.ProdEffective']
start_time = '2020-01-01 00:00'
end_time = '2022-01-01 00:00'
interval = "M"
"""
Run Script
"""
connect_to_Server('PDXPI01')
output = pd.DataFrame()
for tag in pitags:
values = get_tag_history2(
tag, start_time, end_time, interval=interval)
output[tag] = values['Value']
for i, col in enumerate(output.columns):
output[col].plot(fig=plt.figure(i))
plt.title(col)
plt.show()
The error when using interval = "1mo" is --- >
ValueError: invalid unit abbreviation: mo
The error when using interval = "M" is --- >
FormatException: The 'M' token in the string 'M' was not expected.
at OSIsoft.AF.Time.AFTimeSpan.FormatError(String input, Char token, Boolean throwErrors, AFTimeSpan& result)

Date is not working even when date column is set to index

I have a multiple dataframe dictionary where the index is set to 'Date' but am having a trouble to capture the specific day of a search.
Dictionary created as per link:
Call a report from a dictionary of dataframes
Then I tried to add the following column to create specific days for each row:
df_dict[k]['Day'] = pd.DatetimeIndex(df['Date']).day
It´s not working. The idea is to separate the day of the month only (from 1 to 31) for each row. When I call the report, it will give me the day of month of that occurrence.
More details if needed.
Regards and thanks!
In the case of your code, there is no 'Date' column, because it's set as the index.
df_dict = {f.stem: pd.read_csv(f, parse_dates=['Date'], index_col='Date') for f in files}
To extract the day from the index use the following code.
df_dict[k]['Day'] = df.index.day
Pulling the code from this question
# here you can see the Date column is set as the index
df_dict = {f.stem: pd.read_csv(f, parse_dates=['Date'], index_col='Date') for f in files}
data_dict = dict() # create an empty dict here
for k, df in df_dict.items():
df_dict[k]['Return %'] = df.iloc[:, 0].pct_change(-1)*100
# create a day column; this may not be needed
df_dict[k]['Day'] = df.index.day
# aggregate the max and min of Return
mm = df_dict[k]['Return %'].agg(['max', 'min'])
# get the min and max day of the month
date_max = df.Day[df['Return %'] == mm.max()].values[0]
date_min = df.Day[df['Return %'] == mm.min()].values[0]
# add it to the dict, with ticker as the key
data_dict[k] = {'max': mm.max(), 'min': mm.min(), 'max_day': date_max, 'min_day': date_min}
# print(data_dict)
[out]:
{'aapl': {'max': 8.702843218147871,
'max_day': 2,
'min': -4.900700398891522,
'min_day': 20},
'msft': {'max': 6.603769278967109,
'max_day': 2,
'min': -4.084428935702855,
'min_day': 8}}

Add a column of minutes to a datetime in pandas

I have a dataframe with a start time and the length of operation. I'm trying to figure out out to add the length (in minutes) to the start time in order to figure out the end time of the session. I've run a few different variations of the same general idea and keep getting the same error, "unsupported type for timedelta minutes component: Series". The code extract is below:
data= {'Name': ['John', 'Peter'],
'Start' : [2, 2],
'Length': [120, 90],
}
df = pd.DataFrame.from_records(data)
df['Start'] = pd.to_datetime(df['Start'])
df['Length'] = pd.to_datetime(df['Length'])
df["tdiffinmin"] = df['Start'].apply(lambda x: x + pd.DateOffset(minutes = df["Length"]))
Ive also tried the follow as other methods of doing this math and keep getting similar errors.
df["tdiffinmin"] = df['Start'].apply(lambda x: x -pd.DateOffset(minutes = df["Length"]))
df["tdiffinmin"] = (df['Start']. + timedelta(minutes = df["Length"])).dt.total_seconds() / 60
df['tdiffinmin'] = df['Start'] - pd.DateOffset(minutes = df["Length"])
The full code reads from a data set (excel sheet or CSV), populates a Dataframe, and this is some of the math I am doing. Originally it was done with Start and Stop times, so I know something similar is possible. In the dataset, Length is in minutes and Start is a date and time, so datetime is necessary.
You should convert Length into timedelta, not datetime:
df['Start'] = pd.to_datetime(df['Start'])
df['Length'] = pd.to_timedelta(df['Length'], unit='min')
df['tdiffinmin'] = df['Start'] + df['Length']
Output:
Length Name Start tdiffinmin
0 02:00:00 John 1970-01-01 00:00:00.000000002 1970-01-01 02:00:00.000000002
1 01:30:00 Peter 1970-01-01 00:00:00.000000002 1970-01-01 01:30:00.000000002

How to set an index within multiindex for datetime?

I have this df:
open high low close volume
date symbol
2014-02-20 AAPL 69.9986 70.5252 69.4746 69.7569 76529103
MSFT 33.5650 33.8331 33.4087 33.7259 27541038
2014-02-21 AAPL 69.9727 70.2061 68.8967 68.9821 69757247
MSFT 33.8956 34.2619 33.8241 33.9313 38030656
2014-02-24 AAPL 68.7063 69.5954 68.6104 69.2841 72364950
MSFT 33.6723 33.9269 33.5382 33.6723 32143395
which is returned from here:
from datetime import datetime
from iexfinance.stocks import get_historical_data
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
start = '2014-01-01'
end = datetime.today().utcnow()
symbol = ['AAPL', 'MSFT']
prices = pd.DataFrame()
datasets_test = []
for d in symbol:
data_original = data.DataReader(d, 'iex', start, end)
data_original['symbol'] = d
prices = pd.concat([prices,data_original],axis=0)
prices = prices.set_index(['symbol'], append=True)
prices.sort_index(inplace=True)
when trying to get the day of the week:
A['day_of_week'] = features.index.get_level_values('date').weekday
I get error:
AttributeError: 'Index' object has no attribute 'weekday'
I tried to change the date index to date time with
prices['date'] = pd.to_datetime(prices['date'])
but got this error:
KeyError: 'date'
Any idea how to keep 2 indexs, date + symbol but to change one of them (date) tp datetime so I could get the day of the week?
Looks like the date level of the index contains strings, not datetime objects. One solution is to reset all MultiIndex levels into columns, convert the date column to datetime, and set the MultiIndex back. Then you can proceed with pandas datetime accessors like .weekday in the usual way.
prices = prices.reset_index()
prices['date'] = pd.to_datetime(prices['date'])
prices = prices.set_index(['date', 'symbol'])
prices.index.get_level_values('date').weekday
Int64Index([3, 3, 4, 4, 0, 0, 1, 1, 2, 2,
...
1, 1, 2, 2, 3, 3, 4, 4, 1, 1],
dtype='int64', name='date', length=2516)

How to set frequency with pd.to_datetime()?

When fitting a statsmodel, I'm receiving a warning about the date frequency.
First, I import a dataset:
import statsmodels as sm
df = sm.datasets.get_rdataset(package='datasets', dataname='airquality').data
df['Year'] = 1973
df['Date'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
df.drop(columns=['Year', 'Month', 'Day'], inplace=True)
df.set_index('Date', inplace=True, drop=True)
Next I try to fit a SES model:
fit = sm.tsa.api.SimpleExpSmoothing(df['Wind']).fit()
Which returns this warning:
/anaconda3/lib/python3.6/site-packages/statsmodels/tsa/base/tsa_model.py:171: ValueWarning: No frequency information was provided, so inferred frequency D will be used.
% freq, ValueWarning)
My dataset is daily so inferred 'D' is ok, but I was wondering how I can manually set the frequency.
Note that the DatetimeIndex doesn't have the freq (last line) ...
DatetimeIndex(['1973-05-01', '1973-05-02', '1973-05-03', '1973-05-04',
'1973-05-05', '1973-05-06', '1973-05-07', '1973-05-08',
'1973-05-09', '1973-05-10',
...
'1973-09-21', '1973-09-22', '1973-09-23', '1973-09-24',
'1973-09-25', '1973-09-26', '1973-09-27', '1973-09-28',
'1973-09-29', '1973-09-30'],
dtype='datetime64[ns]', name='Date', length=153, freq=None)
As per this answer I've checked for missing dates, but there doesn't appear to be any:
pd.date_range(start = '1973-05-01', end = '1973-09-30').difference(df.index)
DatetimeIndex([], dtype='datetime64[ns]', freq='D')
How should I set the frequency for the index?
I think pd.to_datetime not set default frequency, need DataFrame.asfreq:
df = df.set_index('Date').asfreq('d')
print (df.index)
DatetimeIndex(['1973-05-01', '1973-05-02', '1973-05-03', '1973-05-04',
'1973-05-05', '1973-05-06', '1973-05-07', '1973-05-08',
'1973-05-09', '1973-05-10',
...
'1973-09-21', '1973-09-22', '1973-09-23', '1973-09-24',
'1973-09-25', '1973-09-26', '1973-09-27', '1973-09-28',
'1973-09-29', '1973-09-30'],
dtype='datetime64[ns]', name='Date', length=153, freq='D')
But if duplicated values in index get error:
df = pd.concat([df, df])
df = df.set_index('Date')
print (df.asfreq('d').index)
ValueError: cannot reindex from a duplicate axis
Solution is use resample with some aggregate function:
print (df.resample('2D').mean().index)
DatetimeIndex(['1973-05-01', '1973-05-03', '1973-05-05', '1973-05-07',
'1973-05-09', '1973-05-11', '1973-05-13', '1973-05-15',
'1973-05-17', '1973-05-19', '1973-05-21', '1973-05-23',
'1973-05-25', '1973-05-27', '1973-05-29', '1973-05-31',
'1973-06-02', '1973-06-04', '1973-06-06', '1973-06-08',
'1973-06-10', '1973-06-12', '1973-06-14', '1973-06-16',
'1973-06-18', '1973-06-20', '1973-06-22', '1973-06-24',
'1973-06-26', '1973-06-28', '1973-06-30', '1973-07-02',
'1973-07-04', '1973-07-06', '1973-07-08', '1973-07-10',
'1973-07-12', '1973-07-14', '1973-07-16', '1973-07-18',
'1973-07-20', '1973-07-22', '1973-07-24', '1973-07-26',
'1973-07-28', '1973-07-30', '1973-08-01', '1973-08-03',
'1973-08-05', '1973-08-07', '1973-08-09', '1973-08-11',
'1973-08-13', '1973-08-15', '1973-08-17', '1973-08-19',
'1973-08-21', '1973-08-23', '1973-08-25', '1973-08-27',
'1973-08-29', '1973-08-31', '1973-09-02', '1973-09-04',
'1973-09-06', '1973-09-08', '1973-09-10', '1973-09-12',
'1973-09-14', '1973-09-16', '1973-09-18', '1973-09-20',
'1973-09-22', '1973-09-24', '1973-09-26', '1973-09-28',
'1973-09-30'],
dtype='datetime64[ns]', name='Date', freq='2D')
The problem is caused by the not explicitly set frequence. In most cases you can't be sure that your data does not have any gaps, so generate a data range with
rng = pd.date_range(start = '1973-05-01', end = '1973-09-30', freq='D')
reindex your DataFrame with this rng and fill the np.nan with your method or value of choice.