Grouping a Group with Accumulator - pandas

I want to segment time-series data using separate interval sets. For example, one set has smaller periodic intervals, and one set has larger aperiodic intervals. The interval sets are likely unaligned, such that one from one set may bisect one from the other. In this case the aggregate over the bisected, second interval should resume with the previous value as initial, i.e. an accumulator. The sets are nested in that the carry only occurs between the "inner" set.
For example, sum:
[.,.,.,.,.,.,.,.,., .,.,.,.,.,.][.,.,.,.,.,.,....
[1,2,3,4,5,6,7,8,9][1,2,3,4,5,6 ,7,8,9]
[[45], [21] ],[[45], ....
|_value carries_|
I want to say I should iterate over the dataframe with a for-loop, but...
Specifics:
Both intervals sets are series of datetime64 dtype representing boundary or cut points which can be closed either to the left or right. The aperiodic interval set is generated manually, but the periodic set is driven by the time-series data:
a = pd.read_csv...
o = pd.offsets.Week(weekday=1)
p = pd.date_range(o.rollback(d.index[0]), o.rollforward(d.index[-1]), freq=o)
But it is easy enough to convert both to interval indices, if that helps:
p = pd.interval_range(v1.rollback(d.index[0]), v1.rollforward(d.index[-1]), freq=o)
a = pd.IntervalIndex(pd.arrays.IntervalArray.from_breaks(a))

Related

pd.to_timedelta('1D') works while pd.to_timedelta('D') fails. Issue when converting ferquency to timedelta

I'm encountering a silly problem I guess.
Currently, I'm using pd.infer_freq to get the frequency of a dataframe index. Afterwards, I use pd.to_timedelta() to convert this frequency to a timedelta object, to be added to another date.
This works quite fine, except when the dataframe index has a frequency which can be expressed as a single time unit (eg. 1 day, or 1 minute).
To be more precise,
freq = pd.infer_freq(df)
# let's say it gives '2D' because data index is spaced on that interval
timedelta = pd.to_timedelta(freq)
works, while
freq = pd.infer_freq(df)
# let's say it gives 'D' because data index is spaced on that interval
timedelta = pd.to_timedelta(freq)
fails and returns
ValueError: unit abbreviation w/o a number
This could work if I supplied '1D' instead of 'D' though.
I could try to check if the first character of the freq string is numeric, and add '1' otherwise, but that seems quite cumbersome.
Is anyone aware of a better approach ?

In Python, is there a way to divide an element in a list by the next element in a list using a for loop or list comprehension?

I have a list of metrics that each have values for multiple time periods. I would like to write a script that takes a value of a metric for a particular time period and divides it by the previous year.
Currently my code looks like this:
for metric in metric:
iya_df[metric+' '+period[0][-4:]+' IYA'] = pivot[metric][period[0]]/pivot[metric][period[1]]*100
iya_df[metric+' '+period[1][-4:]+' IYA'] = pivot[metric][period[1]]/pivot[metric][period[2]]*100
iya_df[metric+' '+period[2][-4:]+' IYA'] = pivot[metric][period[2]]/pivot[metric][period[3]]*100
iya_df[metric+' '+period[3][-4:]+' IYA'] = pivot[metric][period[3]]/pivot[metric][period[4]]*100
I have a list of metrics and a list of periods. (The slicer after period is just to grab the 4 -digit year).
The source table is a pivot table with multiple indices.
I would like to change the code so that I don't have to change it if my list of time periods changes in length.
There's probably a more efficient way to do this with list comprehension than loops but I'm still getting stronger in Python.

How to set Custom Business Day End Frequency in Pandas

I have a pandas dataframe with an unusual DatetimeIndex. The frame contains daily data (end of each day) from 1985 to 1990 but some "random" days are missing:
DatetimeIndex(['1985-01-02', '1985-01-03', '1985-01-04', '1985-01-07',
'1985-01-08', '1985-01-09', '1985-01-10', '1985-01-11',
'1985-01-14', '1985-01-15',
...
'1990-12-17', '1990-12-18', '1990-12-19', '1990-12-20',
'1990-12-21', '1990-12-24', '1990-12-26', '1990-12-27',
'1990-12-28', '1990-12-31'],
dtype='datetime64[ns]', name='date', length=1516, freq=None)
I often need operations like shifting an entire column such that a value that is at the last day of a month (which could e.g. in my DatetimeIndex be '1985-05-30') is shifted to the last day of the next (which could e.g. my DatetimeIndex be '1985-06-27').
While looking for a smart way to perform such shifts, I stumbled over Offset Aliases provided by pandas.tseries.offsets. It can be observed that there are the aliases custom business day frequency (C) and custom business month end frequency (CBM). When looking at an example, it seems like that this could provide exactly what I need:
mth_us = pd.offsets.CustomBusinessMonthEnd(calendar=USFederalHolidayCalendar())
day_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
df['Col1_shifted'] = df['Col1'].shift(periods=1, freq = mth_us) # shifted by 1 month
df['Col2_shifted'] = df['Col2'].shift(periods=1, freq = day_us) # shifted by 1 day
The problem is that my DatetimeIndex is not equal to USFederalHolidayCalendar(). Can someone please tell me how I can use pd.offsets.CustomBusinessMonthEnd (and also pd.offsets.CustomBusinessDay) with my own custom DatetimeIndex?
If not, has any of you an idea how to tackle this issue in a different way?
Thanks a lot for your help!

Defining an RDLC chart axis with an aggregate function

The autoaxis for one of my embedded charts isn't behaving well, sometimes only showing one other major value besides top and bottom. So I thought I'd set my own boundaries, which seemed pretty easy given that one of the columns on the chart is always going to be larger than any of the others.
<Maximum>=(((Max(Fields!Entered.Value, "Chart1") + 10) \ 50) + 1) * 50</Maximum>
(the other columns detail what happened to the things that entered this process)
Round up to the nearest 50 with a little overage to put the label on top. Then I can put the intervals at this divided by 5 and I'm gold.
Except I'm not gold. The chart groups records by date and the individual bars are Sum(Fields!Entered.Value) et cetera, so it's drastically underscaling when multiple batches get processed on a single date. But hey, it groups records by date, I can use that:
<ChartCategoryHierarchy>
<ChartMembers>
<ChartMember>
<Group Name="Chart1_CategoryGroup">
<GroupExpressions>
<GroupExpression>=Fields!Date.Value</GroupExpression>
</GroupExpressions>
</Group>
</ChartMember>
</ChartMembers>
</ChartCategoryHierarchy>
as:
<Maximum>=(((Max(Fields!Entered.Value, "Chart1_CategoryGroup") + 10) \ 50) + 1) * 50</Maximum>
and it'll aggregate over the group just fine. Right?
The ValueAxis_Primary.Maximum expression for the chart 'Chart1' has a scope parameter that is not valid for an aggregate function. The scope parameter must be set to a string constant that is equal to either the name of a containing group, the name of a containing data region, or the name of a dataset.
Nope! It works just fine for "Chart1" but not for "Chart1_CategoryGroup"!
So, uh:
what scope are the axis calculations operating in, 'cause it ain't the category scope?
is there some way to provide them an aggregate scope that groups the data by date so they can do their calculations proper?
You Have To Nest The Scope
A little extra work gave me this insight:
Max(Fields!Entered.Value, "Chart1_CategoryGroup") returns the maximum of the entered fields within one single category group, which is not the level the Y axis is concerned with. What you're interested in is the maximum value of the summed calculation (within a group) for the whole chart, so specify the scopes to do that:
<Maximum>
=(((Max(
Sum(Fields!Entered.Value, "Chart1_CategoryGroup")
, "Chart1") + 10) \ 50) + 1) * 50
</Maximum>

Calling preprocessing.scale on a heterogeneous array

I have this TypeError as per below, I have checked my df and it all contains numbers only, can this be caused when I converted to numpy array? After the conversion the array has items like
[Timestamp('1993-02-11 00:00:00') 28.1216 28.3374 ...]
Any suggestion how to solve this, please?
df:
Date Open High Low Close Volume
9 1993-02-11 28.1216 28.3374 28.1216 28.2197 19500
10 1993-02-12 28.1804 28.1804 28.0038 28.0038 42500
11 1993-02-16 27.9253 27.9253 27.2581 27.2974 374800
12 1993-02-17 27.2974 27.3366 27.1796 27.2777 210900
X = np.array(df.drop(['High'], 1))
X = preprocessing.scale(X)
TypeError: float() argument must be a string or a number
While you're saying that your dataframe "all contains numbers only", you also note that the first column consists of datetime objects. The error is telling you that preprocessing.scale only wants to work with float values.
The real question, however, is what you expect to happen to begin with. preprocessing.scale centers values on the mean and normalizes the variance. This is such that measured quantities are all represented on roughly the same footing. Now, your first column tells you what dates your data correspond to, while the rest of the columns are numeric data themselves. Why would you want to normalize the dates? How would you normalize the dates?
Semantically speaking, I believe you should leave your dates alone. Whatever post-processing you're planning to perform on your numerical data, the normalized data should still be parameterized by the original dates. If you want to process your dates too, you need to come up with an explicit way to handle your dates to something numeric (say, elapsed time from a given date in given units).
So I believe you should drop your dates from your processing round altogether, and start with
X = df.drop(['Date','High'], 1).as_matrix()