What are Pandas "expanding window" functions? - pandas

Pandas documentation lists a bunch of "expanding window functions" :
http://pandas.pydata.org/pandas-docs/version/0.17.0/api.html#standard-expanding-window-functions
But I couldn't figure out what they do from the documentation.

You may want to read this Pandas docs:
A common alternative to rolling statistics is to use an expanding
window, which yields the value of the statistic with all the data
available up to that point in time.
These follow a similar interface to .rolling, with the .expanding
method returning an Expanding object.
As these calculations are a special case of rolling statistics, they
are implemented in pandas such that the following two calls are
equivalent:
In [96]: df.rolling(window=len(df), min_periods=1).mean()[:5]
Out[96]:
A B C D
2000-01-01 0.314226 -0.001675 0.071823 0.892566
2000-01-02 0.654522 -0.171495 0.179278 0.853361
2000-01-03 0.708733 -0.064489 -0.238271 1.371111
2000-01-04 0.987613 0.163472 -0.919693 1.566485
2000-01-05 1.426971 0.288267 -1.358877 1.808650
In [97]: df.expanding(min_periods=1).mean()[:5]
Out[97]:
A B C D
2000-01-01 0.314226 -0.001675 0.071823 0.892566
2000-01-02 0.654522 -0.171495 0.179278 0.853361
2000-01-03 0.708733 -0.064489 -0.238271 1.371111
2000-01-04 0.987613 0.163472 -0.919693 1.566485
2000-01-05 1.426971 0.288267 -1.358877 1.808650

To sum up the difference between rolling and expanding function in one line:
In rolling function the window size remain constant whereas in the expanding function it changes.
Example:
Suppose you want to predict the weather, you have 100 days of data:
Rolling: let's say window size is 10. For first prediction, it will use (the previous) 10 days of data and predict day 11. For next prediction, it will use the 2nd day (data point) to 11th day of data.
Expanding: For first prediction it will use 10 days of data. However, for second prediction it will use 10 + 1 days of data. The window has therefore "expanded."
Window size expands continuously in later method.
Code example:
sums = series.expanding(min_periods=2).sum()
series contains data of number of previously downloaded apps over time series.
Above written code line sum all the number of downloaded apps till that time.
Note: min_periods=2 means that we need at least 2 previous data points to aggregate over. Our aggregate here is the sum.

Those illustrations from Uber explain the concepts very well:
Expanding window
Sliding window
Original article: https://eng.uber.com/omphalos/

Related

Is there an R package or function that can be used to calculate the effect size of interactions within a fully factorial study using Hedges d?

I am currently doing a meta-analysis and need to calculate the effect size of interactions. Most of the studies within the meta-analysis have four groups (Control, Stressor 1, Stressor 2 and Stressor 1 x Stressor 2 (interaction)). However, some studies include the interaction effect of three stressors.
I am looking for a function where I can calculate the SMD (Hedges d) of these interactions but so far I have only found functions that calculate the difference between a treatment and a control. There is the multiplestressorR package but that does not support instances in which there are more than two stressors.

groupBy with 3 ConstraintCollectors?

I like to schedule observations of variable duration (planning entity) into hourly time slots over several nights. I would need to impose that there are no gaps in particular groups, and need collectors for minimum, maximum and sum. Is there a workaround to have a groupBy with one groupKeyMapping and three collectors?
constraintFactory.from(OB.class)
.groupBy(OB::getGroupID, min(OB::getStart), max(OB::getEnd), sum(OB::getDuration))
I tried to workaround this using toList() and computing values myself but strangely it doesn't pass down a List<OB> but single OBs. The code below prints class my.package.OB
constraintFactory.from(OB.class)
.groupBy(OB::getGroupID, toList())
.filter((groupID, entries) -> {
println "=> ${entries.class} "
return true
})
This was a gap in our API (see PLANNER-2330). It is being addressed for OptaPlanner 8.3.0. We encourage you to upgrade - not only to get this particular feature, but also some nice performance improvements, assuming you're using higher cardinality joins.

Is there a way to generate random numbers between 0 and 500, but if first number is 300 not to deviate more than 20 for the next?

Is there a way to generate random numbers between 0 and 500, but if first number for example, is 300, not to deviate more than 20 for the next? I don't want 500 then 0 then 399 then 1. Thanks.
Just plug the first random number back into the "Random Number (Range)" built-in VI.
Bonus
Use a shift register to find a new random number within range of the last random number:
Previous answer refers to usage of minimal LabVIEW version 2019.
OpenG Numeric Library has similar function for generation of random number is the specified range, and supports earlier versions of LabVIEW.
Also, based on task description - if I've understood correctly - anyway random numbers should be in range 0 - 500; so we need to do additional check whether +/- 20 offset would not cause number "overflow".
Let me attach snippet of the solution which implements it. Note, that Select functions I've used just in order to show all the code on one snippet (instead of having Case Structure with pages).

pivot_table error - InvalidOperation: [<class 'decimal.InvalidOperation'>]

The above error is being raised from a pivot_table operation for a variable set to be the column grouping (if it matters, it's failing in the format.py module)
/anaconda/lib/python3.4/site-packages/pandas/core/format.py in __call__(self, num)
2477 sign = 1
2478
-> 2479 if dnum < 0: # pragma: no cover
2480 sign = -1
2481 dnum = -dnum
(Pandas v17.1)
If I create random values for the 'problem' variable via numpy there is no error.
Whilst I doubt it's an edge case for the pivot_table function, I can't figure out what might be causing the problem on the data side:
i) The variable is the first integer from a modest sized sequence of integers (eg 2 from 246) (via df.var.str[0]).
ii) pd.unique(df.var) returns the expected 1-9 values
iii) There are no NaNs: notnull(df.var).all() returns True
iv) The dtype is int64 (or if the integer is cast as a string - or set to label these alternatives still fail with the same error)
v) a period index is used - and that forms the index for pivot table.
vi) the aggregation is 'count'
Creating a another variable with random values with those characteristics (1-9 values from from numpy's random.randint) - the pivot_table call works. If I cast it as a string, or use labels, it still works.
Likewise, I've been playing with the data set for a while - usually on some other position in the sequence without issue. But today - the first place is causing a problem.
Possibly, it's a data issue - but why doesn't pivot_table return empty cells or NaNs, rather than failing at that point.
But I'm at a loss after a day exploring.
any thoughts on why the above error is being raised would be much appreciated (as it'll help me track down the data issue if that is the case).
thanks
Chris
The simplest solution is to reset pandas formatting options by
pd.set_option('display.float_format', None)
further details
I had encoutered same problem. As a workaround you can also filter dataframe that is pivoted to avoid NaNs in result.
My problem is related to use of pd.set_eng_float_format(2, True). Without this all pivots works well.

Time Trend Variable in Balanced Panel Data, Stata

I have some balanced panel data and want to include trend variable into my regression. However, I have 60 districts in 7 year time period and I am not sure how to include trend variable. Year variable is repetitive as expected and for 2005-2011. I am thinking about the following;
gen t = .
replace t = 1 if year==2005
replace t = 2 if year==2006
up to year 2011 and it gives me t variable from 1 to 7, for 180 different panels in the data.
My question: is it OK to include trend variable as I described above or should I directly throw year variable into regression?
Your variable t is just
gen t = year - 2004
and can be obtained in one line as above. Your variable t has one small advantage over year: if you regress a variable on t the intercept refers to values in 2003, which is a gain on referring to values in year 0, which is way outside the range of the data.
In panel data analysis we call that a time effect. If you include only dummy variables for individual districts then they are called individual effects (in your case district effects). So, including either individual effects or time effect in the panel data is called one way fixed effects whereas including both is called two way fixed effects. In Stata you do the following:
use http://dss.princeton.edu/training/Panel101.dta
reg y x1 i.year # for time effect
reg y x1 i.country # for country effect (in your case district effect)
reg y x1 i.year i.country #two way fixed effect
For details see tutorial from UCLA.