Shifting a Pandas column, and then take the mean of the next 3 values (post_shift) - pandas

I have a dataframe, df which looks like this
Open High Low Close Volume
Date
2007-03-22 2.65 2.95 2.64 2.86 176389
2007-03-23 2.87 2.87 2.78 2.78 63316
2007-03-26 2.83 2.83 2.51 2.52 54051
2007-03-27 2.61 3.29 2.60 3.28 589443
2007-03-28 3.65 4.10 3.60 3.80 1114659
2007-03-29 3.91 3.91 3.33 3.57 360501
2007-03-30 3.70 3.88 3.66 3.71 185787
I'm attempting to create a new column, that first shifts the Open column 3 rows (df.Open.shift(-3)) and then takes the average of itself and the next 2 values.
So for example the above dataframe's Open column would be shifted -3 rows and look something like this:
Date
2007-03-22 2.610
2007-03-23 3.650
2007-03-26 3.910
2007-03-27 3.700
2007-03-28 3.710
2007-03-29 3.710
2007-03-30 3.500
I then want to take the forward mean of the next 3 values(including itself) via iteration.
So the first iteration would 2.610 (first value) + 3.650 + 3.910(which are the next values) divided by 3.
Then we take the next value 3.650 (first value) and do the same. Creating a column of values.
At first I tried something like :
df['Avg'] =df.Open.shift(-3).iloc[0:3].mean()
But this doesn't iterate through all the values of Open.shift
This next loop seems to work but is very slow, and I was told it's bad practice to use for loops in Pandas.
for i in range(0, len(df.Open)):
df['Avg'][i] =df.Open.shift(-3).iloc[i:i+4].mean()
I tried to thinking of ways to use apply
df.Open.shift(-3).apply(loc[0:4].mean())
df.Open.shift(-3).apply(lambda x: x[0:4].mean())
but these seems to give errors such as
TypeError: 'float' object is not subscriptable etc
I can't think of an elegant way of doings this.
Thank you.

You can use pandas rolling_mean. Since it uses backward window, it will give you the first two rows as 2.61 (value itself) and 3.13(mean of row 0 and row 1). To handle that, you can use shift(-2) to shift the values by 2 rows.
pd.rolling_mean(df, window=3, min_periods=1).shift(-2)
output:
open
date
2007-03-22 3.390000
2007-03-23 3.753333
2007-03-26 3.773333
2007-03-27 3.706667
2007-03-28 3.640000
2007-03-29 NaN
2007-03-30 NaN

numpy solution
As promised
NOTE: HUGE CAVEAT
This is an advanced technique and is not recommended for any beginner!!!
Using this might actually shave your poodle bald by accident. BE CAREFUL!
as_strided
from numpy.lib.stride_tricks import as_strided
import numpy as np
import pandas as pd
# I didn't have your full data for all dates
# so I created my own array
# You should be able to just do
# o = df.Open.values
o = np.array([2.65, 2.87, 2.83, 2.61, 3.65, 3.91, 3.70, 3.71, 3.71, 3.50])
# because we shift 3 rows, I trim with 3:
# because it'll be rolling 3 period mean
# add two np.nan at the end
# this makes the strides cleaner.. sortof
# whatever, I wanted to do it
o = np.append(o[3:], [np.nan] * 2)
# strides are the size of the chunk of memory
# allocated to each array element. there will
# be a stride for each numpy dimension. for
# a one dimensional array, I only want the first
s = o.strides[0]
# it gets fun right here
as_strided(o, (len(o) - 2, 3), (s, s))
# ^ \___________/ \__/
# | \ \______
# object to stride --- size of array --- \
# to make memory chunk
# to step forward
# per dimension
[[ 2.61 3.65 3.91]
[ 3.65 3.91 3.7 ]
[ 3.91 3.7 3.71]
[ 3.7 3.71 3.71]
[ 3.71 3.71 3.5 ]
[ 3.71 3.5 nan]
[ 3.5 nan nan]]
Now we just take the mean. All together
o = np.array([2.65, 2.87, 2.83, 2.61, 3.65, 3.91, 3.70, 3.71, 3.71, 3.50])
o = np.append(o[3:], [np.nan] * 2)
s = o.strides[0]
as_strided(o, (len(o) - 2, 3), (s, s)).mean(1)
array([ 3.39 , 3.75333333, 3.77333333, 3.70666667, 3.64 ,
nan, nan])
You can wrap it in a pandas series
pd.Series(
as_strided(o, (len(o) - 2, 3), (s, s)).mean(1),
df.index[3:],
)

Related

iterating through a dataframe alternative of for loop

i have a very large dataframe, i did a for loop but it is taking forever, and I am wondering if there is any alternative?
index
ids
year
0
1890
2001
1
2678
NaN
2
4780
NaN
3
9844
1999
the idea is to get an array of ids of people who have NaN values in the 'year' column, so what I did, was I turned NaN into 0, and wrote this for loop.
df_nan = []
for i in range(0, len(df.index)):
for j in range(0, len(df.columns)):
if ((int(df.values[i,j])) == 0):
df_nan.append(df.values[i,0])
the for loop works, coz I tried it on a smaller dataframe, but I cant use it on the main one because it takes so long.
You can use filtering.
df = pd.DataFrame({'ids': [1890, 2678, 4780, 9844], 'year': [2001, pd.np.nan, pd.np.nan, 1999]})
nan_rows = df[df['year'].isnull()]
ids = nan_rows['ids'].values
print(ids) # outputs: [2678 4780]

Random Choice loop through groups of samples

I have a df containing column of "Income_group", "Rate", and "Probability", respectively. I need randomly select rate for each income group. How can I write a Loop function and print out the result for each income bin.
The pandas data frame table looks like this:
import pandas as pd
df={'Income_Groups':['1','1','1','2','2','2','3','3','3'],
'Rate':[1.23,1.25,1.56, 2.11,2.32, 2.36,3.12,3.45,3.55],
'Probability':[0.25, 0.50, 0.25,0.50,0.25,0.25,0.10,0.70,0.20]}
df2=pd.DataFrame(data=df)
df2
Datatable
Shooting in the dark here, but you can use np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], p=x['Probability']))
)
Output (can vary due to randomness):
Income_Groups
1 1.25
2 2.36
3 3.45
dtype: float64
You can also pass size into np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], size=3, p=x['Probability']))
)
Output:
Income_Groups
1 [1.23, 1.25, 1.25]
2 [2.36, 2.11, 2.11]
3 [3.12, 3.12, 3.45]
dtype: object
GroupBy.apply because of the weights.
import numpy as np
(df2.groupby('Income_Groups')
.apply(lambda gp: np.random.choice(a=gp.Rate, p=gp.Probability, size=1)[0]))
#Income_Groups
#1 1.23
#2 2.11
#3 3.45
#dtype: float64
Another silly way because your weights seem to be have precision to 2 decimal places:
s = df2.set_index(['Income_Groups', 'Probability']).Rate
(s.repeat(s.index.get_level_values('Probability')*100) # Weight
.sample(frac=1) # Shuffle |
.reset_index() # + | -> Random Select
.drop_duplicates(subset=['Income_Groups']) # Select |
.drop(columns='Probability'))
# Income_Groups Rate
#0 2 2.32
#1 1 1.25
#3 3 3.45

Pandas .loc without KeyError

>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['1','2']] # Succeeds, as in the answer below.
I'd like something that doesn't fail in either of
>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
Is there a function like loc which gracefully handles this, or some other way of expressing this query?
Update for #AlexLenail comment
It's a fair point that this will be slow for large lists. I did a little bit of more digging and found that the intersection method is available for Indexes and columns. I'm not sure about the algorithmic complexity but it's much faster empirically.
You can do something like this.
good_keys = df.index.intersection(all_keys)
df.loc[good_keys]
Or like your example
df = pd.DataFrame([1], index=['1'])
df.loc[df.index.intersection(['2'])]
Here is a little experiment below
n = 100000
# Create random values and random string indexes
# have the bad indexes contain extra values not in DataFrame Index
rand_val = np.random.rand(n)
rand_idx = []
for x in range(n):
rand_idx.append(str(x))
bad_idx = []
for x in range(n*2):
bad_idx.append(str(x))
df = pd.DataFrame(rand_val, index=rand_idx)
df.head()
def get_valid_keys_list_comp():
# Return filtered DataFrame using list comprehension to filter keys
vkeys = [key for key in bad_idx if key in df.index.values]
return df.loc[vkeys]
def get_valid_keys_intersection():
# Return filtered DataFrame using list intersection() to filter keys
vkeys = df.index.intersection(bad_idx)
return df.loc[vkeys]
%%timeit
get_valid_keys_intersection()
# 64.5 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
get_valid_keys_list_comp()
# 6.14 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original answer
I'm not sure if pandas has a built-in function to handle this but you can use Python list comprehension to filter to valid indexes with something like this.
Given a DataFrame df2
A B C D F
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
You can filter your index query with this
keys = ['test', 'train', 'try', 'fake', 'broken']
valid_keys = [key for key in keys if key in df2.index.values]
df2.loc[valid_keys]
This will also work for columns if you use df2.columns instead of df2.index.values
I found an alternative (provided a check for df.empty is made beforehand). You could do something like this
df[df.index=='2'] -> returns either a dataframe with matched values or empty dataframe.
It seems to work fine for me. I'm running Python 3.5 with pandas version 0.20.3.
import numpy as np
import pandas as pd
# Create dataframe
data = {'distance': [0, 300, 600, 1000],
'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
# Create a subset of the dataframe.
df.loc[keys]
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
Virginia NaN NaN
Or if you want to exclude the NaN row:
df.loc[keys].dropna()
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
This page https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike has the solution:
In [8]: pd.DataFrame([1], index=['1']).reindex(['2'])
Out[8]:
0
2 NaN
Using the sample dataframe from #binjip's answer:
import numpy as np
import pandas as pd
# Create dataframe
data = {'distance': [0, 300, 600, 1000],
'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
Get matching records from the dataframe. NB: The dataframe index must be unique for this to work!
df.reindex(keys)
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
Virginia NaN NaN
If you want to omit missing keys:
df.reindex(df.index.intersection(keys))
distance population
Alabama 0 4.8
Alaska 300 0.7
Arizona 600 6.4
df.loc uses index (values from df.index) not the position of the row. Did you mean to use .iloc instead

pandas using qcut on series with fewer values than quantiles

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):
>>> s = pd.Series([5, np.nan, np.nan])
When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)
>>> s.quantile([0.5, 1])
0.5 5.0
1.0 5.0
dtype: float64
But when I apply .qcut() with an integer value for number of quantiles an error is thrown:
>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5., 5., 5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Even after I set the duplicates argument, it still fails:
>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0
How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)
The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:
0 (4.999, 5.000]
1 NaN
2 NaN
Ok, this is a workaround which might work for you.
pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]:
0 (4.999, 5.0]
1 NaN
2 NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)
#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)
#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
Use python 3.5 instead of python 2.7 .
This worked for me

Numpy or Pandas function for "x-value-window" means or other stats?

Let's say I have x-y data samples sorted by x-value. I'm going to use Pandas as example, but I would be perfectly happy with a Numpy/Scipy-only solution, of course.
In [24]: pd.set_option('display.max_rows', 10)
In [25]: df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
In [26]: df = df.sort('x')
In [27]: df
Out[27]:
x y
13 -3.403818 0.717744
49 -2.688876 1.936267
74 -2.388332 -0.121599
52 -2.185848 0.617896
90 -2.155343 -1.132673
.. ... ...
65 1.736506 -0.170502
0 1.770901 0.520490
60 1.878376 0.206113
63 2.263602 1.112115
33 2.384195 -1.877502
[100 rows x 2 columns]
Now, I want to kind of "window" it or "discretize" it and get statistics on each window. But I don't want to do the Pandas moving-window functions because they define windows by rows. I want to define windows by a span of x-values, thus "x-value-window". Specifically, let's define each x-value-window with 2 parameters:
center x-value of each window
in this example, let's say I want x = 0.0 + 0.4 * k for all positive or negative k
thus -3.2, -2.8, -2.4, ..., 1.6, 2.0, 2.4
width of each window
in this example, let's say I want W = 0.5
thus, the example windows will be [-3.2-0.25, -3.2+0.25], [-2.8-0.25, -2.8+0.25], ..., [2.4-0.25, 2.4+0.25]
note that the windows overlap, which is intended
Having thus defined the windows, I would like to ask if there's a function that will produce the following data frame (or numpy array):
x y
-3.2 mean of y-values in x-value-window centered at -3.2
-2.8 mean of y-values in x-value-window centered at -2.8
-2.4 mean of y-values in x-value-window centered at -2.4
... ...
1.6 mean of y-values in x-value-window centered at 1.6
2.0 mean of y-values in x-value-window centered at 2.0
2.4 mean of y-values in x-value-window centered at 2.4
Is there anything that will do this for me? Or do I have to totally roll my own (and probably in a very slow python loop instead of fast numpy or pandas code)?
Extra 1: It would be even better if there's support for weighted windows (such as supported by Pandas's rolling_window function) but of course the weights in this case would not be based on how far the sample's row is from the center row of the window, but rather, how far the sample's x-value is from the center of the x-value-window.
Extra 2: It would be nice if there's support for statistics other than mean on the x-value-windows, e.g. (a) variance of the y-values in each x-value-window or (b) count of the number of samples falling within each x-value-window.
I first create a range of x values centered at zero. This range is wide enough so that then min value minus the width and the max value plus the width will capture all x values.
I then iterate through this range of x values which have k as the step size. At each point, I use loc to capture y values located at the selected x value plus and minus the width. The mean of these selected values are then calculated. These values are used to create the result dataframe.
import math
import numpy as np
import pandas as pd
k = .4
w = .5
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
x_range = np.arange(math.floor((df.x.min() + w) / k) * k,
k * (math.ceil((df.x.max() - w) / k) + 1), k)
result = pd.DataFrame((df.loc[df.x.between(x - w, x + w), 'y'].mean() for x in x_range),
index=x_range, columns=['y_mean'])
result.index.name = 'centered_x'
>>> result
y_mean
centered_x
-2.400000e+00 0.653619
-2.000000e+00 0.733606
-1.600000e+00 0.576594
-1.200000e+00 0.150462
-8.000000e-01 0.065884
-4.000000e-01 0.022925
-8.881784e-16 0.211693
4.000000e-01 0.057527
8.000000e-01 -0.141970
1.200000e+00 0.233695
1.600000e+00 0.203570
2.000000e+00 0.306409
2.400000e+00 0.576789