Pandas expanding window with min_periods - pandas

I want to compute expanding window statistics, but with a minimum number of periods of 3, rather than 1. That is, I want it start computing the statistic after the window of 3 values, and then include all values up until that point:
value expanding_min
------------------------
6 NaN
5 NaN
2 NaN
3 2
1 1
however, using
df['expanding_min']= df.groupby(groupby)['value'].transform(lambda x: pd.rolling_min(x, window=len(x), min_periods=3))
or
df['expanding_min']= df.groupby(groupby)['value'].transform(lambda x: pd.expanding_min(x, min_periods=3))
I get the following error:
ValueError: min_periods (3) must be <= window (1)

This works for me, changing from value to df.value:
pd.expanding_min(df.value, min_periods=3)
or
pd.rolling_min(df.value, window=len(df.value), min_periods=3)
both output:
0 NaN
1 NaN
2 2
3 2
4 1
dtype: float64
Perhaps your window is being set by some other 'value' whose length is 1? This is why pandas is giving the error message

Related

sort and count values in a column DataFrame (Python Pandas)

I have the next DataFrame
df
I count the values this way
I want to have the category values in the next order:
1.0 1
4.0 1
7.0 2
10.0 1
and so on ...
In the ascending way with their respect amount of values
You can sort by index using sort_index()
df['col_1'].value_counts().sort_index()
You can sort on the index after calling value_counts. here's an example
df = pd.DataFrame({'x':[1,2,2,2,1,4,5,5,4,3,5,6,3]})
df['x'].value_counts().sort_index()
Output:
1 2
2 3
3 2
4 2
5 3
6 1
Name: x, dtype: int64
Created Data frame having values as below :
and then sort it via below code ,below is output shown
df1.groupby(by=['Cat']).count().sort_values(by='col1')

Pandas Groupby and Apply

I am performing a grouby and apply over a dataframe that is returning some strange results, I am using pandas 1.3.1
Here is the code:
ddf = pd.DataFrame({
"id": [1,1,1,1,2]
})
def do_something(df):
return "x"
ddf["title"] = ddf.groupby("id").apply(do_something)
ddf
I would expect every row in the title column to be assigned the value "x" but when this happens I get this data:
id title
0 1 NaN
1 1 x
2 1 x
3 1 NaN
4 2 NaN
Is this expected?
The result is not strange, it's the right behavior: apply returns a value for the group, here 1 and 2 which becomes the index of the aggregation:
>>> list(ddf.groupby("id"))
[(1, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
0 1
1 1
2 1
3 1),
(2, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
4 2)]
Why I have a result? Because the label of the group is found in the same of your dataframe index:
>>> ddf.groupby("id").apply(do_something)
id
1 x
2 x
dtype: object
Now change the id like this:
ddf['id'] += 10
# id
# 0 11
# 1 11
# 2 11
# 3 11
# 4 12
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 0 11 NaN
# 1 11 NaN
# 2 11 NaN
# 3 11 NaN
# 4 12 NaN
Or change the index:
ddf.index += 10
# id
# 10 1
# 11 1
# 12 1
# 13 1
# 14 2
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 10 1 NaN
# 11 1 NaN
# 12 1 NaN
# 13 1 NaN
# 14 2 NaN
Yes it is expected.
First of all the apply(do_something) part works like a charme, it is the groupby right before that causes the problem.
A Groupby returns a groupby object, which is a little different to a normal dataframe. If you debug and inspect what the groupby returns, then you can see you need some form of summary function to use it(mean max or sum).If you run one of them as example like this:
df = ddf.groupby("id")
df.mean()
it leads to this result:
Empty DataFrame
Columns: []
Index: [1, 2]
After that do_something is applied to index 1 and 2 only; and then integrated into your original df. This is why you only have index 1 and 2 with x.
For now I would recommend leave out the groupby since it is not clear why you want to use it here anyway.
And have a deeper look into the groupby object
If need new column in aggregate function use GroupBy.transform, is necessary specified column after groupby used for processing, here id:
ddf["title"] = ddf.groupby("id")['id'].transform(do_something)
Or assign new column in function:
def do_something(x):
x['title'] = 'x'
return x
ddf = ddf.groupby("id").apply(do_something)
Explanation why not workin gis in another answers.

Pandas dataframe first value shows up as column name

I am new to pandas, I have a pandas data frame, but the first value (0,0), is being used as an index/name? I want 0.9121 to be the first value, how do I do that?
0 0.2171
1 0.21163
2 0.87221
3 0.432735
4 0.3231
Name: 0.9121, dtype: float64
I would like to have:
0 0.9121
1 0.2171
2 0.21163
3 0.87221
4 0.432735
5 0.3231

How to get the mode of a column in pandas where there are few of the same mode values pandas

I have a data frame and i'd like to get the mode of a specific column.
i'm using:
freq_mode = df.mode()['my_col'][0]
However I get the error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index my_col')
I'm guessing it's because I have few mode that are the same.
I need any of the mode, it doesn't matter. How can I use any() to get any of the mode existed?
For me your code working nice with sample data.
If necessary select first value of Series from mode use:
freq_mode = df['my_col'].mode().iat[0]
We can see the one column
df=pd.DataFrame({"A":[14,4,5,4,1,5],
"B":[5,2,54,3,2,7],
"C":[20,20,7,3,8,7],
"train_label":[7,7,6,6,6,7]})
X=df['train_label'].mode()
print(X)
DataFrame
A B C train_label
0 14 5 20 7
1 4 2 20 7
2 5 54 7 6
3 4 3 3 6
4 1 2 8 6
5 5 7 7 7
Output
0 6
1 7
dtype: int64

get second largest value in row in selected columns in dataframe in pandas

I have a dataframe with subset of it shown below. There are more columns to the right and left of the ones I am showing you
M_cols 10D_MA 30D_MA 50D_MA 100D_MA 200D_MA Max Min 2nd smallest
68.58 70.89 69.37 **68.24** 64.41 70.89 64.41 68.24
**68.32**71.00 69.47 68.50 64.49 71.00 64.49 68.32
68.57 **68.40** 69.57 71.07 64.57 71.07 64.57 68.40
I can get the min (and max is easy as well) with the following code
df2['MIN'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].max(axis=1)
But how do I get the 2nd smallest. I tried this and got the following error
df2['2nd SMALLEST'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].nsmallest(2)
TypeError: nsmallest() missing 1 required positional argument: 'columns'
Seems like this should be a simple answer but I am stuck
For example you have following df
df=pd.DataFrame({'V1':[1,2,3],'V2':[3,2,1],'V3':[3,4,9]})
After pick up the value need to compare , we just need to sort value by axis=0(default)
sortdf=pd.DataFrame(np.sort(df[['V1','V2','V3']].values))
sortdf
Out[419]:
0 1 2
0 1 3 3
1 2 2 4
2 1 3 9
1st max:
sortdf.iloc[:,-1]
Out[421]:
0 3
1 4
2 9
Name: 2, dtype: int64
2nd max
sortdf.iloc[:,-2]
Out[422]:
0 3
1 2
2 3
Name: 1, dtype: int64