Counting consecutive occurences in dataframe based on condition

Counting consecutive occurences in dataframe based on condition - pandas

I am trying to find whether 3 or more occurences of any consecutive number in a column are present, and if so mark the last one with a 1 and the rest with zero's.
df['a'] = df.assign(consecutive=df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size')).query('consecutive > #threshold')
is what i have found here: Identifying consecutive occurrences of a value however this gives me the error: ValueError: Wrong number of items passed 6, placement implies 1. I understand the issue that it cannot be printed into the dataframe but what would be the correct approach to get this desired result?
Secondly if this condition is satisfied, I would like to execute an equation (e.g. 2*b) to multiple rows neighbouring (either previous or results to follow) the 1 (like the shift function but then repetitive to e.g. 3 previous rows). I'm quite sure this must be possible but have not been able to get this whole objective to work. It does not necessarily have to be based on the one in column c, this is just a proposal.
small data excerpt below for interpretation, column c and d present desired result:
a b c d
16215 2 0 0
24848 4 0 0
24849 4 0 8
24850 4 0 8
24851 4 1 8
24852 6 0 0
24853 6 0 0
24854 8 0 0
24855 8 0 0
24856 8 0 16
25208 8 0 16
25932 8 1 16
28448 10 0 0
28449 10 0 0
28450 10 0 0

Using cumsum with diff create the groupkey, then find the last position of each group when it total count is more than 3 , then we using bfill with limit
s=df.b.diff().ne(0).cumsum()
s1=s.groupby(s).transform('count')
s2=s.groupby(s).cumcount()
df['c']=((s1==s2+1)&(s1>3)).astype(int)
df['d']=(df.c.mask(df.c==0)*df.b*2).bfill(limit=2).combine_first(df.c)
df
Out[87]:
a b c d
0 16215 2 0 0.0
1 24848 4 0 0.0
2 24849 4 0 8.0
3 24850 4 0 8.0
4 24851 4 1 8.0
5 24852 6 0 0.0
6 24853 6 0 0.0
7 24854 8 0 0.0
8 24855 8 0 0.0
9 24856 8 0 16.0
10 25208 8 0 16.0
11 25932 8 1 16.0
12 28448 10 0 0.0
13 28449 10 0 0.0
14 28450 10 0 0.0

Related

Fill the row in a data frame with a specific value based on a condition on the specific column

I have a data frame df:
df=
A B C D
1 4 7 2
2 6 -3 9
-2 7 2 4
I am interested in changing the whole row values to 0 if it's element in the column C is negative. i.e. if df['C']<0, its corresponding row should be filled with the value 0 as shown below:
df=
A B C D
1 4 7 2
0 0 0 0
-2 7 2 4

You can use DataFrame.where or mask:
df.where(df['C'] >= 0, 0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Another option is simple masking via multiplication:
df.mul(df['C'] >= 0, axis=0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
You can also set values directly via loc as shown in this comment:
df.loc[df['C'] <= 0] = 0
df
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Which has the added benefit of modifying the original DataFrame (if you'd rather not return a copy).

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks

Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64

You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10

To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Get row index based on multiple column values in pandas

Here is a sample df >> real one > 500k rows. I am trying to get the row index of every instance where column ‘Trigger’ is == 1 so I can get the value in column ‘Price’. See desired column.
df10 = pd.DataFrame({
'Trigger': [0,0,1,1,1,0,0,1,0,1],
'Price': [12,14,16,18,20,2,4,6,8,10],
'Stock': ['AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'IBM','IBM','IBM','IBM','IBM'],
'desired':[0,0,16,18,20,0,0,6,0,10]
})
I was looking at answers online and you can use this code but it gives an array or all instances and I don’t know how to move the position in the array >> or if that is possible
df10['not_correct'] = np.where(df10['Trigger'] ==1 , df10.iloc[df10.index[df10['Trigger'] == 1][0],0],0)
So essentially, I want to find the index row number of (all) instances where column ‘Trigger’ == 1. It would be similar to a simple if statement in excel >> if (a[row#] == 1, b[row#],0)
Keep in mind this is example and I will NOT know where the 1 and 0 are in the actual df or how many 1’s there actually are in the ‘Trigger’ column >> it could be 0, 1 or 50.

To get the row number, use df.index in your np.where.
df10['row']=np.where(df10['Trigger']==1,df10.index,0)
df10
Out[7]:
Trigger Price Stock desired row
0 0 12 AAPL 0 0
1 0 14 AAPL 0 0
2 1 16 AAPL 16 2
3 1 18 AAPL 18 3
4 1 20 AAPL 20 4
5 0 2 IBM 0 0
6 0 4 IBM 0 0
7 1 6 IBM 6 7
8 0 8 IBM 0 0
9 1 10 IBM 10 9

The np.where do not need filter the result
df10['New']=np.where(df10.Trigger==1,df10.Price,0)
df10
Out[180]:
Trigger Price Stock desired New
0 0 12 AAPL 0 0
1 0 14 AAPL 0 0
2 1 16 AAPL 16 16
3 1 18 AAPL 18 18
4 1 20 AAPL 20 20
5 0 2 IBM 0 0
6 0 4 IBM 0 0
7 1 6 IBM 6 6
8 0 8 IBM 0 0
9 1 10 IBM 10 10

Pandas running sum

I have a pandas dataframe and it is something like this:
x y
1 0
2 1
3 2
4 0 <<<< Reset
5 1
6 2
7 3
8 0 <<<< Reset
9 1
10 2
The x values could be anything, they are not meaningful for this question. The y values increment, and reset and increment again. I need a third column (z) which is a number that represents the groups, so it increments when the y values are reset.
I cannot guarantee that the reset will be to zero, only a value that is less than the previous one, should indicate a reset.
x y z
1 0 0
2 1 0
3 2 0
4 0 1 <<<< Incremented by 1
5 1 1
6 2 1
7 3 1
8 0 2 <<<< Incremented by 1
9 1 2
10 2 2
So To produce z, i understand what needs to be done, just not familiar with the syntax. My solution would be to first assign z as a sparse column of 0 and 1's, where everything is zero except a 1 is given when y[ix] < y[ix-1], indicating that the y counter has been reset. Then a cumulative running sum should be performed on the z column, meaning that: z[ix] = sum(z[0],z[1],...,z[ix])
Id appreciate some help with the syntax of assigning column z, if someone has a moment.

Based on your logic:
#general case
df['z'] = df['y'].diff().lt(0).cumsum()
# or equivalently
# df['z'] = df['y'].lt(df['y'].shift()).cumsum()
Output:
x y z
0 1 0 0
1 2 1 0
2 3 2 0
3 4 0 1
4 5 1 1
5 6 2 1
6 7 3 1
7 8 0 2
8 9 1 2
9 10 2 2

Using ne(1)
df.y.diff().ne(1).cumsum().sub(1)
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 2
8 2
9 2
Name: y, dtype: int32

Pandas Dynamic Index Referencing during Calculation

I have the following data frame
val sum
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 0
6 7 0
I would like to calculate the sum of the next three rows' (including the current row) values. I need to do this for very big files. What is the most efficient way? The expected result is
val sum
0 1 6
1 2 9
2 3 12
3 4 15
4 5 18
5 6 13
6 7 7
In general, how can I dynamically referencing to other rows (via boolean operations) while making assignments?

> pd.rolling_sum(df['val'], window=3).shift(-2)
0 6
1 9
2 12
3 15
4 18
5 NaN
6 NaN
If you want the last values to be "filled in" then you'll need to tack on NaN's to the end of your dataframe.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Counting consecutive occurences in dataframe based on condition - pandas

Related

Fill the row in a data frame with a specific value based on a condition on the specific column

Maximum of calculated pandas column and 0

Get row index based on multiple column values in pandas

Pandas running sum

Pandas Dynamic Index Referencing during Calculation

Categories

Resources