Pandas Dataframe Checking Consecutive Values in a colum - pandas

Have a Pandas Dataframe like below.
EventOccurrence Month
1 4
1 5
1 6
1 9
1 10
1 12
Need to add a identifier column to above panda's dataframe such that whenever Month is consecutive thrice a value of True is filled, else false. Explored few options like shift and window without luck. Any pointer is appreciated.
EventOccurrence Month Flag
1 4 F
1 5 F
1 6 T
1 9 F
1 10 F
1 12 F
Thank You.

You can check whether the diff between rows is one, and the diff shifted by 1 is one as well:
df['Flag'] = (df.Month.diff() == 1) & (df.Month.diff().shift() == 1)
EventOccurrence Month Flag
0 1 4 False
1 1 5 False
2 1 6 True
3 1 9 False
4 1 10 False
5 1 12 False
Note that this will also return True if it is consecutive > 3 times, but that behaviour wasn't specified in the question so I'll assume it's OK
If it needs to only flag the third one, and not for example the fourth consecutive instance, you could add a condition:
df['Flag'] = (df.Month.diff() == 1) & (df.Month.diff().shift() == 1) & (df.Month.diff().shift(2) !=1)

Related

How to keep only the last index in groups of rows where a condition is met in pandas?

I have the following dataframe:
d = {'value': [1,1,1,1,1,1,1,1,1,1], 'flag_1': [0,1,0,1,1,1,0,1,1,1],'flag_2':[1,0,1,1,1,1,1,0,1,1],'index':[1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(data=d)
I need to perform the following filter on it:
If flag 1 and flag 2 are equal keep the row with the maximum index from the consecutive indices. Below for rows 4,5,6 and rows 9,10 flag 1 and flag 2 are equal. From the group of consecutive indices 4,5,6 therefore I wish to keep only row 6 and drop rows 4 and 5. For the next group of rows 9 and 10 I wish to keep only row 10. The rows where flag 1 and 2 are not equal should all be retained. I want my final output to look as shown below:
I am really not sure how to achieve what is required so I would be grateful for any advice on how to do it.
IIUC, you can compare consecutive rows with shift. This solution requires a sorted index.
In [5]: df[~df[['flag_1', 'flag_2']].eq(df[['flag_1', 'flag_2']].shift(-1)).all(axis=1)]
Out[5]:
value flag_1 flag_2 index
0 1 0 1 1
1 1 1 0 2
2 1 0 1 3
5 1 1 1 6
6 1 0 1 7
7 1 1 0 8
9 1 1 1 10

pandas: idxmax for k-th largest

Having df of probabilities distribution, I get max probability for rows with df.idxmax(axis=1) like this:
df['1k-th'] = df.idxmax(axis=1)
and get the following result:
(scroll the tables to the right if you can not see all the columns)
0 1 2 3 4 5 6 1k-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1
the question is how to get the 2-th, 3th, etc probabilities, so that I get the following result?:
0 1 2 3 4 5 6 1k-th 2-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6 0
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4 3
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1 4
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5 4
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1 2
Thank you!
My own solution is not the prettiest, but does it's job and works fast:
for i in range(7):
p[f'{i}k'] = p[[0,1,2,3,4,5,6]].idxmax(axis=1)
p[f'{i}k_v'] = p[[0,1,2,3,4,5,6]].max(axis=1)
for x in range(7):
p[x] = np.where(p[x]==p[f'{i}k_v'], np.nan, p[x])
The loop does:
finds the largest value and it's column index
drops the found value (sets to nan)
again
finds the 2nd largest value
drops the found value
etc ...

Pandas : Get a column value where another column is the minimum in a sub-grouping [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

Pandas : dataframe cumsum , reset if other column is false [duplicate]

This question already has an answer here:
How to reset cumsum after change in sign of values?
(1 answer)
Closed 4 years ago.
I have a dataframe with 2 columns, the objective here is simple ; reset the df.cumsum() if a row column is set to False;
df
value condition
0 1 1
1 2 1
2 3 1
3 4 0
4 5 1
the wanted result is as follows :
df
value condition
0 1 1
1 3 1
2 6 1
3 4 0
4 9 1
If i loop over the dataframe as described in this post Python pandas cumsum() reset after hitting max
i can achieve the wanted results, but i was looking for a more vectorized way using pandas standard functions
How about:
df['cSum'] = df.groupby((df.condition == 0).cumsum()).value.cumsum()
Output:
value condition cSum
0 1 1 1
1 2 1 3
2 3 1 6
3 4 0 4
4 5 1 9
You'll group consecutive rows together until you encounter a 0 in the condition column, and then you apply the cumsum within each group separately.

if statement in excel, adding 1 if cell with text but

I am creating an excel sheet that has three columns. Detail, month and month count
1 -- I would like for the formula to look at the detail column and if there is text add the previous cell number plus 1 to new month count, if not insert 0
2-- I would like the formula to add the previous cell before the cell with 0 and for the cell with 0 not to impact the other cells or reset the cells back to 1 witch is the problem am having
3-- I also need the formula to reset for every month from what ever number it was back to 0 or 1 depending if the new month first cell has text or not. for this I need the formula to look at the month column
This is what I have so far:
=IF(ISTEXT(G95), I94+ 1, 0)
The formula for the count column should be as follows.
=IF(A2<>"",COUNTIF($B$1:B2,B2)-COUNTIFS($A$1:A2,"",$B$1:B2,B2),0)
Breakdown of how this works:
A2<>"" Will check if the detail column is populated
COUNTIF($B$1:B2,B2) will figure out how many entries are above this row that reference the same month.
COUNTIFS($A$1:A2,"",$B$1:B2,B2) Will find how many cells are blank provided that it also matches the month. This subtracted from the previous section gives you how many are not blank.
The IF will return 0 if the detail is empty.
Which returned the following data
Orderly Random
Det Mon Count Det Mon Count
X 1 1 2 0
X 1 2 X 1 1
X 1 3 X 1 2
1 0 2 0
X 1 4 X 2 1
X 2 1 X 1 3
X 2 2 X 1 4
2 0 1 0
2 0 1 0
2 0 2 0
3 0 3 0
X 3 1 X 3 1
3 0 1 0
X 3 2 3 0
X 3 3 X 1 5
3 0 X 2 2
X 3 4 X 3 2
3 0 3 0
X 3 5 3 0
X 3 6 2 0
It sounds like you want to keep a running total for the month count in the column and put a 0 if there is not text. If that is the case, you can put this formula in I95.
=IF(ISTEXT(G95),MAX($I$2:I94)+1, 0)