pandas: How to compare value of column with next value - pandas

I have a dataframe which looks as follows:
colA colB
0 A 10
1 B 20
2 C 5
3 D 2
4 F 30
....
I would like to compare column 1 values to detect two successive decrements. That is, I want to report the index values where I have two successive decrements of column 1. For example, I want to report 'B' because there are two successive rows following B where column 1 values are decremented. I am not sure how to approach this without writing a loop. ( If there is no way to avoid a loop I'd like to know.)
Thanks

You can use loc for this:
desired=frame.loc[(frame["colB"]>=frame["colB"].shift(-1)) &
(frame["colB"].shift(-1)>=frame["colB"].shift(-2) )]
print(desired)
The output will be:
colA colB
1 B 20
if you only wish to report the value B:
desired=frame["colA"].loc[(frame["colB"]>=frame["colB"].shift(-1)) &
(frame["colB"].shift(-1)>=frame["colB"].shift(-2) )]
print(desired.values)
The output will be:
['B']

Yes you can do this without using loop.
df = pd.DataFrame({'colA':['A', 'B', 'C', 'D', 'F'], 'colB':[10, 20, 5, 2, 30]})
>>> df['colC'] = df['colB'].diff(-1)
>>> df
colA colB colC
0 A 10 -10.0
1 B 20 15.0
2 C 5 3.0
3 D 2 -28.0
4 F 30 NaN
'colC' is the difference between the consecutive row.
>>> df['colD'] = np.where(df['colC'] > 0, 1, 0)
>>> df
colA colB colC colD
0 A 10 -10.0 0
1 B 20 15.0 1
2 C 5 3.0 1
3 D 2 -28.0 0
4 F 30 -1.0 0
In 'colD' we are marking flag where the difference is greater than 0.
>>> df1['s'] = df1['colD'].shift(-1)
>>> df1
colA colB colC colD s
0 A 10 -10.0 0 1.0
1 B 20 15.0 1 1.0
2 C 5 3.0 1 0.0
3 D 2 -28.0 0 0.0
4 F 30 -1.0 0 NaN
In column 's' we shift the value of 'colD'.
>>> df1['flag'] = np.where((df1['colD'] == 1) & (df1['colD'] == df1['s']), 1, 0)
>>> df1
colA colB colC colD s flag
0 A 10 -10.0 0 1.0 0
1 B 20 15.0 1 1.0 1
2 C 5 3.0 1 0.0 0
3 D 2 -28.0 0 0.0 0
4 F 30 -1.0 0 NaN 0
Then 'flag' is required column.

Need a little bit logic here
s=df.colB.diff().gt(0) # get the diff
df.loc[df.groupby(s.cumsum()).colA.transform('count').ge(3)&s,'colA'] # then we using count to see which one is more than 3 items (include the line start to two items decreasing )
Out[45]:
1 B
Name: colA, dtype: object

Related

Pandas transformation, duplicate index values to column values

I have the following pandas dataframe:
0
0
A 0
B 0
C 0
C 4
A 1
A 7
Now there are some index letter (A and C) that appear multiple times. I want the values of these index letters on a extra column beside instead of a extra row. The desired pandas dataframe looks like:
0 1 3
0
A 0 1 7
B 0 np.nan np.nan
C 0 4 np.nan
Anything would help!
IIUC, you need to add a helper column:
(df.assign(group=df.groupby(level=0).cumcount())
.set_index('group', append=True)[0] # 0 is the name of the column here
.unstack('group')
)
or:
(df.reset_index()
.assign(group=lambda d: d.groupby('index').cumcount())
.pivot('index', 'group', 0) # col name here again
)
output:
group 0 1 2
A 0.0 1.0 7.0
B 0.0 NaN NaN
C 0.0 4.0 NaN

Pandas - groupby one column and get mean of all other columns

I have a dataframe, with columns:
cols = ['A', 'B', 'C']
If I groupby one column, say, 'A', like so:
df.groupby('A')['B'].mean()
It works.
But I need to groupby one column and then get the mean of all other columns. I've tried:
df[cols].groupby('A').mean()
But I get the error:
KeyError: 'A'
What am I missing?
Please try:
df.groupby('A').agg('mean')
sample data
B C A
0 1 4 K
1 2 6 S
2 4 7 K
3 6 3 K
4 2 1 S
5 7 3 K
6 8 9 K
7 9 3 K
print(df.groupby('A').agg('mean'))
B C
A
K 5.833333 4.833333
S 2.000000 3.500000
You can use df.groupby('col').mean(). For example to calcualte mean for columns 'A', 'B' and 'C':
A B C D
0 1 NaN 1 1
1 1 2.0 2 1
2 2 3.0 1 1
3 1 4.0 1 1
4 2 5.0 2 1
df[['A', 'B', 'C']].groupby('A').mean()
or
df.groupby('A')[['A', 'B', 'C']].mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
If you need mean for all columns:
df.groupby('A').mean()
Output:
B C D
A
1 3.0 1.333333 1.0
2 4.0 1.500000 1.0
Perhaps the missing column is string rather than numeric?
df = pd.DataFrame({
'A': ['big', 'small','small', 'small'],
'B': [1,0,0,0],
'C': [1,1,1,0],
'D': ['1','0','0','0']
})
df.groupby(['A']).mean()
Output:
A
B
C
big
1.0
1.0
small
0.0
0.6666666666666666
Here, converting the column to a numeric type such as int or float produces the desired result:
df.D = df.D.astype(int)
df.groupby(['A']).mean()
Output:
A
B
C
D
big
1.0
1.0
1.0
small
0.0
0.6666666666666666
0.0

How to calculate the percentage of counts to condition in pandas?

I have a dataframe and I want to compute the percentage of some specific command - the equation below.
$$\frac{N(A=a\quad and\quad B=0)}{N(A=a)}$$
id A B
0 a 0
1 b 1
2 c 0
3 a 1
4 a 1
Now I want to get these specific percentage:
id A B perc
0 a 0 0.3333
1 b 1 1.0
2 c 0 1.0
3 a 1 0.6666
Furthermore, I want this function where I can drop the rows by its percentage. For example, if the positives 1 and the negatives 0 are approximiately equal, I will drop these rows.
id A B
0 a 0
1 a 1
2 b 0
3 b 0
4 b 1
The result will be:
id A B
2 b 0
3 b 0
4 b 1
I think you need SeriesGroupBy.value_counts:
df = df.groupby('A')['B'].value_counts(normalize=True).reset_index(name='perc')
print (df)
A B perc
0 a 1 0.666667
1 a 0 0.333333
2 b 1 1.000000
3 c 0 1.000000
For second solution remove values if same percentages by crosstab, get values A by compare both columns and last filter by Series.isin with invert mask by ~:
print (df)
id A B
0 0 a 0
1 1 a 1
2 2 b 0
3 3 b 0
4 4 b 1
df1 = pd.crosstab(df['A'], df['B'], normalize='index')
print (df1)
B 0 1
A
a 0.500000 0.500000
b 0.666667 0.333333
idx = df1.index[df1[0].eq(df1[1])]
print (idx)
Index(['a'], dtype='object', name='A')
df = df[~df['A'].isin(idx)]
print (df)
id A B
2 2 b 0
3 3 b 0
4 4 b 1

pandas column operation on certain row in succession

I have a panda dataframe like this:
second block
0 1 a
1 2 b
2 3 c
3 4 a
4 5 c
This is a sequential data and I would like to get a new column which is the time difference between the current block and next time it repeats.
second block freq
0 1 a 3 //(4-1)
1 2 b 0 //(not repeating)
2 3 c 2 //(5-3)
3 4 a 0 //(not repeating)
4 5 c 0 //(not repeating)
I have tried to get the unique list of blocks. Then a for loop that do as below.
for i in unique_block:
df['freq'] = df['timestamp'].shift(-1) - df['timestamp']
I do not know how to get 0 for row index 1,3,4 and since the dataframe is too big. This is not efficient. This is not working.
Thanks.
Use groupby + diff(periods=-1). Multiply by -1 to get your difference convention and fillna with 0.
df['freq'] = (df.groupby('block').diff(-1)*-1).fillna(0)
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
You can use shift and transform in your groupby:
df['freq'] = df.groupby('block').second.transform(lambda x: x.shift(-1) - x).fillna(0)
>>> df
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
Using
df.groupby('block').second.apply(lambda x : x.diff().shift(-1)).fillna(0)
Out[242]:
0 3.0
1 0
2 2.0
3 0
4 0
Name: second, dtype: float64

Re-index to insert missing rows in a multi-indexed dataframe

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0