backwards average only when value in column changes - pandas

I tried to calculate the average for the last x rows in a DataFrame only when the value is changing
A and B are my inputs and C is my desired output
a = 0
def iloc_backwards (df, col):
for i in df.index:
val1 = df[col].iloc[i]
val2 = df[col].iloc[i+1]
if val1 == val2 :
a+
else: df.at[i,col] = df.rolling(window=a).mean()
A B C
1 0 0.25
2 0 0.25
3 0 0.25
4 1 0.25
5 0 0.5
6 1 0.5

If you should take the average of all values up to the first encounter of a value that is non-zero, try this code:
df['group'] = df['B'].shift().ne(0).cumsum()
df['C'] = df.groupby('group').B.transform('mean')
df[['A', 'B', 'C']]
This corresponds with your desired output.

Related

Count number of Columns with specific value in pandas

I am searching a way to countif rows in pandas. An example would be:
df = pd.DataFrame(data = {'A': [x,y, z], 'B':[z,y,x], 'C': [y,x,z] })
I want to count the number of repetitions on each row and add it to new columns based on specific criteria:
Criteria
C1 = x
C2 = y
C3 = z
In the example above, C3 will be [1,0,2] As there are one 'z' in row 0, no 'z' in row 1 and two 'z' in row 2.
The end table would look like:
A B C | C1 C2 C3
x z y | 1 1 1
y y x | 1 2 0
z x z | 1 0 2
How can I do this in Pandas?
Thanks a lot!
do you mean:
df.join(df.apply(pd.Series.value_counts, axis=1).fillna(0))
Output:
A B C x y z
0 x z y 1.0 1.0 1.0
1 y y x 1.0 2.0 0.0
2 z x z 1.0 0.0 2.0
Can iterate through the values and sum across axis 1
df = pd.concat([df.eq(val).sum(1) for val in ['x', 'y', 'z']], axis=1)
0 1 2
0 1 1 1
1 1 2 0
2 1 0 2
Then rename your column names accordingly.
For a more general solution, consider np.unique and using the pd.Series.name attr.
pd.concat([df.eq(val).sum(1).rename(val) for val in np.unique(df)], axis=1)
x y z
0 1 1 1
1 1 2 0
2 1 0 2
And with some trivial tweaks, you can have your end table
map_ = {'x':'C1', 'y':'C2', 'z':'C3'}
df.join(pd.concat([df.eq(i).sum(1).rename(map_[i]) for i in np.unique(df)], 1))
A B C C1 C2 C3
0 x z y 1 1 1
1 y y x 1 2 0
2 z x z 1 0 2

How to apply subtraction to groupby object

I have a dataframe like this
test = pd.DataFrame({'category':[1,1,2,2,3,3],
'type':['new', 'old','new', 'old','new', 'old'],
'ratio':[0.1,0.2,0.2,0.4,0.4,0.8]})
category ratio type
0 1 0.10000 new
1 1 0.20000 old
2 2 0.20000 new
3 2 0.40000 old
4 3 0.40000 new
5 3 0.80000 old
I would like to subtract each category's old ratio from the new ratio but not sure how to reshape the DF to do so
Use DataFrame.pivot first, so possible subtract very easy:
df = test.pivot('category','type','ratio')
df['val'] = df['old'] - df['new']
print (df)
type new old val
category
1 0.1 0.2 0.1
2 0.2 0.4 0.2
3 0.4 0.8 0.4
Another approach
df = df.groupby('category').apply(lambda x: x[x['type'] == 'old'].reset_index()['ratio'][0] - x[x['type'] == 'new'].reset_index()['ratio'][0]).reset_index(name='val')
Output
category val
0 1 0.1
1 2 0.2
2 3 0.4

pandas: How to compare value of column with next value

I have a dataframe which looks as follows:
colA colB
0 A 10
1 B 20
2 C 5
3 D 2
4 F 30
....
I would like to compare column 1 values to detect two successive decrements. That is, I want to report the index values where I have two successive decrements of column 1. For example, I want to report 'B' because there are two successive rows following B where column 1 values are decremented. I am not sure how to approach this without writing a loop. ( If there is no way to avoid a loop I'd like to know.)
Thanks
You can use loc for this:
desired=frame.loc[(frame["colB"]>=frame["colB"].shift(-1)) &
(frame["colB"].shift(-1)>=frame["colB"].shift(-2) )]
print(desired)
The output will be:
colA colB
1 B 20
if you only wish to report the value B:
desired=frame["colA"].loc[(frame["colB"]>=frame["colB"].shift(-1)) &
(frame["colB"].shift(-1)>=frame["colB"].shift(-2) )]
print(desired.values)
The output will be:
['B']
Yes you can do this without using loop.
df = pd.DataFrame({'colA':['A', 'B', 'C', 'D', 'F'], 'colB':[10, 20, 5, 2, 30]})
>>> df['colC'] = df['colB'].diff(-1)
>>> df
colA colB colC
0 A 10 -10.0
1 B 20 15.0
2 C 5 3.0
3 D 2 -28.0
4 F 30 NaN
'colC' is the difference between the consecutive row.
>>> df['colD'] = np.where(df['colC'] > 0, 1, 0)
>>> df
colA colB colC colD
0 A 10 -10.0 0
1 B 20 15.0 1
2 C 5 3.0 1
3 D 2 -28.0 0
4 F 30 -1.0 0
In 'colD' we are marking flag where the difference is greater than 0.
>>> df1['s'] = df1['colD'].shift(-1)
>>> df1
colA colB colC colD s
0 A 10 -10.0 0 1.0
1 B 20 15.0 1 1.0
2 C 5 3.0 1 0.0
3 D 2 -28.0 0 0.0
4 F 30 -1.0 0 NaN
In column 's' we shift the value of 'colD'.
>>> df1['flag'] = np.where((df1['colD'] == 1) & (df1['colD'] == df1['s']), 1, 0)
>>> df1
colA colB colC colD s flag
0 A 10 -10.0 0 1.0 0
1 B 20 15.0 1 1.0 1
2 C 5 3.0 1 0.0 0
3 D 2 -28.0 0 0.0 0
4 F 30 -1.0 0 NaN 0
Then 'flag' is required column.
Need a little bit logic here
s=df.colB.diff().gt(0) # get the diff
df.loc[df.groupby(s.cumsum()).colA.transform('count').ge(3)&s,'colA'] # then we using count to see which one is more than 3 items (include the line start to two items decreasing )
Out[45]:
1 B
Name: colA, dtype: object

Pandas - manipulating column based on multiple condition on dataframe

I have a Pandas dataframe with two columns, column fruit has values 0 or 1 and the other column as qty which has missing values with float values.
Now, I need to overwrite the qty column with a condition such as if fruit column has value as 1 and if qty column is missing then replace the qty column with 0 else same value as qty.
Any help is appreciated.
Regards,
Dj
Use:
df = pd.DataFrame({
'qty': [np.nan,np.nan,10,23, np.nan],
'fruit': [1,0,1,0,1]
})
print (df)
qty fruit
0 NaN 1
1 NaN 0
2 10.0 1
3 23.0 0
4 NaN 1
mask = (df['fruit'] == 1) & (df['qty'].isna())
df['qty'] = np.where(mask, 0, df['qty'])
Another solutions:
df.loc[mask, 'qty'] = 0
df['qty'] = df['qty'].mask(mask, 0)
print (df)
qty fruit
0 0.0 1
1 NaN 0
2 10.0 1
3 23.0 0
4 0.0 1
df1 = df[df.fruit == 1].fillna(0)
df.update(df1)
print(df)
df is your original dataframe .
you can add df.fruit = df.fruit.astype(int) before printing the df if you like.

Find multiple strings in a given column

I'm not sure whether it is possible to do easily.
I have 2 dataframes. In the first one (df1) there is a column with texts ('Texts') and in the second one there are 2 columns, one with some sort texts ('subString') and the second with a score ('Score').
What I want is to sum up all the scores associated to the subString field in the second dataframe when these subString are a substring of the text column in the first dataframe.
For example, if I have a dataframe like this:
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Texts':['this is a string',
'here we have another string',
'this one is completly different',
'one more',
'this is one more',
'and the last one'],
'c':['C','C','C','C','C','C'],
'd':['D','D','D','D','NaN','NaN']
}, columns = ['ID','Texts','c','d'])
df1
Out[2]:
ID Texts c d
0 1 this is a string C D
1 2 here we have another string C D
2 3 this one is completly different C D
3 4 one more C D
4 5 this is one more C NaN
5 6 and the last one C NaN
And another dataframe like this:
df2 = pd.DataFrame({
'SubString':['This', 'one', 'this is', 'is one'],
'Score':[0.5, 0.2, 0.75, -0.5]
}, columns = ['SubString','Score'])
df2
Out[3]:
SubString Score
0 This 0.50
1 one 0.20
2 this is 0.75
3 is one -0.50
I want to get something like this:
df1['Score'] = 0.0
for index1, row1 in df1.iterrows():
score = 0
for index2, row2 in df2.iterrows():
if row2['SubString'] in row1['Texts']:
score += row2['Score']
df1.set_value(index1, 'Score', score)
df1
Out[4]:
ID Texts c d Score
0 1 this is a string C D 0.75
1 2 here we have another string C D 0.00
2 3 this one is completly different C D -0.30
3 4 one more C D 0.20
4 5 this is one more C NaN 0.45
5 6 and the last one C NaN 0.20
Is there a less garbled and faster way to do it?
Thanks!
Option 1
In [691]: np.array([np.where(df1.Texts.str.contains(x.SubString), x.Score, 0)
for _, x in df2.iterrows()]
).sum(axis=0)
Out[691]: array([ 0.75, 0. , -0.3 , 0.2 , 0.45, 0.2 ])
Option 2
In [674]: df1.Texts.apply(lambda x: df2.Score[df2.SubString.apply(lambda y: y in x)].sum())
Out[674]:
0 0.75
1 0.00
2 -0.30
3 0.20
4 0.45
5 0.20
Name: Texts, dtype: float64
Note: apply doesn't get rid of loops, it just hides them.