Iterate over every row and compare a column value of a dataframe - pandas

I have following dataframe. I want to iterate over every row and compare the score column, if the value is >= value present in cut_off list.
seq score status
7 TTGTTCTCTGTGTATTTCAGGCT 10.42 positive
56 CAGGTGAGA 9.22 positive
64 AATTCCTGTGGACTTTCAAGTAT 1.23 positive
116 AAGGTATAT 7.84 positive
145 AAGGTAATA 8.49 positive
172 TGGGTAGGT 6.86 positive
204 CAGGTAGAG 7.10 positive
214 GCGTTTCTTGAATCCAGCAGGGA 3.58 positive
269 GAGGTAATG 8.73 positive
274 CACCCATTCCTGTACCTTAGGTA 8.96 positive
325 GCCGTAAGG 5.46 positive
356 GAGGTGAGG 8.41 positive
cut_off = range(0, 11)
The code I tried so far is:
cutoff_list_pos = []
number_list_pos = []
cut_off = range(0, int(new_df['score'].max())+1)
for co in cut_off:
for df in df_elements:
val = (df['score']>=co).value_counts()
cutoff_list_pos.append(co)
number_list_pos.append(val)
The desired output is:
cutoff true false
0 0 12.0 0
1 1 12.0 0
and so on..
If the score is >= to the value in cut_off, it should assign the row as true else false.

You can use parameter keys in concat by values of cutoff_list_pos, then transpose and convert index to column by DataFrame.reset_index:
df = (pd.concat(number_list_pos, axis=1, keys=cutoff_list_pos, sort=False)
.T
.rename_axis('cutoff')
.reset_index())

Another pandas implementation:
res_df = pd.DataFrame(columns=['cutoff', 'true'])
for i in range(1,int(df['score'].max()+1)):
temp_df = pd.DataFrame(data={'cutoff': i, 'true': (df['score']>=i).sum()}, index=[i])
res_df = pd.concat([res_df, temp_df])
res_df
cutoff true
1 1 12
2 2 11
3 3 11
4 4 10
5 5 10
6 6 9
7 7 8
8 8 6
9 9 2
10 10 1

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Pandas with a condition select a value from a column and multiply by scalar in new column, row by row

A value in 'Target_Labels' is either 0.0,1.0,2.0 in float64.
Based on this value, I would like to look up a value in one of three columns 'B365A','B365D','B365H' and multiply this value by 10 in a new column. This operation needs to be row wise throughout the entire DataFrame.
I have tried many combinations but nothing seem to work...
final['amount'] = final['Target_Labels'].apply((lambda x: 'B365A' * 10 if x==0.0 else ('B365D' * 10 if x ==1 else 'B365H' * 10))
def prod(x, var1, var2, var3, var4):
if (x[var4])==0:
x[var3]*10
elif (x[var4])==1:
x[var1]*10
else:
x[var2]*10
return x
final['montant'] = final.apply(lambda x: prod(x, 'B365D', 'B365H','B365A', 'Target_Labels'), axis=1)
I'm new to Pandas and any help is welcome...
Use numpy to indexing and get individual cells:
array = final.values
row = range(len(df))
col = final['Target_Labels'] - 1
>>> final
B365A B365D B365H Target_Labels
0 11 12 13 1
1 11 12 13 2
2 11 12 13 3
>>> final['amount'] = final.values[(range(len(final)),
final['Target_Labels'] - 1)] * 10
>>> final
B365A B365D B365H Target_Labels amount
0 11 12 13 1 110
1 11 12 13 2 120
2 11 12 13 3 130

Find Max Gradient by Row in For Loop Pandas

I have a df of 15 x 4 and I'm trying to compute the maximum gradient in a North (N) minus South (S) direction for each row using a "S" and "N" value for each min or max in the rows below. I'm not sure that this is the best pythonic way to do this. My df "ms" looks like this:
minSlats minNlats maxSlats maxNlats
0 57839.4 54917.0 57962.6 56979.9
0 57763.2 55656.7 58120.0 57766.0
0 57905.2 54968.6 58014.3 57031.6
0 57796.0 54810.2 57969.0 56848.2
0 57820.5 55156.4 58019.5 57273.2
0 57542.7 54330.6 58057.6 56145.1
0 57829.8 54755.4 57978.8 56777.5
0 57796.0 54810.2 57969.0 56848.2
0 57639.4 54286.6 58087.6 56140.1
0 57653.3 56182.7 57996.5 57975.8
0 57665.1 56048.3 58069.7 58031.4
0 57559.9 57121.3 57890.8 58043.0
0 57689.7 55155.5 57959.4 56440.8
0 57649.4 56076.5 58043.0 58037.4
0 57603.9 56290.0 57959.8 57993.9
My loop structure looks like this:
J = len(ms)
grad = pd.DataFrame()
for i in range(J):
if ms.maxSlats.iloc[i] > ms.maxNlats.iloc[i]:
gr = ( ms.maxSlats.iloc[i] - ms.minNlats.iloc[i] ) * -1
grad[gr] = [i+1, i]
elif ms.maxNlats.iloc[i] > ms.maxSlats.iloc[i]:
gr = ms.maxNlats.iloc[i] - ms.minSlats.iloc[i]
grad[gr] = [i+1, i]
grad = grad.T # need to transpose
print(grad)
I obtain the correct answer but I'm wondering if there is a cleaner way to do this to obtain the same answer below:
grad.T
Out[317]:
0 1
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-3158.8 8 7
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
thank you,
Use np.where to compute gradient and keep only last duplicated index.
grad = np.where(ms.maxSlats > ms.maxNlats, (ms.maxSlats - ms.minNlats) * -1,
ms.maxNlats - ms.minSlats)
df = pd.DataFrame({'A': pd.RangeIndex(1, len(ms)+1),
'B': pd.RangeIndex(0, len(ms))},
index=grad)
df = df[~df.index.duplicated(keep='last')]
>>> df
A B
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3158.8 8 7
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14

How to split numbers in pandas column into deciles?

I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()
Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100