Pandas with a condition select a value from a column and multiply by scalar in new column, row by row - pandas

A value in 'Target_Labels' is either 0.0,1.0,2.0 in float64.
Based on this value, I would like to look up a value in one of three columns 'B365A','B365D','B365H' and multiply this value by 10 in a new column. This operation needs to be row wise throughout the entire DataFrame.
I have tried many combinations but nothing seem to work...
final['amount'] = final['Target_Labels'].apply((lambda x: 'B365A' * 10 if x==0.0 else ('B365D' * 10 if x ==1 else 'B365H' * 10))
def prod(x, var1, var2, var3, var4):
if (x[var4])==0:
x[var3]*10
elif (x[var4])==1:
x[var1]*10
else:
x[var2]*10
return x
final['montant'] = final.apply(lambda x: prod(x, 'B365D', 'B365H','B365A', 'Target_Labels'), axis=1)
I'm new to Pandas and any help is welcome...

Use numpy to indexing and get individual cells:
array = final.values
row = range(len(df))
col = final['Target_Labels'] - 1
>>> final
B365A B365D B365H Target_Labels
0 11 12 13 1
1 11 12 13 2
2 11 12 13 3
>>> final['amount'] = final.values[(range(len(final)),
final['Target_Labels'] - 1)] * 10
>>> final
B365A B365D B365H Target_Labels amount
0 11 12 13 1 110
1 11 12 13 2 120
2 11 12 13 3 130

Related

Add/subtract value of a column to the entire column of the dataframe pandas

I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

How to split numbers in pandas column into deciles?

I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()
Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8

subset df by masking between specific rows

I'm trying to subset a pandas df by removing rows that fall between specific values. The problem is these values can be at different rows so I can't select fixed rows.
Specifically, I want to remove rows that fall between ABC xxx and the integer 5. These values could fall anywhere in the df and be of unequal length.
Note: The string ABC will be followed by different values.
I thought about returning all the indexes that contain these two values.
But mask could work better if I could return all rows between these two values?
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'X',1,2,'ABC',1,4,5,'Y',1,2],
})
mask = (df['Val'].str.contains(r'ABC(?!$)')) & (df['Val'] == 5)
Intended Output:
Val
0 None
8 X
9 1
10 2
15 Y
16 1
17 2
If ABC is always before 5 and always pairs (ABC, 5) get indices of values with np.where, zip and get index values between - last filter by isin with invert mask by ~:
#2 values of ABC, 5 in data
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'None','None','None',
'None','ABC','None',1,2,3,4,5,'None','None','None']
})
m1 = np.where(df['Val'].str.contains(r'ABC', na=False))[0]
m2 = np.where(df['Val'] == 5)[0]
print (m1)
[ 1 12]
print (m2)
[ 7 18]
idx = [x for y, z in zip(m1, m2) for x in range(y, z + 1)]
print (df[~df.index.isin(idx)])
Val
0 None
8 X
9 1
10 2
11 None
19 X
20 1
21 2
a = df.index[df['Val'].str.contains('ABC')==True][0]
b = df.index[df['Val']==5][0]+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
Output
Val
0 None
8 X
9 1
10 2
If there are more than one 'ABC' and 5, then you the below version.
With this you get the df other than the first ABC & the last 5
a = (df['Val'].str.contains('ABC')==True).idxmax()
b = df['Val'].where(df['Val']==5).last_valid_index()+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]

Combine two columns of numbers in dataframe into single column using pandas/python

I'm very new to Pandas and Python.
I have a 3226 x 61 dataframe and I would like to combine two columns into a single one.
The two columns I would like to combine are both integers - one has either one or two digits (1 through 52) while the other has three digits (e.g., 1 or 001, 23 or 023). I need the output to be a five digit integer (e.g., 01001 or 52023). There will be no mathematical operations with the resulting integers - I will need them only for look-up purposes.
Based on some of the other posts on this fantastic site, I tried the following:
df['YZ'] = df['Y'].map(str) + df['Z'].map(str)
But that returns "1.00001 for a first column of "1" and second column of "001", I believe because making "1" a str turns it into "1.0", which "001" is added to the end.
I've also tried:
df['YZ'] = df['Y'].join(df['Z'])
Getting the following error:
AttributeError: 'Series' object has no attribute 'join'
I've also tried:
df['Y'] = df['Y'].astype(int)
df['Z'] = df['Z'].astype(int)
df['YZ'] = df[['Y','Z']].apply(lambda x: ''.join(x), axis=1)
Getting the following error:
TypeError: ('sequence item 0: expected str instance, numpy.int32
found', 'occurred at index 0')
A copy of the columns is below:
1 1
1 3
1 5
1 7
1 9
1 11
1 13
I understand there are two issues here:
Combining the two columns
Getting the correct format (five digits)
Frankly, I need help with both but would be most appreciative of the column combining problem.
I think you need convert columns to string, add 0 by zfill and simply sum by +:
df['YZ'] = df['Y'].astype(str).str.zfill(2) + df['Z'].astype(str).str.zfill(3)
Sample:
df=pd.DataFrame({'Y':[1,3,5,7], 'Z':[10,30,51,74]})
print (df)
Y Z
0 1 10
1 3 30
2 5 51
3 7 74
df['YZ'] = df['Y'].astype(str).str.zfill(2) + df['Z'].astype(str).str.zfill(3)
print (df)
Y Z YZ
0 1 10 01010
1 3 30 03030
2 5 51 05051
3 7 74 07074
If need also change original columns:
df['Y'] = df['Y'].astype(str).str.zfill(2)
df['Z'] = df['Z'].astype(str).str.zfill(3)
df['YZ'] = df['Y'] + df['Z']
print (df)
Y Z YZ
0 01 010 01010
1 03 030 03030
2 05 051 05051
3 07 074 07074
Solution with join:
df['Y'] = df['Y'].astype(str).str.zfill(2)
df['Z'] = df['Z'].astype(str).str.zfill(3)
df['YZ'] = df[['Y','Z']].apply('-'.join, axis=1)
print (df)
Y Z YZ
0 01 010 01-010
1 03 030 03-030
2 05 051 05-051
3 07 074 07-074
And without change original columns:
df['YZ'] = df['Y'].astype(str).str.zfill(2) + '-' + df['Z'].astype(str).str.zfill(3)
print (df)
Y Z YZ
0 1 10 01-010
1 3 30 03-030
2 5 51 05-051
3 7 74 07-074