How to split numbers in pandas column into deciles? - pandas

I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()

Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8

Related

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

Pandas with a condition select a value from a column and multiply by scalar in new column, row by row

A value in 'Target_Labels' is either 0.0,1.0,2.0 in float64.
Based on this value, I would like to look up a value in one of three columns 'B365A','B365D','B365H' and multiply this value by 10 in a new column. This operation needs to be row wise throughout the entire DataFrame.
I have tried many combinations but nothing seem to work...
final['amount'] = final['Target_Labels'].apply((lambda x: 'B365A' * 10 if x==0.0 else ('B365D' * 10 if x ==1 else 'B365H' * 10))
def prod(x, var1, var2, var3, var4):
if (x[var4])==0:
x[var3]*10
elif (x[var4])==1:
x[var1]*10
else:
x[var2]*10
return x
final['montant'] = final.apply(lambda x: prod(x, 'B365D', 'B365H','B365A', 'Target_Labels'), axis=1)
I'm new to Pandas and any help is welcome...
Use numpy to indexing and get individual cells:
array = final.values
row = range(len(df))
col = final['Target_Labels'] - 1
>>> final
B365A B365D B365H Target_Labels
0 11 12 13 1
1 11 12 13 2
2 11 12 13 3
>>> final['amount'] = final.values[(range(len(final)),
final['Target_Labels'] - 1)] * 10
>>> final
B365A B365D B365H Target_Labels amount
0 11 12 13 1 110
1 11 12 13 2 120
2 11 12 13 3 130

Replacing -999 with a number but I want all replaced number to be different

I have a Pandas DataFrame named df and in df['salary'] column, there are 400 values represented by same number -999. I want to replace that -999 value with any number in between 200 and 500. I want to replace all 400 values with a different number from 200 to 500. So far I have written this code:
df['salary'] = df['salary'].replace(-999, random.randint(200, 500))
but this code is replacing all -999 with the same value. I want all replaced values to be different from each other. How can do this.
You can use Series.mask with np.random.randint:
df = pd.DataFrame({"salary":[0,1,2,3,4,5,-999,-999,-999,1,3,5,-999]})
df['salary'] = df["salary"].mask(df["salary"].eq(-999), np.random.randint(200, 500, size=len(df)))
print (df)
salary
0 0
1 1
2 2
3 3
4 4
5 5
6 413
7 497
8 234
9 1
10 3
11 5
12 341
If you want non-repeating numbers instead:
s = pd.Series(range(200, 500)).sample(frac=1).reset_index(drop=True)
df['salary'] = df["salary"].mask(df["salary"].eq(-999), s)

Pandas column merging on condition

This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100