Create column with values only for some multiindex in pandas - pandas

I have a dataframe like this:
df = pd.DataFrame(np.random.randint(50, size=(4, 4),
index=[['a', 'a', 'b', 'b'], [800, 900, 800, 900]],
columns=['X', 'Y', 'r_value', 'z_value'])
df.index.names = ["dat", "recor"]
X Y r_value z_value
dat recor
a 800 14 28 12 18
900 47 34 59 49
b 800 33 18 24 33
900 18 25 44 19
...
I want to apply a function to create a new column based on r_value that gives values only for the case of recor==900, so, in the end I would like something like:
X Y r_value z_value BB
dat recor
a 800 14 28 12 18 NaN
900 47 34 59 49 0
b 800 33 18 24 33 NaN
900 18 25 44 19 2
...
I have created the function like:
x = df.loc[pd.IndexSlice[:,900], "r_value"]
conditions = [x >=70, np.logical_and(x >= 40, x < 70), \
np.logical_and(x >= 10, x < 40), x <10]
choices = [0, 1, 2, 3]
BB = np.select(conditions, choices)
So now I need to append BB as a column, filling with NaNs the rows corresponding to recor==800. How can I do it? I have tried a couple of ideas (not commented here) without result. Thx.

Try
df.loc[df.index.get_level_values('recor')==900, 'BB'] = BB
the part df.index.get_level_values('recor')==900 creates a boolean array with True where the index level "recor" equals 900
indexing using a columns that does not already exist, ie "BB" creates that new column.
The rest of the column should automatically be filled with NaN.
I cant test it since you didn't include a minimal reproducible example.

Related

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

Summing columns and rows

How do I add up rows and columns.
The last column Sum needs to be the sum of the rows R0+R1+R2.
The last row needs to be the sum of these columns.
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
The result:
Type R0 R1 R2 Sum
0 AP 16 20 78 NaN
1 AP+ 10 14 55 NaN
2 SP 32 26 90 NaN
3 Total 0 0 0 NaN
Let us try .iloc position selection
df.iloc[-1,1:]=df.iloc[:-1,1:].sum()
df['Sum']=df.iloc[:,1:].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341
In general it may be better practice to specify column names:
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
# List columns
cols_to_sum=['R0', 'R1', 'R2']
# Access last row and sum columns-wise
df.loc[df.index[-1], cols_to_sum] = df[cols_to_sum].sum(axis=0)
# Create 'Sum' column summing row-wise
df['Sum']=df[cols_to_sum].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341

How to create a partially filled column in pandas

I have a df_trg with, say 10 rows numbered 0-9.
I get from various sources values for an additional column foo which contains only a subset of rows, e.g. S1 has 0-3, 7, 9 and S2 has 4, 6.
I would like to get a data frame with a single new column foo where some rows may remain NaN.
Is there a "nicer" way other than:
df_trg['foo'] = np.nan
for src in sources:
df_trg['foo'][df_trg.index.isin(src.index)] = src
for example, using join or merge?
Let's create the source DataFrame (df), s1 and s2 (Series objects with
updating data) and a list of them (sources):
df = pd.DataFrame(np.arange(1, 51).reshape((5, -1)).T)
s1 = pd.Series([11, 12, 13, 14, 15, 16], index=[0, 1, 2, 3, 7, 9])
s2 = pd.Series([27, 28], index=[4, 6])
sources = [s1, s2]
Start the computation from adding foo column, initially filled with
an empty string:
df = df.assign(foo='')
Then run the following "updating" loop:
for src in sources:
df.foo.update(other=src)
The result is:
0 1 2 3 4 foo
0 1 11 21 31 41 11
1 2 12 22 32 42 12
2 3 13 23 33 43 13
3 4 14 24 34 44 14
4 5 15 25 35 45 27
5 6 16 26 36 46
6 7 17 27 37 47 28
7 8 18 28 38 48 15
8 9 19 29 39 49
9 10 20 30 40 50 16
In my opinion, this solution is (at least a little) nicer than yours and
shorter.
Alternative: Fill foo column initially with NaN, but this time
updating values will be converted to float (side effect of using NaN).

how to calculate percentage changes across 2 columns in a dataframe using pct_change in Python

I have a dataframe and want to use pct_chg method to calculate the % change between only 2 of the selected columns, B and C, and put the output into a new column. the below code doesnt seem to work. can anyone help me?
df2 = pd.DataFrame(np.random.randint(0,50,size=(100, 4)), columns=list('ABCD'))
df2['new'] = df2.pct_change(axis=1)['B']['C']
Try:
df2['new'] = df2[['B','C']].pct_change(axis=1)['C']
pct_change returns pct_change across all the columns, you can select the required column and assign to a new variable.
df2['new'] = df2.pct_change(axis=1)['C']
A B C D new
0 29 4 29 5 6.250000
1 14 35 2 40 -0.942857
2 5 18 31 10 0.722222
3 17 10 42 41 3.200000
4 24 48 47 35 -0.020833
IIUC, you can just do the following:
df2['new'] = (df2['C']-df2['B'])/df2['B']

assigning title to intervals in pandas

import numpy as np
xlist = np.arange(1, 100).tolist()
df = pd.DataFrame(xlist,columns=['Numbers'],dtype=int)
pd.cut(df['Numbers'],5)
how to assign column name to each distinct intervals created ?
IIUC, you can use pd.concat function and join them in a new data frame based on indexes:
# get indexes
l = df.index.tolist()
n =20
indexes = [l[i:i + n] for i in range(0, len(l), n)]
# create new data frame
new_df = pd.concat([df.iloc[x].reset_index(drop=True) for x in indexes], axis=1)
new_df.columns = ['Numbers'+str(x) for x in range(new_df.shape[1])]
print(new_df)
Numbers0 Numbers1 Numbers2 Numbers3 Numbers4
0 1 21 41 61 81.0
1 2 22 42 62 82.0
2 3 23 43 63 83.0
3 4 24 44 64 84.0
4 5 25 45 65 85.0