Generate a new column based on other columns' value - pandas

here is my sample data input and output:
df=pd.DataFrame({'A_flag': [1, 1,1], 'B_flag': [1, 1,0],'C_flag': [0, 1,0],'A_value': [5, 3,7], 'B_value': [2, 7,4],'C_value': [4, 2,5]})
df1=pd.DataFrame({'A_flag': [1, 1,1], 'B_flag': [1, 1,0],'C_flag': [0, 1,0],'A_value': [5, 3,7], 'B_value': [2, 7,4],'C_value': [4, 2,5], 'Final':[3.5,3,7]})
I want to generate another column called 'Final' conditional on A_flag, B_flag and C_flag:
(a) If number of three columns equal to 1 is 3, then 'Final'=median of (A_value, B_value, C_value)
(b) If the number of satisfied conditions is 2, then 'Final'= mean of those two
(c) If the number is 1, the 'Final' = that one
For example, in row 1, A_flag=1 and B_flag =1, 'Final'=A_value+B_value/2=5+2/2=3.5
in row 2, all three flags are 1 so 'Final'= median of (3,7,2) =3
in row 3, only A_flag=1, so 'Final'=A_value=7
I tried the following:
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==3, "Final"]= df[['A_flag','B_flag','C_flag']].median(axis=1)
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==2, "Final"]=
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==1, "Final"]=
I don't know how to subset the columns that for the second and third scenarios.

Assuming the order of flag and value columns match, you can first filter the flag and value like columns then mask the values in value columns where flag is 0, then calculate median along axis=1
flag = df.filter(like='_flag')
value = df.filter(like='_value')
df['median'] = value.mask(flag.eq(0).to_numpy()).median(1)
A_flag B_flag C_flag A_value B_value C_value median
0 1 1 0 5 2 4 3.5
1 1 1 1 3 7 2 3.0
2 1 0 0 7 4 5 7.0

When dealing with functions and dataframe, usually the easiest way to go is defining a function and then calling that function to the dataframe either by iterating over the columns or the rows. I think in your case this might work:
import pandas as pd
df = pd.DataFrame(
{
"A_flag": [1, 1, 1],
"B_flag": [1, 1, 0],
"C_flag": [0, 1, 0],
"A_value": [5, 3, 7],
"B_value": [2, 7, 4],
"C_value": [4, 2, 5],
}
)
def make_final_column(row):
flags = [(row['A_flag'], row['A_value']), (row['B_flag'], row['B_value']), (row['C_flag'], row['C_value'])]
met_condition = [row[1] for row in flags if row[0] == 1]
return sum(met_condition) / len(met_condition)
df["Final"] = df.apply(make_final_column, axis=1)
df

With numpy:
flags = df[["A_flag", "B_flag", "C_flag"]].to_numpy()
values = df[["A_value", "B_value", "C_value"]].to_numpy()
# Sort each row so that the 0 flags appear first
index = np.argsort(flags)
flags = np.take_along_axis(flags, index, axis=1)
# Rearrange the values to match the flags
values = np.take_along_axis(values, index, axis=1)
# Result
np.select(
[
flags[:, 0] == 1, # when all flags are 1
flags[:, 1] == 1, # when two flags are 1
flags[:, 2] == 1, # when one flag is 1
],
[
np.quantile(values, 0.5, axis=1), # median all of 3 values
np.mean(values[:, -2:], axis=1), # mean of the two 1-flag
values[:, 2], # value of the 1-flag
],
default=np.nan
)

Quite interesting solutions already. I have used a masked approach.
Explanation:
So, with the flag given already it becomes easy to find which values are important just by multiplying by the flag. There after mask the values which are zero in respective rows and find median over the axis.
>>> import numpy as np
>>> t_arr = np.array((df.A_flag * df.A_value, df.B_flag * df.B_value, df.C_flag * df.C_value)).T
>>> maskArr = np.ma.masked_array(t_arr, mask=x==0)
>>> df["Final"] = np.ma.median(maskArr, axis=1)
>>> df
A_flag B_flag C_flag A_value B_value C_value Final
0 1 1 0 5 2 4 3.5
1 1 1 1 3 7 2 3.0
2 1 0 0 7 4 5 7.0

Related

pandas read dataframe multi-header values

I have this dataframe with multiple headers
name, 00590BL, 01090BL, 01100MS, 02200MS
lat, 613297, 626278, 626323, 616720
long, 5185127, 5188418, 5188431, 5181393
elv, 1833, 1915, 1915, 1499
1956-01-01, 1, 2, 2, -2
1956-01-02, 2, 3, 3, -1
1956-01-03, 3, 4, 4, 0
1956-01-04, 4, 5, 5, 1
1956-01-05, 5, 6, 6, 2
I read this as
dfr = pd.read_csv(f_name,
skiprows = 0,
header = [0,1,2,3],
index_col = 0,
parse_dates = True
)
I would like to extract the value related the rows named 'lat' and 'long'.
A easy way, could be to read the dataframe in two step. In other words, the idea could be have two dataframes. I do not like this because it is not very elegant and it not seems to take advantage of pandas potentiality. I believe that I could use some feature related to multi-index.
what do you think?
You can use get_level_values:
dfr = pd.read_csv(f_name, skiprows=0, header=[0, 1, 2, 3], index_col=0,
parse_dates=[0], skipinitialspace=True)
lat = df.columns.get_level_values('lat').astype(int)
long = df.columns.get_level_values('long').astype(int)
elv = df.columns.get_level_values('elv').astype(int)
Output:
>>> lat.to_list()
[613297, 626278, 626323, 616720]
>>> long.to_list()
[5185127, 5188418, 5188431, 5181393]
>>> elv.to_list()
[1833, 1915, 1915, 1499]
If you only need the first row of column header, use droplevel
df = dfr.droplevel(['lat', 'long', 'elv'], axis=1).rename_axis(columns=None))
print(df)
# Output
00590BL 01090BL 01100MS 02200MS
1956-01-01 1 2 2 -2
1956-01-02 2 3 3 -1
1956-01-03 3 4 4 0
1956-01-04 4 5 5 1
1956-01-05 5 6 6 2
One way to do this is to use the .loc method to select the rows by their label. For example, you could use the following code to extract the 'lat' values:
lat_values = dfr.loc['lat']
And similarly, you could use the following code to extract the 'long' values:
long_values = dfr.loc['long']
Alternatively, you can use the .xs method to extract the values of the desired level.
lat_values = dfr.xs('lat', level=1, axis=0) long_values = dfr.xs('long', level=1, axis=0)
Both these approach will extract the values for 'lat' and 'long' rows from the dataframe and will allow you to access it as one dataframe with one index.

In pandas, how to reindex(fill 0) in level 2 in multiindex

I have a dataframe pivot table with 2 level of index: month and rating. The rating should be 1,2,3 (not to be confused with the columns 1,2,3). I found that for some months, the rating could be missing. E.g, (Population and 2021-10) only has rating 1,2. I need every month to have ratings 1,2,3. So I need to fill 0 for the missing rating index.
tbl = pd.pivot_table(self.df, values=['ID'], index=['month', 'risk'],
columns=["Factor"], aggfunc='count', fill_value=0)
tbl = tbl.droplevel(None, axis=1).rename_axis(None, axis=1).rename_axis(index={'month': None,
'Risk': 'Client Risk Rating'})
# show Low for rating 1, Moderate for rating 2, Potential High for rating 3
rating = {1: 'Low',
2: 'Moderate',
3: 'Potential High'
}
pop = {'N': 'Refreshed Clients', 'Y': 'Population'}
tbl.rename(index={**rating,**pop}, inplace=True)
tbl = tbl.applymap(lambda x: x.replace(',', '')).astype(np.int64)
tbl = tbl.div(tbl.sum(axis=1), axis=0)
# client risk rating may be missing (e.g., only 1,2).
# To draw, need to fill the missing client risk rating with 0
print("before",tbl)
tbl=tbl.reindex(pd.MultiIndex.from_product(tbl.index.levels), fill_value=0)
print("after pd.MultiIndex.from_product",tbl)
I have used pd.MultiIndex.from_product. It does not work when all data is missing one index. For example, population has Moderate, 2021-03 and 2021-04 have Low and Moderate. After pd.MultiIndex.from_product, population has Low and Moderate, but all are missing High. My question is to have every month with risk 1,2,3. It seems the index values are from data.
You can use pd.MultiIndex.from_product to create a full index:
>>> df
1 2 3
(Population) 1 0.436954 0.897747 0.387058
2 0.464940 0.611953 0.133941
2021-08(Refreshed) 1 0.496111 0.282798 0.048384
2 0.163582 0.213310 0.504647
3 0.008980 0.651175 0.400103
>>> df.reindex(pd.MultiIndex.from_product(df.index.levels), fill_value=0)
1 2 3
(Population) 1 0.436954 0.897747 0.387058
2 0.464940 0.611953 0.133941
3 0.000000 0.000000 0.000000 # New record
2021-08(Refreshed) 1 0.496111 0.282798 0.048384
2 0.163582 0.213310 0.504647
3 0.008980 0.651175 0.400103
Update
I wonder df=df.reindex([1,2,3],level='rating',fill_value=0) doesn't work because the new index values [1,2,3] cannot fill the missing values for the previous rating index. By using the from_product, it creates the product of two index.
In fact it works. I mean it has an effect but not the one you expect. The method reindex the level not the values. Let me show you:
# It seems there is not effect because you don't see 3 and 4 as expected?
>>> df.reindex([1, 2, 3, 4], level='ratings')
0 1 2
ratings
(Population) 1 0.536154 0.671380 0.839362
2 0.729484 0.512379 0.440018
2021-08(Refreshed) 1 0.279990 0.295757 0.405536
2 0.864217 0.798092 0.144219
3 0.214566 0.407581 0.736905
# But yes something happens
>>> df.reindex([1, 2, 3, 4], level='ratings').index.levels
FrozenList([['(Population)', '2021-08(Refreshed)'], [1, 2, 3, 4]])
The level has been reindexed ---^
# It's different from values
>>> df.reindex([1, 2, 3, 4], level='ratings').index.get_level_values('ratings')
Int64Index([1, 2, 1, 2, 3], dtype='int64', name='ratings')

Filtering a column of lists of strings in a Pandas DataFrame

df=pd.DataFrame({'sym':['A', 'B', 'C', 'D'],'event':[['1','2', '3'], ['1'], ['2', '3'],['2']]} )
df
sym event
0 A [1, 2, 3]
1 B [1]
2 C [2, 3]
3 D [2]
Event column is made up of lists of strings. I am trying to filter the event column for any rows that contain '3' so I am looking for index 0 and 2.
I know to use
["3" in df.event[0]]
for each row and I think a lambda function would push me over the finish line.
Please try:
print(df[df.event.astype(str).str.contains(r'\b3\b')])
sym event
0 A [1, 2, 3]
2 C [2, 3]
Series.explode to split list-like values to rows
use explode to turn a list to row:
'3' in df['event'].explode().values
to find which row contains '3', use index:
idx = df['event'].explode() == '3'
df.loc[idx[idx].index]
Let us try
out = df[pd.DataFrame(df.event.tolist()).isin(['3']).any(1).values]
Out[78]:
sym event
0 A [1, 2, 3]
2 C [2, 3]

Select rows where number can be found in list

Given the following data
I hope to select the rows where num appears in list. In this case, it will select row 1 and row2, row 3 is not selected since 3 can't be found in [4,5].
Following is the dataframe, how should we write the filter query?
cat1=pd.DataFrame({"num":[1,2,3],
"list":[[1,2,3],[3,2],[4,5]]})
One possible solution with list comprehension, zip and in passed to boolean indexing:
df = cat1[[a in b for a, b in zip(cat1.num, cat1.list)]]
Or solution with DataFrame.apply with axis=1 for processing per rows:
df = cat1[cat1.apply(lambda x: x.num in x.list, axis=1)]
Or create DataFrame and test membership:
df = cat1[pd.DataFrame(cat1.list.tolist()).isin(cat1.num).any(axis=1)]
print (df)
num list
0 1 [1, 2, 3]
1 2 [3, 2]
A different solution if you are using pandas .25 is using explode():
cat1[cat1['num'].isin(cat1.explode('list1').query("num==list1").loc[:,'num'])]
num list1
0 1 [1, 2, 3]
1 2 [3, 2]

Access Row Based on Column Value

I have the following pandas dataframe:
data = {'ID': [1, 2, 3], 'Neighbor': [3, 1, 2], 'x': [5, 6, 7]}
Now I want to create a new column 'y', which for each row is the value of the field x, from that row referenced by the neighbor column (ie that column, whose ID equals the value of Neighbor), e.g: For row 0 (ID 1), 'Neighbor' is 3, thus 'y' should be 7.
So the resulting dataframe should have the colum y = [7, 5, 6].
Can I solve this without using df.apply? (As this is rather time-consuming for my big dataframes.)
I would like to use sth like
df.loc[:, 'y'] = df.loc[df.Neighbor.eq(df.ID), 'x']}
but this returns NaN.
we can pass a dict from your ID and X columns then map these into your new column
your_dict_ = dict(zip(df['ID'],df['x']))
print(your_dict_)
{1: 5, 2: 6, 3: 7}
Then we can use .map to pass these your column using the Neighbor column as the key.
df['Y'] = df['Neighbor'].map(your_dict_)
print(df)
ID Neighbor x Y
0 1 3 5 7
1 2 1 6 5
2 3 2 7 6