Apply Process Function to Groups in a Dataframe - pandas

I've got a dataframe, like this:
df_1 = pd.DataFrame({'X' : ['A','A','A','A','B','B','B'],
'Y' : [1, 0, 1, 1, 0, 0,'Nan']})
I would like to group it by X and create a column Z:
df_2 = pd.DataFrame({'X' : ['A','B'],
'Z' : [0.5, 0.5]})
But the difficult to describe thing that I would like to do is to apply this function:
def fun(Y,Z):
if Y == 1:
Z = Z + 1
elif Y == 0:
Z = Z - 1
So the first Y in df_1 is a 1, that is in group A so the Z for group A increases to 1.5. Then the next one is a 0 so it goes back to 0.5, then there are 2 more 1's so it ends up at 2.5.
Which would give me:
X Z
A 2.5
B -1.5

You can modify your first DataFrame and use sum with index alignment.
0 -> -1 (subtract 1 when a zero is found)
NaN --> 0 (unchanged when a NaN is found
df_1['Y'] = pd.to_numeric(df_1.Y, errors='coerce')
u = df_1.assign(Z=df_1.Y.replace({0: -1, np.nan: 0})).groupby('X')['Z'].sum().to_frame()
df_2 = df_2.set_index('X') + u
Z
X
A 2.5
B -1.5

Related

Why this iteration doesn't change correctly the global DF Variable

I have a similar code from the one below for my job and I don't know why it doesn't change correctly the global DF variables by a nested array for loop.
>> df1 = pd.DataFrame({
>> 'x': [1,2,3,4,5],
>> 'y': ['a', 'b', 'c', 'd', 'e']
>> })
>> df2 = df1
>> for array in [[df1, 9], [df2, 'z']]:
>> array[0]['x'] = array[1]
>> array[0]['y'] = array[1]
>> print(array[0])
x y
0 9 9
1 9 9
2 9 9
3 9 9
4 9 9
x y
0 z z
1 z z
2 z z
3 z z
4 z z
>> print(df1)
x y
0 z z
1 z z
2 z z
3 z z
4 z z
>> print(df2)
x y
0 z z
1 z z
2 z z
3 z z
4 z z
So in the first iteration we see the correct changes, df1 with 9 in both columns and df2 with z in both columns.
But then when we check the global variables we see everything as z, even the df1. And I don't know why.
When an object in python is mutable, you copy by reference and not by value. For example; int and str are immutable object types, but list, dict and pandas.DataFrame are mutable. See the below example for int and list what this means:
a = 1
b = a
b += 1
print(a)
# >> 1
x = [1,2,3]
y = x
y.append(4)
print(x)
# >> [1, 2, 3, 4]
So, when you assigned df2, you assigned it to the exact same object as where df1 was referring to. That means, that when you change df2, you also change the object that is referred to by df1, because it is physically the same object. You can check this by using the inbuilt id() function:
df1 = pd.DataFrame({'x': [1,2,3,4,5], 'y': ['a', 'b', 'c', 'd', 'e']})
df2 = df1
print(id(df1), id(df2))
# >> 4695746416 4695746416
To have a new copy of the same dataframe, you need to use copy():
df1 = pd.DataFrame({'x': [1,2,3,4,5], 'y': ['a', 'b', 'c', 'd', 'e']})
df2 = df1.copy()
print(id(df1), id(df2))
# >> 4695749728 4695742816

How to create a new column based on row values in python?

I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.
This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300

Fill zeroes with increment of the max value

I have the following dataframe
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 0}, {'id':'d', 'val':0}])
What I want is to replace 0's with +1 of the max value
The result I want is as follows:
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 3}, {'id':'d', 'val':4}])
I tried the following:
for _, r in df.iterrows():
if r.val == 0:
r.val = df.val.max()+1
However, it there a one-line way to do the above
Filter only 0 rows with boolean indexing and DataFrame.loc and assign range with count Trues values of condition with add maximum value and 1, because python count from 0 in range:
df.loc[df['val'].eq(0), 'val'] = range(df['val'].eq(0).sum()) + df.val.max() + 1
print (df)
id val
0 a 1
1 b 2
2 c 3
3 d 4

Sort specific values from a column in panda data frame

i have a data frame, for example :
df = ID aa_len aa_seq \
0 001 45 [M, R, S, R, Y, P, L, L, R, G, E, A, V, A, V, ...
1 002 45 [M, R, S, R, Y, P, L, L, R, G, E, A, V, A, V, ...
mut_position
0 [-1]
1 [5, 94, 95, 132]
The "mut_position" can be -1 or other non negative number (2,3,4) or a list of few numbers.
for example it can be -1 as in 001. a list of a few like in 002 or one number- for example 4.
i need to count the number of subjects who doesnt have -1.
i tried to so that by comparing to -1 and collect the ones that r different but it dosent seems to work...
def count_mutations(df, ref_aa_len):
nomis = -1
mutation = (df['mut_position']) != nomis
print (mutation)
what i get it True for both (ignore the ref_aa_len, that should come later)-
0 True
1 True
I think need list compehension with generator and sum of boolean Trues:
df['non_negative'] = [sum(y != -1 for y in x) for x in df['mut_position']]
print (df)
mut_position non_negative
0 [-1] 0
1 [5, 94, 95, 132] 4
If possible scalars also:
print (df)
mut_position
0 [-1]
1 [5,94,95,132]
2 6
3 -1
df['non_negative'] = [sum(y != -1 for y in x)
if isinstance(x, list)
else int(x != -1) for x in df['mut_position']]
print (df)
mut_position non_negative
0 [-1] 0
1 [5, 94, 95, 132] 4
2 6 1
3 -1 0
If need check first values if list for -1 and filter by boolean indexing:
df = pd.DataFrame({'mut_position':[[-1], [5,94,95,132],[2,-1], [-1]]})
print (df)
mut_position
0 [-1]
1 [5, 94, 95, 132]
2 [2, -1]
3 [-1]
df1 = df[df['mut_position'].str[0] != -1 ]
print (df1)
mut_position
1 [5, 94, 95, 132]
2 [2, -1]
Detail:
str[0] working for select first char of string or first value of iterable:
print (df['mut_position'].str[0])
0 -1
1 5
2 2
3 -1
Name: mut_position, dtype: int64
And for check -1 for any position use all:
df1 = df[[all(y != -1 for y in x) for x in df['mut_position']]]
print (df1)
mut_position
1 [5, 94, 95, 132]
List comprehension return boolena list:
print ([all(y != -1 for y in x) for x in df['mut_position']])
[False, True, False, False]

Getting count of rows from breakpoints of different column

Consider there are two columns A and B in a dataframe. How can I decile column A and use those breakpoints of column A deciles to calculate the count of rows in column B??
import pandas as pd
import numpy as np
df=pd.read_excel("E:\Sai\Development\UCG\qcut.xlsx")
df['Range']=pd.qcut(df['a'],10)
df_gb=df.groupby('Range',as_index=False).agg({'a':[min,max,np.size]})
df_gb.columns = df_gb.columns.droplevel()
df_gb=df_gb.rename(columns={'':'Range','size':'count_A'})
df['Range_B']=0
df['Range_B'].loc[df['b']<=df_gb['max'][0]]=1
df['Range_B'].loc[(df['b']>df_gb['max'][0]) & (df['b']<=df_gb['max'][1])]=2
df['Range_B'].loc[(df['b']>df_gb['max'][1]) & (df['b']<=df_gb['max'][2])]=3
df['Range_B'].loc[(df['b']>df_gb['max'][2]) & (df['b']<=df_gb['max'][3])]=4
df['Range_B'].loc[(df['b']>df_gb['max'][3]) & (df['b']<=df_gb['max'][4])]=5
df['Range_B'].loc[(df['b']>df_gb['max'][4]) & (df['b']<=df_gb['max'][5])]=6
df['Range_B'].loc[(df['b']>df_gb['max'][5]) & (df['b']<=df_gb['max'][6])]=7
df['Range_B'].loc[(df['b']>df_gb['max'][6]) & (df['b']<=df_gb['max'][7])]=8
df['Range_B'].loc[(df['b']>df_gb['max'][7]) & (df['b']<=df_gb['max'][8])]=9
df['Range_B'].loc[df['b']>df_gb['max'][8]]=10
df_gb_b=df.groupby('Range_B',as_index=False).agg({'b':np.size})
df_gb_b=df_gb_b.rename(columns={'b':'count_B'})
df_final = pd.concat([df_gb, df_gb_b], axis=1)
df_final=df_final[['Range','count_A','count_B']]
Is there any simple solution, as I intend to do for so many columns
I hope this would help:
df['Range'] = pd.qcut(df['a'], 10)
df2 = df.groupby(['Range'])['a'].count().reset_index().rename(columns = {'a':'count_A'})
for item in df2['Range'].values:
df2.loc[df2['Range'] == item, 'count_B'] = df['b'].apply(lambda x: x in item).sum()
df2 = df2.sort_values('Range', ascending = True)
if you want to additionally count values b that are out of range a:
min_border = df2['Range'].values[0].left
max_border = df2['Range'].values[-1].right
df2.loc[0, 'count_B'] += df.loc[df['b'] <= min_border, 'b'].count()
df2.iloc[-1, 2] += df.loc[df['b'] > max_border, 'b'].count()
One way -
df = pd.DataFrame({'A': np.random.randint(0, 100, 20), 'B': np.random.randint(0, 10, 20)})
bins = [0, 1, 4, 8, 16, 32, 60, 100, 200, 500, 5999]
labels = ["{0} - {1}".format(i, j) for i, j in zip(bins, bins[1:])]
df['group_A'] = pd.cut(df['A'], bins, right=False, labels=labels)
df['group_B'] = pd.cut(df.B, bins, right=False, labels=labels)
df1 = df.groupby(['group_A'])['A'].count().reset_index()
df2 = df.groupby(['group_B'])['B'].count().reset_index()
df_final = pd.merge(df1, df2, left_on =['group_A'], right_on =['group_B']).drop(['group_B'], axis=1).rename(columns={'group_A': 'group'})
print(df_final)
Output
group A B
0 0 - 1 0 1
1 1 - 4 1 3
2 4 - 8 1 9
3 8 - 16 2 7
4 16 - 32 3 0
5 32 - 60 7 0
6 60 - 100 6 0
7 100 - 200 0 0
8 200 - 500 0 0
9 500 - 5999 0 0