pandas cut by the mount column - pandas

A pandas dataframe(x) with two columns: sum and value. sum is the count of records has the same value. For example:
sum value
2 3
4 1
means 2 records has value 3 and 4 records has value 1
And what I want to do is: sort by value and then cut [1,1,1,1,3,3] into 3 parts: [1,1], [1,1], [3,3]
How to cut the value into 3 parts and Each part has an equal number of records?
pandas.cut can't take sum column into consideration

I think you can use cumsum with double numpy.where:
sumall = df['sum'].sum()
df = df.sort_values(by='value')
df['sum_sum'] = df['sum'].cumsum()
df['tag'] = np.where(df['sum_sum'] < sumall / 3, 0,
np.where(df['sum_sum'] < 2 * sumall / 3, 1, 2) )
print (df)
sum value sum_sum tag
1 4 1 4 2
0 2 3 6 2

this works for me. but ugly:
sum = df['sum'].sum()
def func(x):
if x < sum/3:
return 0
elif x < 2 * sum/3:
return 1
return 2
df = df.sort_values(by='value')
df['sum_sum'] = np.cumsum(df['sum'].values)
df['tag'] = df['sum_sum'].apply(func)

Related

Duplicate row and add string

I wish to duplicate Pandas data row and add string to end while keeping rest of data intact:
I_have = pd.DataFrame({'id':['a','b','c'], 'my_data' = [1,2,3])
I want:
Id my_data
a 1
a_dup1 1
a_dup2 1
b 2
b_dup1 2
b_dup2 2
c 3
c_dup1 3
c_dup2 3
I could do this by 1) iterrows() or 2) 3x copies of existing data and appending, but hopefully there is more pythonic way to do this.
This seems to work:
tmp1 = I_have.copy(deep=True)
tmp2 = I_have.copy(deep=True)
tmp1['id'] = tmp1['id']+'_dup1'
tmp2['id'] = tmp2['id']+'_dup2'
pd.concat([I_have, tmp1, tmp2])
Use Index.repeat with DataFrame.loc for duplicated rows and then add counter by numpy.tile, last add substrings for duplicated values - not equal 0 in Series.mask:
N = 3
df = df.loc[df.index.repeat(N)].reset_index(drop=True)
a = np.tile(np.arange(N), N)
df['id'] = df['id'].mask(a != 0, df['id'] + '_dup' + a.astype(str))
#alternative solution
#df.loc[a != 0, 'id'] = df['id'] + '_dup' + a.astype(str)
print (df)
id my_data
0 a 1
1 a_dup1 1
2 a_dup2 1
3 b 2
4 b_dup1 2
5 b_dup2 2
6 c 3
7 c_dup1 3
8 c_dup2 3

get first row in a group and assign values

I have a pandas dataframe in the below format
id name value_1 value_2
1 def 1 0
2 abc 0 1
I would need to sort the above dataframe based on id, name, value_1 & value_2. Following that, for every group of [id,name,value_1,value_2], get the first row and set df['result'] = 1. For the other rows in that group, set df['result'] = 0.
I do the sorting and get the first row using the below code:
df = df.sort_values(["id","name","value_1","value_2"], ascending=True)
first_row_per_group = df.groupby(["id","name","value_1","value_2"]).agg('first')
After getting the first row, I set first_row_per_group ['result'] = 1. But I am not sure how to set the other rows (non-first) rows to 0.
Any suggestions would be appreciated.
duplicated would be faster than groupby:
df = df.sort_values(['id', 'name', 'value_1', 'value_2'])
df['result'] = (~df['id'].duplicated()).astype(int)
use df.groupby(...).cumcount() to get a counter of rows within the group which you can then manipulate.
In [51]: df
Out[51]:
a b c
0 def 1 0
1 abc 0 1
2 def 1 0
3 abc 0 1
In [52]: df2 = df.sort_values(['a','b','c'])
In [53]: df2['result'] = df2.groupby(['a', 'b', 'c']).cumcount()
In [54]: df2['result'] = np.where(df2['result'] == 0, 1, 0)
In [55]: df2
Out[55]:
a b c result
1 abc 0 1 1
3 abc 0 1 0
0 def 1 0 1
2 def 1 0 0

Filter rows from subsets of a Pandas DataFrame efficiently

I have a DataFrame consisting of medical data where the columns are ["Patient_ID", "Code", "Data"], where "Code" just represents some medical interaction patient "Patient_ID" had on "Date". Any patient will generally have more than one row, since they have more than one interaction. I want to apply two types of filtering to this data.
Remove any patients who have less than some min_len interactions
To each patient apply a half-overlapping, sliding window of length T days. Within each window keep only the first of any duplicate codes, and then shuffle the codes within the window
So I need to modify subsets of the overall dataframe, but the modification involves changing the size of the subset. I have both of these implemented as part of a larger pipeline, however they are a sigfnificant bottleneck in terms of time. I'm wondering if there's a more efficient way to achieve the same thing, as I really just threw together what worked and I'm not too familiar on efficiency of pandas operations. Here is how I have them currently:
def Filter_by_length(df, min_len = 1):
print("Filtering short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
def shuffle_col(df, col):
df[col] = np.random.permutation(df[col])
return df
def Filter_by_redundancy(df, T, min_len = 1):
print("Filtering redundant concepts and short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
start_date = sub_df.DATE.min()
end_date = sub_df.DATE.max()
next_date = start_date + dt.timedelta(days = T)
while start_date <= end_date:
sub_df = pd.concat([sub_df[sub_df.DATE < start_date],\
shuffle_col(sub_df[(sub_df.DATE <= next_date) & (sub_df.DATE >= start_date)]\
.drop_duplicates(subset = ['CODE']), "CODE"),\
sub_df[sub_df.DATE > next_date]], sort = False )
start_date += dt.timedelta(days = int(T/2))
next_date += dt.timedelta(days = int(T/2))
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
As you can see, in the second case I am actually applying both filters, because it is important to have the option to apply both together or either one on its own, but I am interested in any performance improvement that can be made to either one or both.
For the first part, instead of counting in your group-by like that, I would use this approach:
>>> d = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'q': [np.random.randint(1, 15, size=np.random.randint(1, 5)) for _ in range(5)]}).explode('q')
id q
0 1 1
0 1 9
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
3 4 11
3 4 5
4 5 5
4 5 6
4 5 3
4 5 2
>>> sizes = d.groupby('id').size()
>>> d[d['id'].isin(sizes[sizes >= 3].index)] # index is list of IDs meeting criteria
id q
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
4 5 5
4 5 6
4 5 3
4 5 2
I'm not sure why you want to shuffle your codes within some window. To avoid an X-Y problem, what are you in fact trying to do there?

Pandas Dataframe: split column into multiple columns

I need to break a column in a DataFrame that at present collects multiple values (someone else's excel sheet unfortunately) for a categorical data field that can have multiple values.
As you can see below the column has 15 category codes seen in the column header.
Original DataFrame
I want to split the column based on the category codes seen in the column header ['Pamphlet'] and then transform the values collected for each record in the original column to be mapped to there respective new columns as a (1) for checked and (0) for unchecked instead of the raw value [1,2,4,5].
This is the code to split based on , between values but I need to put these into the new columns I need to set up by splitting the column ['Pamphlet'] up by the values in the header [15: 1) OSA\n2) Nutrition\n3) Activity\n4) etc.].
'''df_old['Pamphlets'].str.split(pat = ',', n = -1, expand = True)'''
Shape of desired DatFrame
If I could just get an outline of whats the best approach, if it is even possible to do this within Pandas, Thanks.
You need to go through your columns one by one and divide the headers, then create a new dataframe for each column made up of split columns, then join all that back to the original dataframe. It's a bit messy but doable.
You need to use a function and some loops to go through the columns.
First lets define the dataframe. (It would be much appreciated if in future questions you supply a replicatable dataframe and any other data.
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
print(df_full)
1) Mail\n2) Email \n3) At PAC/TPAC 1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation
0 2 5
1 1 1
2 3 4
3 2 4
4 3 2
5 1 5
6 3 1
7 2 4
8 3 3
9 1 2
We will go through the dataframe column by column using a function. For now let's build the column manually for the first column. After we'll turn this next part into a function.
First, let's grab the first column.
s_col = df_full.iloc[:, 0]
print(s_col)
0 2
1 1
2 3
3 2
4 3
5 1
6 3
7 2
8 3
9 1
Name: 1) Mail\n2) Email \n3) At PAC/TPAC, dtype: int64
Split the header into individual pieces.
col = s_col.name.split("\n")
print(col)
['1) Mail', '2) Email ', '3) At PAC/TPAC']
Clean up any leading or trailing white space.
col = [x.strip() for x in col]
print(col)
['1) Mail', '2) Email', '3) At PAC/TPAC']
Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
print(df)
1) Mail 2) Email 3) At PAC/TPAC
0 2 2 2
1 1 1 1
2 3 3 3
3 2 2 2
4 3 3 3
5 1 1 1
6 3 3 3
7 2 2 2
8 3 3 3
9 1 1 1
Create a copy to make changes to the values.
df_res = df.copy()
Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
print(df_res)
1) Mail 2) Email 3) At PAC/TPAC
0 0 1 0
1 1 0 0
2 0 0 1
3 0 1 0
4 0 0 1
5 1 0 0
6 0 0 1
7 0 1 0
8 0 0 1
9 1 0 0
Now we have split a column into its components and assigned a bool value.
Let's step back and make the above a function so we can use it for each column in the original dataframe.
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
Now for the last step. Let's create a loop to go through the columns in the original dataframe, call the function to split each column, and then concat it to the original dataframe less the columns that were split.
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)
1) Mail 2) Email 3) At PAC/TPAC 1) ACC 2) IM 3) PT 4) Smoking, 5) Cessation
0 0 1 0 0 0 0 0 1
1 1 0 0 1 0 0 0 0
2 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0
4 0 0 1 0 1 0 0 0
5 1 0 0 0 0 0 0 1
6 0 0 1 1 0 0 0 0
7 0 1 0 0 0 0 1 0
8 0 0 1 0 0 1 0 0
9 1 0 0 0 1 0 0 0
Here is the full code...
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)

subset df by masking between specific rows

I'm trying to subset a pandas df by removing rows that fall between specific values. The problem is these values can be at different rows so I can't select fixed rows.
Specifically, I want to remove rows that fall between ABC xxx and the integer 5. These values could fall anywhere in the df and be of unequal length.
Note: The string ABC will be followed by different values.
I thought about returning all the indexes that contain these two values.
But mask could work better if I could return all rows between these two values?
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'X',1,2,'ABC',1,4,5,'Y',1,2],
})
mask = (df['Val'].str.contains(r'ABC(?!$)')) & (df['Val'] == 5)
Intended Output:
Val
0 None
8 X
9 1
10 2
15 Y
16 1
17 2
If ABC is always before 5 and always pairs (ABC, 5) get indices of values with np.where, zip and get index values between - last filter by isin with invert mask by ~:
#2 values of ABC, 5 in data
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'None','None','None',
'None','ABC','None',1,2,3,4,5,'None','None','None']
})
m1 = np.where(df['Val'].str.contains(r'ABC', na=False))[0]
m2 = np.where(df['Val'] == 5)[0]
print (m1)
[ 1 12]
print (m2)
[ 7 18]
idx = [x for y, z in zip(m1, m2) for x in range(y, z + 1)]
print (df[~df.index.isin(idx)])
Val
0 None
8 X
9 1
10 2
11 None
19 X
20 1
21 2
a = df.index[df['Val'].str.contains('ABC')==True][0]
b = df.index[df['Val']==5][0]+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
Output
Val
0 None
8 X
9 1
10 2
If there are more than one 'ABC' and 5, then you the below version.
With this you get the df other than the first ABC & the last 5
a = (df['Val'].str.contains('ABC')==True).idxmax()
b = df['Val'].where(df['Val']==5).last_valid_index()+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]