Duplicate row and add string - pandas

I wish to duplicate Pandas data row and add string to end while keeping rest of data intact:
I_have = pd.DataFrame({'id':['a','b','c'], 'my_data' = [1,2,3])
I want:
Id my_data
a 1
a_dup1 1
a_dup2 1
b 2
b_dup1 2
b_dup2 2
c 3
c_dup1 3
c_dup2 3
I could do this by 1) iterrows() or 2) 3x copies of existing data and appending, but hopefully there is more pythonic way to do this.
This seems to work:
tmp1 = I_have.copy(deep=True)
tmp2 = I_have.copy(deep=True)
tmp1['id'] = tmp1['id']+'_dup1'
tmp2['id'] = tmp2['id']+'_dup2'
pd.concat([I_have, tmp1, tmp2])

Use Index.repeat with DataFrame.loc for duplicated rows and then add counter by numpy.tile, last add substrings for duplicated values - not equal 0 in Series.mask:
N = 3
df = df.loc[df.index.repeat(N)].reset_index(drop=True)
a = np.tile(np.arange(N), N)
df['id'] = df['id'].mask(a != 0, df['id'] + '_dup' + a.astype(str))
#alternative solution
#df.loc[a != 0, 'id'] = df['id'] + '_dup' + a.astype(str)
print (df)
id my_data
0 a 1
1 a_dup1 1
2 a_dup2 1
3 b 2
4 b_dup1 2
5 b_dup2 2
6 c 3
7 c_dup1 3
8 c_dup2 3

Related

Filter rows from subsets of a Pandas DataFrame efficiently

I have a DataFrame consisting of medical data where the columns are ["Patient_ID", "Code", "Data"], where "Code" just represents some medical interaction patient "Patient_ID" had on "Date". Any patient will generally have more than one row, since they have more than one interaction. I want to apply two types of filtering to this data.
Remove any patients who have less than some min_len interactions
To each patient apply a half-overlapping, sliding window of length T days. Within each window keep only the first of any duplicate codes, and then shuffle the codes within the window
So I need to modify subsets of the overall dataframe, but the modification involves changing the size of the subset. I have both of these implemented as part of a larger pipeline, however they are a sigfnificant bottleneck in terms of time. I'm wondering if there's a more efficient way to achieve the same thing, as I really just threw together what worked and I'm not too familiar on efficiency of pandas operations. Here is how I have them currently:
def Filter_by_length(df, min_len = 1):
print("Filtering short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
def shuffle_col(df, col):
df[col] = np.random.permutation(df[col])
return df
def Filter_by_redundancy(df, T, min_len = 1):
print("Filtering redundant concepts and short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
start_date = sub_df.DATE.min()
end_date = sub_df.DATE.max()
next_date = start_date + dt.timedelta(days = T)
while start_date <= end_date:
sub_df = pd.concat([sub_df[sub_df.DATE < start_date],\
shuffle_col(sub_df[(sub_df.DATE <= next_date) & (sub_df.DATE >= start_date)]\
.drop_duplicates(subset = ['CODE']), "CODE"),\
sub_df[sub_df.DATE > next_date]], sort = False )
start_date += dt.timedelta(days = int(T/2))
next_date += dt.timedelta(days = int(T/2))
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
As you can see, in the second case I am actually applying both filters, because it is important to have the option to apply both together or either one on its own, but I am interested in any performance improvement that can be made to either one or both.
For the first part, instead of counting in your group-by like that, I would use this approach:
>>> d = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'q': [np.random.randint(1, 15, size=np.random.randint(1, 5)) for _ in range(5)]}).explode('q')
id q
0 1 1
0 1 9
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
3 4 11
3 4 5
4 5 5
4 5 6
4 5 3
4 5 2
>>> sizes = d.groupby('id').size()
>>> d[d['id'].isin(sizes[sizes >= 3].index)] # index is list of IDs meeting criteria
id q
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
4 5 5
4 5 6
4 5 3
4 5 2
I'm not sure why you want to shuffle your codes within some window. To avoid an X-Y problem, what are you in fact trying to do there?

Pandas Dataframe: split column into multiple columns

I need to break a column in a DataFrame that at present collects multiple values (someone else's excel sheet unfortunately) for a categorical data field that can have multiple values.
As you can see below the column has 15 category codes seen in the column header.
Original DataFrame
I want to split the column based on the category codes seen in the column header ['Pamphlet'] and then transform the values collected for each record in the original column to be mapped to there respective new columns as a (1) for checked and (0) for unchecked instead of the raw value [1,2,4,5].
This is the code to split based on , between values but I need to put these into the new columns I need to set up by splitting the column ['Pamphlet'] up by the values in the header [15: 1) OSA\n2) Nutrition\n3) Activity\n4) etc.].
'''df_old['Pamphlets'].str.split(pat = ',', n = -1, expand = True)'''
Shape of desired DatFrame
If I could just get an outline of whats the best approach, if it is even possible to do this within Pandas, Thanks.
You need to go through your columns one by one and divide the headers, then create a new dataframe for each column made up of split columns, then join all that back to the original dataframe. It's a bit messy but doable.
You need to use a function and some loops to go through the columns.
First lets define the dataframe. (It would be much appreciated if in future questions you supply a replicatable dataframe and any other data.
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
print(df_full)
1) Mail\n2) Email \n3) At PAC/TPAC 1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation
0 2 5
1 1 1
2 3 4
3 2 4
4 3 2
5 1 5
6 3 1
7 2 4
8 3 3
9 1 2
We will go through the dataframe column by column using a function. For now let's build the column manually for the first column. After we'll turn this next part into a function.
First, let's grab the first column.
s_col = df_full.iloc[:, 0]
print(s_col)
0 2
1 1
2 3
3 2
4 3
5 1
6 3
7 2
8 3
9 1
Name: 1) Mail\n2) Email \n3) At PAC/TPAC, dtype: int64
Split the header into individual pieces.
col = s_col.name.split("\n")
print(col)
['1) Mail', '2) Email ', '3) At PAC/TPAC']
Clean up any leading or trailing white space.
col = [x.strip() for x in col]
print(col)
['1) Mail', '2) Email', '3) At PAC/TPAC']
Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
print(df)
1) Mail 2) Email 3) At PAC/TPAC
0 2 2 2
1 1 1 1
2 3 3 3
3 2 2 2
4 3 3 3
5 1 1 1
6 3 3 3
7 2 2 2
8 3 3 3
9 1 1 1
Create a copy to make changes to the values.
df_res = df.copy()
Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
print(df_res)
1) Mail 2) Email 3) At PAC/TPAC
0 0 1 0
1 1 0 0
2 0 0 1
3 0 1 0
4 0 0 1
5 1 0 0
6 0 0 1
7 0 1 0
8 0 0 1
9 1 0 0
Now we have split a column into its components and assigned a bool value.
Let's step back and make the above a function so we can use it for each column in the original dataframe.
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
Now for the last step. Let's create a loop to go through the columns in the original dataframe, call the function to split each column, and then concat it to the original dataframe less the columns that were split.
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)
1) Mail 2) Email 3) At PAC/TPAC 1) ACC 2) IM 3) PT 4) Smoking, 5) Cessation
0 0 1 0 0 0 0 0 1
1 1 0 0 1 0 0 0 0
2 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0
4 0 0 1 0 1 0 0 0
5 1 0 0 0 0 0 0 1
6 0 0 1 1 0 0 0 0
7 0 1 0 0 0 0 1 0
8 0 0 1 0 0 1 0 0
9 1 0 0 0 1 0 0 0
Here is the full code...
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1

How do I find top most used words by some kind of formula with another column

My dataframe is like this:
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people 0
4 Good Evening 0
Now I want to find top most frequently word used in a different way, let me explain.
Let me show you the expected output first, then I will explain:
Hello - 2
People - 1
world - 1
how - 1
are - 1
you - 1
I - 1
am - 1
fine - 1
What i am trying to say is: Here people is in 3 rows 3 times. But the count is shown only 1 in output. Because:
row 1 contain people and c1 = 1
row 2 contain people and c1 = 1
row 3 contain people and c1 = 0
So row1 + row2 - row3 = 1 (because value of row1 and row2 are 1, and row3 is 0)
In the same way, Hello's value is 2 in output, because
row 1 contain hello and c1 = 1
row 2 contain hello and c1 = 1
So row1 + row2 = 2
I do not want to create a new column of output, just want to print it.
I am using this to count most used words
print(pd.Series(' '.join(df['text']).lower().split()).value_counts()[:10])
But idk how to calculate things in my way
You can use defaultdict for storage values - first zip column with ci, loop them with Counter and add if c1 == 0 add negative counts.
Last filter only positive or 0 counts in dictionary comprehension:
from collections import Counter, defaultdict
zipped = zip(df['text'], df['c1'])
d = defaultdict(int)
for a, b in zipped:
c = Counter(set(a.lower().split()))
for k, v in c.items():
if b == 0:
v = -v
d[k] += v
d = {k: v for k, v in d.items() if v > 0}
print (d)
{'are': 1, 'hello': 2, 'how': 1,'people': 1, 'world': 1, 'you': 1, 'i': 1, 'am': 1, 'fine': 1}
Similar solution if value in c1 are sorted - first all 1 and then all 0:
from collections import Counter, defaultdict
df = df.sort_values('c1', ascending=False)
zipped = zip(df['text'], df['c1'])
d = defaultdict(int)
for a, b in zipped:
c = Counter(set(a.lower().split()))
for k, v in c.items():
if (b == 0) and (k in d):
d[k] -= v
elif (b == 1):
d[k] += v
print (d)
defaultdict(<class 'int'>, {'are': 1, 'hello': 2, 'how': 1, 'people': 1,
'world': 1, 'you': 1, 'i': 1, 'am': 1, 'fine': 1})
df = pd.DataFrame({'val': list(d.keys()),
'No': list(d.values())}).sort_values('No', ascending=False)
print (df)
val No
1 hello 2
0 are 1
2 how 1
3 people 1
4 world 1
5 you 1
6 i 1
7 am 1
8 fine 1
s = pd.Series(d).sort_values(ascending=False)
print (s)
hello 2
fine 1
am 1
i 1
you 1
world 1
people 1
how 1
are 1
dtype: int64

pandas cut by the mount column

A pandas dataframe(x) with two columns: sum and value. sum is the count of records has the same value. For example:
sum value
2 3
4 1
means 2 records has value 3 and 4 records has value 1
And what I want to do is: sort by value and then cut [1,1,1,1,3,3] into 3 parts: [1,1], [1,1], [3,3]
How to cut the value into 3 parts and Each part has an equal number of records?
pandas.cut can't take sum column into consideration
I think you can use cumsum with double numpy.where:
sumall = df['sum'].sum()
df = df.sort_values(by='value')
df['sum_sum'] = df['sum'].cumsum()
df['tag'] = np.where(df['sum_sum'] < sumall / 3, 0,
np.where(df['sum_sum'] < 2 * sumall / 3, 1, 2) )
print (df)
sum value sum_sum tag
1 4 1 4 2
0 2 3 6 2
this works for me. but ugly:
sum = df['sum'].sum()
def func(x):
if x < sum/3:
return 0
elif x < 2 * sum/3:
return 1
return 2
df = df.sort_values(by='value')
df['sum_sum'] = np.cumsum(df['sum'].values)
df['tag'] = df['sum_sum'].apply(func)