Pandas Dataframe: split column into multiple columns - pandas

I need to break a column in a DataFrame that at present collects multiple values (someone else's excel sheet unfortunately) for a categorical data field that can have multiple values.
As you can see below the column has 15 category codes seen in the column header.
Original DataFrame
I want to split the column based on the category codes seen in the column header ['Pamphlet'] and then transform the values collected for each record in the original column to be mapped to there respective new columns as a (1) for checked and (0) for unchecked instead of the raw value [1,2,4,5].
This is the code to split based on , between values but I need to put these into the new columns I need to set up by splitting the column ['Pamphlet'] up by the values in the header [15: 1) OSA\n2) Nutrition\n3) Activity\n4) etc.].
'''df_old['Pamphlets'].str.split(pat = ',', n = -1, expand = True)'''
Shape of desired DatFrame
If I could just get an outline of whats the best approach, if it is even possible to do this within Pandas, Thanks.

You need to go through your columns one by one and divide the headers, then create a new dataframe for each column made up of split columns, then join all that back to the original dataframe. It's a bit messy but doable.
You need to use a function and some loops to go through the columns.
First lets define the dataframe. (It would be much appreciated if in future questions you supply a replicatable dataframe and any other data.
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
print(df_full)
1) Mail\n2) Email \n3) At PAC/TPAC 1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation
0 2 5
1 1 1
2 3 4
3 2 4
4 3 2
5 1 5
6 3 1
7 2 4
8 3 3
9 1 2
We will go through the dataframe column by column using a function. For now let's build the column manually for the first column. After we'll turn this next part into a function.
First, let's grab the first column.
s_col = df_full.iloc[:, 0]
print(s_col)
0 2
1 1
2 3
3 2
4 3
5 1
6 3
7 2
8 3
9 1
Name: 1) Mail\n2) Email \n3) At PAC/TPAC, dtype: int64
Split the header into individual pieces.
col = s_col.name.split("\n")
print(col)
['1) Mail', '2) Email ', '3) At PAC/TPAC']
Clean up any leading or trailing white space.
col = [x.strip() for x in col]
print(col)
['1) Mail', '2) Email', '3) At PAC/TPAC']
Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
print(df)
1) Mail 2) Email 3) At PAC/TPAC
0 2 2 2
1 1 1 1
2 3 3 3
3 2 2 2
4 3 3 3
5 1 1 1
6 3 3 3
7 2 2 2
8 3 3 3
9 1 1 1
Create a copy to make changes to the values.
df_res = df.copy()
Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
print(df_res)
1) Mail 2) Email 3) At PAC/TPAC
0 0 1 0
1 1 0 0
2 0 0 1
3 0 1 0
4 0 0 1
5 1 0 0
6 0 0 1
7 0 1 0
8 0 0 1
9 1 0 0
Now we have split a column into its components and assigned a bool value.
Let's step back and make the above a function so we can use it for each column in the original dataframe.
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
Now for the last step. Let's create a loop to go through the columns in the original dataframe, call the function to split each column, and then concat it to the original dataframe less the columns that were split.
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)
1) Mail 2) Email 3) At PAC/TPAC 1) ACC 2) IM 3) PT 4) Smoking, 5) Cessation
0 0 1 0 0 0 0 0 1
1 1 0 0 1 0 0 0 0
2 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0
4 0 0 1 0 1 0 0 0
5 1 0 0 0 0 0 0 1
6 0 0 1 1 0 0 0 0
7 0 1 0 0 0 0 1 0
8 0 0 1 0 0 1 0 0
9 1 0 0 0 1 0 0 0
Here is the full code...
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)

Related

Duplicate row and add string

I wish to duplicate Pandas data row and add string to end while keeping rest of data intact:
I_have = pd.DataFrame({'id':['a','b','c'], 'my_data' = [1,2,3])
I want:
Id my_data
a 1
a_dup1 1
a_dup2 1
b 2
b_dup1 2
b_dup2 2
c 3
c_dup1 3
c_dup2 3
I could do this by 1) iterrows() or 2) 3x copies of existing data and appending, but hopefully there is more pythonic way to do this.
This seems to work:
tmp1 = I_have.copy(deep=True)
tmp2 = I_have.copy(deep=True)
tmp1['id'] = tmp1['id']+'_dup1'
tmp2['id'] = tmp2['id']+'_dup2'
pd.concat([I_have, tmp1, tmp2])
Use Index.repeat with DataFrame.loc for duplicated rows and then add counter by numpy.tile, last add substrings for duplicated values - not equal 0 in Series.mask:
N = 3
df = df.loc[df.index.repeat(N)].reset_index(drop=True)
a = np.tile(np.arange(N), N)
df['id'] = df['id'].mask(a != 0, df['id'] + '_dup' + a.astype(str))
#alternative solution
#df.loc[a != 0, 'id'] = df['id'] + '_dup' + a.astype(str)
print (df)
id my_data
0 a 1
1 a_dup1 1
2 a_dup2 1
3 b 2
4 b_dup1 2
5 b_dup2 2
6 c 3
7 c_dup1 3
8 c_dup2 3

Add/subtract value of a column to the entire column of the dataframe pandas

I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.

Adding new column to an existing dataframe at an arbitrary position [duplicate]

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1

Delete all rows from pandas dataframe containing 0 as string and integer

I am using read_csv to load data from Yahoo Finance leads to rows containing 0 sometimes as string and at other times as integer. Trying to drop / delete these rows per Boolean masking:
df[(df != '0') & (df != 0)]
leads to errors:
TypeError: Could not compare ['0'] with block values
(in case the dataframe does not have any row with the string value '0') and
TypeError: Could not compare [0] with block values
(in case the frame does not have any integer value 0).
With the following dataframe:
df = pd.DataFrame({'int': [0,0,2,3,0,0,1,2,3],
'string': ['0','1','2','3','0','0','1','2','0']})
int string
0 0 0
1 0 1
2 2 2
3 3 3
4 0 0
5 0 0
6 1 1
7 2 2
8 3 0
The following code should work:
df = df[df.string != '0']
df = df[df.int != 0]
This gives the following output:
int string
2 2 2
3 3 3
6 1 1
7 2 2