pandas creating new columns for each value in categorical columns - pandas

I have a pandas dataframe with some numeric and some categoric columns. I want to create a new column for each value of every categorical column and give that column a value of 1 in every row where that value is true and 0 in every row where that value is false. So the df is something like this -
col1 col2 col3
A P 1
B P 3
A Q 7
expected result is something like this:
col1 col2 col3 A B P Q
A P 1 1 0 1 0
B P 3 0 1 1 0
A Q 7 1 0 0 1
Is this possible? can someone please help me?

Use df.select_dtypes, pd.get_dummies with pd.concat:
# First select all columns which have object dtypes
In [826]: categorical_cols = df.select_dtypes('object').columns
# Create one-hot encoding for the above cols and concat with df
In [817]: out = pd.concat([df, pd.get_dummies(df[categorical_cols])], 1)
In [818]: out
Out[818]:
col1 col2 col3 col1_A col1_B col2_P col2_Q
0 A P 1 1 0 1 0
1 B P 3 0 1 1 0
2 A Q 7 1 0 0 1

Related

Perform similar computations in every dataframe in a list of dataframes

I have a list of 18 different dataframes. The only thing these dataframes have in common is that each contains a variable that ends with "_spec". The computations I would like to perform on each dataframe in the list are as follows:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above); and
store the results of #2 in a separate list of 18 dataframes
I can get the output that I would like for each individual dataframe with the following:
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) # count (negative) no. of numeric vars in df
lvmo_spec = df[df.sum(numeric_only=True,axis=1)==lvmo_numlength].filter(regex='_spec') # does ^ = sum of numeric vars?
lvmo_spec.to_list()
but I don't want to copy and paste this 18(+) times...
I am new to writing functions and loops, but I know these can be utilized to perform the procedure I desire; yet I don't know how to execute it. The below code shows the abomination I have created, which can't even make it off the ground. Any suggestions?
# make list of dataframes
name_list = [lvmo, trx_nonrx, pd, odose_drg, fx, cpn_use, dem_hcc, dem_ori, drg_man, drg_cou, nlx_gvn, nlx_ob, opd_rsn, opd_od, psy_yn, sti_prep_tkn, tx_why, tx_curtx]
# create variable that satisfies condition 1
def numlen(name):
return name + "_numlen"
# create variable that satisfies condition 2
def spec(name):
return name + "_spec"
# loop it all together
for name in name_list:
numlen(name) = -len(name.select_dtypes('number').columns.tolist())
spec(name) = name[name.sum(numeric_only=True,axis=1)]==numlen(name).filter(regex='spec')
You can achieve what I believe your question is asking as follows, given input df_list which is a list of dataframes:
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
Explanation:
for each input dataframe, create a new dataframe as follows: for rows where the sum of the values in numeric columns is <=0 and is equal in magnitude to the number of numeric columns, select only those columns with a label ending in '_spec'
use a list comprehension to compile the above new dataframes into a list
Note that this can also be expressed using a standard for loop instead of a list comprehension as follows:
res_list = []
for df in df_list:
res_list.append( df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') )
Sample code (using 7 input dataframe objects instead of 18:
import pandas as pd
df_list = [pd.DataFrame({'b':['a','b','c','d']} | {f'col{i+1}{"_spec" if not i%3 else ""}':[-1,0,0]+([0 if i!=n-1 else -n]) for i in range(n)}) for n in range(7)]
for df in df_list: print(df)
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
for df in res_list: print(df)
Input:
b
0 a
1 b
2 c
3 d
b col1_spec
0 a -1
1 b 0
2 c 0
3 d -1
b col1_spec col2
0 a -1 -1
1 b 0 0
2 c 0 0
3 d 0 -2
b col1_spec col2 col3
0 a -1 -1 -1
1 b 0 0 0
2 c 0 0 0
3 d 0 0 -3
b col1_spec col2 col3 col4_spec
0 a -1 -1 -1 -1
1 b 0 0 0 0
2 c 0 0 0 0
3 d 0 0 0 -4
b col1_spec col2 col3 col4_spec col5
0 a -1 -1 -1 -1 -1
1 b 0 0 0 0 0
2 c 0 0 0 0 0
3 d 0 0 0 0 -5
b col1_spec col2 col3 col4_spec col5 col6
0 a -1 -1 -1 -1 -1 -1
1 b 0 0 0 0 0 0
2 c 0 0 0 0 0 0
3 d 0 0 0 0 0 -6
Output:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
col1_spec
0 -1
3 -1
col1_spec
0 -1
3 0
col1_spec
0 -1
3 0
col1_spec col4_spec
0 -1 -1
3 0 -4
col1_spec col4_spec
0 -1 -1
3 0 0
col1_spec col4_spec
0 -1 -1
3 0 0
Also, a couple of comments about the original question:
lvmo_spec.to_list() doesn't work because to_list() is not defined. There is a method named tolist(), but it will only work for a Series (not a DataFrame).
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) gives a negative result. I have assumed this is your intention, and that you want the sum of each row's numeric values to have a negative value, but this is slightly at odds with your description which states:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above);

how to make a new 0 and 1 column?

I have a pandas data frame and I wanna make a new columns with 0 and 1:
if col1 is zero and col2 is positive set new column to 1.
if 'col1is zero andcol2is negative set new column to 0. ifcol1is 1 andcol2 is positive set new column to 0. if 'col1 is 1 and col2 is negative set new column to 1.
col1 col2
0 2
0 -4
1 -2
1 5
1 9
new_colum
1
0
1
0
0
You can determine if col2 is positive and get the absolute difference with col1 (booleans behave like 0/1):
df['new_column'] = df['col1'].sub(df['col2'].gt(0)).abs()
Or, compare the two outputs, you want them to be different:
df['new_column'] = df['col1'].ne(df['col2'].gt(0)).astype(int)
output:
col1 col2 new_column
0 0 2 1
1 0 -4 0
2 1 -2 1
3 1 5 0
4 1 9 0

Create new columns from categorical variables

ID
column_factors
column1
column2
0
fact1
d
w
1
fact1, fact2
a
x
2
fact3
b
y
3
fact1,fact4
c
z
I have a table in pandas dataframe. What I would like create is, removing column "column_factors" and create new columns called "fact1", "fact2", "fact3", "fact4". And filling the new columns with dummy values as shown below. Thanks in advance,
ID
fact1
fact2
fact3
fact4
column1
column2
0
1
0
0
0
d
w
1
1
1
0
0
a
x
2
0
0
1
0
b
y
3
1
0
0
1
c
z
Use Series.str.get_dummies
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html#pandas.Series.str.get_dummies
dummy_cols = df['column_factors'].str.get_dummies(sep=',')
df = df.join(dummy_cols).drop(columns='column_factors')

Drop all group rows when met a condition?

I have pandas data frame have two-level group based on 'col10' and 'col1'.All I want to do is, drop all group rows if a specified value in another column repeated or this value did not existed in the group (keep the group which the specified value existed once only) for example:
The original data frame:
df = pd.DataFrame( {'col0':['A','A','A','A','A','B','B','B','B','B','B','B','c'],'col1':[1,1,2,2,2,1,1,1,1,2,2,2,1], 'col2':[1,2,1,2,3,1,2,1,2,2,2,2,1]})
I need to keep the the rows for the group for example (['A',1],['A',2],['B',2]) in this original DF
The desired dataframe:
I tried this step:
df.groupby(['col0','col1']).apply(lambda x: (x['col2']==1).sum()==1)
where the result is
col0 col1
A 1 True
2 True
B 1 False
2 True
c 1 False
dtype: bool
How to create the desired Df based on this bool?
You can do this as below:
m=(df.groupby(['col0','col1'])['col2'].
transform(lambda x: np.where((x.eq(1)).sum()==1,x,np.nan)).dropna().index)
df.loc[m]
Or:
df[df.groupby(['col0','col1'])['col2'].transform(lambda x: x.eq(1).sum()==1)]
col0 col1 col2
0 A 1 1
1 A 1 2
2 A 2 1
3 A 2 2
4 A 2 3
12 c 1 1

Pandas, create new column applying groupby values

I have a DF:
Col1 Col2 Label
0 0 5345
1 0 7574
2 0 3445
0 1 2126
1 1 4653
2 1 9566
So I'm trying to groupby on Col1 and Col2 to get index value based on Label column like this:
df_gb = df.groupby(['Col1','Col2'])['Label'].agg(['sum', 'count'])
df_gb['sum_count'] = df_gb['sum'] / df_gb['count']
sum_count_total = df_gb['sum_count'].sum()
index = df_gb['sum_count'] / 10
Col2 Col1
0 0 2.996036
1 3.030063
2 3.038579
1 0 2.925314
1 2.951295
2 2.956083
2 0 2.875549
1 2.899254
2 2.905063
Everything so far is as I expected. But now I would like to assign this 'index' groupby df to my original 'df' based on those two groupby columns. If it was only one column it's working with map() function but not if I would like to assign index values based on two columns order.
df_index = df.copy()
df_index['index'] = df.groupby([]).apply(index)
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Tried with agg() and transform() but without success. Any ideas how to proceed?
Thanks in advance.
Hristo.
I believe you need join:
a = df.join(index.rename('new'), on=['Col1','Col2'])
print (a)
Col1 Col2 Label new
0 0 0 5345 534.5
1 1 0 7574 757.4
2 2 0 3445 344.5
3 0 1 2126 212.6
4 1 1 4653 465.3
5 2 1 9566 956.6
Or GroupBy.transform:
df['new']=df.groupby(['Col1','Col2'])['Label'].transform(lambda x: x.sum() / x.count()) / 10
print (df)
Col1 Col2 Label new
0 0 0 5345 534.5
1 1 0 7574 757.4
2 2 0 3445 344.5
3 0 1 2126 212.6
4 1 1 4653 465.3
5 2 1 9566 956.6
And if no NaNs in Label column use solution from Zero suggestion, thank you:
df.groupby(['Col1','Col2'])['Label'].transform('mean') / 10
If need count only non NaNs values by count use solution with transform.