Create new columns from categorical variables - pandas

ID
column_factors
column1
column2
0
fact1
d
w
1
fact1, fact2
a
x
2
fact3
b
y
3
fact1,fact4
c
z
I have a table in pandas dataframe. What I would like create is, removing column "column_factors" and create new columns called "fact1", "fact2", "fact3", "fact4". And filling the new columns with dummy values as shown below. Thanks in advance,
ID
fact1
fact2
fact3
fact4
column1
column2
0
1
0
0
0
d
w
1
1
1
0
0
a
x
2
0
0
1
0
b
y
3
1
0
0
1
c
z

Use Series.str.get_dummies
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html#pandas.Series.str.get_dummies
dummy_cols = df['column_factors'].str.get_dummies(sep=',')
df = df.join(dummy_cols).drop(columns='column_factors')

Related

Change 1st instance of every unique row as 1 in pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.
Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

How to check pair of string values in a column, after grouping the dataframe using ID column?

My Doubt in a Table/Dataframe viewI have a dataframe containing 2 columns: ID and Code.
ID Code Flag
1 A 0
1 C 1
1 B 1
2 A 0
2 B 1
3 A 0
4 C 0
Within each ID, if Code 'A' exists with 'B' or 'C', then it should flag 1.
I tried Groupby('ID') with filter(). but it is not showing the perfect result. Could anyone please help ?
You can do the following:
First use pd.groupby('ID') and concatenate the codes using 'sum' to create a new column. Then assing the value 1 if a row contains A or B as Code and when the new column contains an A:
df['s'] = df.groupby('ID').Code.transform('sum')
df['Flag'] = 0
df.loc[((df.Code == 'B') | (df.Code == 'C')) & df.s.str.contains('A'), 'Flag'] = 1
df = df.drop(columns = 's')
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0
You can use boolean masks, direct for B/C, per group for A, then combine them and convert to integer:
# is the Code a B or C?
m1 = df['Code'].isin(['B', 'C'])
# is there also a A in the same group?
m2 = df['Code'].eq('A').groupby(df['ID']).transform('any')
# if both are True, flag 1
df['Flag'] = (m1&m2).astype(int)
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

pandas creating new columns for each value in categorical columns

I have a pandas dataframe with some numeric and some categoric columns. I want to create a new column for each value of every categorical column and give that column a value of 1 in every row where that value is true and 0 in every row where that value is false. So the df is something like this -
col1 col2 col3
A P 1
B P 3
A Q 7
expected result is something like this:
col1 col2 col3 A B P Q
A P 1 1 0 1 0
B P 3 0 1 1 0
A Q 7 1 0 0 1
Is this possible? can someone please help me?
Use df.select_dtypes, pd.get_dummies with pd.concat:
# First select all columns which have object dtypes
In [826]: categorical_cols = df.select_dtypes('object').columns
# Create one-hot encoding for the above cols and concat with df
In [817]: out = pd.concat([df, pd.get_dummies(df[categorical_cols])], 1)
In [818]: out
Out[818]:
col1 col2 col3 col1_A col1_B col2_P col2_Q
0 A P 1 1 0 1 0
1 B P 3 0 1 1 0
2 A Q 7 1 0 0 1

pandas: create a new dataframe from existing columns values

I have a dataframe like this;
ID code num
333_c_132 x 0
333_c_132 n36 1
998_c_134 x 0
998_c_134 n36 0
997_c_135 x 1
997_c_135 n36 0
From this I have to create a new dataframe like below; you can see a new column numX is formed with unique ID. Please note that numX values are taken from num column corresponding to n36.
ID code num numX
333_c_132 x 0 1
998_c_134 x 0 0
997_c_135 x 1 0
How can I do this only using pandas?
You can use a mask then merge after pivotting:
m = df['code'].eq('n36')
(df[~m].merge(df[m].set_index(['ID','code'])['num'].unstack()
,left_on='ID',right_index=True))
ID code num n36
0 333_c_132 x 0 1
2 998_c_134 x 0 0
4 997_c_135 x 1 0

Creating new columns with value names of other columns in pandas

I have a DataFrame as shown below.
DF =
id w R
1 A L
2 B J
3 C L,J
I now want to create a new column that shows if the value in the column Rappears in the row.
DF2 =
id w R L J
1 A L 1 0
2 B J 0 1
3 C L,J 1 1
I tried this line, but the result wasn't what I wanted:
for x in DF.R.unique():
DF[x]=(DF.R==x).astype(int)
DF2 =
id w R L J L,J
1 A L 1 0 0
2 B J 0 1 0
3 C L,J 0 0 1
What is needed to fix this? The DF is also very big and slow methods won't work.
You need to specific the sep , in your example is ,
df.R.str.get_dummies(sep=',')
Out[192]:
J L
0 0 1
1 1 0
2 1 1
I would use pandas' built-in str methods:
chars_to_count = ['L', 'J']
for char in chars_to_count:
DF[char] = DF['R'].str.count(char)