Creating new columns with value names of other columns in pandas - pandas

I have a DataFrame as shown below.
DF =
id w R
1 A L
2 B J
3 C L,J
I now want to create a new column that shows if the value in the column Rappears in the row.
DF2 =
id w R L J
1 A L 1 0
2 B J 0 1
3 C L,J 1 1
I tried this line, but the result wasn't what I wanted:
for x in DF.R.unique():
DF[x]=(DF.R==x).astype(int)
DF2 =
id w R L J L,J
1 A L 1 0 0
2 B J 0 1 0
3 C L,J 0 0 1
What is needed to fix this? The DF is also very big and slow methods won't work.

You need to specific the sep , in your example is ,
df.R.str.get_dummies(sep=',')
Out[192]:
J L
0 0 1
1 1 0
2 1 1

I would use pandas' built-in str methods:
chars_to_count = ['L', 'J']
for char in chars_to_count:
DF[char] = DF['R'].str.count(char)

Related

Change 1st instance of every unique row as 1 in pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.
Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

Get column name in data frame if value is between values

I have a dataframe:
import numpy as np
import pandas as pd
random_number_gen = np.random.default_rng()
df = pd.DataFrame(random_number_gen.integers(-5, 5, size=(1, 13)), columns=list('ABCDEFGHIJKLM'))
A
B
C
D
E
F
G
H
I
J
K
L
M
0
1
4
-4
-1
3
-5
-3
0
-4
-1
3
2
I would like to obtain the names of the columns where a value falls between -1 and 1. I tried this and others:
df.columns[(( -1<= df.any()) & (df.any() <=1)).iloc[0]]
Any help is welcome. Thanks.
If you have a single row:
df.columns[df.iloc[0].between(-1,1)]
# or
df.columns[df.squeeze().between(-1,1)]
If you can have multiple rows:
df.columns[(df.ge(-1)&df.le(1)).any()]
Example output:
Index(['E', 'G', 'J'], dtype='object')
Used input:
A B C D E F G H I J K L M
0 3 -3 -4 -3 -1 3 -1 -5 -2 1 3 2 4

How to check pair of string values in a column, after grouping the dataframe using ID column?

My Doubt in a Table/Dataframe viewI have a dataframe containing 2 columns: ID and Code.
ID Code Flag
1 A 0
1 C 1
1 B 1
2 A 0
2 B 1
3 A 0
4 C 0
Within each ID, if Code 'A' exists with 'B' or 'C', then it should flag 1.
I tried Groupby('ID') with filter(). but it is not showing the perfect result. Could anyone please help ?
You can do the following:
First use pd.groupby('ID') and concatenate the codes using 'sum' to create a new column. Then assing the value 1 if a row contains A or B as Code and when the new column contains an A:
df['s'] = df.groupby('ID').Code.transform('sum')
df['Flag'] = 0
df.loc[((df.Code == 'B') | (df.Code == 'C')) & df.s.str.contains('A'), 'Flag'] = 1
df = df.drop(columns = 's')
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0
You can use boolean masks, direct for B/C, per group for A, then combine them and convert to integer:
# is the Code a B or C?
m1 = df['Code'].isin(['B', 'C'])
# is there also a A in the same group?
m2 = df['Code'].eq('A').groupby(df['ID']).transform('any')
# if both are True, flag 1
df['Flag'] = (m1&m2).astype(int)
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

Create new columns from categorical variables

ID
column_factors
column1
column2
0
fact1
d
w
1
fact1, fact2
a
x
2
fact3
b
y
3
fact1,fact4
c
z
I have a table in pandas dataframe. What I would like create is, removing column "column_factors" and create new columns called "fact1", "fact2", "fact3", "fact4". And filling the new columns with dummy values as shown below. Thanks in advance,
ID
fact1
fact2
fact3
fact4
column1
column2
0
1
0
0
0
d
w
1
1
1
0
0
a
x
2
0
0
1
0
b
y
3
1
0
0
1
c
z
Use Series.str.get_dummies
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html#pandas.Series.str.get_dummies
dummy_cols = df['column_factors'].str.get_dummies(sep=',')
df = df.join(dummy_cols).drop(columns='column_factors')

Pandas merge conflict rows by counts?

A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1
I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1