efficiently aggregate multi-row data in a single row representation in pandas dataframe - pandas

I have onehot coded pandas dataframe like following
p c1 c2 c3
A 1 0 0
B 1 0 0
A 0 1 0
A 0 0 1
B 0 0 1
I want to put the values of missing cells in one column from the following rows as follows
desired output
p c1 c2 c3
A 1 1 1
B 1 0 1

Like this:
In [463]: df.groupby('p').agg(sum).reset_index()
Out[463]:
p c1 c2 c3
0 A 1 1 1
1 B 1 0 1

s = df.set_index('p').stack()
df = s[s.eq(1)].unstack().fillna(0).astype(int).reset_index()
df.shape()

Related

Change 1st instance of every unique row as 1 in pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.
Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

How to check pair of string values in a column, after grouping the dataframe using ID column?

My Doubt in a Table/Dataframe viewI have a dataframe containing 2 columns: ID and Code.
ID Code Flag
1 A 0
1 C 1
1 B 1
2 A 0
2 B 1
3 A 0
4 C 0
Within each ID, if Code 'A' exists with 'B' or 'C', then it should flag 1.
I tried Groupby('ID') with filter(). but it is not showing the perfect result. Could anyone please help ?
You can do the following:
First use pd.groupby('ID') and concatenate the codes using 'sum' to create a new column. Then assing the value 1 if a row contains A or B as Code and when the new column contains an A:
df['s'] = df.groupby('ID').Code.transform('sum')
df['Flag'] = 0
df.loc[((df.Code == 'B') | (df.Code == 'C')) & df.s.str.contains('A'), 'Flag'] = 1
df = df.drop(columns = 's')
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0
You can use boolean masks, direct for B/C, per group for A, then combine them and convert to integer:
# is the Code a B or C?
m1 = df['Code'].isin(['B', 'C'])
# is there also a A in the same group?
m2 = df['Code'].eq('A').groupby(df['ID']).transform('any')
# if both are True, flag 1
df['Flag'] = (m1&m2).astype(int)
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

is there a function where I can do one hot encoding and removing duplicates in R?

I have this database
ID
LABEL
1
A
1
B
2
B
3
c
I'm trying to do an one hot encoding, which I was able to do. However, I also need to remove the duplicated IDs, so my one hot code appears to be like below:
ID
A
B
C
1
1
0
0
1
0
1
0
2
0
1
0
3
0
0
1
and I need this to be the final database
ID
A
B
C
1
1
1
0
2
0
1
0
3
0
0
1
this is my code
dummy <- dummyVars('~ .', data = data_to_be_encoded)
encoded_data <- data.frame(predict(dummy, newdata = data_to_be_encoded))

Create new columns from categorical variables

ID
column_factors
column1
column2
0
fact1
d
w
1
fact1, fact2
a
x
2
fact3
b
y
3
fact1,fact4
c
z
I have a table in pandas dataframe. What I would like create is, removing column "column_factors" and create new columns called "fact1", "fact2", "fact3", "fact4". And filling the new columns with dummy values as shown below. Thanks in advance,
ID
fact1
fact2
fact3
fact4
column1
column2
0
1
0
0
0
d
w
1
1
1
0
0
a
x
2
0
0
1
0
b
y
3
1
0
0
1
c
z
Use Series.str.get_dummies
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html#pandas.Series.str.get_dummies
dummy_cols = df['column_factors'].str.get_dummies(sep=',')
df = df.join(dummy_cols).drop(columns='column_factors')

How to find difference between rows in a pandas multiIndex, by level 1

Suppose we have a DataFrame like this, only with many, many more index A values:
df = pd.DataFrame([[1,2,1,2],
[1,1,2,2],
[2,2,1,0],
[1,2,1,2],
[2,1,1,2] ], columns=['A','B','c1','c2'])
df.groupby(['A','B']).sum()
## result
c1 c2
A B
1 1 2 2
2 2 4
2 1 1 2
2 1 0
How can I get a data frame that consists of the difference between rows, by the second level of the index, level B?
The output here would be
A c1 c2
1 0 -2
2 0 2
Note In my particular use case, I have a lot of column A values, so I can write out the value for A explicitly.
Check diff and dropna
g = df.groupby(['A','B'])[['c1','c2']].sum()
g = g.groupby(level=0).diff().dropna()
g
Out[25]:
c1 c2
A B
1 2 0.0 2.0
2 2 0.0 -2.0
Assigning the first grouping to result variable:
result = df.groupby(['A','B']).sum()
You could use a pipe operation with nth:
result.groupby('A').pipe(lambda df: df.nth(0) - df.nth(-1))
c1 c2
A
1 0 -2
2 0 2
A simpler option, in my opinion, would be to use agg combined with numpy's ufunc reduce, as this covers scenarios where you have more than two rows:
result.groupby('A').agg(np.subtract.reduce)
c1 c2
A
1 0 -2
2 0 2