Change 1st instance of every unique row as 1 in pandas - pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.

Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

Related

incompatible index of inserted column with frame index with group by and count

I have data that looks like this:
CHROM POS REF ALT ... is_sever_int is_sever_str is_sever_f encoding_str
0 chr1 14907 A G ... 1 1 one one
1 chr1 14930 A G ... 1 1 one one
These are the columns that I'm interested to perform calculations on (example) :
is_severe snp _id encoding
1 1 one
1 1 two
0 1 one
1 2 two
0 2 two
0 2 one
what I want to do is to count for each snp_id and severe_id how many ones and twos are in the encoding column :
snp_id is_svere encoding_one encoding_two
1 1 1 1
1 0 1 0
2 1 0 1
2 0 1 1
I tried this :
df.groupby(["snp_id","is_sever_f","encoding_str"])["encoding_str"].count()
but it gave the error :
incompatible index of inserted column with frame index
then i tried this:
df["count"]=df.groupby(["snp_id","is_sever_f","encoding_str"],as_index=False)["encoding_str"].count()
and it returned:
Expected a 1D array, got an array with shape (2532831, 3)
how can i fix this? thank you:)
Let's try groupby with whole columns and get size of each group then unstack the encoding index.
out = (df.groupby(['is_severe', 'snp_id', 'encoding']).size()
.unstack(fill_value=0)
.add_prefix('encoding_')
.reset_index())
print(out)
encoding is_severe snp_id encoding_one encoding_two
0 0 1 1 0
1 0 2 1 1
2 1 1 1 1
3 1 2 0 1
Try as follows:
Use pd.get_dummies to convert categorical data in column encoding into indicator variables.
Chain df.groupby and get sum to turn double rows per group into one row (i.e. [0,1] and [1,0] will become [1,1] where df.snp_id == 2 and df.is_severe == 0).
res = pd.get_dummies(data=df, columns=['encoding'])\
.groupby(['snp_id','is_severe'], as_index=False, sort=False).sum()
print(res)
snp_id is_severe encoding_one encoding_two
0 1 1 1 1
1 1 0 1 0
2 2 1 0 1
3 2 0 1 1
If your actual df has more columns, limit the assigment to the data parameter inside get_dummies. I.e. use:
res = pd.get_dummies(data=df[['is_severe', 'snp_id', 'encoding']],
columns=['encoding']).groupby(['snp_id','is_severe'],
as_index=False, sort=False)\
.sum()

How to check pair of string values in a column, after grouping the dataframe using ID column?

My Doubt in a Table/Dataframe viewI have a dataframe containing 2 columns: ID and Code.
ID Code Flag
1 A 0
1 C 1
1 B 1
2 A 0
2 B 1
3 A 0
4 C 0
Within each ID, if Code 'A' exists with 'B' or 'C', then it should flag 1.
I tried Groupby('ID') with filter(). but it is not showing the perfect result. Could anyone please help ?
You can do the following:
First use pd.groupby('ID') and concatenate the codes using 'sum' to create a new column. Then assing the value 1 if a row contains A or B as Code and when the new column contains an A:
df['s'] = df.groupby('ID').Code.transform('sum')
df['Flag'] = 0
df.loc[((df.Code == 'B') | (df.Code == 'C')) & df.s.str.contains('A'), 'Flag'] = 1
df = df.drop(columns = 's')
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0
You can use boolean masks, direct for B/C, per group for A, then combine them and convert to integer:
# is the Code a B or C?
m1 = df['Code'].isin(['B', 'C'])
# is there also a A in the same group?
m2 = df['Code'].eq('A').groupby(df['ID']).transform('any')
# if both are True, flag 1
df['Flag'] = (m1&m2).astype(int)
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

is there a function where I can do one hot encoding and removing duplicates in R?

I have this database
ID
LABEL
1
A
1
B
2
B
3
c
I'm trying to do an one hot encoding, which I was able to do. However, I also need to remove the duplicated IDs, so my one hot code appears to be like below:
ID
A
B
C
1
1
0
0
1
0
1
0
2
0
1
0
3
0
0
1
and I need this to be the final database
ID
A
B
C
1
1
1
0
2
0
1
0
3
0
0
1
this is my code
dummy <- dummyVars('~ .', data = data_to_be_encoded)
encoded_data <- data.frame(predict(dummy, newdata = data_to_be_encoded))

Pandas merge conflict rows by counts?

A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1
I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1