pandas: create a new dataframe from existing columns values - pandas

I have a dataframe like this;
ID code num
333_c_132 x 0
333_c_132 n36 1
998_c_134 x 0
998_c_134 n36 0
997_c_135 x 1
997_c_135 n36 0
From this I have to create a new dataframe like below; you can see a new column numX is formed with unique ID. Please note that numX values are taken from num column corresponding to n36.
ID code num numX
333_c_132 x 0 1
998_c_134 x 0 0
997_c_135 x 1 0
How can I do this only using pandas?

You can use a mask then merge after pivotting:
m = df['code'].eq('n36')
(df[~m].merge(df[m].set_index(['ID','code'])['num'].unstack()
,left_on='ID',right_index=True))
ID code num n36
0 333_c_132 x 0 1
2 998_c_134 x 0 0
4 997_c_135 x 1 0

Related

incompatible index of inserted column with frame index with group by and count

I have data that looks like this:
CHROM POS REF ALT ... is_sever_int is_sever_str is_sever_f encoding_str
0 chr1 14907 A G ... 1 1 one one
1 chr1 14930 A G ... 1 1 one one
These are the columns that I'm interested to perform calculations on (example) :
is_severe snp _id encoding
1 1 one
1 1 two
0 1 one
1 2 two
0 2 two
0 2 one
what I want to do is to count for each snp_id and severe_id how many ones and twos are in the encoding column :
snp_id is_svere encoding_one encoding_two
1 1 1 1
1 0 1 0
2 1 0 1
2 0 1 1
I tried this :
df.groupby(["snp_id","is_sever_f","encoding_str"])["encoding_str"].count()
but it gave the error :
incompatible index of inserted column with frame index
then i tried this:
df["count"]=df.groupby(["snp_id","is_sever_f","encoding_str"],as_index=False)["encoding_str"].count()
and it returned:
Expected a 1D array, got an array with shape (2532831, 3)
how can i fix this? thank you:)
Let's try groupby with whole columns and get size of each group then unstack the encoding index.
out = (df.groupby(['is_severe', 'snp_id', 'encoding']).size()
.unstack(fill_value=0)
.add_prefix('encoding_')
.reset_index())
print(out)
encoding is_severe snp_id encoding_one encoding_two
0 0 1 1 0
1 0 2 1 1
2 1 1 1 1
3 1 2 0 1
Try as follows:
Use pd.get_dummies to convert categorical data in column encoding into indicator variables.
Chain df.groupby and get sum to turn double rows per group into one row (i.e. [0,1] and [1,0] will become [1,1] where df.snp_id == 2 and df.is_severe == 0).
res = pd.get_dummies(data=df, columns=['encoding'])\
.groupby(['snp_id','is_severe'], as_index=False, sort=False).sum()
print(res)
snp_id is_severe encoding_one encoding_two
0 1 1 1 1
1 1 0 1 0
2 2 1 0 1
3 2 0 1 1
If your actual df has more columns, limit the assigment to the data parameter inside get_dummies. I.e. use:
res = pd.get_dummies(data=df[['is_severe', 'snp_id', 'encoding']],
columns=['encoding']).groupby(['snp_id','is_severe'],
as_index=False, sort=False)\
.sum()

Change 1st instance of every unique row as 1 in pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.
Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

Create new columns from categorical variables

ID
column_factors
column1
column2
0
fact1
d
w
1
fact1, fact2
a
x
2
fact3
b
y
3
fact1,fact4
c
z
I have a table in pandas dataframe. What I would like create is, removing column "column_factors" and create new columns called "fact1", "fact2", "fact3", "fact4". And filling the new columns with dummy values as shown below. Thanks in advance,
ID
fact1
fact2
fact3
fact4
column1
column2
0
1
0
0
0
d
w
1
1
1
0
0
a
x
2
0
0
1
0
b
y
3
1
0
0
1
c
z
Use Series.str.get_dummies
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html#pandas.Series.str.get_dummies
dummy_cols = df['column_factors'].str.get_dummies(sep=',')
df = df.join(dummy_cols).drop(columns='column_factors')

Pandas merge conflict rows by counts?

A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1
I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1

Creating new columns with value names of other columns in pandas

I have a DataFrame as shown below.
DF =
id w R
1 A L
2 B J
3 C L,J
I now want to create a new column that shows if the value in the column Rappears in the row.
DF2 =
id w R L J
1 A L 1 0
2 B J 0 1
3 C L,J 1 1
I tried this line, but the result wasn't what I wanted:
for x in DF.R.unique():
DF[x]=(DF.R==x).astype(int)
DF2 =
id w R L J L,J
1 A L 1 0 0
2 B J 0 1 0
3 C L,J 0 0 1
What is needed to fix this? The DF is also very big and slow methods won't work.
You need to specific the sep , in your example is ,
df.R.str.get_dummies(sep=',')
Out[192]:
J L
0 0 1
1 1 0
2 1 1
I would use pandas' built-in str methods:
chars_to_count = ['L', 'J']
for char in chars_to_count:
DF[char] = DF['R'].str.count(char)