Simple addition of different sizes DataFrames in Pandas - pandas

I have 2 very simple addition problems with Pandas, I hope you could help me.
My first question:
Let say I have the following two dataframes: a_df and b_df
a = [[1,1,1,1],[0,0,0,0],[1,1,0,0]]
a_df = pd.DataFrame(a)
a_df =
0 1 2 3
0 1 1 1 1
1 0 0 0 0
2 1 1 0 0
b = [1,1,1,1]
b_df = pd.DataFrame(b).T
b_df=
0 1 2 3
0 1 1 1 1
I would like to add b_df to a_df to obtain c_df such that my expected output would be the follow:
c_df =
0 1 2 3
0 2 2 2 2
1 1 1 1 1
2 2 2 1 1
The current method I use is replicate b_df to the same size of a_df and carry out the addition, shown below. However, this method is not very efficient if my a_df is very very large.
a = [[1,1,1,1],[0,0,0,0],[1,1,0,0]]
a_df = pd.DataFrame(a)
b = [1,1,1,1]
b_df = pd.DataFrame(b).T
b_df = pd.concat([b_df]*len(a_df)).reset_index(drop=True)
c_df = a_df + b_df
Are there any other ways to add b_df(without replicating it) to a_df in order to obtain what I want c_df to be?
My second question is very similar to my first one:
Let say I have d_df and e_df as follows:
d = [1,1,1,1]
d_df = pd.DataFrame(d)
d_df=
0
0 1
1 1
2 1
3 1
e = [1]
e_df = pd.DataFrame(e)
e_df=
0
0 1
I want to add e_df to d_df such that I would get the following result:
0
0 2
1 2
2 2
3 2
Again, current I am replicating e_df using the following method (same as Question 1) before adding with d_df
d = [1,1,1,1]
d_df = pd.DataFrame(d)
e = [1]
e_df = pd.DataFrame(e)
e_df = pd.concat([e_df]*len(d_df)).reset_index(drop=True)
f_df = d_df + e_df
Is there a way without replicating e_df?
Please advise and help me. Thank you so much in advanced
Tommy

Try this :
pd.DataFrame(a_df.to_numpy() + b_df.to_numpy())
0 1 2 3
0 2 2 2 2
1 1 1 1 1
2 2 2 1 1
numpy offers the broadcasting features that allows you to add the way u want, as long as the shape is similar on one end. I feel someone has answered something similar to this before. Once I find it I will reference it here.
This article from numpy explains broadcasting pretty well

For first convert one row DataFrame to Series:
c_df = a_df + b_df.iloc[0]
print (c_df)
0 1 2 3
0 2 2 2 2
1 1 1 1 1
2 2 2 1 1
Same principe is for second:
c_df = d_df + e_df.iloc[0]
print (c_df)
0
0 2
1 2
2 2
3 2
More information is possible find in How do I operate on a DataFrame with a Series for every column.

Related

incompatible index of inserted column with frame index with group by and count

I have data that looks like this:
CHROM POS REF ALT ... is_sever_int is_sever_str is_sever_f encoding_str
0 chr1 14907 A G ... 1 1 one one
1 chr1 14930 A G ... 1 1 one one
These are the columns that I'm interested to perform calculations on (example) :
is_severe snp _id encoding
1 1 one
1 1 two
0 1 one
1 2 two
0 2 two
0 2 one
what I want to do is to count for each snp_id and severe_id how many ones and twos are in the encoding column :
snp_id is_svere encoding_one encoding_two
1 1 1 1
1 0 1 0
2 1 0 1
2 0 1 1
I tried this :
df.groupby(["snp_id","is_sever_f","encoding_str"])["encoding_str"].count()
but it gave the error :
incompatible index of inserted column with frame index
then i tried this:
df["count"]=df.groupby(["snp_id","is_sever_f","encoding_str"],as_index=False)["encoding_str"].count()
and it returned:
Expected a 1D array, got an array with shape (2532831, 3)
how can i fix this? thank you:)
Let's try groupby with whole columns and get size of each group then unstack the encoding index.
out = (df.groupby(['is_severe', 'snp_id', 'encoding']).size()
.unstack(fill_value=0)
.add_prefix('encoding_')
.reset_index())
print(out)
encoding is_severe snp_id encoding_one encoding_two
0 0 1 1 0
1 0 2 1 1
2 1 1 1 1
3 1 2 0 1
Try as follows:
Use pd.get_dummies to convert categorical data in column encoding into indicator variables.
Chain df.groupby and get sum to turn double rows per group into one row (i.e. [0,1] and [1,0] will become [1,1] where df.snp_id == 2 and df.is_severe == 0).
res = pd.get_dummies(data=df, columns=['encoding'])\
.groupby(['snp_id','is_severe'], as_index=False, sort=False).sum()
print(res)
snp_id is_severe encoding_one encoding_two
0 1 1 1 1
1 1 0 1 0
2 2 1 0 1
3 2 0 1 1
If your actual df has more columns, limit the assigment to the data parameter inside get_dummies. I.e. use:
res = pd.get_dummies(data=df[['is_severe', 'snp_id', 'encoding']],
columns=['encoding']).groupby(['snp_id','is_severe'],
as_index=False, sort=False)\
.sum()

Change 1st instance of every unique row as 1 in pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.
Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

How to explode a list into new columns pandas

Let's say I have the following df
x
1 ['abc','bac','cab']
2 ['bac']
3 ['abc','cab']
And I would like to take each element of each list and put it into a new row, like so
abc bac cab
1 1 1 1
2 0 1 0
3 1 0 1
I have referred to multiple links but can't seem to get this correctly.
Thanks!
One approach with str.join + str.get_dummies:
out = df['x'].str.join(',').str.get_dummies(',')
out:
abc bac cab
0 1 1 1
1 0 1 0
2 1 0 1
Or with explode + pd.get_dummies then groupby max:
out = pd.get_dummies(df['x'].explode()).groupby(level=0).max()
out:
abc bac cab
0 1 1 1
1 0 1 0
2 1 0 1
Can also do pd.crosstab after explode if want counts instead of dummies:
s = df['x'].explode()
out = pd.crosstab(s.index, s)
out:
x abc bac cab
row_0
0 1 1 1
1 0 1 0
2 1 0 1
*Note output is the same here, but will be count if there are duplicates.
DataFrame:
import pandas as pd
df = pd.DataFrame({
'x': [['abc', 'bac', 'cab'], ['bac'], ['abc', 'cab']]
})
I will do
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = pd.DataFrame(mlb.fit_transform(df['x']), columns=mlb.classes_, index=df.index)

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.
You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1