Combine nested columns - pandas

I have the following:
df1 = pd.DataFrame({'data': [1,2,3]})
df2 = pd.DataFrame({'data': [4,5,6]})
df = pd.concat([df1,df2], keys=['hello','world'], axis=1)
df[('hello','new_col')] = df[('world','data')]*2
print (df)
hello world hello
data data new_col
0 1 4 8
1 2 5 10
2 3 6 12
When I add a new nested column as above, it separates it from the existing hello column. How do I add a new nested column such that new_col is associated/beneath with the existing hello column? Can this be done during assignment or only after? I.e. I want the below
hello world
data new_col data
0 1 4 8
1 2 5 10
2 3 6 12

You can do this after:
df = df[['hello', 'world']]
print(df)
hello world
data new_col data
0 1 8 4
1 2 10 5
2 3 12 6

Related

Add values in columns of multiple dataframes if values in another column are same

Question related to pandas dataframe
df1:
id count
1 3
2 7
3 11
df2:
id count
3 6
4 8
5 2
df3:
id count
2 1
4 3
6 9
Expected output df:
id count
1 3
2 8
3 17
4 11
5 2
6 9
Any help is appreciated &
Thanks in advance!
Use concat and aggregate sum:
df = pd.concat([df1, df2, df3]).groupby('id', as_index=False).sum()

Dataframe count of columns matching value in another column in that row

How to find the count of columns with same value as a specified column in the dataframe with large number of rows.
For instance, below df has
df = pd.DataFrame(np.random.randint(0,10,size=(5, 4)), columns=list('ABCD'))
df.index.name = 'id'
A B C D
id
0 7 6 6 2
1 6 5 3 5
2 8 8 0 9
3 0 2 8 9
4 4 3 8 5
bc_cols = ['B', 'C']
df['max'] = df[bc_cols].max(axis=1)
A B C D BC_max
id
0 7 6 6 2 6
1 6 5 3 5 5
2 8 8 0 9 8
3 0 2 8 9 8
4 4 3 8 5 8
For each row, we want to get the number of columns with the value matching the max. I was able to get to by doing this.
df["freq"] = df[bc_cols].stack().groupby(by='id').apply(lambda g: g[g==g.max()].count())
A B C D BC_max BC_freq
id
0 7 6 6 2 6 2
1 6 5 3 5 5 1
2 8 8 0 9 8 1
3 0 2 8 9 8 1
4 4 3 8 5 8 1
But this is turning out to be very inefficient and slow. We need to do this on a fairly large dataframe with several hundred thousand rows so I am looking for an efficient way to do this. Any ideas?
Once you have BC_max why not re-use it:
def get_bc_freq(row):
if (row.B == row.BC_max) and (row.C == row.BC_max):
return 2
elif (row.B == row.BC_max) or (row.C == row.BC_max):
return 1
return 0
df['freq'] = df.apply(lambda row: get_bc_freq(row), axis=1)
Or the prettier one-liner:
df['freq'] = df.apply(lambda row: [row.B, row.C].count(row.BC_max), axis=1)
UPDATE - to make the columns you use more dynamic you could use list comprehension (not sure how much this helps with performance but...):
cols_to_use = ['B', 'C']
df['freq'] = df.apply(lambda row: [row[x] for x in cols_to_use].count(row.BC_max), axis=1)

Conditional count of cumulative sum Dataframe - Loop through columns

Im trying to compute a cumulative sum with a reset within a dataframe, based on the sign of each values. The idea is to the same exercise for each column separately.
For example, let's assume I have the following dataframe:
df = pd.DataFrame({'A': [1,1,1,-1,-1,1,1,1,1,-1,-1,-1],'B':[1,1,-1,-1,-1,1,1,1,-1,-1,-1,1]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
For each column, I want to compute the cumulative sum until I find a change in sign; in which case, the sum should be reset to 1. For the example above, I am expecting the following result:
df1=pd.DataFrame({'A_cumcount':[1,2,3,1,2,1,2,3,4,1,2,3],'B_cumcount':[1,2,1,2,3,1,2,3,1,2,3,4],index=[0,1,2,3,4,5,6,7,8,9,10,11]})
Similar issue has been discussed here: Pandas: conditional rolling count
I have tried the following code:
nb_col=len(df.columns) #number of columns in dataframe
for i in range(0,int(nb_col)): #Loop through the number of columns in the dataframe
name=df.columns[i] #read the column name
name=name+'_cumcount'
#add column for the calculation
df=df.reindex(columns=np.append(df.columns.values, [name]))
df=df[df.columns[nb_col+i]]=df.groupby((df[df.columns[i]] != df[df.columns[i]].shift(1)).cumsum()).cumcount()+1
My question is, is there a way to avoid this for loop? So I can avoid appending a new column each time and make the computation faster. Thank you
Answers received (all working fine):
From #nixon
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).add_suffix('_cumcount')
From #jezrael
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1).add_suffix('_cumcount'))
From #Scott Boston:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)
I think in pandas need loop, e.g. by apply:
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1)
.add_suffix('_cumcount'))
print (df1)
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1
You can try this:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)
Output:
A B
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1
You can start by grouping by where the changes in the sequence occur by doing x.diff().ne(0).cumsum(), and using cumcount over the groups:
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum())
.cumcount()+1).add_suffix('_cumcount')
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1

Add new columns using a pandas Series [duplicate]

This question already has answers here:
How to assign values to multiple non existing columns in a pandas dataframe?
(2 answers)
Closed 4 years ago.
I have a pandas DataFrame and a pandas Series. I want to add new constant columns that have the values of the dataframe. In an example:
In [1]: import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,4,5], 'b': [2,2,3,2,5]})
In [2]: df1
Out[2]:
a b
0 1 2
1 2 2
2 3 3
3 4 2
4 5 5
In [3]: s1 = pd.Series({'c':2, 'd':3})
In [4]: s1
Out[4]:
c 2
d 3
dtype: int64
In [5]: for key, value in s1.to_dict().items():
df1[key] = value
My ugly loop does what I want. But there must be definitely a better solution using maybe some merge or group operation I guess
In [6]: df1
Out[6]:
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3
Any suggestions?
Use assign with unpacking Series by **:
df1 = df1.assign(**s1)
print (df1)
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3
Numpy solution for new DataFrame with numpy.broadcast_to and join:
df = pd.DataFrame(np.broadcast_to(s1.values, (len(df1),len(s1))),
index=df1.index,
columns=s1.index)
df1 = df1.join(df)
print (df1)
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3

Need to loop over pandas series to find indices of variable

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated
Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]