Combine nested columns

Combine nested columns - pandas

I have the following:
df1 = pd.DataFrame({'data': [1,2,3]})
df2 = pd.DataFrame({'data': [4,5,6]})
df = pd.concat([df1,df2], keys=['hello','world'], axis=1)
df[('hello','new_col')] = df[('world','data')]*2
print (df)
hello world hello
data data new_col
0 1 4 8
1 2 5 10
2 3 6 12
When I add a new nested column as above, it separates it from the existing hello column. How do I add a new nested column such that new_col is associated/beneath with the existing hello column? Can this be done during assignment or only after? I.e. I want the below
hello world
data new_col data
0 1 4 8
1 2 5 10
2 3 6 12

You can do this after:
df = df[['hello', 'world']]
print(df)
hello world
data new_col data
0 1 8 4
1 2 10 5
2 3 12 6

Related

Add values in columns of multiple dataframes if values in another column are same

Question related to pandas dataframe
df1:
id count
1 3
2 7
3 11
df2:
id count
3 6
4 8
5 2
df3:
id count
2 1
4 3
6 9
Expected output df:
id count
1 3
2 8
3 17
4 11
5 2
6 9
Any help is appreciated &
Thanks in advance!

Use concat and aggregate sum:
df = pd.concat([df1, df2, df3]).groupby('id', as_index=False).sum()

Dataframe count of columns matching value in another column in that row

How to find the count of columns with same value as a specified column in the dataframe with large number of rows.
For instance, below df has
df = pd.DataFrame(np.random.randint(0,10,size=(5, 4)), columns=list('ABCD'))
df.index.name = 'id'
A B C D
id
0 7 6 6 2
1 6 5 3 5
2 8 8 0 9
3 0 2 8 9
4 4 3 8 5
bc_cols = ['B', 'C']
df['max'] = df[bc_cols].max(axis=1)
A B C D BC_max
id
0 7 6 6 2 6
1 6 5 3 5 5
2 8 8 0 9 8
3 0 2 8 9 8
4 4 3 8 5 8
For each row, we want to get the number of columns with the value matching the max. I was able to get to by doing this.
df["freq"] = df[bc_cols].stack().groupby(by='id').apply(lambda g: g[g==g.max()].count())
A B C D BC_max BC_freq
id
0 7 6 6 2 6 2
1 6 5 3 5 5 1
2 8 8 0 9 8 1
3 0 2 8 9 8 1
4 4 3 8 5 8 1
But this is turning out to be very inefficient and slow. We need to do this on a fairly large dataframe with several hundred thousand rows so I am looking for an efficient way to do this. Any ideas?

Once you have BC_max why not re-use it:
def get_bc_freq(row):
if (row.B == row.BC_max) and (row.C == row.BC_max):
return 2
elif (row.B == row.BC_max) or (row.C == row.BC_max):
return 1
return 0
df['freq'] = df.apply(lambda row: get_bc_freq(row), axis=1)
Or the prettier one-liner:
df['freq'] = df.apply(lambda row: [row.B, row.C].count(row.BC_max), axis=1)
UPDATE - to make the columns you use more dynamic you could use list comprehension (not sure how much this helps with performance but...):
cols_to_use = ['B', 'C']
df['freq'] = df.apply(lambda row: [row[x] for x in cols_to_use].count(row.BC_max), axis=1)

Conditional count of cumulative sum Dataframe - Loop through columns

Im trying to compute a cumulative sum with a reset within a dataframe, based on the sign of each values. The idea is to the same exercise for each column separately.
For example, let's assume I have the following dataframe:
df = pd.DataFrame({'A': [1,1,1,-1,-1,1,1,1,1,-1,-1,-1],'B':[1,1,-1,-1,-1,1,1,1,-1,-1,-1,1]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
For each column, I want to compute the cumulative sum until I find a change in sign; in which case, the sum should be reset to 1. For the example above, I am expecting the following result:
df1=pd.DataFrame({'A_cumcount':[1,2,3,1,2,1,2,3,4,1,2,3],'B_cumcount':[1,2,1,2,3,1,2,3,1,2,3,4],index=[0,1,2,3,4,5,6,7,8,9,10,11]})
Similar issue has been discussed here: Pandas: conditional rolling count
I have tried the following code:
nb_col=len(df.columns) #number of columns in dataframe
for i in range(0,int(nb_col)): #Loop through the number of columns in the dataframe
name=df.columns[i] #read the column name
name=name+'_cumcount'
#add column for the calculation
df=df.reindex(columns=np.append(df.columns.values, [name]))
df=df[df.columns[nb_col+i]]=df.groupby((df[df.columns[i]] != df[df.columns[i]].shift(1)).cumsum()).cumcount()+1
My question is, is there a way to avoid this for loop? So I can avoid appending a new column each time and make the computation faster. Thank you
Answers received (all working fine):
From #nixon
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).add_suffix('_cumcount')
From #jezrael
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1).add_suffix('_cumcount'))
From #Scott Boston:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)

I think in pandas need loop, e.g. by apply:
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1)
.add_suffix('_cumcount'))
print (df1)
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1

You can try this:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)
Output:
A B
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1

You can start by grouping by where the changes in the sequence occur by doing x.diff().ne(0).cumsum(), and using cumcount over the groups:
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum())
.cumcount()+1).add_suffix('_cumcount')
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1

Add new columns using a pandas Series [duplicate]

This question already has answers here:
How to assign values to multiple non existing columns in a pandas dataframe?
(2 answers)
Closed 4 years ago.
I have a pandas DataFrame and a pandas Series. I want to add new constant columns that have the values of the dataframe. In an example:
In [1]: import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,4,5], 'b': [2,2,3,2,5]})
In [2]: df1
Out[2]:
a b
0 1 2
1 2 2
2 3 3
3 4 2
4 5 5
In [3]: s1 = pd.Series({'c':2, 'd':3})
In [4]: s1
Out[4]:
c 2
d 3
dtype: int64
In [5]: for key, value in s1.to_dict().items():
df1[key] = value
My ugly loop does what I want. But there must be definitely a better solution using maybe some merge or group operation I guess
In [6]: df1
Out[6]:
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3
Any suggestions?

Use assign with unpacking Series by **:
df1 = df1.assign(**s1)
print (df1)
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3
Numpy solution for new DataFrame with numpy.broadcast_to and join:
df = pd.DataFrame(np.broadcast_to(s1.values, (len(df1),len(s1))),
index=df1.index,
columns=s1.index)
df1 = df1.join(df)
print (df1)
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3

Need to loop over pandas series to find indices of variable

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated

Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Combine nested columns - pandas

You can do this after: df = df[['hello', 'world']] print(df) hello world data new_col data 0 1 8 4 1 2 10 5 2 3 12 6

Related

Add values in columns of multiple dataframes if values in another column are same

Dataframe count of columns matching value in another column in that row

Conditional count of cumulative sum Dataframe - Loop through columns

Add new columns using a pandas Series [duplicate]

Need to loop over pandas series to find indices of variable

Categories

Resources