Need to loop over pandas series to find indices of variable - pandas

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated

Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]

Related

change all values in a dataframe with other values from another dataframe

I just started with learning pandas.
I have 2 dataframes.
The first one is
val num
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4
and the second one is
0 1 2 3
0 1 2 3 4
1 5 3 2 2
2 2 5 3 2
I want to change my second dataframe so that the values present in the dataframe are compared with the column val in the first dataframe and every values that is the same needs then to be changed in the values that is present in de the num column from dataframe 1. Which means that in the end i need to get the following dataframe:
0 1 2 3
0 0 1 2 3
1 4 2 1 1
2 1 4 2 1
How do i do that in pandas?
You can use DataFrame.replace() to do this:
df2.replace(df1.set_index('val')['num'])
Explanation:
The first step is to set the val column of the first DataFrame as the index. This will change how the matching is performed in the third step.
Convert the first DataFrame to a Series, by sub-setting to the index and the num column. It looks like this:
val
1 0
2 1
3 2
4 3
5 4
Name: num, dtype: int64
Next, use DataFrame.replace() to do the replacement in the second DataFrame. It looks up each value from the second DataFrame, finds a matching index in the Series, and replaces it with the value from the Series.
Full reproducible example:
import pandas as pd
import io
s = """ val num
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4"""
df1 = pd.read_csv(io.StringIO(s), delim_whitespace=True)
s = """ 0 1 2 3
0 1 2 3 4
1 5 3 2 2
2 2 5 3 2"""
df2 = pd.read_csv(io.StringIO(s), delim_whitespace=True)
print(df2.replace(df1.set_index('val')['num']))
Creat the mapping dict , then replace
mpd = dict(zip(df1.val,df1.num))
df2.replace(mpd, inplace=True)
0 1 2 3
0 0 1 2 3
1 4 2 1 1
2 1 4 2 1

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Dataframe count of columns matching value in another column in that row

How to find the count of columns with same value as a specified column in the dataframe with large number of rows.
For instance, below df has
df = pd.DataFrame(np.random.randint(0,10,size=(5, 4)), columns=list('ABCD'))
df.index.name = 'id'
A B C D
id
0 7 6 6 2
1 6 5 3 5
2 8 8 0 9
3 0 2 8 9
4 4 3 8 5
bc_cols = ['B', 'C']
df['max'] = df[bc_cols].max(axis=1)
A B C D BC_max
id
0 7 6 6 2 6
1 6 5 3 5 5
2 8 8 0 9 8
3 0 2 8 9 8
4 4 3 8 5 8
For each row, we want to get the number of columns with the value matching the max. I was able to get to by doing this.
df["freq"] = df[bc_cols].stack().groupby(by='id').apply(lambda g: g[g==g.max()].count())
A B C D BC_max BC_freq
id
0 7 6 6 2 6 2
1 6 5 3 5 5 1
2 8 8 0 9 8 1
3 0 2 8 9 8 1
4 4 3 8 5 8 1
But this is turning out to be very inefficient and slow. We need to do this on a fairly large dataframe with several hundred thousand rows so I am looking for an efficient way to do this. Any ideas?
Once you have BC_max why not re-use it:
def get_bc_freq(row):
if (row.B == row.BC_max) and (row.C == row.BC_max):
return 2
elif (row.B == row.BC_max) or (row.C == row.BC_max):
return 1
return 0
df['freq'] = df.apply(lambda row: get_bc_freq(row), axis=1)
Or the prettier one-liner:
df['freq'] = df.apply(lambda row: [row.B, row.C].count(row.BC_max), axis=1)
UPDATE - to make the columns you use more dynamic you could use list comprehension (not sure how much this helps with performance but...):
cols_to_use = ['B', 'C']
df['freq'] = df.apply(lambda row: [row[x] for x in cols_to_use].count(row.BC_max), axis=1)

Conditional count of cumulative sum Dataframe - Loop through columns

Im trying to compute a cumulative sum with a reset within a dataframe, based on the sign of each values. The idea is to the same exercise for each column separately.
For example, let's assume I have the following dataframe:
df = pd.DataFrame({'A': [1,1,1,-1,-1,1,1,1,1,-1,-1,-1],'B':[1,1,-1,-1,-1,1,1,1,-1,-1,-1,1]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
For each column, I want to compute the cumulative sum until I find a change in sign; in which case, the sum should be reset to 1. For the example above, I am expecting the following result:
df1=pd.DataFrame({'A_cumcount':[1,2,3,1,2,1,2,3,4,1,2,3],'B_cumcount':[1,2,1,2,3,1,2,3,1,2,3,4],index=[0,1,2,3,4,5,6,7,8,9,10,11]})
Similar issue has been discussed here: Pandas: conditional rolling count
I have tried the following code:
nb_col=len(df.columns) #number of columns in dataframe
for i in range(0,int(nb_col)): #Loop through the number of columns in the dataframe
name=df.columns[i] #read the column name
name=name+'_cumcount'
#add column for the calculation
df=df.reindex(columns=np.append(df.columns.values, [name]))
df=df[df.columns[nb_col+i]]=df.groupby((df[df.columns[i]] != df[df.columns[i]].shift(1)).cumsum()).cumcount()+1
My question is, is there a way to avoid this for loop? So I can avoid appending a new column each time and make the computation faster. Thank you
Answers received (all working fine):
From #nixon
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).add_suffix('_cumcount')
From #jezrael
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1).add_suffix('_cumcount'))
From #Scott Boston:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)
I think in pandas need loop, e.g. by apply:
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1)
.add_suffix('_cumcount'))
print (df1)
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1
You can try this:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)
Output:
A B
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1
You can start by grouping by where the changes in the sequence occur by doing x.diff().ne(0).cumsum(), and using cumcount over the groups:
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum())
.cumcount()+1).add_suffix('_cumcount')
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1

Pandas dataframe rename column

I splited a dataframe into two parts and changed their column names seperately. Here's what I got:
df1 = df[df['colname'==0]]
df2 = df[df['colname'==1]]
df1.columns = [ 'a'+ x for x in df1.columns]
df2.columns = [ 'b'+ x for x in df2.columns]
And it turned out df2 has the columns start with 'ba' rather than 'b'. What happened?
I cannot simulate your problem, for me working nice.
Alternative solution should be add_prefix instead list comprehension:
df = pd.DataFrame({'colname':[0,1,0,0,0,1],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
C D E F colname
0 7 1 5 a 0
1 8 3 3 a 1
2 9 5 6 a 0
3 4 7 9 b 0
4 2 1 2 b 0
5 3 0 4 b 1
df1 = df[df['colname']==0].add_prefix('a')
df2 = df[df['colname']==1].add_prefix('b')
print (df1)
aC aD aE aF acolname
0 7 1 5 a 0
2 9 5 6 a 0
3 4 7 9 b 0
4 2 1 2 b 0
print (df2)
bC bD bE bF bcolname
1 8 3 3 a 1
5 3 0 4 b 1