create new column based on other columns in pandas dataframe - pandas

What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)

You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7

Related

Fill the row in a data frame with a specific value based on a condition on the specific column

I have a data frame df:
df=
A B C D
1 4 7 2
2 6 -3 9
-2 7 2 4
I am interested in changing the whole row values to 0 if it's element in the column C is negative. i.e. if df['C']<0, its corresponding row should be filled with the value 0 as shown below:
df=
A B C D
1 4 7 2
0 0 0 0
-2 7 2 4
You can use DataFrame.where or mask:
df.where(df['C'] >= 0, 0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Another option is simple masking via multiplication:
df.mul(df['C'] >= 0, axis=0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
You can also set values directly via loc as shown in this comment:
df.loc[df['C'] <= 0] = 0
df
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Which has the added benefit of modifying the original DataFrame (if you'd rather not return a copy).

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Delete rows in dataframe based on info from Series

I would like to delete all rows in the Dataframe that have number of appereance = 10 and status = 1.
Example of Dataframe X is
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
...
First I found all rows with status=1 with count()=10
exclude=X[X.Status == 1].groupby('ID')['Status'].value_counts().loc[lambda x: x==10].index
exclude is Series
MultiIndex([( 371391, 1),
( 383537, 1),
...
Is it possible to delete rows in Dataframe X based od info for ID from Series ?
If your original DataFrame looks something like this:
print(df)
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
5 371391 1
6 371391 1
7 371391 1
8 371391 1
9 371391 1
10 371391 1
11 371391 1
12 371391 1
13 371391 1
And you group IDs and statuses together to find the IDs you want to exclude:
df2 = df.groupby(['ID', 'Status']).size().to_frame('size').reset_index()
print(df2)
ID Status size
0 366804 0 1
1 371391 1 10
2 383537 1 1
3 383538 0 1
4 383539 0 1
excludes = df2.loc[(df2['size'] == 10) & (df2['Status'] == 1), 'ID']
print(excludes)
1 371391
Name: ID, dtype: int64
Then you could use Series.isin and invert the boolean Series ~s:
df = df[~df['ID'].isin(excludes)]
print(df)
ID Status
0 366804 0
2 383537 1
3 383538 0
4 383539 0

Need to loop over pandas series to find indices of variable

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated
Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]

Pandas, bygroup operation

I have in pandas by using of groupby() next output (A,B,C are the columns in the input table)
C
A B
0 0 6
2 1
6 5
. . .
Output details: [244 rows x 1 columns] I just want to have all 3 columns instead of one,how is it possible to do?
Output, which I wish:
A B C
0 0 6
0 2 1
. . .
It appears to be undocumented, but simply: gb.bfill(), see this example:
In [68]:
df=pd.DataFrame({'A':[0,0,0,0,0,0,0,0],
'B':[0,0,0,0,1,1,1,1],
'C':[1,2,3,4,1,2,3,4],})
In [69]:
gb=df.groupby(['A', 'B'])
In [70]:
print gb.bfill()
A B C
0 0 0 1
1 0 0 2
2 0 0 3
3 0 0 4
4 0 1 1
5 0 1 2
6 0 1 3
7 0 1 4
[8 rows x 3 columns]
But I don't see why you need to do that, don't you end up with the original DataFrame (only maybe rearranged)?