how do I sum each column based on condition of another column without iterating over the columns in pandas datframe - pandas

I have a data frame as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 1.0 85.0 66.0 29.0 0.0 0.0
1 8.0 183.0 64.0 0.0 0.0 0.0
2 1.0 89.0 66.0 23.0 94.0 1.0
3 0.0 137.0 40.0 35.0 168.0 1.0
4 5.0 116.0 74.0 0.0 0.0 1.0
I would like a pythonic way to sum each column in separate based on a condition of one of the columns. I could do it with iterating over the df columns, but I'm sure there is a better way I'm not familiar with.
In specific to the data I have, I'd like to sum each column values if at the last column 'Outcome' is equal to 1. In the end, I should get as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Any ideas?

Here is a solution to get the expected output:
sum_df = df.loc[df.Outcome == 1.0].sum().to_frame().T
sum_df.Outcome = 0.0
Output:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Documentation:
loc: access a group of rows / columns by labels or boolean array
sum: sum by default over all columns and return a Series indexed by the columns.
to_frame: convert a Series to a DataFrame.
.T: accessor the transpose function, transpose the DataFrame.

use np.where
df1[np.where(df1['Outcome'] == 1,True,False)].sum().to_frame().T
Output
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 3.0

Will these work for you?
df1.loc[~(df1['Outcome'] == 0)].groupby('Outcome').agg('sum').reset_index()
or
df1.loc[df1.Outcome == 1.0].sum().to_frame().T

Related

Matplotlib eventplot - raster plot from binary values

I have created a dataframe where each column is an equal-length series of 1.0s and 0.0s. There is nothing else in the dataframe. I want to create a raster-style plot from this data where each column would be a horizontal line stacked up along the y-axis and each tick on the x-axis would correspond to a row index value.
However, when I try to do this, I get an "axis -1 is out of bounds for array of dimension 0" error. None of the other entries for this or very similar errors seem to relate to eventplot. I thought the type of data I had would be perfect for eventplot (a discrete black dash wherever there's a 1.0, otherwise nothing), but maybe I'm very wrong.
Here's a toy example of the kind of dataframe I'm trying to pass plus the function as I'm calling it:
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
plt.eventplot(df, colors='black', lineoffsets=1,
linelengths=1, orientation='vertical')
Any help appreciated, thank you.
Edit: If I convert my df into an np.array and pass that instead, I no longer get that particular error, but I don't at all get the result I'm looking for. I do get the correct values on the x-axis (in my real data, this is 0-22), but I don't get each column of data represented as a separate line, and I'm having no luck advancing in that direction.
When using eventplot, the array passed to positions needs to contain the row numbers of the ones in each column. Here is an example with your toy data:
import io
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import data into dataframe
data = """
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
"""
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Create series of indexes containing positions for raster plot
positions = df.apply(lambda x: df.index[x == 1])
# Create raster plot with inverted y-axis to display columns in ascending order
plt.eventplot(positions, lineoffsets=df.index, linelengths=0.75, colors='black')
plt.yticks(range(positions.index.size), positions.index)
plt.gca().invert_yaxis()

pandas - how to select rows based on a conjunction of a non indexed column?

Consider the following DataFrame -
In [47]: dati
Out[47]:
x y
frame face lmark
1 NaN NaN NaN NaN
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
2201 0.0 1.0 634.0 395.0
2.0 629.0 439.0
3.0 630.0 486.0
How can we select the rows where dati['x'] > 629.5 for all rows sharing the same value in the 'frame' column. For this example, I would expect to result to be
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
because column 'x' of 'frame' 2201, 'lmark' 2.0 is not greater than 629.5
Use GroupBy.transform with GroupBy.all for test if all Trues per groups and filter in boolean indexing:
df = dat[(dat['x'] > 629.5).groupby(level=0).transform('all')]
print (df)
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0

Python: group by with sum special columns and keep the initial rows too

I have a df:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 0.0 3.0 0.0 0.0
I would like to change the values for row #6: Passat value in Car column by add the values from row#2 & row#3 & row#4 (Golf, Tiguan, Touareg) in the Car column) and also keep the values of row#2 & row#3 & row#4 as initial.
Because Passat includes Golf, Touareg, Tiguan and due to it I need to add the values of Golf, Touareg, Tiguanrows to Passat row.
I tried to do it the following code:
car_list = ['Golf', 'Tiguan', 'Touareg']
for car in car_list:
df['Car'][df['Car']==car]='Passat'
and after I used groupby by Car and sum() function:
df1 = df.groupby(['Car'])['Jan17', 'Jun18', 'Dec18', 'Apr19'].sum().reset_index()
In result, df1 doesn't have initial (Golf, Tiguan, Touareg) rows. So, this way is wrong.
Expected result is df1:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 1.0 4.7 9.0 11.4
I'd appreciate for any idea. Thanks)
First we use .isin to get the correct Cars, then we use .filter to get the correct value columns, finally we sum the values and put them in our variable sums.
Then we select the Passat row and add the values to that row:
sums = df[df['Car'].isin(car_list)].filter(regex='\w{3}\d{2}').sum()
df.loc[df['Car'].eq('Passat'), 'Jan17':] += sums
Output
ID Car Jan17 Jun18 Dec18 Apr19
0 0 Nissan 0.0 1.7 3.7 0.0
1 1 Porsche 10.0 0.0 2.8 3.5
2 2 Golf 0.0 1.7 3.0 2.0
3 3 Tiguan 1.0 0.0 3.0 5.2
4 4 Touareg 0.0 0.0 3.0 4.2
5 5 Mercedes 0.0 0.0 0.0 7.2
6 6 Passat 1.0 4.7 9.0 11.4
Solution is in view of function:
car_list = ['Golf', 'Tiguan', 'Touareg', 'Passat']
def updateCarInfoBySum(df, car_list, name, id):
req = df[df['Car'].isin(car_list)]
req.set_index(['Car', 'ID], inplace=True)
req.loc[('new_value', '000'), :] = req.sum(axis=0)
req.reset_index(inplace=True)
req = req[req.Car != name]
req['Car'][req['Car'] == 'new_value'] = name
req['ID'][req['ID'] == '000'] = id
req.set_index(['Car', 'ID], inplace=True)
df_final = df.copy()
df_final.set_index(['Car', 'ID], inplace=True)
df_final.update(req)
return df_final

Sum of NaNs to equal NaN (not zero)

I can add a TOTAL column to this DF using df['TOTAL'] = df.sum(axis=1), and it adds the row elements like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN 0.0
However, I would like the total of the bottom row to be NaN, not zero, like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN Nan
Is there a way I can achieve this in a performant way?
Add parameter min_count=1 to DataFrame.sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
df['TOTAL'] = df.sum(axis=1, min_count=1)
print (df)
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN NaN

pandas pivot table strange results

I am have some trouble with grouping by pandas pivot table. I have a dataset and I am taking two subsets of it.
Here is how I create the subsets and how they look like
df3= df2.head(170).tail()
df3
cuts delta_2 tag
165 (360, 2000] 426.0 0.0
166 (360, 2000] 426.0 0.0
167 (360, 2000] 426.0 0.0
168 (360, 2000] 426.0 0.0
169 NaN NaN 0.0
df4= (df2.head(171)).tail()
df4
cuts delta_2 tag
166 (360, 2000] 426.0 0.0
167 (360, 2000] 426.0 0.0
168 (360, 2000] 426.0 0.0
169 NaN NaN 0.0
170 (180, 360] 183.0 0.0
Now I am just trying to group them using pivot tables and I get strange results:
df3.pivot_table(values = 'tag', index= 'cuts', aggfunc=['sum', 'count'],dropna=True).sort_values('cuts')
sum count
tag tag
cuts
NaN 0.0 0
(360, 2000] 0.0 4
The above seems to have not counted anything for NaN category. However issue becomes much bigger in following
df4.pivot_table(values = 'tag', index= 'cuts', aggfunc=['sum', 'count'],dropna=True).sort_values('cuts')
sum count
tag tag
cuts
NaN 0.0 3
(180, 360] 0.0 0
(360, 2000] 0.0 1
Here counting gets really weird. I am not able to figure out why. Variable Cuts was created using pd.cut function on the variable delta_2. My objective is just to get mean but since mean was showing strange results, I tried to calculate sum and count.
df3.pivot_table(values = 'tag', index= 'cuts', aggfunc=[np.sum,
np.mean],dropna=True).sort_values('cuts')
use numpy sum and numpy mean to compute the sum and mean.