pandas pivot table strange results - pandas

I am have some trouble with grouping by pandas pivot table. I have a dataset and I am taking two subsets of it.
Here is how I create the subsets and how they look like
df3= df2.head(170).tail()
df3
cuts delta_2 tag
165 (360, 2000] 426.0 0.0
166 (360, 2000] 426.0 0.0
167 (360, 2000] 426.0 0.0
168 (360, 2000] 426.0 0.0
169 NaN NaN 0.0
df4= (df2.head(171)).tail()
df4
cuts delta_2 tag
166 (360, 2000] 426.0 0.0
167 (360, 2000] 426.0 0.0
168 (360, 2000] 426.0 0.0
169 NaN NaN 0.0
170 (180, 360] 183.0 0.0
Now I am just trying to group them using pivot tables and I get strange results:
df3.pivot_table(values = 'tag', index= 'cuts', aggfunc=['sum', 'count'],dropna=True).sort_values('cuts')
sum count
tag tag
cuts
NaN 0.0 0
(360, 2000] 0.0 4
The above seems to have not counted anything for NaN category. However issue becomes much bigger in following
df4.pivot_table(values = 'tag', index= 'cuts', aggfunc=['sum', 'count'],dropna=True).sort_values('cuts')
sum count
tag tag
cuts
NaN 0.0 3
(180, 360] 0.0 0
(360, 2000] 0.0 1
Here counting gets really weird. I am not able to figure out why. Variable Cuts was created using pd.cut function on the variable delta_2. My objective is just to get mean but since mean was showing strange results, I tried to calculate sum and count.

df3.pivot_table(values = 'tag', index= 'cuts', aggfunc=[np.sum,
np.mean],dropna=True).sort_values('cuts')
use numpy sum and numpy mean to compute the sum and mean.

Related

Pandas groupby aggregation yields extraneous groups. Bug?

Having some issues with data manipulation in pandas, seems like maybe a pandas bug? Would love some ideas.
I've got a (index-sorted) dataframe my_df that looks something like this:
value1 value2
col0 col1 col2 col3 col4
0035ca76-b209-4c4e-9bba-b18459c4dceb 203 positive 173 148 0.0 0.086892
negative 1156 148 0.0 0.090347
1157 148 0.0 0.090347
1158 148 0.0 0.090347
1159 148 0.0 0.084884
1160 148 0.0 0.079942
1161 148 0.0 0.079824
1162 148 0.0 0.071289
positive 173 66 0.0 0.079831
negative 1156 66 0.0 0.082660
1157 66 0.0 0.082660
1158 66 0.0 0.082660
1159 66 0.0 0.084353
1160 66 0.0 0.076934
1161 66 0.0 0.076494
1162 66 0.0 0.070424
00e35aaf-050a-4f09-bf94-df994e4bf681 24 positive 14 38 0.0 0.073936
negative 134 38 0.0 0.075913
135 38 0.0 0.075913
136 38 0.0 0.074403
137 38 0.0 0.081120
138 38 0.0 0.078560
139 38 0.0 0.080680
140 38 0.0 0.073892
positive 14 1 0.0 0.051979
negative 134 1 0.0 0.043818
135 1 0.0 0.043818
136 1 0.0 0.049795
137 1 0.0 0.052171
138 1 0.0 0.048573
139 1 0.0 0.045205
140 1 0.0 0.054696
... more rows for this and other col0 + col1 combos
I'm trying to just compute the sum of "value2" for each unique combination of [col0, col1, col2, col3]. As far as I can tell, the most logical way to do this would be
my_df.groupby(level=list(range(4))).sum()
However, I'm getting really odd results that seem like a pandas bug.
grouped = my_df.groupby(list(range(4)))
for name, group in grouped:
print(group)
break
sums = grouped.sum()
Indeed the first group is as I'd expect
value1 value2
col0 col1 col2 col3 col4
0035ca76-b209-4c4e-9bba-b18459c4d681 199 positive 174 151 0.0 0.089186
158 0.0 0.104250
and the number of groups in grouped is correct (you'll have to take my word for that, I've verified other ways) but sums is haywire and has a bajillion extraneous rows
(Pdb) len(grouped)
334
(Pdb) len(sums)
53760
(Pdb) sums[:30]
value1 valu2
col0 col1 col2 col3 col4
1f11aede-6aed-44ef-9296-004b6269662c 17 positive 7 1 0.0 0.0
4 0.0 0.0
5 0.0 0.0
6 0.0 0.0
7 0.0 0.0
8 0.0 0.0
11 0.0 0.0
12 0.0 0.0
24 0.0 0.0
32 0.0 0.0
33 0.0 0.0
38 0.0 0.0
39 0.0 0.0
53 0.0 0.0
56 0.0 0.0
66 0.0 0.0
69 0.0 0.0
70 0.0 0.0
72 0.0 0.0
73 0.0 0.0
75 0.0 0.0
85 0.0 0.0
91 0.0 0.0
94 0.0 0.0
116 0.0 0.0
119 0.0 0.0
The values given in col4 are widely varied throughout the dataframe. It looks like the groupby + aggregation op creates a sum row for every value of col4 in the entire dataframe, as opposed to just values of col4 that actually pertain to each group. In other words, most of these rows don't even have entries in the original dataframe:
(Pdb) my_df.loc[("1f11aede-6aed-44ef-9296-004b6269662c", 17, "positive", 7, 1)]
*** KeyError: ('1f11aede-6aed-44ef-9296-004b6269662c', 17, 'positive', 7, 1)
Any idea what's going on here? These seems totally off-script from what the groupby API and tutorial describe. For example, as far as I know groupby => agg should create one row per group here.
TLDR: As of pandas 1.2.4, groupby has nonintuitive behavior when one of the indexes is Categorical. Fixed with groupby(..., observed=True).
Ok, this one took me a while. Turns out if one of your indexes is Categorical then groupby has, in my opinion, totally nonintuitive behavior.
# Does some sort of cartesian products of index values if any of the indexes are Categorical.
my_df.groupby(level=list(range(4)))
# Doesn't do the cartesian product, in line with behavior for every other index type.
my_df.groupby(level=list(range(4)), observed=True)
In my case, col2 was Categorical. That is, at some point in my code, I had:
col2_type = pd.api.types.CategoricalDtype(categories=["positive", "negative"], ordered=True)
col2 = ["positive", "negative", "negative", "negative", #...]
# Make col2 categorical
my_df['col2'] = my_df.assign(col2=col2)['col2'].astype(col2_type)
A bit of mea culpa on this one, this behavior is documented (see observed: ). That said, I'm not the only one who has been thoroughly confused by this.
"Issue"
PR to change the default value for observed in some future release of pandas (as of this writing, PR is unmerged)
With any luck, this "issue" will be fixed in an upcoming release.
Try following sentences to get sum of value2 column by grouping slice_id, col1, col2, col3 columns data
my_df.groupby(['slice_id', 'col1', 'col2', 'col3']).agg('sum')
or
my_df.groupby(['slice_id', 'col1', 'col2', 'col3'])[['value2']].agg('sum')

Matplotlib eventplot - raster plot from binary values

I have created a dataframe where each column is an equal-length series of 1.0s and 0.0s. There is nothing else in the dataframe. I want to create a raster-style plot from this data where each column would be a horizontal line stacked up along the y-axis and each tick on the x-axis would correspond to a row index value.
However, when I try to do this, I get an "axis -1 is out of bounds for array of dimension 0" error. None of the other entries for this or very similar errors seem to relate to eventplot. I thought the type of data I had would be perfect for eventplot (a discrete black dash wherever there's a 1.0, otherwise nothing), but maybe I'm very wrong.
Here's a toy example of the kind of dataframe I'm trying to pass plus the function as I'm calling it:
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
plt.eventplot(df, colors='black', lineoffsets=1,
linelengths=1, orientation='vertical')
Any help appreciated, thank you.
Edit: If I convert my df into an np.array and pass that instead, I no longer get that particular error, but I don't at all get the result I'm looking for. I do get the correct values on the x-axis (in my real data, this is 0-22), but I don't get each column of data represented as a separate line, and I'm having no luck advancing in that direction.
When using eventplot, the array passed to positions needs to contain the row numbers of the ones in each column. Here is an example with your toy data:
import io
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import data into dataframe
data = """
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
"""
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Create series of indexes containing positions for raster plot
positions = df.apply(lambda x: df.index[x == 1])
# Create raster plot with inverted y-axis to display columns in ascending order
plt.eventplot(positions, lineoffsets=df.index, linelengths=0.75, colors='black')
plt.yticks(range(positions.index.size), positions.index)
plt.gca().invert_yaxis()

Fill missing values in DataFrame

I have a dataframe that is either missing two values in two columns, or one value in one column.
Date 30 45 60 90
0 2004-01-02 0.88 0.0 0.0 0.93
1 2004-01-05 0.88 0.0 0.0 0.91
...
20 2019-12-24 1.55 0 1.58 1.58
21 2019-12-26 1.59 0 1.60 1.58
I would like to compute all the zero values in the dataframe by some simple linear method. Here is the thing, if there is a value in the 60 column, use the average of the 60 and the 30 for the 45. Otherwise use some simple method to compute both the 45 and the 60.
What is the pandas way to do this? [Prefer no loops]
EDIT 1
As per the suggestions in the comment, I tried
df.replace(0, np.nan, inplace=True)
df=df.interpolate(method='linear', limit_direction='forward', axis=0)
But the df still contains all the np.nan

How to calculate statistic values over 2D DataFrame bin wise for column ranges defined via IntervalIndex?

I've a 2D DataFrame like follows
0.0 0.1 0.2 0.3 0.4 ...
0 0 1 NaN 3 4
1 NaN NaN NaN NaN 9
...
. For every row I'd like to calculate the arithmetic mean and the arithmetic standard deviation for specific, equal width column ranges (bins) which shall be defined via IntervalIndex. NaN shall be ignored. E.g. with pd.IntervalIndex.from_tuples([(0.0, 0.2), (0.2, 0.4)] I'd expect something like
(0.0, 0.2) (0.2, 0.4)
mean 0. 3.5
std ...
The intervals shall support different widths. As the DataFrame has many rows and many columns memory and execution performance is critical. How can I get my expected output as performant as possible?
You can do a cut, and groupby:
df.columns=df.columns.astype(float)
cuts = pd.cut(df.columns, bins=[0, 0.2, 0.4],include_lowest=True)
df.groupby(cuts, axis=1).mean()
Output:
(-0.001, 0.2] (0.2, 0.4]
0 0.5 3.5
1 NaN 9.0
Note: you can also pass pd.IntervalIndex.from_tuples([(0.0, 0.2), (0.2, 0.4)]) to bins in pd.cut, if you already have them defined somewhere. But you need to be careful about 0, which is not included in the intervals above.
Note 2: it appears that groupby().agg does not support std on axis=1. You can transform the dataframe:
df.T.groupby(cuts).agg(['mean','std']).T
Output:
(-0.001, 0.2] (0.2, 0.4]
0 mean 0.500000 3.500000
std 0.707107 0.707107
1 mean NaN 9.000000
std NaN NaN

how do I sum each column based on condition of another column without iterating over the columns in pandas datframe

I have a data frame as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 1.0 85.0 66.0 29.0 0.0 0.0
1 8.0 183.0 64.0 0.0 0.0 0.0
2 1.0 89.0 66.0 23.0 94.0 1.0
3 0.0 137.0 40.0 35.0 168.0 1.0
4 5.0 116.0 74.0 0.0 0.0 1.0
I would like a pythonic way to sum each column in separate based on a condition of one of the columns. I could do it with iterating over the df columns, but I'm sure there is a better way I'm not familiar with.
In specific to the data I have, I'd like to sum each column values if at the last column 'Outcome' is equal to 1. In the end, I should get as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Any ideas?
Here is a solution to get the expected output:
sum_df = df.loc[df.Outcome == 1.0].sum().to_frame().T
sum_df.Outcome = 0.0
Output:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Documentation:
loc: access a group of rows / columns by labels or boolean array
sum: sum by default over all columns and return a Series indexed by the columns.
to_frame: convert a Series to a DataFrame.
.T: accessor the transpose function, transpose the DataFrame.
use np.where
df1[np.where(df1['Outcome'] == 1,True,False)].sum().to_frame().T
Output
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 3.0
Will these work for you?
df1.loc[~(df1['Outcome'] == 0)].groupby('Outcome').agg('sum').reset_index()
or
df1.loc[df1.Outcome == 1.0].sum().to_frame().T