Pandas - possible to aggregate two columns using two different aggregations? - pandas

I'm loading a csv file, which has the following columns:
date, textA, textB, numberA, numberB
I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB.
data = pd.read_table("file.csv", sep=",", thousands=',')
grouped = data.groupby(["date", "textA", "textB"], as_index=False)
...but I cannot see how to then apply two different aggregate functions, to two different columns?
I.e. sum(numberA), min(numberB)

The agg method can accept a dict, in which case the keys indicate the column to which the function is applied:
grouped.agg({'numberA':'sum', 'numberB':'min'})
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'number A': np.arange(8),
'number B': np.arange(8) * 2})
grouped = df.groupby('A')
print(grouped.agg({
'number A': 'sum',
'number B': 'min'}))
yields
number B number A
A
bar 2 9
foo 0 19
This also shows that Pandas can handle spaces in column names. I'm not sure what the origin of the problem was, but literal spaces should not have posed a problem. If you wish to investigate this further,
print(df.columns)
without reassigning the column names, will show show us the repr of the names. Maybe there was a hard-to-see character in the column name that looked like a space (or some other character) but was actually a u'\xa0' (NO-BREAK SPACE), for example.

Related

How to concatenate values from multiple rows using Pandas?

In the screenshot, 'Ctrl' column contains a key value. I have two duplicate rows for OTC-07 which I need to consolidate. I would like to concat the rest of column values for OTC-07. i.e, OTC-07 should have Type A,B and Assertion a,b,c,d after consolidation.. Can anyone help me on this? :o
First define a dataframe of given structure:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ctrl': ['OTC-05', 'OTC-06', 'OTC-07', 'OTC-07', 'OTC-08'],
'Type': ['A', 'A', 'A', 'B', np.NaN],
'Assertion': ['a,b,c', 'c,b', 'a,c', 'b,c,d', 'a,b,c']
})
df
Output:
Then replace NaN values with empty strings:
df = df.replace(np.NaN, '', regex=True)
Then group by column 'Ctrl' and aggregate columns 'Type' and 'Assertion'. Please not that assertion aggregation is a bit tricky as you need not a simple concatenation, but sorted list of unique letters:
df.groupby(['Ctrl']).agg({
'Type': ','.join,
'Assertion': lambda x: ','.join(list(sorted(set(','.join(x).split(',')))))
})
Output:

MultiIndex: advanced indexing - how to select different parts of DataFrame?

arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df2 = pd.DataFrame(np.random.randn(8, 4), index=arrays)
The matrix I have is df2. Now I want to select all the rows of 'foo', 'one' & 'two', but only the 'one' row of multiIndex 'bar'. This seems very easy but I have tried multiple things without succes.
df2.loc['bar':('foo','one')]
, Produces a similar matrix but including the rows of 'baz' that I don't want.
df2.loc[idx['foo','bar'],idx['one','two'], :]
, also similar but the second row of 'foo', 'two' I don't want.
Would be great if anybody could help and has some tips for handling the multiIndex!
In a single line, the simplest way IMO is to build an expression with query, as described here:
df.query("ilevel_0 == 'foo' or (ilevel_0 == 'bar' and ilevel_1 == 'one')")
0 1 2 3
bar one 0.249768 0.619312 1.851270 -0.593451
foo one 0.770139 -2.205407 0.359475 -0.754134
two -1.109005 -0.802934 0.874133 0.135057
Otherwise, using more conventional means, you may consider
pd.concat([df.loc[['foo']], df.loc[[('bar', 'one')]]])
0 1 2 3
foo one 0.770139 -2.205407 0.359475 -0.754134
two -1.109005 -0.802934 0.874133 0.135057
bar one 0.249768 0.619312 1.851270 -0.593451
Which has two parts:
df.loc[['foo']]
0 1 2 3
foo one 0.770139 -2.205407 0.359475 -0.754134
two -1.109005 -0.802934 0.874133 0.135057
and,
df.loc[[('bar', 'one')]]
0 1 2 3
bar one 0.249768 0.619312 1.85127 -0.593451
The braces around each index are to prevent the level from being dropped during the slicing operation.

Cumulative standard deviation of groups

How can one calculate cumulative standard deviation of groups with varying lengths?
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'foo',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df.groupby('A')['B'].nunique() gives bar: 2, foo: 3
...but...
df.groupby('A')['C', 'D'].rolling(df.groupby('A')['B'].nunique(), min_periods=2).std()
...gives...
ValueError: window must be an integer
I think you could use expanding (new since Pandas 0.18) to get a rolling window that expands with the size of the group, first adding B as index and sorting:
df.set_index('B').sort_index().groupby(['A'])['C', 'D'].expanding(2).std()
C D
A B
bar one NaN NaN
two 0.174318 0.039794
foo one NaN NaN
one 1.395085 1.364566
three 1.010592 1.029694
three 0.986744 0.957615
two 0.854773 0.876763
two 1.048024 0.807519

std() groupby Pandas issue

Could this be a bug? When I used describe() or std() for a groupby object, I get different answers
import pandas as pd
import numpy as np
import random as rnd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : 1*(np.random.randn(8)>0.5),
...: 'D' : np.random.randn(8)})
df.head()
df[['C','D']].groupby(['C'],as_index=False).describe()
# this line gives me the standard deviation of 'C' to be 0,0. Within each group value of C is constant, so that makes sense.
df[['C','D']].groupby(['C'],as_index=False).std()
# This line gives me the standard deviation of 'C' to be 0,1. I think this is wrong
It makes sense. In the second case, you only compute the std of column D.
How? That's just how the groupby works. You
slice on C and D
groupby on C
call GroupBy.std
At step 3, you did not specify any column, so std was assumed to be computed on the column that was not the grouper... aka, column D.
As for why you see C with 0, 1... that's because you specify as_index=False, so the C column is inserted with values coming in from the original dataFrame... which in this case is 0, 1.
Run this and it'll become clear.
df[['C','D']].groupby(['C']).std()
D
C
0 0.998201
1 NaN
When you specify as_index=False, the index you see above is inserted as a column. Contrast this with,
df[['C','D']].groupby(['C'])[['C', 'D']].std()
C D
C
0 0.0 0.998201
1 NaN NaN
Which is exactly what describe gives, and what you're looking for.
My friend mukherjees and I have done my more trials with this one and decided that there is really an issue with std(). You can see in the following link, how we show "std() is not the same as .apply(np.std, ddof=1). " After noticing, we also found the following related bug report:
https://github.com/pandas-dev/pandas/issues/10355
Even with the std(), you will get the zero standard deviation of C within each group. I just added a seed to your code to make it replicable. I am not sure what is the issue -
import pandas as pd
import numpy as np
import random as rnd
np.random.seed=1987
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : 1*(np.random.randn(8)>0.5),
'D' : np.random.randn(8)})
df
df[['C','D']].groupby(['C'],as_index=False).describe()
df[['C','D']].groupby(['C'],as_index=False).std()
To go further deep, if you look at the source code of describe for groupby which inherits from DataFrame.describe,
def describe_numeric_1d(series):
stat_index = (['count', 'mean', 'std', 'min'] +
formatted_percentiles + ['max'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
[series.quantile(x) for x in percentiles] + [series.max()])
return pd.Series(d, index=stat_index, name=series.name)
Above code shows that describe just shows the result of std() only

How do I use pandas to add a calculated column in a pivot table?

I'm using pandas 0.16.0 & numpy 1.9.2
I did the following to add a calculated field (column) in the pivot table
Set up dataframe as follows,
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6, 'B' : ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D' : np.random.randn(24), 'E' : np.random.randn(24), 'F' : [datetime.datetime(2013, i, 1) for i in range(1, 13)] + [datetime.datetime(2013, i, 15) for i in range(1, 13)]})
Pivoted the data frame as follows,
df1 = df.pivot_table(values=['D'],index=['A'],columns=['C'],aggfunc=np.sum,margins=False)
Tried adding a calculated field as follows, but I get an error (see below),
df1['D2'] = df1['D'] * 2
Error,
ValueError: Wrong number of items passed 2, placement implies 1
This is because you have a Hierarchical Index (i.e. MultiIndex) as columns in your 'pivot table' dataframe.
If you print out reslults of df1['D'] * 2 you will notice that you get two columns:
C bar foo
A
one -3.163 -10.478
three -2.988 1.418
two -2.218 3.405
So to put it back to df1 you need to provide two columns to assign it to:
df1[[('D2','bar'), ('D2','foo')]] = df1['D'] * 2
Which yields:
D D2
C bar foo bar foo
A
one -1.581 -5.239 -3.163 -10.478
three -1.494 0.709 -2.988 1.418
two -1.109 1.703 -2.218 3.405
A more generalized approach:
new_cols = pd.MultiIndex.from_product(('D2', df1.D.columns))
df1[new_cols] = df1.D * 2
You can find more info on how to deal with MultiIndex in the docs