How can one calculate cumulative standard deviation of groups with varying lengths?
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'foo',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df.groupby('A')['B'].nunique() gives bar: 2, foo: 3
...but...
df.groupby('A')['C', 'D'].rolling(df.groupby('A')['B'].nunique(), min_periods=2).std()
...gives...
ValueError: window must be an integer
I think you could use expanding (new since Pandas 0.18) to get a rolling window that expands with the size of the group, first adding B as index and sorting:
df.set_index('B').sort_index().groupby(['A'])['C', 'D'].expanding(2).std()
C D
A B
bar one NaN NaN
two 0.174318 0.039794
foo one NaN NaN
one 1.395085 1.364566
three 1.010592 1.029694
three 0.986744 0.957615
two 0.854773 0.876763
two 1.048024 0.807519
Related
I'd like to filter my dataset by picking rows that are between two values (dinamically defined as quantiles) per each group. Concretely, I have a dataset like
import pandas as pd
df = pd.DataFrame({'day': ['one', 'one', 'one', 'one', 'one', 'one', 'two', 'two', 'two', 'two', 'two'],
'weather': ['rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'rain', 'rain', 'sun', 'rain'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]})
I'd like to select the rows where the values are between the 0.1 and 0.9 quantile per each day and per each weater. I can calculate the quantiles via
df.groupby(['day', 'weather']).quantile([0.1, .9])
But then I feel stuck. Joining the resulting dataset with the original one it's a waste (the original dataset can be quite big), and I am wondering if there is something along the lines of
df.groupby(['day', 'weather']).select('value', between=[0.1, 0.9])
Transform value with quantile
g = df.groupby(['day', 'weather'])['value']
df[df['value'].between(g.transform('quantile', 0.1), g.transform('quantile', 0.9))]
day weather value
1 one rain 2
4 one sun 5
8 two rain 9
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df2 = pd.DataFrame(np.random.randn(8, 4), index=arrays)
The matrix I have is df2. Now I want to select all the rows of 'foo', 'one' & 'two', but only the 'one' row of multiIndex 'bar'. This seems very easy but I have tried multiple things without succes.
df2.loc['bar':('foo','one')]
, Produces a similar matrix but including the rows of 'baz' that I don't want.
df2.loc[idx['foo','bar'],idx['one','two'], :]
, also similar but the second row of 'foo', 'two' I don't want.
Would be great if anybody could help and has some tips for handling the multiIndex!
In a single line, the simplest way IMO is to build an expression with query, as described here:
df.query("ilevel_0 == 'foo' or (ilevel_0 == 'bar' and ilevel_1 == 'one')")
0 1 2 3
bar one 0.249768 0.619312 1.851270 -0.593451
foo one 0.770139 -2.205407 0.359475 -0.754134
two -1.109005 -0.802934 0.874133 0.135057
Otherwise, using more conventional means, you may consider
pd.concat([df.loc[['foo']], df.loc[[('bar', 'one')]]])
0 1 2 3
foo one 0.770139 -2.205407 0.359475 -0.754134
two -1.109005 -0.802934 0.874133 0.135057
bar one 0.249768 0.619312 1.851270 -0.593451
Which has two parts:
df.loc[['foo']]
0 1 2 3
foo one 0.770139 -2.205407 0.359475 -0.754134
two -1.109005 -0.802934 0.874133 0.135057
and,
df.loc[[('bar', 'one')]]
0 1 2 3
bar one 0.249768 0.619312 1.85127 -0.593451
The braces around each index are to prevent the level from being dropped during the slicing operation.
I want to make a pivot from a dataframe with multiple duplicates in 'index' and 'column', where the values I want are always equal when 'index' and 'column' are duplicates.
df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
... "bar": ['A', 'A', 'B', 'C'],
... "baz": [1, 1, 3, 4]})
But I get:
ValueError: Index contains duplicate entries, cannot reshape
when I try
df.pivot(index='foo', columns='bar', values='baz')
Try this:
df1 = df[~df.duplicated()].pivot(index='foo', columns='bar', values='baz')
print(df1)
bar A B C
foo
one 1.0 NaN NaN
two NaN 3.0 4.0
Could this be a bug? When I used describe() or std() for a groupby object, I get different answers
import pandas as pd
import numpy as np
import random as rnd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : 1*(np.random.randn(8)>0.5),
...: 'D' : np.random.randn(8)})
df.head()
df[['C','D']].groupby(['C'],as_index=False).describe()
# this line gives me the standard deviation of 'C' to be 0,0. Within each group value of C is constant, so that makes sense.
df[['C','D']].groupby(['C'],as_index=False).std()
# This line gives me the standard deviation of 'C' to be 0,1. I think this is wrong
It makes sense. In the second case, you only compute the std of column D.
How? That's just how the groupby works. You
slice on C and D
groupby on C
call GroupBy.std
At step 3, you did not specify any column, so std was assumed to be computed on the column that was not the grouper... aka, column D.
As for why you see C with 0, 1... that's because you specify as_index=False, so the C column is inserted with values coming in from the original dataFrame... which in this case is 0, 1.
Run this and it'll become clear.
df[['C','D']].groupby(['C']).std()
D
C
0 0.998201
1 NaN
When you specify as_index=False, the index you see above is inserted as a column. Contrast this with,
df[['C','D']].groupby(['C'])[['C', 'D']].std()
C D
C
0 0.0 0.998201
1 NaN NaN
Which is exactly what describe gives, and what you're looking for.
My friend mukherjees and I have done my more trials with this one and decided that there is really an issue with std(). You can see in the following link, how we show "std() is not the same as .apply(np.std, ddof=1). " After noticing, we also found the following related bug report:
https://github.com/pandas-dev/pandas/issues/10355
Even with the std(), you will get the zero standard deviation of C within each group. I just added a seed to your code to make it replicable. I am not sure what is the issue -
import pandas as pd
import numpy as np
import random as rnd
np.random.seed=1987
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : 1*(np.random.randn(8)>0.5),
'D' : np.random.randn(8)})
df
df[['C','D']].groupby(['C'],as_index=False).describe()
df[['C','D']].groupby(['C'],as_index=False).std()
To go further deep, if you look at the source code of describe for groupby which inherits from DataFrame.describe,
def describe_numeric_1d(series):
stat_index = (['count', 'mean', 'std', 'min'] +
formatted_percentiles + ['max'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
[series.quantile(x) for x in percentiles] + [series.max()])
return pd.Series(d, index=stat_index, name=series.name)
Above code shows that describe just shows the result of std() only
I'm loading a csv file, which has the following columns:
date, textA, textB, numberA, numberB
I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB.
data = pd.read_table("file.csv", sep=",", thousands=',')
grouped = data.groupby(["date", "textA", "textB"], as_index=False)
...but I cannot see how to then apply two different aggregate functions, to two different columns?
I.e. sum(numberA), min(numberB)
The agg method can accept a dict, in which case the keys indicate the column to which the function is applied:
grouped.agg({'numberA':'sum', 'numberB':'min'})
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'number A': np.arange(8),
'number B': np.arange(8) * 2})
grouped = df.groupby('A')
print(grouped.agg({
'number A': 'sum',
'number B': 'min'}))
yields
number B number A
A
bar 2 9
foo 0 19
This also shows that Pandas can handle spaces in column names. I'm not sure what the origin of the problem was, but literal spaces should not have posed a problem. If you wish to investigate this further,
print(df.columns)
without reassigning the column names, will show show us the repr of the names. Maybe there was a hard-to-see character in the column name that looked like a space (or some other character) but was actually a u'\xa0' (NO-BREAK SPACE), for example.