Could this be a bug? When I used describe() or std() for a groupby object, I get different answers
import pandas as pd
import numpy as np
import random as rnd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : 1*(np.random.randn(8)>0.5),
...: 'D' : np.random.randn(8)})
df.head()
df[['C','D']].groupby(['C'],as_index=False).describe()
# this line gives me the standard deviation of 'C' to be 0,0. Within each group value of C is constant, so that makes sense.
df[['C','D']].groupby(['C'],as_index=False).std()
# This line gives me the standard deviation of 'C' to be 0,1. I think this is wrong
It makes sense. In the second case, you only compute the std of column D.
How? That's just how the groupby works. You
slice on C and D
groupby on C
call GroupBy.std
At step 3, you did not specify any column, so std was assumed to be computed on the column that was not the grouper... aka, column D.
As for why you see C with 0, 1... that's because you specify as_index=False, so the C column is inserted with values coming in from the original dataFrame... which in this case is 0, 1.
Run this and it'll become clear.
df[['C','D']].groupby(['C']).std()
D
C
0 0.998201
1 NaN
When you specify as_index=False, the index you see above is inserted as a column. Contrast this with,
df[['C','D']].groupby(['C'])[['C', 'D']].std()
C D
C
0 0.0 0.998201
1 NaN NaN
Which is exactly what describe gives, and what you're looking for.
My friend mukherjees and I have done my more trials with this one and decided that there is really an issue with std(). You can see in the following link, how we show "std() is not the same as .apply(np.std, ddof=1). " After noticing, we also found the following related bug report:
https://github.com/pandas-dev/pandas/issues/10355
Even with the std(), you will get the zero standard deviation of C within each group. I just added a seed to your code to make it replicable. I am not sure what is the issue -
import pandas as pd
import numpy as np
import random as rnd
np.random.seed=1987
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : 1*(np.random.randn(8)>0.5),
'D' : np.random.randn(8)})
df
df[['C','D']].groupby(['C'],as_index=False).describe()
df[['C','D']].groupby(['C'],as_index=False).std()
To go further deep, if you look at the source code of describe for groupby which inherits from DataFrame.describe,
def describe_numeric_1d(series):
stat_index = (['count', 'mean', 'std', 'min'] +
formatted_percentiles + ['max'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
[series.quantile(x) for x in percentiles] + [series.max()])
return pd.Series(d, index=stat_index, name=series.name)
Above code shows that describe just shows the result of std() only
Related
I am going to try to express this problem in the most general way possible. Suppose I have a pandas dataframe with multiple columns ['A', 'B', 'C', 'D'].
For each unique value in 'A', I need to get the following ratio: the number of times 'B' == x, divided by the number of times 'B' == y, when 'C' == q OR p...
I'm sorry, but I don't know how to express this pythonically.
Sample data:
df = pd.DataFrame({'A': ['foo', 'zar', 'zar', 'bar', 'foo', 'bar','foo', 'bar', 'tar', 'foo', 'foo'],
'B': ['one', 'two', 'four', 'three', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C': np.random.randn(11),'D': np.random.randn(11)})`
I need something like the following. For each unique value i in 'A', I need the ratio of the number of times 'B' == 'one' over the number of times 'B' == 'two' when 'C' > 2.
So, an output would be something like:
foo = 0.75
I multiplied np.random.randn(11) by 10 so that the C > 2 constraint can exist, since np.random.randn(11) returns decimal values. The following code will produce what you want in steps. Feel free to condense. Also, it was ambiguous whether the C > 2 constraint applies to both the numerator and denominator or just the denominator. I assumed just the denominator. If you need it to be applied to the numerator, add the [df.C > 2] constraint to the n variable as well. Also, the ratios returned for this current df are inf if divide by 0 occurs and nan if 0 divided by 0 occurs.
for i in df.A.unique():
#print unique value
print(f"Unique Val: {i}")
#print numerator
print("Numerator:")
n = (df[df.A == i].B == 'one').sum()
print(n)
#print denominator
print("Denominator:")
d = (df[df.A == i][df.C > 2].B == 'two').sum()
print(d)
#print ratio
print("Ratio:")
r = n/d
print(r, "\n")
How can one calculate cumulative standard deviation of groups with varying lengths?
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'foo',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df.groupby('A')['B'].nunique() gives bar: 2, foo: 3
...but...
df.groupby('A')['C', 'D'].rolling(df.groupby('A')['B'].nunique(), min_periods=2).std()
...gives...
ValueError: window must be an integer
I think you could use expanding (new since Pandas 0.18) to get a rolling window that expands with the size of the group, first adding B as index and sorting:
df.set_index('B').sort_index().groupby(['A'])['C', 'D'].expanding(2).std()
C D
A B
bar one NaN NaN
two 0.174318 0.039794
foo one NaN NaN
one 1.395085 1.364566
three 1.010592 1.029694
three 0.986744 0.957615
two 0.854773 0.876763
two 1.048024 0.807519
In Python,
How best to combine all rows of each column in a multi-column DataFrame
into one column,
separated by ‘ | ’ separator
including null values
import pandas as pd
html = 'https://en.wikipedia.org/wiki/Visa_requirements_for_Norwegian_citizens'
df = pd.read_html(html, header=0)
df= df[1]
df.to_csv('norway.csv)
From This:
To This:
df = pandas.DataFrame([
{'A' : 'x', 'B' : 2, 'C' : None},
{'A' : None, 'B' : 2, 'C' : 1},
{'A' : 'y', 'B' : None, 'C' : None},
])
pandas.DataFrame(df.fillna('').apply(lambda x: '|'.join(x.astype(str)), axis = 0)).transpose()
I believe you need replace missing values if necessary by fillna, convert values to strings with astype and apply with join. Get Series, so for one column DataFrame add to_frame with transposing:
df = df.fillna(' ').astype(str).apply('|'.join).to_frame().T
print (df)
Country Allowed_stay Visa_requirement
0 Albania|Afganistan|Andorra 30|30|60 visa free| | visa free
Or use list comprehension with DataFrame constructor:
L = ['|'.join(df[x].fillna(' ').astype(str)) for x in df]
df1 = pd.DataFrame([L], columns=df.columns)
print (df1)
Country Allowed_stay Visa_requirement
0 Albania|Afganistan|Andorra 30|30|60 visa free| | visa free
I'm using pandas 0.16.0 & numpy 1.9.2
I did the following to add a calculated field (column) in the pivot table
Set up dataframe as follows,
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6, 'B' : ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D' : np.random.randn(24), 'E' : np.random.randn(24), 'F' : [datetime.datetime(2013, i, 1) for i in range(1, 13)] + [datetime.datetime(2013, i, 15) for i in range(1, 13)]})
Pivoted the data frame as follows,
df1 = df.pivot_table(values=['D'],index=['A'],columns=['C'],aggfunc=np.sum,margins=False)
Tried adding a calculated field as follows, but I get an error (see below),
df1['D2'] = df1['D'] * 2
Error,
ValueError: Wrong number of items passed 2, placement implies 1
This is because you have a Hierarchical Index (i.e. MultiIndex) as columns in your 'pivot table' dataframe.
If you print out reslults of df1['D'] * 2 you will notice that you get two columns:
C bar foo
A
one -3.163 -10.478
three -2.988 1.418
two -2.218 3.405
So to put it back to df1 you need to provide two columns to assign it to:
df1[[('D2','bar'), ('D2','foo')]] = df1['D'] * 2
Which yields:
D D2
C bar foo bar foo
A
one -1.581 -5.239 -3.163 -10.478
three -1.494 0.709 -2.988 1.418
two -1.109 1.703 -2.218 3.405
A more generalized approach:
new_cols = pd.MultiIndex.from_product(('D2', df1.D.columns))
df1[new_cols] = df1.D * 2
You can find more info on how to deal with MultiIndex in the docs
I'm loading a csv file, which has the following columns:
date, textA, textB, numberA, numberB
I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB.
data = pd.read_table("file.csv", sep=",", thousands=',')
grouped = data.groupby(["date", "textA", "textB"], as_index=False)
...but I cannot see how to then apply two different aggregate functions, to two different columns?
I.e. sum(numberA), min(numberB)
The agg method can accept a dict, in which case the keys indicate the column to which the function is applied:
grouped.agg({'numberA':'sum', 'numberB':'min'})
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'number A': np.arange(8),
'number B': np.arange(8) * 2})
grouped = df.groupby('A')
print(grouped.agg({
'number A': 'sum',
'number B': 'min'}))
yields
number B number A
A
bar 2 9
foo 0 19
This also shows that Pandas can handle spaces in column names. I'm not sure what the origin of the problem was, but literal spaces should not have posed a problem. If you wish to investigate this further,
print(df.columns)
without reassigning the column names, will show show us the repr of the names. Maybe there was a hard-to-see character in the column name that looked like a space (or some other character) but was actually a u'\xa0' (NO-BREAK SPACE), for example.