multiple condition grouping and counting pandas - pandas

I am going to try to express this problem in the most general way possible. Suppose I have a pandas dataframe with multiple columns ['A', 'B', 'C', 'D'].
For each unique value in 'A', I need to get the following ratio: the number of times 'B' == x, divided by the number of times 'B' == y, when 'C' == q OR p...
I'm sorry, but I don't know how to express this pythonically.
Sample data:
df = pd.DataFrame({'A': ['foo', 'zar', 'zar', 'bar', 'foo', 'bar','foo', 'bar', 'tar', 'foo', 'foo'],
'B': ['one', 'two', 'four', 'three', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C': np.random.randn(11),'D': np.random.randn(11)})`
I need something like the following. For each unique value i in 'A', I need the ratio of the number of times 'B' == 'one' over the number of times 'B' == 'two' when 'C' > 2.
So, an output would be something like:
foo = 0.75

I multiplied np.random.randn(11) by 10 so that the C > 2 constraint can exist, since np.random.randn(11) returns decimal values. The following code will produce what you want in steps. Feel free to condense. Also, it was ambiguous whether the C > 2 constraint applies to both the numerator and denominator or just the denominator. I assumed just the denominator. If you need it to be applied to the numerator, add the [df.C > 2] constraint to the n variable as well. Also, the ratios returned for this current df are inf if divide by 0 occurs and nan if 0 divided by 0 occurs.
for i in df.A.unique():
#print unique value
print(f"Unique Val: {i}")
#print numerator
print("Numerator:")
n = (df[df.A == i].B == 'one').sum()
print(n)
#print denominator
print("Denominator:")
d = (df[df.A == i][df.C > 2].B == 'two').sum()
print(d)
#print ratio
print("Ratio:")
r = n/d
print(r, "\n")

Related

MultiIndex: advanced indexing - how to select different parts of DataFrame?

arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df2 = pd.DataFrame(np.random.randn(8, 4), index=arrays)
The matrix I have is df2. Now I want to select all the rows of 'foo', 'one' & 'two', but only the 'one' row of multiIndex 'bar'. This seems very easy but I have tried multiple things without succes.
df2.loc['bar':('foo','one')]
, Produces a similar matrix but including the rows of 'baz' that I don't want.
df2.loc[idx['foo','bar'],idx['one','two'], :]
, also similar but the second row of 'foo', 'two' I don't want.
Would be great if anybody could help and has some tips for handling the multiIndex!
In a single line, the simplest way IMO is to build an expression with query, as described here:
df.query("ilevel_0 == 'foo' or (ilevel_0 == 'bar' and ilevel_1 == 'one')")
0 1 2 3
bar one 0.249768 0.619312 1.851270 -0.593451
foo one 0.770139 -2.205407 0.359475 -0.754134
two -1.109005 -0.802934 0.874133 0.135057
Otherwise, using more conventional means, you may consider
pd.concat([df.loc[['foo']], df.loc[[('bar', 'one')]]])
0 1 2 3
foo one 0.770139 -2.205407 0.359475 -0.754134
two -1.109005 -0.802934 0.874133 0.135057
bar one 0.249768 0.619312 1.851270 -0.593451
Which has two parts:
df.loc[['foo']]
0 1 2 3
foo one 0.770139 -2.205407 0.359475 -0.754134
two -1.109005 -0.802934 0.874133 0.135057
and,
df.loc[[('bar', 'one')]]
0 1 2 3
bar one 0.249768 0.619312 1.85127 -0.593451
The braces around each index are to prevent the level from being dropped during the slicing operation.

How to count nulls in a group rowwise in pandas DataFrame

According to this topic https://stackoverflow.com/questions/19384532/how-to-count-number-of-rows-per-group-and-other-statistics-in-pandas-group-by I'd like to add one more stat - count null values (a.k.a. NaN) in DataFrame:
tdf = pd.DataFrame(columns = ['indicator', 'v1', 'v2', 'v3', 'v4'],
data = [['A', '3', pd.np.nan, '4', pd.np.nan ],
['A', '3', '4', '4', pd.np.nan ],
['B', pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan],
['B', '1', None, pd.np.nan, None ],
['C', '9', '7', '4', '0']])
I'd like to use something like this:
tdf.groupby('indicator').agg({'indicator': ['count']})
but with the addition of nulls counter to have it in separate column, like:
tdf.groupby('indicator').agg({'indicator': ['count', 'isnull']})
Now, I get error: AttributeError: Cannot access callable attribute 'isnull' of 'SeriesGroupBy' objects, try using the 'apply' method
How can I access this pd.isnull() function here or use some with its functionality?
Expected output would be:
indicator nulls
count count
indicator
A 2 3
B 2 7
C 1 0
Note that pd.np.nan works as None in the same way.
First set_index and check all missing values with count by sum and then aggregate count with sum:
df = tdf.set_index('indicator').isnull().sum(axis=1).groupby(level=0).agg(['count','sum'])
print (df)
count sum
indicator
A 2 3
B 2 7
C 1 0
Detail:
print (tdf.set_index('indicator').isnull().sum(axis=1))
indicator
A 2
A 1
B 4
B 3
C 0
dtype: int64
Another solution is use function with GroupBy.apply:
def func(x):
a = len(x)
b = x.isnull().values.sum()
return pd.Series([a,b],index=['indicator count','nulls count'])
df = tdf.set_index('indicator').groupby('indicator').apply(func)
print (df)
indicator count nulls count
indicator
A 2 3
B 2 7
C 1 0
I've found almost satisfying answer myself: (cons: bit too complicated). In R for example I'd use RowSums on is.na(df) matrix. It's quite this way but more coding unfortunately.
def count_nulls_rowwise_by_group(tdf, group):
cdf = pd.concat([tdf[group], pd.isnull(tdf).sum(axis=1).rename('nulls')], axis=1)
return cdf.groupby(group).agg({group: 'count', 'nulls': 'sum'}).rename(index=str, columns={group: 'count'})
count_nulls_rowwise_by_group(tdf)
gives:
Out[387]:
count nulls
indicator
A 2 3
B 2 7
C 1 0

std() groupby Pandas issue

Could this be a bug? When I used describe() or std() for a groupby object, I get different answers
import pandas as pd
import numpy as np
import random as rnd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : 1*(np.random.randn(8)>0.5),
...: 'D' : np.random.randn(8)})
df.head()
df[['C','D']].groupby(['C'],as_index=False).describe()
# this line gives me the standard deviation of 'C' to be 0,0. Within each group value of C is constant, so that makes sense.
df[['C','D']].groupby(['C'],as_index=False).std()
# This line gives me the standard deviation of 'C' to be 0,1. I think this is wrong
It makes sense. In the second case, you only compute the std of column D.
How? That's just how the groupby works. You
slice on C and D
groupby on C
call GroupBy.std
At step 3, you did not specify any column, so std was assumed to be computed on the column that was not the grouper... aka, column D.
As for why you see C with 0, 1... that's because you specify as_index=False, so the C column is inserted with values coming in from the original dataFrame... which in this case is 0, 1.
Run this and it'll become clear.
df[['C','D']].groupby(['C']).std()
D
C
0 0.998201
1 NaN
When you specify as_index=False, the index you see above is inserted as a column. Contrast this with,
df[['C','D']].groupby(['C'])[['C', 'D']].std()
C D
C
0 0.0 0.998201
1 NaN NaN
Which is exactly what describe gives, and what you're looking for.
My friend mukherjees and I have done my more trials with this one and decided that there is really an issue with std(). You can see in the following link, how we show "std() is not the same as .apply(np.std, ddof=1). " After noticing, we also found the following related bug report:
https://github.com/pandas-dev/pandas/issues/10355
Even with the std(), you will get the zero standard deviation of C within each group. I just added a seed to your code to make it replicable. I am not sure what is the issue -
import pandas as pd
import numpy as np
import random as rnd
np.random.seed=1987
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : 1*(np.random.randn(8)>0.5),
'D' : np.random.randn(8)})
df
df[['C','D']].groupby(['C'],as_index=False).describe()
df[['C','D']].groupby(['C'],as_index=False).std()
To go further deep, if you look at the source code of describe for groupby which inherits from DataFrame.describe,
def describe_numeric_1d(series):
stat_index = (['count', 'mean', 'std', 'min'] +
formatted_percentiles + ['max'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
[series.quantile(x) for x in percentiles] + [series.max()])
return pd.Series(d, index=stat_index, name=series.name)
Above code shows that describe just shows the result of std() only

Extract row with maximum value in DataFrameGroupBy

Newbie trying to break my addiction to excel. I have a data set of paid invoices with the vendor and country where it was paid along with the amount. I want know for each vendor, which country they have the greatest invoice amount and what percentage of their total business is in that country. Using this data set I want the result to be:
Desired output
import pandas as pd
import numpy as np
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
'Pct' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
CoCntry = df.groupby(['Company', 'Country'])
CoCntry.aggregate(np.sum)
After looking at multiple examples including: Extract row with max value and Getting max value using groupby
2: Python : Getting the Row which has the max value in groups using groupby I've gotten as far as creating a DataFrameGroupBy summarizing the invoice data by country. I’m struggling with how to find the max row. After which I must figure out how to calculate the percent. Advice welcome.
You can use transform for return Series Pct of summed values per groups by first level Company. Then filter Dataframe by max value per groups with idxmax and last divide Amount column with Series Pct:
g = CoCntry.groupby(level='Company')['Amount']
Pct = g.transform('sum')
print (Pct)
Company Country
bar one 25
three 25
two 25
foo one 28
three 28
two 28
Name: Amount, dtype: int64
CoCntry = CoCntry.loc[g.idxmax()]
print (CoCntry)
Amount Pct
Company Country
bar one 11 0
foo two 11 0
CoCntry.Pct = CoCntry.Amount.div(Pct)
print (CoCntry.reset_index())
Company Country Amount Pct
0 bar one 11 0.440000
1 foo two 11 0.392857
Similar another solution:
CoCntry = df.groupby(['Company', 'Country']).Amount.sum()
print (CoCntry)
Company Country
bar one 11
three 4
two 10
foo one 9
three 8
two 11
Name: Amount, dtype: int64
g = CoCntry.groupby(level='Company')
Pct = g.sum()
print (Pct)
Company
bar 25
foo 28
Name: Amount, dtype: int64
maxCoCntry = CoCntry.loc[g.idxmax()].to_frame()
maxCoCntry['Pct'] = maxCoCntry.Amount.div(Pct, level=0)
print (maxCoCntry.reset_index())
Company Country Amount Pct
0 bar one 11 0.440000
1 foo two 11 0.392857
setup
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
})
solution
# sum total invoice per country per company
comp_by_country = df.groupby(['Company', 'Country']).Amount.sum()
# sum total invoice per company
comp_totals = df.groupby('Company').Amount.sum()
# percent of per company per country invoice relative to company
comp_by_country_pct = comp_by_country.div(comp_totals).rename('Pct')
answer to OP question
Which 'Country' has greatest total invoice for 'Company' and what percentage of that companies total business.
comp_by_country_pct.loc[
comp_by_country_pct.groupby(level=0).idxmax()
].reset_index()
Company Country Pct
0 bar one 0.440000
1 foo two 0.392857

How do I use pandas to add a calculated column in a pivot table?

I'm using pandas 0.16.0 & numpy 1.9.2
I did the following to add a calculated field (column) in the pivot table
Set up dataframe as follows,
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6, 'B' : ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D' : np.random.randn(24), 'E' : np.random.randn(24), 'F' : [datetime.datetime(2013, i, 1) for i in range(1, 13)] + [datetime.datetime(2013, i, 15) for i in range(1, 13)]})
Pivoted the data frame as follows,
df1 = df.pivot_table(values=['D'],index=['A'],columns=['C'],aggfunc=np.sum,margins=False)
Tried adding a calculated field as follows, but I get an error (see below),
df1['D2'] = df1['D'] * 2
Error,
ValueError: Wrong number of items passed 2, placement implies 1
This is because you have a Hierarchical Index (i.e. MultiIndex) as columns in your 'pivot table' dataframe.
If you print out reslults of df1['D'] * 2 you will notice that you get two columns:
C bar foo
A
one -3.163 -10.478
three -2.988 1.418
two -2.218 3.405
So to put it back to df1 you need to provide two columns to assign it to:
df1[[('D2','bar'), ('D2','foo')]] = df1['D'] * 2
Which yields:
D D2
C bar foo bar foo
A
one -1.581 -5.239 -3.163 -10.478
three -1.494 0.709 -2.988 1.418
two -1.109 1.703 -2.218 3.405
A more generalized approach:
new_cols = pd.MultiIndex.from_product(('D2', df1.D.columns))
df1[new_cols] = df1.D * 2
You can find more info on how to deal with MultiIndex in the docs