I'd like to filter my dataset by picking rows that are between two values (dinamically defined as quantiles) per each group. Concretely, I have a dataset like
import pandas as pd
df = pd.DataFrame({'day': ['one', 'one', 'one', 'one', 'one', 'one', 'two', 'two', 'two', 'two', 'two'],
'weather': ['rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'rain', 'rain', 'sun', 'rain'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]})
I'd like to select the rows where the values are between the 0.1 and 0.9 quantile per each day and per each weater. I can calculate the quantiles via
df.groupby(['day', 'weather']).quantile([0.1, .9])
But then I feel stuck. Joining the resulting dataset with the original one it's a waste (the original dataset can be quite big), and I am wondering if there is something along the lines of
df.groupby(['day', 'weather']).select('value', between=[0.1, 0.9])
Transform value with quantile
g = df.groupby(['day', 'weather'])['value']
df[df['value'].between(g.transform('quantile', 0.1), g.transform('quantile', 0.9))]
day weather value
1 one rain 2
4 one sun 5
8 two rain 9
Related
New to Pandas/Python, I have managed to make an index like below;
MultiIndex([( 1, 1, 4324),
( 1, 2, 8000),
( 1, 3, 8545),
( 1, 4, 8544),
( 1, 5, 7542),
(12, 30, 7854),
(12, 31, 7511)],
names=['month', 'day', 'count'], length=366)
I'm struggling to find out how I can store the first number into a list (the 1-12 one) the second number into another list (1-31 values) and the third number into another seperate list (scores 0-9000)
I am trying to build a heatmap that is Month x Day on the axis' and using count as the values and failing horribly! I am assuming I have to seperate Month, Day and Count into seperate lists to make the heat map?
data1 = pd.read_csv("a2data/Data1.csv")
data2 = pd.read_csv("a2data/Data2.csv")
merged_df = pd.concat([data1, data2])
merged_df.set_index(['month', 'day'], inplace=True)
merged_df.sort_index(inplace=True)
merged_df2=merged_df.groupby(['month', 'day']).count.mean().reset_index()
merged_df2.set_index(['month', 'day', 'count'], inplace=True)
#struggling here to seperate out month, day and count in order to make a heatmap
Are you looking for:
# let start here
merged_df2=merged_df.groupby(['month', 'day']).count.mean()
# use sns
import seaborn as sns
sns.heatmap(merged_df2.unstack('day'))
Output:
Or you can use plt:
merged_df2=merged_df.groupby(['month', 'day']).count.mean().unstack('day')
plt.imshow(merged_df2)
plt.xticks(np.arange(merged_df2.shape[1]), merged_df2.columns)
plt.yticks(np.arange(merged_df2.shape[0]), merged_df2.index)
plt.show()
which gives:
How can one calculate cumulative standard deviation of groups with varying lengths?
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'foo',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df.groupby('A')['B'].nunique() gives bar: 2, foo: 3
...but...
df.groupby('A')['C', 'D'].rolling(df.groupby('A')['B'].nunique(), min_periods=2).std()
...gives...
ValueError: window must be an integer
I think you could use expanding (new since Pandas 0.18) to get a rolling window that expands with the size of the group, first adding B as index and sorting:
df.set_index('B').sort_index().groupby(['A'])['C', 'D'].expanding(2).std()
C D
A B
bar one NaN NaN
two 0.174318 0.039794
foo one NaN NaN
one 1.395085 1.364566
three 1.010592 1.029694
three 0.986744 0.957615
two 0.854773 0.876763
two 1.048024 0.807519
Could this be a bug? When I used describe() or std() for a groupby object, I get different answers
import pandas as pd
import numpy as np
import random as rnd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : 1*(np.random.randn(8)>0.5),
...: 'D' : np.random.randn(8)})
df.head()
df[['C','D']].groupby(['C'],as_index=False).describe()
# this line gives me the standard deviation of 'C' to be 0,0. Within each group value of C is constant, so that makes sense.
df[['C','D']].groupby(['C'],as_index=False).std()
# This line gives me the standard deviation of 'C' to be 0,1. I think this is wrong
It makes sense. In the second case, you only compute the std of column D.
How? That's just how the groupby works. You
slice on C and D
groupby on C
call GroupBy.std
At step 3, you did not specify any column, so std was assumed to be computed on the column that was not the grouper... aka, column D.
As for why you see C with 0, 1... that's because you specify as_index=False, so the C column is inserted with values coming in from the original dataFrame... which in this case is 0, 1.
Run this and it'll become clear.
df[['C','D']].groupby(['C']).std()
D
C
0 0.998201
1 NaN
When you specify as_index=False, the index you see above is inserted as a column. Contrast this with,
df[['C','D']].groupby(['C'])[['C', 'D']].std()
C D
C
0 0.0 0.998201
1 NaN NaN
Which is exactly what describe gives, and what you're looking for.
My friend mukherjees and I have done my more trials with this one and decided that there is really an issue with std(). You can see in the following link, how we show "std() is not the same as .apply(np.std, ddof=1). " After noticing, we also found the following related bug report:
https://github.com/pandas-dev/pandas/issues/10355
Even with the std(), you will get the zero standard deviation of C within each group. I just added a seed to your code to make it replicable. I am not sure what is the issue -
import pandas as pd
import numpy as np
import random as rnd
np.random.seed=1987
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : 1*(np.random.randn(8)>0.5),
'D' : np.random.randn(8)})
df
df[['C','D']].groupby(['C'],as_index=False).describe()
df[['C','D']].groupby(['C'],as_index=False).std()
To go further deep, if you look at the source code of describe for groupby which inherits from DataFrame.describe,
def describe_numeric_1d(series):
stat_index = (['count', 'mean', 'std', 'min'] +
formatted_percentiles + ['max'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
[series.quantile(x) for x in percentiles] + [series.max()])
return pd.Series(d, index=stat_index, name=series.name)
Above code shows that describe just shows the result of std() only
Newbie trying to break my addiction to excel. I have a data set of paid invoices with the vendor and country where it was paid along with the amount. I want know for each vendor, which country they have the greatest invoice amount and what percentage of their total business is in that country. Using this data set I want the result to be:
Desired output
import pandas as pd
import numpy as np
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
'Pct' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
CoCntry = df.groupby(['Company', 'Country'])
CoCntry.aggregate(np.sum)
After looking at multiple examples including: Extract row with max value and Getting max value using groupby
2: Python : Getting the Row which has the max value in groups using groupby I've gotten as far as creating a DataFrameGroupBy summarizing the invoice data by country. I’m struggling with how to find the max row. After which I must figure out how to calculate the percent. Advice welcome.
You can use transform for return Series Pct of summed values per groups by first level Company. Then filter Dataframe by max value per groups with idxmax and last divide Amount column with Series Pct:
g = CoCntry.groupby(level='Company')['Amount']
Pct = g.transform('sum')
print (Pct)
Company Country
bar one 25
three 25
two 25
foo one 28
three 28
two 28
Name: Amount, dtype: int64
CoCntry = CoCntry.loc[g.idxmax()]
print (CoCntry)
Amount Pct
Company Country
bar one 11 0
foo two 11 0
CoCntry.Pct = CoCntry.Amount.div(Pct)
print (CoCntry.reset_index())
Company Country Amount Pct
0 bar one 11 0.440000
1 foo two 11 0.392857
Similar another solution:
CoCntry = df.groupby(['Company', 'Country']).Amount.sum()
print (CoCntry)
Company Country
bar one 11
three 4
two 10
foo one 9
three 8
two 11
Name: Amount, dtype: int64
g = CoCntry.groupby(level='Company')
Pct = g.sum()
print (Pct)
Company
bar 25
foo 28
Name: Amount, dtype: int64
maxCoCntry = CoCntry.loc[g.idxmax()].to_frame()
maxCoCntry['Pct'] = maxCoCntry.Amount.div(Pct, level=0)
print (maxCoCntry.reset_index())
Company Country Amount Pct
0 bar one 11 0.440000
1 foo two 11 0.392857
setup
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
})
solution
# sum total invoice per country per company
comp_by_country = df.groupby(['Company', 'Country']).Amount.sum()
# sum total invoice per company
comp_totals = df.groupby('Company').Amount.sum()
# percent of per company per country invoice relative to company
comp_by_country_pct = comp_by_country.div(comp_totals).rename('Pct')
answer to OP question
Which 'Country' has greatest total invoice for 'Company' and what percentage of that companies total business.
comp_by_country_pct.loc[
comp_by_country_pct.groupby(level=0).idxmax()
].reset_index()
Company Country Pct
0 bar one 0.440000
1 foo two 0.392857
I'm loading a csv file, which has the following columns:
date, textA, textB, numberA, numberB
I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB.
data = pd.read_table("file.csv", sep=",", thousands=',')
grouped = data.groupby(["date", "textA", "textB"], as_index=False)
...but I cannot see how to then apply two different aggregate functions, to two different columns?
I.e. sum(numberA), min(numberB)
The agg method can accept a dict, in which case the keys indicate the column to which the function is applied:
grouped.agg({'numberA':'sum', 'numberB':'min'})
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'number A': np.arange(8),
'number B': np.arange(8) * 2})
grouped = df.groupby('A')
print(grouped.agg({
'number A': 'sum',
'number B': 'min'}))
yields
number B number A
A
bar 2 9
foo 0 19
This also shows that Pandas can handle spaces in column names. I'm not sure what the origin of the problem was, but literal spaces should not have posed a problem. If you wish to investigate this further,
print(df.columns)
without reassigning the column names, will show show us the repr of the names. Maybe there was a hard-to-see character in the column name that looked like a space (or some other character) but was actually a u'\xa0' (NO-BREAK SPACE), for example.