Newbie trying to break my addiction to excel. I have a data set of paid invoices with the vendor and country where it was paid along with the amount. I want know for each vendor, which country they have the greatest invoice amount and what percentage of their total business is in that country. Using this data set I want the result to be:
Desired output
import pandas as pd
import numpy as np
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
'Pct' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
CoCntry = df.groupby(['Company', 'Country'])
CoCntry.aggregate(np.sum)
After looking at multiple examples including: Extract row with max value and Getting max value using groupby
2: Python : Getting the Row which has the max value in groups using groupby I've gotten as far as creating a DataFrameGroupBy summarizing the invoice data by country. I’m struggling with how to find the max row. After which I must figure out how to calculate the percent. Advice welcome.
You can use transform for return Series Pct of summed values per groups by first level Company. Then filter Dataframe by max value per groups with idxmax and last divide Amount column with Series Pct:
g = CoCntry.groupby(level='Company')['Amount']
Pct = g.transform('sum')
print (Pct)
Company Country
bar one 25
three 25
two 25
foo one 28
three 28
two 28
Name: Amount, dtype: int64
CoCntry = CoCntry.loc[g.idxmax()]
print (CoCntry)
Amount Pct
Company Country
bar one 11 0
foo two 11 0
CoCntry.Pct = CoCntry.Amount.div(Pct)
print (CoCntry.reset_index())
Company Country Amount Pct
0 bar one 11 0.440000
1 foo two 11 0.392857
Similar another solution:
CoCntry = df.groupby(['Company', 'Country']).Amount.sum()
print (CoCntry)
Company Country
bar one 11
three 4
two 10
foo one 9
three 8
two 11
Name: Amount, dtype: int64
g = CoCntry.groupby(level='Company')
Pct = g.sum()
print (Pct)
Company
bar 25
foo 28
Name: Amount, dtype: int64
maxCoCntry = CoCntry.loc[g.idxmax()].to_frame()
maxCoCntry['Pct'] = maxCoCntry.Amount.div(Pct, level=0)
print (maxCoCntry.reset_index())
Company Country Amount Pct
0 bar one 11 0.440000
1 foo two 11 0.392857
setup
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
})
solution
# sum total invoice per country per company
comp_by_country = df.groupby(['Company', 'Country']).Amount.sum()
# sum total invoice per company
comp_totals = df.groupby('Company').Amount.sum()
# percent of per company per country invoice relative to company
comp_by_country_pct = comp_by_country.div(comp_totals).rename('Pct')
answer to OP question
Which 'Country' has greatest total invoice for 'Company' and what percentage of that companies total business.
comp_by_country_pct.loc[
comp_by_country_pct.groupby(level=0).idxmax()
].reset_index()
Company Country Pct
0 bar one 0.440000
1 foo two 0.392857
Related
I'd like to filter my dataset by picking rows that are between two values (dinamically defined as quantiles) per each group. Concretely, I have a dataset like
import pandas as pd
df = pd.DataFrame({'day': ['one', 'one', 'one', 'one', 'one', 'one', 'two', 'two', 'two', 'two', 'two'],
'weather': ['rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'rain', 'rain', 'sun', 'rain'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]})
I'd like to select the rows where the values are between the 0.1 and 0.9 quantile per each day and per each weater. I can calculate the quantiles via
df.groupby(['day', 'weather']).quantile([0.1, .9])
But then I feel stuck. Joining the resulting dataset with the original one it's a waste (the original dataset can be quite big), and I am wondering if there is something along the lines of
df.groupby(['day', 'weather']).select('value', between=[0.1, 0.9])
Transform value with quantile
g = df.groupby(['day', 'weather'])['value']
df[df['value'].between(g.transform('quantile', 0.1), g.transform('quantile', 0.9))]
day weather value
1 one rain 2
4 one sun 5
8 two rain 9
I would like to merge n data frames based on certain variables (external to the data frame).
Let me clarify the problem referring to an example.
We have two dataframes detailing the height and age of certain members of a population.
On top, we are given one array per data frame, containing one value per property (so array length = number of columns with numerical value in the data frame).
Consider the following two data frames
df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'],
'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D'],
'Age': [4, 6, 4], 'Height': [3,9, 2]})
looking as
( Name Age Height
0 A 3 7
1 B 8 2
2 C 4 1
3 D 2 4
4 E 5 9,
Name Age Height
0 A 4 3
1 B 6 9
2 D 4 2)
As mentioned, we also have two arrays, say
array1 = np.array([ 1, 5])
array2 = np.array([2, 3])
To make the example concrete, let us say each array contains the year in which the property was measured.
The output should be constructed as follows:
if an individual appears only in one dataframe, its properties are taken from said dataframe
if an individual appears in more than one data frame, for each property take the values from the data frame whose associated array has the corresponding higher value. So, for property i, compare array1[[i]] and array2[[i]], and take property values from dataframe df1 if array1[[i]] > array2[[i]], and viceversa.
In the context of the example, the rules are translated as, take the property which has been measured more recently, if more are available
The output given the example data frames should look like
Name Age Height
0 A 4 7
1 B 6 2
2 C 4 1
3 D 4 4
4 E 5 9
Indeed, for the first property "Age", as array1[[0]] < array2[[0]], values are taken from the second dataframe, for the available individuals (A, B, D). Remaining values come from the first dataframe.
For the second property "Height", as as array1[[1]] > array2[[1]], values come from the first dataframe, which already describes all the individuals.
At the moment I have some sort of solution based on looping over properties, but it is silly convoluted, I am wondering if any Pandas expert out there could help me towards an elegant solution.
Thanks for your support.
Your question is a bit confusing: array indexes start from 0 so I think in your example it should be [[0]] and [[1]] instead of [[1]] and [[2]].
You can first concatenate your dataframes to have all names listed, then loop over your columns and update the values where the corresponding array is greater (I added a Z row to df2 to show new rows are being added):
df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'],
'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D', 'Z'],
'Age': [4, 6, 4, 8], 'Height': [3,9, 2, 7]})
array1 = np.array([ 1, 5])
array2 = np.array([2, 3])
df1.set_index('Name', inplace=True)
df2.set_index('Name', inplace=True)
df3 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
for i, col in enumerate(df1.columns):
if array2[[i]] > array1[[i]]:
df3[col].update(df2[col])
print(df3)
Note: You have to set Name as index in order to update the right rows
Output:
Age Height
Name
A 4 7
B 6 2
C 4 1
D 4 4
E 5 9
Z 8 7
I you have more than two dataframes in a list, you'll have to store your arrays in a list as well and iterate over the dataframe list while keeping track of the highest array values in a new array.
I have a dataframe with multiple columns and rows. One column, say 'name' has several rows with names, the same name used multiple times. Other rows, say, 'x', 'y', 'z', 'zz' have values. I want to group by name and get the mean of each column (x,y,z,zz)for each name, then plot on a bar chart.
Using the pandas.DataFrame.groupby is an important data-wrangling stuff. Let's first make a dummy Pandas data frame.
df = pd.DataFrame({"name": ["John", "Sansa", "Bran", "John", "Sansa", "Bran"],
"x": [2, 3, 4, 5, 6, 7],
"y": [5, -3, 10, 34, 1, 54],
"z": [10.6, 99.9, 546.23, 34.12, 65.04, -74.29]})
>>>
name x y z
0 John 2 5 10.60
1 Sansa 3 -3 99.90
2 Bran 4 10 546.23
3 John 5 34 34.12
4 Sansa 6 1 65.04
5 Bran 7 54 -74.29
We can use the label of the column to group the data (here the label is "name"). Explicitly defining the by parameter can be omitted (c.f., df.groupby("name")).
df.groupby(by = "name").mean().plot(kind = "bar")
which gives us a nice bar graph.
Transposing the group by results using T (as also suggested by anky) yields a different visualization. We can also pass a dictionary as the by parameter to determine the groups. The by parameter can also be a function, Pandas series, or ndarray.
df.groupby(by = {1: "Sansa", 2: "Bran"}).mean().T.plot(kind = "bar")
I am going to try to express this problem in the most general way possible. Suppose I have a pandas dataframe with multiple columns ['A', 'B', 'C', 'D'].
For each unique value in 'A', I need to get the following ratio: the number of times 'B' == x, divided by the number of times 'B' == y, when 'C' == q OR p...
I'm sorry, but I don't know how to express this pythonically.
Sample data:
df = pd.DataFrame({'A': ['foo', 'zar', 'zar', 'bar', 'foo', 'bar','foo', 'bar', 'tar', 'foo', 'foo'],
'B': ['one', 'two', 'four', 'three', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C': np.random.randn(11),'D': np.random.randn(11)})`
I need something like the following. For each unique value i in 'A', I need the ratio of the number of times 'B' == 'one' over the number of times 'B' == 'two' when 'C' > 2.
So, an output would be something like:
foo = 0.75
I multiplied np.random.randn(11) by 10 so that the C > 2 constraint can exist, since np.random.randn(11) returns decimal values. The following code will produce what you want in steps. Feel free to condense. Also, it was ambiguous whether the C > 2 constraint applies to both the numerator and denominator or just the denominator. I assumed just the denominator. If you need it to be applied to the numerator, add the [df.C > 2] constraint to the n variable as well. Also, the ratios returned for this current df are inf if divide by 0 occurs and nan if 0 divided by 0 occurs.
for i in df.A.unique():
#print unique value
print(f"Unique Val: {i}")
#print numerator
print("Numerator:")
n = (df[df.A == i].B == 'one').sum()
print(n)
#print denominator
print("Denominator:")
d = (df[df.A == i][df.C > 2].B == 'two').sum()
print(d)
#print ratio
print("Ratio:")
r = n/d
print(r, "\n")
I'm using pandas 0.16.0 & numpy 1.9.2
I did the following to add a calculated field (column) in the pivot table
Set up dataframe as follows,
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6, 'B' : ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D' : np.random.randn(24), 'E' : np.random.randn(24), 'F' : [datetime.datetime(2013, i, 1) for i in range(1, 13)] + [datetime.datetime(2013, i, 15) for i in range(1, 13)]})
Pivoted the data frame as follows,
df1 = df.pivot_table(values=['D'],index=['A'],columns=['C'],aggfunc=np.sum,margins=False)
Tried adding a calculated field as follows, but I get an error (see below),
df1['D2'] = df1['D'] * 2
Error,
ValueError: Wrong number of items passed 2, placement implies 1
This is because you have a Hierarchical Index (i.e. MultiIndex) as columns in your 'pivot table' dataframe.
If you print out reslults of df1['D'] * 2 you will notice that you get two columns:
C bar foo
A
one -3.163 -10.478
three -2.988 1.418
two -2.218 3.405
So to put it back to df1 you need to provide two columns to assign it to:
df1[[('D2','bar'), ('D2','foo')]] = df1['D'] * 2
Which yields:
D D2
C bar foo bar foo
A
one -1.581 -5.239 -3.163 -10.478
three -1.494 0.709 -2.988 1.418
two -1.109 1.703 -2.218 3.405
A more generalized approach:
new_cols = pd.MultiIndex.from_product(('D2', df1.D.columns))
df1[new_cols] = df1.D * 2
You can find more info on how to deal with MultiIndex in the docs