Get the Pandas groupby agg output columns - pandas

Here is my code
import pandas as pd
df = pd.DataFrame()
df['country'] = ['UK', 'UK', 'USA', 'USA', 'USA']
df['name'] = ['United Kingdom', 'United Kingdom', 'United States', 'United States', 'United States']
df['year'] = [1, 2, 1, 2, 3]
df['x'] = [100, 125, 200, 225, 250]
print(df.groupby(['country', 'name']).agg({'x':['mean', 'count']}))
The output I get is
x
mean count
country name
UK United Kingdom 112.5 2
USA United States 225.0 3
But I need a result as a list of rows
[['UK','United Kingdom',112.5,2],...]
or columns
[['UK', 'USA'],['United Kingdom','United States'],[112.5,225],[2,3]]
The name column can consist of an arbitrary number of words, e.g. Kingdom of the Netherlands.
Thank you

Convert MultiIndex to columns by as_index=False parameter, then convert DataFrame to numpy array and last to list:
print(df.groupby(['country', 'name'], as_index=False).agg({'x':['mean', 'count']}).to_numpy().tolist())
[['UK', 'United Kingdom', 112.5, 2], ['USA', 'United States', 225.0, 3]]
For second output add transposing:
print(df.groupby(['country', 'name'], as_index=False).agg({'x':['mean', 'count']}).T.to_numpy().tolist())
[['UK', 'USA'], ['United Kingdom', 'United States'], [112.5, 225.0], [2, 3]]

Related

Inputting first and last name to output a value in Pandas Dataframe

I am trying to create an input function that returns a value for the corresponding first and last name.
For this example i'd like to be able to enter "Emily" and "Bell" and return "attempts: 3"
Heres my code so far:
import pandas as pd
import numpy as np
data = {
'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'lastname': ['Thompson','Wu', 'Downs','Hunter','Bell','Cisneros', 'Becker', 'Sims', 'Gallegos', 'Horne'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no',
'yes', 'yes', 'no', 'no', 'yes']
}
data
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
fname = input()
lname = input()
print(f"{fname} {lname}'s number of attempts: {???}")
I thought there would be specific documentation for this but I cant find any on the pandas dataframe documentation. I am assuming its pretty simple but can't find it.
fname = input()
lname = input()
# use loc to filter the row and then capture the value from attempts columns
print(f"{fname} {lname}'s number of attempts:{df.loc[df['name'].eq(fname) & df['lastname'].eq(lname)]['attempts'].squeeze()}")
Emily
Bell
Emily Bell's number of attempts:2
alternately, to avoid mismatch due to case
fname = input().lower()
lname = input().lower()
print(f"{fname} {lname}'s number of attempts:{df.loc[(df['name'].str.lower() == fname) & (df['lastname'].str.lower() == lname)]['attempts'].squeeze()}")
emily
BELL
emily bell's number of attempts:2
Try this:
df[(df['name'] == fname) & (df['lastname'] == lname)]['attempts'].squeeze()

How to return a list into a dataframe based on matching index of other column

I have a two data frames, one made up with a column of numpy array list, and other with two columns. I am trying to match the elements in the 1st dataframe (df) to get two columns, o1 and o2 from the df2, by matching based on index. I was wondering i can get some inputs.. please note the string 'A1' in column in 'o1' is repeated twice in df2 and as you may see in my desired output dataframe the duplicates are removed in column o1.
import numpy as np
import pandas as pd
array_1 = np.array([[0, 2, 3], [3, 4, 6], [1,2,3,6]])
#dataframe 1
df = pd.DataFrame({ 'A': array_1})
#dataframe 2
df2 = pd.DataFrame({ 'o1': ['A1', 'B1', 'A1', 'C1', 'D1', 'E1', 'F1'], 'o2': [15, 17, 18, 19, 20, 7, 8]})
#desired output
df_output = pd.DataFrame({ 'A': array_1, 'o1': [['A1', 'C1'], ['C1', 'D1', 'F1'], ['B1','A1','C1','F1']],
'o2': [[15, 18, 19], [19, 20, 8], [17,18,19,8]] })
# please note in the output, the 'index 0 of df1 has 0&2 which have same element i.e. 'A1', the output only shows one 'A1' by removing duplicated one.
I believe you can explode df and use that to extract information from df2, then finally join back to df
s = df['A'].explode()
df_output= df.join(df2.loc[s].groupby(s.index).agg(lambda x: list(set(x))))
Output:
A o1 o2
0 [0, 2, 3] [C1, A1] [18, 19, 15]
1 [3, 4, 6] [F1, D1, C1] [8, 19, 20]
2 [1, 2, 3, 6] [F1, B1, C1, A1] [8, 17, 18, 19]

How to reshape an unstacked Pandas data frame to "long" form before passing it to a plotting function

I'm trying to make a simple bar plot displaying ratios using the Plotly px.bar() function.
I have the following data set:
test_df = pd.DataFrame({'Manufacturer':['Ford', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW'],
'Metric':['Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty'],
'Sector':['Germany', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA'],
'Value':[45000, 70000, 90000, 65000, 40000, 65000, 63000, 2700, 4400, 3400, 3000, 4700, 5700, 1500, 2000, 2500, 1300, 2000, 2450],
'City': ['Frankfurt', 'Bremen', 'Berlin', 'Hamburg', 'New York', 'Chicago', 'Los Angeles', 'Dresden', 'Munich', 'Cologne', 'Miami', 'Atlanta', 'Phoenix', 'Nuremberg', 'Dusseldorf', 'Leipzig', 'Houston', 'San Diego', 'San Francisco']
})
I reset the index and create a pivot table, as follows::
temp_table = test_df.reset_index().pivot_table(values = 'Value', index = ['Manufacturer', 'Metric', 'Sector'], aggfunc='sum')
Then, I create two new data frames:
s1 = temp_table.set_index(['Manufacturer','Sector']).query("Metric=='Orders'").Value
s2 = temp_table.set_index(['Manufacturer','Sector']).query("Metric=='Sales'").Value
Then, I unstack these data frames:
s1.div(s2).unstack()
Which gives me:
Sector Germany USA
Manufacturer
---
BMW 19.117647 11.052632
Ford 42.592593 13.333333
Mercedes 20.454545 13.829787
I'd like to be able to make a bar plot using the data above, with Manufacturer on the x-axis and colored by Sector, as follows:
To do so, I think I need the data to be in the following long form:
Manufacturer Sector Ratio
BMW Germany 19.117647
Ford Germany 42.592593
Mercedes Germany 20.454545
BMW USA 11.052632
Ford USA 13.333333
Mercedes USA 13.829787
Question: how would I reshape the unstacked data above such that I would be able to pass it to the Plotly px.bar() function, which requires the following for the x-axis and y-axis arguments:
x (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to position marks along the x axis in cartesian coordinates. Either x or y can optionally be a list of column references or array_likes, in which case the data will be treated as if it were ‘wide’ rather than ‘long’.
Thanks in advance!
Just do not do unstack
df_out=s1.div(s2).reset_index()
This should give you the bar chart you have up there.
test_df.groupby(['Manufacturer', 'Sector'])['Value'].sum().unstack('Sector').plot.bar()

Restructuring dataframe in Python (pandas)

Here is the format of the original data:
data_01 = {'State': ['New York', 'California'],
'May_01_A': [1, 2],
'May_01_B': [3, 4],
'May_02_A': [5, 6],
'May_02_B': [7, 8],}
df_01 = pd.DataFrame(data_01)
I would like to restructure it like this:
data_02 = {'Date': ['May_01', 'May_01', 'May_02', 'May_02'],
'State': ['New York', 'California', 'New York', 'California'],
'Obs_A': [1, 2, 3, 4],
'Obs_B': [5, 6, 7, 8],}
df_02 = pd.DataFrame(data_02)
Any advice would be welcome. Thanks!
Let us do wide_to_long
s=pd.wide_to_long(df_01,['May_01','May_02'],i='State',j='Date',suffix='\\w+',sep='_').unstack(1).stack(0).reset_index()
Date State level_1 A B
0 New York May_01 1 3
1 New York May_02 5 7
2 California May_01 2 4
3 California May_02 6 8

How do I use String Categories as a Feature in a Tensor? (if possible in tensorflow js)

I have the following categories in a dataset, where Region and Years will be used to predict the Salary
Regions: ['Europe', 'North America', 'South America', 'Asia', 'Africa']
Data sample:
{region: 'Asia', years: 5, salary: 1000}
{region: 'Asia', years: 3, salary: 700}
{region: 'Asia', years: 1, salary: 300}
{region: 'Europe', years: 5, salary: 3000}
I would like to use region and years as Xs and salary as Ys.
I've tried to convert regions to tf.oneHot, but can't figure out how to use them together with "years" as oneHot return is another tensor.
indices = tf.tensor1d([0, 1, 2, 3, 4], 'int32');
oneHot = tf.oneHot(indices, 5);
oneHot result -> [[1, 0, 0, 0, 0],...]
xs = tf.tensor2d([[?, 5], [?, 3], [?, 1], [?, 5]]); //[region, years]
ys = tf.tensor1d([1000, 700, 300, 3000]); //[salary]
tf.concat can be used
indices = tf.tensor1d([0, 1, 2, 3, 4], 'int32');
const oneHot = tf.oneHot(indices, 5);
const xs_add = tf.tensor([5, 3, 1, 5, 4]).reshape([5, 1])
xs = tf.concat([oneHot, xs_add], 1)
xs.print()