How to reshape an unstacked Pandas data frame to "long" form before passing it to a plotting function - pandas

I'm trying to make a simple bar plot displaying ratios using the Plotly px.bar() function.
I have the following data set:
test_df = pd.DataFrame({'Manufacturer':['Ford', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW'],
'Metric':['Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty'],
'Sector':['Germany', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA'],
'Value':[45000, 70000, 90000, 65000, 40000, 65000, 63000, 2700, 4400, 3400, 3000, 4700, 5700, 1500, 2000, 2500, 1300, 2000, 2450],
'City': ['Frankfurt', 'Bremen', 'Berlin', 'Hamburg', 'New York', 'Chicago', 'Los Angeles', 'Dresden', 'Munich', 'Cologne', 'Miami', 'Atlanta', 'Phoenix', 'Nuremberg', 'Dusseldorf', 'Leipzig', 'Houston', 'San Diego', 'San Francisco']
})
I reset the index and create a pivot table, as follows::
temp_table = test_df.reset_index().pivot_table(values = 'Value', index = ['Manufacturer', 'Metric', 'Sector'], aggfunc='sum')
Then, I create two new data frames:
s1 = temp_table.set_index(['Manufacturer','Sector']).query("Metric=='Orders'").Value
s2 = temp_table.set_index(['Manufacturer','Sector']).query("Metric=='Sales'").Value
Then, I unstack these data frames:
s1.div(s2).unstack()
Which gives me:
Sector Germany USA
Manufacturer
---
BMW 19.117647 11.052632
Ford 42.592593 13.333333
Mercedes 20.454545 13.829787
I'd like to be able to make a bar plot using the data above, with Manufacturer on the x-axis and colored by Sector, as follows:
To do so, I think I need the data to be in the following long form:
Manufacturer Sector Ratio
BMW Germany 19.117647
Ford Germany 42.592593
Mercedes Germany 20.454545
BMW USA 11.052632
Ford USA 13.333333
Mercedes USA 13.829787
Question: how would I reshape the unstacked data above such that I would be able to pass it to the Plotly px.bar() function, which requires the following for the x-axis and y-axis arguments:
x (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to position marks along the x axis in cartesian coordinates. Either x or y can optionally be a list of column references or array_likes, in which case the data will be treated as if it were ‘wide’ rather than ‘long’.
Thanks in advance!

Just do not do unstack
df_out=s1.div(s2).reset_index()

This should give you the bar chart you have up there.
test_df.groupby(['Manufacturer', 'Sector'])['Value'].sum().unstack('Sector').plot.bar()

Related

Get the Pandas groupby agg output columns

Here is my code
import pandas as pd
df = pd.DataFrame()
df['country'] = ['UK', 'UK', 'USA', 'USA', 'USA']
df['name'] = ['United Kingdom', 'United Kingdom', 'United States', 'United States', 'United States']
df['year'] = [1, 2, 1, 2, 3]
df['x'] = [100, 125, 200, 225, 250]
print(df.groupby(['country', 'name']).agg({'x':['mean', 'count']}))
The output I get is
x
mean count
country name
UK United Kingdom 112.5 2
USA United States 225.0 3
But I need a result as a list of rows
[['UK','United Kingdom',112.5,2],...]
or columns
[['UK', 'USA'],['United Kingdom','United States'],[112.5,225],[2,3]]
The name column can consist of an arbitrary number of words, e.g. Kingdom of the Netherlands.
Thank you
Convert MultiIndex to columns by as_index=False parameter, then convert DataFrame to numpy array and last to list:
print(df.groupby(['country', 'name'], as_index=False).agg({'x':['mean', 'count']}).to_numpy().tolist())
[['UK', 'United Kingdom', 112.5, 2], ['USA', 'United States', 225.0, 3]]
For second output add transposing:
print(df.groupby(['country', 'name'], as_index=False).agg({'x':['mean', 'count']}).T.to_numpy().tolist())
[['UK', 'USA'], ['United Kingdom', 'United States'], [112.5, 225.0], [2, 3]]

Pandas: How to filter rows in dataframe which is not equal to the combination of columns in other dataframe?

Below are the two dataframes. I was trying to filter rows in df_2 which are not equal to the combination of df_count rows. How can I achieve this objective?
import pandas as pd
df_1 = pd.DataFrame({'Name_1':['tom', 'jack', 'tom', 'jack', 'tom', 'nick', 'tom', 'jack', 'tom', 'jack'],
'Name_2':['sam', 'sam', 'ruby', 'sam','sam', 'jack', 'ruby', 'sam','ruby', 'sam']})
df_count = df_1.groupby(['Name_1','Name_2']).size().reset_index().rename(columns={0:'count'}).sort_values(['count'], ascending = False)
df_count = df_count.head(2)
df_count = df_count[['Name_1','Name_2']]
df_2 = pd.DataFrame({'Name_1':['tom', 'nick', 'tom', 'jack', 'tom', 'nick', 'tom', 'jack'],
'Name_2':['sam', 'mike', 'ruby', 'sam', 'sam', 'jack', 'ruby', 'sam'],
'Salary':[200, 500, 1000, 7000, 100, 300, 1200, 900],
'Currency':['AUD', 'CAD', 'JPY', 'USD', 'GBP', 'CAD', 'INR', 'USD']})
pd.merge(df_2,df_count, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)
Output:
Name_1 Name_2 Salary Currency
0 tom sam 200 AUD
1 tom sam 100 GBP
2 nick mike 500 CAD
7 nick jack 300 CAD
Answer taken from here.

How to correctly use map and use np.where to replace values in a column

This is the city column of my dataframe- df.city
array(['la', 'hollywood', 'pasadena', 'los angeles', 'new york',
'studio city', 'venice', 'santa monica', 'mar vista',
'beverly hills', 'w. hollywood', 'encino', 'st. boyle hts .',
'westlake village', 'westwood', 'west la', 'chinatown',
'monterey park', 'rancho park', 'redondo beach', 'long beach',
'marina del rey', 'culver city', 'burbank', 'century city',
'malibu', 'seal beach', 'northridge', 'st. hermosa beach'],
dtype=object)
I want the strings containing ['la','hollywood'] to be converted to 'los angeles'. How to do this, i was using np.where(condition,x,y) for this but its third-argument(y) let me down.
To replace the rest of the cities i made this dictionary
cities={'studio city':'los angeles', 'santa monika':'los angeles', 'mar vista':'los angeles', 'beverly hills':'los angeles', 'encino':'los angeles', 'st. boyle hts .':'los angeles', 'westwood':'los angeles', 'chinatown':'los angeles', 'moterey park':'los angeles', 'rancho park':'los angeles', 'redondo beach':'los angeles', 'century city':'los angeles', 'marina del rey':'los angeles', 'malibu':'los angeles', 'seal beach':'los angeles', 'northridge':'los angeles','st. hermosa beach':'los angeles'}
When i use df.city.map(cities) , it maps the ones present in dictionary and replace the others such as 'los angeles' with NaN's.
How can I go about cleaning this column of my dataframe column?
You could use np.where like this:
df['city'] = np.where((df['city'].str.contains('la'))| (df['city'].str.contains('hollywood')), 'los angeles', df['city'])
The third argument is just the original column.

Controlling decimal precision after resetting index of unstacked Pandas data frame

My data is as follows:
test_df = pd.DataFrame({'Manufacturer':['Ford', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW'],
'Metric':['Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty'],
'Sector':['Germany', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA'],
'Value':[45000, 70000, 90000, 65000, 40000, 65000, 63000, 2700, 4400, 3400, 3000, 4700, 5700, 1500, 2000, 2500, 1300, 2000, 2450],
'City': ['Frankfurt', 'Bremen', 'Berlin', 'Hamburg', 'New York', 'Chicago', 'Los Angeles', 'Dresden', 'Munich', 'Cologne', 'Miami', 'Atlanta', 'Phoenix', 'Nuremberg', 'Dusseldorf', 'Leipzig', 'Houston', 'San Diego', 'San Francisco']
})
I reset the index and create a pivot table, as follows:
temp_table = test_df.reset_index().pivot_table(values = 'Value', index = ['Manufacturer', 'Metric', 'Sector'], aggfunc='sum')
Then, I create two new data frames:
s1 = temp_table.set_index(['Manufacturer','Sector']).query("Metric=='Orders'").Value
s2 = temp_table.set_index(['Manufacturer','Sector']).query("Metric=='Sales'").Value
Then, I unstack these data frames:
s1.div(s2).unstack()
Which gives me:
Sector Germany USA
Manufacturer
---
BMW 19.117647 11.052632
Ford 42.592593 13.333333
Mercedes 20.454545 13.829787
Then, I reset the index:
df_out = s1.div(s2).reset_index()
Which gives me:
Manufacturer Sector Value
0 BMW Germany 19.117647
1 BMW USA 11.052632
2 Ford Germany 42.592593
3 Ford USA 13.333333
4 Mercedes Germany 20.454545
5 Mercedes USA 13.829787
I would like to be able to round the Value column to 2 decimal places.
I tried to use the round() function, as follows:
df_out['Value'].round(2)
But, this doesn't seem to affect the values when I call df_out again.
What is the best way to control the decimal precision in this case?
Thanks!

How do I use count and group by as a condition for another table

So to preface, I'm a first-year comp sci student and we've only just started on SQL, so forgive me if the solution seems obvious.
We were given a database for Zoo, which has tables for Animals, Keepers, and a link entity (if that's the right word) for care roles, connecting the two.
(Schema below)
CREATE TABLE Animal (ID VARCHAR(6) PRIMARY KEY, Name VARCHAR(10), Species
VARCHAR(20),
Age SMALLINT, Sex VARCHAR(1), Weight SMALLINT, F_ID VARCHAR(6), M_ID
VARCHAR(6));
CREATE TABLE Keeper (Staff_ID VARCHAR(6) PRIMARY KEY, Keeper_Name
VARCHAR(20), Specialisation VARCHAR(20));
CREATE TABLE Care_Role (ID VARCHAR(6), Staff_ID VARCHAR(6), Role
VARCHAR(10), PRIMARY KEY (ID, Staff_ID));
Now the task we've been given is to work out which Keepers have been caring for more than 10 animals of the same species using the following data:
INSERT INTO Animal VALUES
('11', 'Horace', 'Marmoset', 99, 'M', 5, '2','1'),
('12', 'sghgdht', 'Marmoset', 42, 'M', 3, '2','1'),
('13', 'xgnyn', 'Marmoset', 37, 'F', 3, '1','11'),
('14', 'sbfdfbng', 'Marmoset', 12, 'F', 3, '1','11'),
('15', 'fdghd', 'Marmoset', 12, 'M', 3, '1','11'),
('16', 'Fred', 'Marmoset', 6, 'M', 3, '15','1'),
('17', 'Mary', 'Marmoset', 3, 'F', 3, '8','14'),
('18', 'Jane', 'Marmoset', 5, 'F', 3, '7','13'),
('19', 'dfgjtjt', 'Marmoset', 5, 'M', 3, '16','17'),
('20', 'Eric', 'Marmoset', 5, 'M', 3, '12','13'),
('21', 'tukyufyu', 'Marmoset', 5, 'M', 3, '12','73'),
('31', 'hgndghmd', 'Giraffe', 99, 'M', 5, '201','1'),
('32', 'sghgdht', 'Giraffe', 42, 'M', 3, '201','1'),
('33', 'xgnyn', 'Giraffe', 37, 'F', 3, '111','1'),
('34', 'sbfdfbng', 'Giraffe', 12, 'F', 3, '111','1'),
('35', 'fdghd', 'Giraffe', 12, 'M', 3, '111','6'),
('36', 'Fred', 'Lion', 6, 'M', 3, '151','111'),
('37', 'Mary', 'Lion', 3, 'F', 3, '81','114'),
('38', 'Jane', 'Lion', 5, 'F', 3, '71','113'),
('39', 'Kingsly', 'Lion', 9, 'M', 3, '161','117'),
('40', 'Eric', 'Lion', 11, 'M', 3, '121','113'),
('41', 'tukyufyu', 'Lion', 2, 'M', 3, '121','173'),
('61', 'hgndghmd', 'Elephant', 6, 'F', 225, '201','111'),
('62', 'sghgdht', 'Elephant', 10, 'F', 230, '201','111'),
('63', 'xgnyn', 'Elephant', 5, 'F', 300, '111','121'),
('64', 'sbfdfbng', 'Elephant', 11, 'F', 173, '111','121'),
('65', 'fdghd', 'Elephant', 12, 'F', 231, '111','666'),
('66', 'Fred', 'Elephant', 17, 'F', 333, '151','147'),
('67', 'Mary', 'Elephant', 3, 'F', 272, '81','148'),
('68', 'Jane', 'Elephant', 8, 'F', 47, '71','136'),
('69', 'dfgjtjt', 'Elephant', 9, 'F', 131, '161','172'),
('70', 'Eric', 'Elephant', 10, 'F', 333, '121','136'),
('71', 'tukyufyu', 'Elephant', 7, 'M', 114, '121','731');
INSERT INTO Keeper VALUES
('1', 'Roger', 'tdfhuihiu'),
('2', 'Sidra', 'rgegegtnrty'),
('3', 'Amit', 'ergetetnt'),
('4', 'Lucia', 'dvojivhwivih');
INSERT INTO Care_Role VALUES
('32', '1', 'feeding'),
('32', '2', 'washing'),
('61', '1', 'feeding'),
('62', '1', 'feeding'),
('63', '1', 'feeding'),
('64', '1', 'feeding'),
('65', '1', 'feeding'),
('66', '1', 'feeding'),
('67', '1', 'feeding'),
('68', '1', 'feeding'),
('69', '1', 'feeding'),
('70', '1', 'feeding'),
('71', '1', 'feeding'),
('11', '4', 'feeding'),
('12', '4', 'feeding'),
('13', '4', 'feeding'),
('14', '4', 'feeding'),
('15', '4', 'feeding'),
('16', '4', 'feeding'),
('17', '4', 'feeding'),
('18', '4', 'feeding'),
('19', '4', 'feeding'),
('20', '4', 'feeding'),
('21', '4', 'feeding');
So far what I've managed to come up with is this:
SELECT Keeper.Keeper_Name, Animal.Species, COUNT(Animal.Species)
FROM Keeper
JOIN Care_Role
ON Keeper.Staff_ID = Care_Role.Staff_ID
JOIN Animal
ON Care_Role.ID = Animal.ID
GROUP BY Animal.Species
But this is returning more than just the name (which is what I want), as well as showing all the people who have looked after animals, rather than just those who have looked after 10 or more, I was wondering if anyone had any ideas on how to help with this? Many thanks!
Your query should be returning an error, because Keeper.Keeper_name is not in the GROUP BY. You have made a good attempt. A reasonable way to start the query is:
SELECT k.Keeper_Name, a.Species, COUNT(*)
FROM Keeper k JOIN
Care_Role cr
ON k.Staff_ID = cr.Staff_ID JOIN
Animal a
ON cr.ID = a.ID
GROUP BY k.Keeper_Name, a.Species;
This will return the number of animals of a given species that each keeper cares for.
Note the following:
Table aliases are abbreviations for the table.
All column names are qualified.
This uses the shorthand of COUNT(*) instead of counting some particular column.
Your question adds an additional condition about 10 animals. You can fit that in using a HAVING clause.