Here is my code
import pandas as pd
df = pd.DataFrame()
df['country'] = ['UK', 'UK', 'USA', 'USA', 'USA']
df['name'] = ['United Kingdom', 'United Kingdom', 'United States', 'United States', 'United States']
df['year'] = [1, 2, 1, 2, 3]
df['x'] = [100, 125, 200, 225, 250]
print(df.groupby(['country', 'name']).agg({'x':['mean', 'count']}))
The output I get is
x
mean count
country name
UK United Kingdom 112.5 2
USA United States 225.0 3
But I need a result as a list of rows
[['UK','United Kingdom',112.5,2],...]
or columns
[['UK', 'USA'],['United Kingdom','United States'],[112.5,225],[2,3]]
The name column can consist of an arbitrary number of words, e.g. Kingdom of the Netherlands.
Thank you
Convert MultiIndex to columns by as_index=False parameter, then convert DataFrame to numpy array and last to list:
print(df.groupby(['country', 'name'], as_index=False).agg({'x':['mean', 'count']}).to_numpy().tolist())
[['UK', 'United Kingdom', 112.5, 2], ['USA', 'United States', 225.0, 3]]
For second output add transposing:
print(df.groupby(['country', 'name'], as_index=False).agg({'x':['mean', 'count']}).T.to_numpy().tolist())
[['UK', 'USA'], ['United Kingdom', 'United States'], [112.5, 225.0], [2, 3]]
Below are the two dataframes. I was trying to filter rows in df_2 which are not equal to the combination of df_count rows. How can I achieve this objective?
import pandas as pd
df_1 = pd.DataFrame({'Name_1':['tom', 'jack', 'tom', 'jack', 'tom', 'nick', 'tom', 'jack', 'tom', 'jack'],
'Name_2':['sam', 'sam', 'ruby', 'sam','sam', 'jack', 'ruby', 'sam','ruby', 'sam']})
df_count = df_1.groupby(['Name_1','Name_2']).size().reset_index().rename(columns={0:'count'}).sort_values(['count'], ascending = False)
df_count = df_count.head(2)
df_count = df_count[['Name_1','Name_2']]
df_2 = pd.DataFrame({'Name_1':['tom', 'nick', 'tom', 'jack', 'tom', 'nick', 'tom', 'jack'],
'Name_2':['sam', 'mike', 'ruby', 'sam', 'sam', 'jack', 'ruby', 'sam'],
'Salary':[200, 500, 1000, 7000, 100, 300, 1200, 900],
'Currency':['AUD', 'CAD', 'JPY', 'USD', 'GBP', 'CAD', 'INR', 'USD']})
pd.merge(df_2,df_count, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)
Output:
Name_1 Name_2 Salary Currency
0 tom sam 200 AUD
1 tom sam 100 GBP
2 nick mike 500 CAD
7 nick jack 300 CAD
Answer taken from here.
This is the city column of my dataframe- df.city
array(['la', 'hollywood', 'pasadena', 'los angeles', 'new york',
'studio city', 'venice', 'santa monica', 'mar vista',
'beverly hills', 'w. hollywood', 'encino', 'st. boyle hts .',
'westlake village', 'westwood', 'west la', 'chinatown',
'monterey park', 'rancho park', 'redondo beach', 'long beach',
'marina del rey', 'culver city', 'burbank', 'century city',
'malibu', 'seal beach', 'northridge', 'st. hermosa beach'],
dtype=object)
I want the strings containing ['la','hollywood'] to be converted to 'los angeles'. How to do this, i was using np.where(condition,x,y) for this but its third-argument(y) let me down.
To replace the rest of the cities i made this dictionary
cities={'studio city':'los angeles', 'santa monika':'los angeles', 'mar vista':'los angeles', 'beverly hills':'los angeles', 'encino':'los angeles', 'st. boyle hts .':'los angeles', 'westwood':'los angeles', 'chinatown':'los angeles', 'moterey park':'los angeles', 'rancho park':'los angeles', 'redondo beach':'los angeles', 'century city':'los angeles', 'marina del rey':'los angeles', 'malibu':'los angeles', 'seal beach':'los angeles', 'northridge':'los angeles','st. hermosa beach':'los angeles'}
When i use df.city.map(cities) , it maps the ones present in dictionary and replace the others such as 'los angeles' with NaN's.
How can I go about cleaning this column of my dataframe column?
You could use np.where like this:
df['city'] = np.where((df['city'].str.contains('la'))| (df['city'].str.contains('hollywood')), 'los angeles', df['city'])
The third argument is just the original column.
My data is as follows:
test_df = pd.DataFrame({'Manufacturer':['Ford', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW'],
'Metric':['Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty'],
'Sector':['Germany', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA'],
'Value':[45000, 70000, 90000, 65000, 40000, 65000, 63000, 2700, 4400, 3400, 3000, 4700, 5700, 1500, 2000, 2500, 1300, 2000, 2450],
'City': ['Frankfurt', 'Bremen', 'Berlin', 'Hamburg', 'New York', 'Chicago', 'Los Angeles', 'Dresden', 'Munich', 'Cologne', 'Miami', 'Atlanta', 'Phoenix', 'Nuremberg', 'Dusseldorf', 'Leipzig', 'Houston', 'San Diego', 'San Francisco']
})
I reset the index and create a pivot table, as follows:
temp_table = test_df.reset_index().pivot_table(values = 'Value', index = ['Manufacturer', 'Metric', 'Sector'], aggfunc='sum')
Then, I create two new data frames:
s1 = temp_table.set_index(['Manufacturer','Sector']).query("Metric=='Orders'").Value
s2 = temp_table.set_index(['Manufacturer','Sector']).query("Metric=='Sales'").Value
Then, I unstack these data frames:
s1.div(s2).unstack()
Which gives me:
Sector Germany USA
Manufacturer
---
BMW 19.117647 11.052632
Ford 42.592593 13.333333
Mercedes 20.454545 13.829787
Then, I reset the index:
df_out = s1.div(s2).reset_index()
Which gives me:
Manufacturer Sector Value
0 BMW Germany 19.117647
1 BMW USA 11.052632
2 Ford Germany 42.592593
3 Ford USA 13.333333
4 Mercedes Germany 20.454545
5 Mercedes USA 13.829787
I would like to be able to round the Value column to 2 decimal places.
I tried to use the round() function, as follows:
df_out['Value'].round(2)
But, this doesn't seem to affect the values when I call df_out again.
What is the best way to control the decimal precision in this case?
Thanks!
So to preface, I'm a first-year comp sci student and we've only just started on SQL, so forgive me if the solution seems obvious.
We were given a database for Zoo, which has tables for Animals, Keepers, and a link entity (if that's the right word) for care roles, connecting the two.
(Schema below)
CREATE TABLE Animal (ID VARCHAR(6) PRIMARY KEY, Name VARCHAR(10), Species
VARCHAR(20),
Age SMALLINT, Sex VARCHAR(1), Weight SMALLINT, F_ID VARCHAR(6), M_ID
VARCHAR(6));
CREATE TABLE Keeper (Staff_ID VARCHAR(6) PRIMARY KEY, Keeper_Name
VARCHAR(20), Specialisation VARCHAR(20));
CREATE TABLE Care_Role (ID VARCHAR(6), Staff_ID VARCHAR(6), Role
VARCHAR(10), PRIMARY KEY (ID, Staff_ID));
Now the task we've been given is to work out which Keepers have been caring for more than 10 animals of the same species using the following data:
INSERT INTO Animal VALUES
('11', 'Horace', 'Marmoset', 99, 'M', 5, '2','1'),
('12', 'sghgdht', 'Marmoset', 42, 'M', 3, '2','1'),
('13', 'xgnyn', 'Marmoset', 37, 'F', 3, '1','11'),
('14', 'sbfdfbng', 'Marmoset', 12, 'F', 3, '1','11'),
('15', 'fdghd', 'Marmoset', 12, 'M', 3, '1','11'),
('16', 'Fred', 'Marmoset', 6, 'M', 3, '15','1'),
('17', 'Mary', 'Marmoset', 3, 'F', 3, '8','14'),
('18', 'Jane', 'Marmoset', 5, 'F', 3, '7','13'),
('19', 'dfgjtjt', 'Marmoset', 5, 'M', 3, '16','17'),
('20', 'Eric', 'Marmoset', 5, 'M', 3, '12','13'),
('21', 'tukyufyu', 'Marmoset', 5, 'M', 3, '12','73'),
('31', 'hgndghmd', 'Giraffe', 99, 'M', 5, '201','1'),
('32', 'sghgdht', 'Giraffe', 42, 'M', 3, '201','1'),
('33', 'xgnyn', 'Giraffe', 37, 'F', 3, '111','1'),
('34', 'sbfdfbng', 'Giraffe', 12, 'F', 3, '111','1'),
('35', 'fdghd', 'Giraffe', 12, 'M', 3, '111','6'),
('36', 'Fred', 'Lion', 6, 'M', 3, '151','111'),
('37', 'Mary', 'Lion', 3, 'F', 3, '81','114'),
('38', 'Jane', 'Lion', 5, 'F', 3, '71','113'),
('39', 'Kingsly', 'Lion', 9, 'M', 3, '161','117'),
('40', 'Eric', 'Lion', 11, 'M', 3, '121','113'),
('41', 'tukyufyu', 'Lion', 2, 'M', 3, '121','173'),
('61', 'hgndghmd', 'Elephant', 6, 'F', 225, '201','111'),
('62', 'sghgdht', 'Elephant', 10, 'F', 230, '201','111'),
('63', 'xgnyn', 'Elephant', 5, 'F', 300, '111','121'),
('64', 'sbfdfbng', 'Elephant', 11, 'F', 173, '111','121'),
('65', 'fdghd', 'Elephant', 12, 'F', 231, '111','666'),
('66', 'Fred', 'Elephant', 17, 'F', 333, '151','147'),
('67', 'Mary', 'Elephant', 3, 'F', 272, '81','148'),
('68', 'Jane', 'Elephant', 8, 'F', 47, '71','136'),
('69', 'dfgjtjt', 'Elephant', 9, 'F', 131, '161','172'),
('70', 'Eric', 'Elephant', 10, 'F', 333, '121','136'),
('71', 'tukyufyu', 'Elephant', 7, 'M', 114, '121','731');
INSERT INTO Keeper VALUES
('1', 'Roger', 'tdfhuihiu'),
('2', 'Sidra', 'rgegegtnrty'),
('3', 'Amit', 'ergetetnt'),
('4', 'Lucia', 'dvojivhwivih');
INSERT INTO Care_Role VALUES
('32', '1', 'feeding'),
('32', '2', 'washing'),
('61', '1', 'feeding'),
('62', '1', 'feeding'),
('63', '1', 'feeding'),
('64', '1', 'feeding'),
('65', '1', 'feeding'),
('66', '1', 'feeding'),
('67', '1', 'feeding'),
('68', '1', 'feeding'),
('69', '1', 'feeding'),
('70', '1', 'feeding'),
('71', '1', 'feeding'),
('11', '4', 'feeding'),
('12', '4', 'feeding'),
('13', '4', 'feeding'),
('14', '4', 'feeding'),
('15', '4', 'feeding'),
('16', '4', 'feeding'),
('17', '4', 'feeding'),
('18', '4', 'feeding'),
('19', '4', 'feeding'),
('20', '4', 'feeding'),
('21', '4', 'feeding');
So far what I've managed to come up with is this:
SELECT Keeper.Keeper_Name, Animal.Species, COUNT(Animal.Species)
FROM Keeper
JOIN Care_Role
ON Keeper.Staff_ID = Care_Role.Staff_ID
JOIN Animal
ON Care_Role.ID = Animal.ID
GROUP BY Animal.Species
But this is returning more than just the name (which is what I want), as well as showing all the people who have looked after animals, rather than just those who have looked after 10 or more, I was wondering if anyone had any ideas on how to help with this? Many thanks!
Your query should be returning an error, because Keeper.Keeper_name is not in the GROUP BY. You have made a good attempt. A reasonable way to start the query is:
SELECT k.Keeper_Name, a.Species, COUNT(*)
FROM Keeper k JOIN
Care_Role cr
ON k.Staff_ID = cr.Staff_ID JOIN
Animal a
ON cr.ID = a.ID
GROUP BY k.Keeper_Name, a.Species;
This will return the number of animals of a given species that each keeper cares for.
Note the following:
Table aliases are abbreviations for the table.
All column names are qualified.
This uses the shorthand of COUNT(*) instead of counting some particular column.
Your question adds an additional condition about 10 animals. You can fit that in using a HAVING clause.