using 'groupby.count' with agg - pandas

df.head
Populous Continents
Australia 2.331602e+07 Australia
Brazil 2.059153e+08 South America
Canada 3.523986e+07 North America
China 1.367645e+09 Asia
France 6.383735e+07 Europe
Above are the first 5 entries of my dataframe.
I want to group them by Continents, then I want to perform some statistical analysis. I want to create a new dataframe with the Avg, Sum, STD of each Group's populous as well as the count of countries in each group, as its columns.
new_df =df.groupby('Continents')['Populous'].agg({ 'Avg': np.average, 'Sum':np.sum, 'STD': np.std}), takes care of three columns, but I don't know how to get count in there. I tried including 'Size': count , within the agg method, but it resulted in an error.
Thank you.

You might also find this useful:
df.groupby('Continents').Populous.describe().unstack()
Also see this answer if you want more stats.

You can use 'Size': len or 'Size': 'count' for this to work. However, as #DSM pointed out, len does count missing values whereas 'count' doesn't.

Related

How to pivot the table containing each value in the output row in SQL

I can't resolve this issue. I tried to use PIVOT() function, I've read the documentation and tried to use that. Additionally, I tried to find the answer but didn't find.
The main problem is using PIVOT() function, that it has to include aggregation function, but I needn't it, I need only pivot the table without any aggregation.
The source table:
COUNTRY
LEVEL
NUMBER
Germany
High
22
Germany
Medium
5
Germany
Low
3
Italy
High
43
Italy
Medium
21
Italy
Low
8
Canada
High
9
Canada
Medium
3
Canada
Low
13
I'd like to get the output table looks like:
COUNTRY
High
Medium
Low
Germany
22
5
3
Italy
43
21
8
Canada
9
3
13
Can anybody help me?
How to do that without using aggregation function or using but the get all values. Cause, for example, if I use min() or max() I get the max and min value and the others cells would be empty.
why do you think that using min/max would leave empty cells? As there is only one value for each country/level combination then using min or max is effectively just picking that one value.
Obviously, if your source data had more than one record for each combination of country/level then you'd need to decide how to deal with it.
This SQL seems to work fine:
select *
from COUNTRY_INFO
pivot(max(NUMBER) for LEVEL in ('High', 'Medium', 'Low'))
as p
order by country;

Adding in missing Country Codes into a dataset (GDP Dataset)

I have downloaded a dataset which has countries, their codes and their GDP by year in 4 columns (5 if you include the unique row number far left). I noticed however that there are some missing codes for the country codes and was wondering if anyone could help me out and tell me how to get those codes and add them in , probably from a seperate dataset I imagine . You can see this isin the pictures I posted. Second pictures shows the missing country code data. Thanks.
.
Your country codes look like ISO 3166-1, which are only defined for countries and not for the larger entities such as « East Asia » and « Western Offshoots ».
You could roll your own for these entities, see ISO country codes glossary:
User-assigned codes - If users need code elements to represent country names not included in ISO 3166-1, the series of letters [...] AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ respectively, and the series of numbers 900 to 999 are available.
I think the easiest is to prefix them all with X so you know easily that they are your own codes. Then use the 2 next letters for initials:
East Asia: XEA
Western Offshoots: XWO
etc.

Needing Clarity on SQL Join Query

Having some trouble understanding this query, particularly the WHERE in the subquery. I don't really get what it is accomplishing. Any help would be appreciated. Thanks
# Find the largest country (by area) in each continent. Show the continent,
# name, and area.
SELECT continent, name, area
FROM countries AS a
WHERE area = (
SELECT MAX(area)
FROM countries AS b
WHERE a.continent = b.continent
)
Consider the following subset of the countries data:
Continent Country Area
North America USA 3718691
North America Canada 3855081
North America Mexico 761602
Europe France 211208
Europe Germany 137846
Europe UK 94525
Europe Italy 116305
This is a correlated query that behaves as follows:
Reads the first row returned by the outer query (North America, USA, 3718691)
Runs the subquery which correlates to a.continent, North America, and returns 3855081 which is the maximum area in North America.
Does the where equality which checks to see if 3855081 matches the area on the row we're working on.
It doesn't match so the next row in the outer query is read and we start over at step 1 this time working on the second row.
Repeat for all rows in the outer query.
When we're looking at rows 2 and 4, step 4. will match so those rows will be returned by the query.
You can check the results by using this data in your countries table and running the query.
Note that this is a very poor way to determine the country with the maximum area per continent because it repeats the subquery for every country. Using my sample data, it determines the maximum area for North America 3 times and the maximum area for Europe 4 times.
Since you asked in your comment, I would write this query as follows:
SELECT a.continent, a.name, a.area
FROM countries AS a
inner join (select continent, max(area) max_area
from countries
group by continent) as b on a.continent = b.continent
WHERE a.area = b.max_area
In this version of the query, the maximum for each continent is only determined once. The original query was written to illustrate correlated queries and it's important to understand them. Correlated queries can often be used to resolve complex logic.
The subquery is finding the maximum area for countries. Which countries? All countries that match the continent of the country in the outer query.
So, for each country it gets the area of the largest country on the same continent.
The WHERE clause then says "are the two areas the same -- the maximum area and the area of this country?". It chooses only countries that have the maximum area.

Find average GDP according to continent in SQL

I have 2 tables- Economics (land_code, gdp) and Continent (Land_code, Cont, Percentage). I need to create a query that calculates average GDP for Continent. In case the country is at the same time in several continents, we should also consider the percentage of GDP that belongs to continent. As I have understood if Egypt has GDP of 100, then 90 belongs to Africa and 10 to Asia, how can I implement this expression?
!!! ALREADY DONE)
Obviously,
select Economics.land_code, Economics.gdp * Continent.Percentage
from ...

DAX for Grouping

I have following data set as data Model.
Country City AssetCount
USA Newyork 50
USA Washington 40
USA California 30
India Bangalore 100
India Delhi 50
India Bombay 30
I want to show one row showing sum of Assetcount at country level & city level on the same row.
There are two slicers for slicing City & Country as below:
USA Newyork
India Washington
California
Bangalore
Delhi
Bombay
So when I select country as India it should show sum of Asset-Count at country(India) level.
In the same way when I select City as Delhi it should show Asset-Count at City(Delhi) level.
India Delhi
180 50
Is it possible using PowerPivot using DAX?
Related content from their question on MSDN
Actually your solution is not working. I have created the hierarchy as Country-->City & kept that in Rows. So when I select a particular Country & City it showing like this:
Row Labels AssetCount
USA 40
Washington 40
Grand Total 40
But I want
USA Washington
120 40
or may be like
USA 120
Washington 40
I have tried some aggregate functions like below:
=SUMX(VALUES(Query[City]),CALCULATE(SUM(Query[AssetCount])))
=CALCULATE(SUM(Query[AssetCount]),SUMMARIZE('Query',Query[City]))
Here Query is table for Data Model & City can be replaced by Country.
but not working.
So showing such counts on same row is possible or not?
Sounds like you are just getting started with Power Pivot. You might browse through the links on this page for more help.
I took the data you provided and pasted it into Excel.
Selected the data and clicked Add to Data Model and checked the box for My Data Has Headers.
I made sure the AssetCount Column had a data type of whole number. Then clicked the Pivot Table button and created a pivot table on my existing spreadsheet.
I put AssetCount in the values and made sure it was set to Sum in the Field Value Settings.
I selected my pivot table and then went to the Analyze tab under PivotTable Tools and clicked the Insert Slicer button.
I selected both Country and City as slicers.
This gives your desired result.
If you want the two numbers in a row, that's pretty straightforward. Keep in mind, that all those slicers do is putting filters on the pivot table.
Therefore to get your city result, you could use either an implicit measure or explicit measure that simply sums up AssetCount.
For the country result, you'd wish to overload the city filter like this:
=calculate(SUM(Query[AssetCount]),ALL(Query[City]))
If you also need the country and city names there, it gets a bit tricky.