.agg on a group inside a groupby object? - pandas

Sorry if this has been asked before, I couldn't find it.
I have census population dataframe that contains the population of each county in the US.
The relevant part of df looks like:
+----+--------+---------+----------------------------+---------------+
| | REGION | STNAME | CTYNAME | CENSUS2010POP |
+----+--------+---------+----------------------------+---------------+
| 1 | 3 | Alabama | Autauga County | 54571 |
+----+--------+---------+----------------------------+---------------+
| 2 | 3 | Alabama | Baldwin County | 182265 |
+----+--------+---------+----------------------------+---------------+
| 69 | 4 | Alaska | Aleutians East Borough | 3141 |
+----+--------+---------+----------------------------+---------------+
| 70 | 4 | Alaska | Aleutians West Census Area | 5561 |
+----+--------+---------+----------------------------+---------------+
How I can get the np.std of the states population (sum of counties' population) for each of the four regions in the US without modifying the df?

You can use transform:
df['std_col'] = df.groupby('STNAME')['CENSUS2010POP'].transform("std")
IIUC, if you want sum of counties, you do:
state_pop = df.groupby('STNAME')['CTYNAME'].nunique().apply(np.std)

You can also directly use the standard deviation method std()
new_df=df.groupby(['REGION'])[['CENSUS2010POP']].std()

Related

How to pivot columns so they turn into rows using PySpark or pandas?

I have a dataframe that looks like the one bellow, but with hundreds of rows. I need to pivot it, so that each column after Region would be a row, like the other table bellow.
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|city |city_tier | city_classification | Region | Jan-2022-orders | Feb-2022-orders | Mar-2022-orders|
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|new york | large | alpha | NE | 100000 |195000 | 237000 |
|los angeles | large | alpha | W | 330000 |400000 | 580000 |
I need to pivot it using PySpark, so I end up with something like this:
+--------------+----------+---------------------+----------+-----------+---------+
|city |city_tier | city_classification | Region | month | orders |
+--------------+----------+---------------------+----------+-----------+---------+
|new york | large | alpha | NE | Jan-2022 | 100000 |
|new york | large | alpha | NE | Fev-2022 | 195000 |
|new york | large | alpha | NE | Mar-2022 | 237000 |
|los angeles | large | alpha | W | Jan-2022 | 330000 |
|los angeles | large | alpha | W | Fev-2022 | 400000 |
|los angeles | large | alpha | W | Mar-2022 | 580000 |
P.S.: A solution using pandas would work too.
In pandas :
df.melt(df.columns[:4], var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000
or even
df.melt(['city', 'city_tier', 'city_classification', 'Region'],
var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000
In PySpark, your current example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('new york', 'large', 'alpha', 'NE', 100000, 195000, 237000),
('los angeles', 'large', 'alpha', 'W', 330000, 400000, 580000)],
['city', 'city_tier', 'city_classification', 'Region', 'Jan-2022-orders', 'Feb-2022-orders', 'Mar-2022-orders']
)
df2 = df.select(
'city', 'city_tier', 'city_classification', 'Region',
F.expr("stack(3, 'Jan-2022', `Jan-2022-orders`, 'Fev-2022', `Feb-2022-orders`, 'Mar-2022', `Mar-2022-orders`) as (month, orders)")
)
df2.show()
# +-----------+---------+-------------------+------+--------+------+
# | city|city_tier|city_classification|Region| month|orders|
# +-----------+---------+-------------------+------+--------+------+
# | new york| large| alpha| NE|Jan-2022|100000|
# | new york| large| alpha| NE|Fev-2022|195000|
# | new york| large| alpha| NE|Mar-2022|237000|
# |los angeles| large| alpha| W|Jan-2022|330000|
# |los angeles| large| alpha| W|Fev-2022|400000|
# |los angeles| large| alpha| W|Mar-2022|580000|
# +-----------+---------+-------------------+------+--------+------+
The function which enables it is stack. It does not have a dataframe API, so you need to use expr to access it.
BTW, this is not pivoting, it's the opposite - unpivoting.

compare two columns in PostgreSQL show only highest value

This is my table
I'm trying to find in which urban area having high girls to boys ratio.
Thank you for helping me in advance.
| urban | allgirls | allboys |
| :---- | :------: | :-----: |
| Ran | 100 | 120 |
| Ran | 110 | 105 |
| dhanr | 80 | 73 |
| dhanr | 140 | 80 |
| mohan | 180 | 73 |
| mohan | 25 | 26 |
This is the query I used, but I did not get the expected results
SELECT urban, Max(allboys) as high_girls,Max(allgirls) as high_boys
from table_urban group by urban
Expected results
| urban | allgirls | allboys |
| :---- | :------: | :-----: |
| dhar | 220 | 153 |
First of all your example expected result doesn't seems correct because the girls to boys ratio is highest in "mohan" and not in "dhanr" - If what you are really looking for is the highest ratio and not the highest number of girls.
You need to first group and find the sum and then find the ratio (divide one with other) and get the first one.
select foo.urban as urban, foo.girls/foo.boys as ratio from (
SELECT urban, SUM(allboys) as boys, SUM(allgirls) as girls
FROM table_urban
GROUP BY urban) as foo order by ratio desc limit 1
SELECT urban, SUM(allboys) boys, SUM(allgirls) girls
FROM table_urban
GROUP BY urban
ORDER BY boys / girls -- or backward, "girls / boys"
LIMIT 1

Updating a column in PL/SQL

(Using PL/SQL anonymous program block)
I have a table tblROUTE2 of Mexican state highways:
+-----------------+------------+---------+----------+----------+----------------------------+-----------+--------+
| TYPE | ADMN_CLASS | TOLL_RD | RTE_NUM1 | RTE_NUM2 | STATEROUTE | LENGTH_KM | STATE |
+-----------------+------------+---------+----------+----------+----------------------------+-----------+--------+
| Paved Undivided | Federal | N | 81 | | Tamaulipas Federal Hwy 81 | 124.551 | NULL |
| Paved Undivided | Federal | N | 130 | | Hidalgo Federal Hwy 130 | 76.347 | NULL |
| Paved Undivided | Federal | N | 130 | | Mexico Federal Hwy 130 | 68.028 | NULL |
+-----------------+------------+---------+----------+----------+----------------------------+-----------+--------+
and tblSTATE2 of Mexican states:
+------+-----------------------+---------+-----------+
| CODE | NAME | POP1990 | AREA_SQMI |
+------+-----------------------+---------+-----------+
| MX02 | Baja California Norte | 1660855 | 28002.325 |
| MX03 | Baja California Sur | 317764 | 27898.191 |
| MX18 | Nayarit | 824643 | 10547.762 |
+------+-----------------------+---------+-----------+
I need to update the STATE field in tblROUTE2 with the CODE field found in tblSTATE2, based on the route name in tblROUTE2. Basically, I need to somehow take the first string or two (some routes have two names)-- before the string 'Federal'-- of the STATEROUTE field in tblROUTE2 and make sure it matches with the string in the NAME field in tblSTATE2. Then since the states are matched with a CODE, update those codes in the STATE field of tblROUTE2.
I have started a code:
DECLARE
state_code
tblROUTE2.STATE%TYPE;
state_name
tblSTATE2.NAME%TYPE;
BEGIN
SELECT STATE, NAME
INTO state_code
FROM tblROUTE2 r, tblSTATE2 s
WHERE STATEROUTE LIKE '%Federal';
END;
As well, I will need to remove the state name from the route name. For example, the string in STATEROUTE 'Tamaulipas Federal Hwy' becomes 'Federal Hwy'. I have started a code, not sure if it's right:
UPDATE tblROUTE2
SET STATEROUTE = TRIM(LEADING FROM 'Federal');
Using MERGE update :
MERGE INTO tblROUTE2 A
USING
(
SELECT CODE, NAME FROM tblSTATE2
) B
ON
(
upper(SUBSTR(A.STATEROUTE, 0, INSTR(UPPER(A.STATEROUTE), UPPER('FEDERAL'))-2)) = upper(B.NAME)
)
WHEN MATCHED THEN UPDATE
SET A.STATE = B.CODE;
Here in FIDDLE I've replicated your tables and added additional record where STATEROUTE matches one of the records in NAME. Although Fiddle return an error, I ran it in my Oracle DB, and one record was updated correctly as the following screenshot:

Why is Mondrian MDX Query creating massive result set when properties are added?

I have relatively simple MDX query that is creating over 10,000 rows in a result set (most of them empty), but the SQL it generates creates relatively small number of rows in the result set. Here is the MDX query:
SELECT
NON EMPTY CrossJoin({([employmentDate.yearQuarterMonth].[2012]:[employmentDate.yearQuarterMonth].[2014])}, {[Measures].[headCount]}) ON COLUMNS,
NON EMPTY {[residenceLocation].[iso_region].Members} ON ROWS
FROM workforce
This is an exert it returns:
| | 2012 | 2013 | 2014 |
| | %{measure.headCount} | %{measure.headCount} | %{measure.headCount} |
+--------------------------+----------------------+----------------------+----------------------+----------------------+
| Germany | #null | 138 | 241 | 238 |
| France | #null | 49 | 40 | 66 |
| United Kingdom | #null | 46 | 20 | 33 |
| Japan | #null | 67 | 135 | 140 |
| Russian Federation | #null | 84 | 105 | 78 |
| United States of America | California | 38 | 43 | 36 |
| | | 38 | 43 | 36 |
| | | 38 | 43 | 36 |
| | | 38 | 43 | 36 |
| | | 38 | 43 | 36 |
| | | 38 | 43 | 36 |
| | | 38 | 43 | 36 |
You can see that it generates the states like California, but repeats that statistic over and over with an empty region names.
The SQL being generated is and it only returns 39 rows:
select
"TIME"."YEAR" as "Year",
"REGIONS_1"."ISO_COUNTRY_CODE" as "Country",
"REGIONS_1"."ISO_REGION" as "Region",
sum("EMPLOYMENT"."HEADCOUNT") as "Headcount"
from
"TIME" "TIME",
"EMPLOYMENT" "EMPLOYMENT",
"REGIONS" "REGIONS_1"
where
"EMPLOYMENT"."EMPLOYMENT_DATE_ID" = "TIME"."ID"
and
"TIME"."YEAR" in (2012, 2013, 2014)
and
"EMPLOYMENT"."RESIDENCE_REGION_ID" = "REGIONS_1"."ID"
group by
"TIME"."YEAR",
"REGIONS_1"."ISO_COUNTRY_CODE",
"REGIONS_1"."ISO_REGION";
Then this SQL for the properties being loaded and it returns over 20,000+:
select
"REGIONS_1"."ISO_COUNTRY_CODE" as "Country Code",
"REGIONS_1"."COUNTRY_NAME" as "Country",
"REGIONS_1"."ISO_COUNTRY_CODE" as "Country Code",
"REGIONS_1"."LATITUDE" as "Latitude",
"REGIONS_1"."LONGITUDE" as "Longitude",
"REGIONS_1"."ISO_REGION" as "Region",
"REGIONS_1"."REGION_NAME" as "Region Name",
"REGIONS_1"."STATE_FIPS" as "State FIPS",
"REGIONS_1"."LATITUDE" as "Latitude",
"REGIONS_1"."LONGITUDE" as "Longitude"
from
"REGIONS" "REGIONS_1"
group by
"REGIONS_1"."ISO_COUNTRY_CODE",
"REGIONS_1"."COUNTRY_NAME",
"REGIONS_1"."LATITUDE",
"REGIONS_1"."LONGITUDE",
"REGIONS_1"."ISO_REGION",
"REGIONS_1"."REGION_NAME",
"REGIONS_1"."STATE_FIPS"
order by
"REGIONS_1"."ISO_COUNTRY_CODE" ASC NULLS LAST,
"REGIONS_1"."ISO_REGION" ASC NULLS LAST;
So I'm not sure why Mondrian is going crazy when it creates the CellSet. The Region table is composed of these columns (but not all):
Country,
Region (ISO Code),
State_FIPS,
Postal_Code,
Latitude,
Longitude,
County,
MSA,
CBSA, etc.
It's a fairly low level Regional data. The Dimension's hierarchy looks like this:
iso_country
iso_region
county
I should add that it also happens, but not to this degree, when using [residenceLocation].[country] where USA is blown out with several empty rows. It includes maybe 10-20 extra rows, but not 10,000. So I think it's the same problem in both cases.
Update I figured out where the extra rows are coming from. When I add Latitude and Longitude as properties to Country, Region, and County then the rows begin to explode. Take them away and it's ok. So is there a way to add properties like these that could be different between rows without it affecting the returned CellSet?

How to add column with the value of another dimension?

I appologize if the title does not make sense. I am trying to do something that is probably simple, but I have not been able to figure it out, and I'm not sure how to search for the answer. I have the following MDX query:
SELECT
event_count ON 0,
TOPCOUNT(name.children, 10, event_count) ON 1
FROM
events
which returns something like this:
| | event_count |
+---------------+-------------+
| P Davis | 123 |
| J Davis | 123 |
| A Brown | 120 |
| K Thompson | 119 |
| R White | 119 |
| M Wilson | 118 |
| D Harris | 118 |
| R Thompson | 116 |
| Z Williams | 115 |
| X Smith | 114 |
I need to include an additional column (gender). Gender is not a metric. It's just another dimension on the data. For instance, consider this query:
SELECT
gender.children ON 0,
TOPCOUNT(name.children, 10, event_count) ON 1
FROM
events
But this is not what I want! :(
| | female | male | unknown |
+--------------+--------+------+---------+
| P Davis | | | 123 |
| J Davis | | 123 | |
| A Brown | | 120 | |
| K Thompson | | 119 | |
| R White | 119 | | |
| M Wilson | | | 118 |
| D Harris | | | 118 |
| R Thompson | | | 116 |
| Z Williams | | | 115 |
| X Smith | | | 114 |
Nice try, but I just want three columns: name, event_count, and gender. How hard can it be?
Obviously this reflects lack of understanding about MDX on my part. Any pointers to quality introductory material would be appreciated.
It's important to understand that in MDX you are building sets of members on each axis, and not specifying column names like a tabular rowset. You are describing a 2-dimensional grid of results, not a linear rowset. If you imagine each dimension as a table, the member set is the set of unique values from a single column in that table.
When you choose a Measure as the member (as in your first example), it looks as if you're selecting from a table, so it's easy to misunderstand. When you choose a Dimension, you get many members, and a cross-join between the rows and columns (which is sparse in this case because the names and genders are 1-to-1).
So, you could crossjoin these two dimensions on a single axis, and then filter out the null cells:
SELECT
event_count ON 0,
TOPCOUNT(
NonEmptyCrossJoin(name.children, gender.children),
10,
event_count) ON 1
FROM
events
Which should give you results that have a single column (event_count) and 10 rows, where each row is composed of the tuple (name, gender).
I hope that sets you on the right path, and please feel free to ask you want me to clarify.
For general introductory material, I think the book "MDX Solutions" is a good place to start:
http://www.amazon.ca/MDX-Solutions-Microsoft-Analysis-Services/dp/0471748080/
For an online MDX introductory material, you can have a look to this gentle introduction that presents the main MDX concepts.