Pandas Pivot table: How to sort top n nows in the pivot table by category and generate new dataframe? - pandas

df = pd.DataFrame({'country': ['AUD','CAD', 'IND','JPY', 'AUD', 'CHY', 'IND', 'KRL', 'SRI', 'KRW', 'CAD'],
'area': ['N','S','N','E','W','S','NE','N','S','SE','N'], 'gdp': [349,65,60,88,75,100,200,250,150,210,160], 'income': [7000,2300,5000,1000,550,1000,2060,2750,1450,2610,1650], 'expense': [500,300,700,600,500,900,206,275,1405,210,150]})
df = df.pivot_table(index=['country','area'],values=['gdp'],aggfunc='sum').sort_values(by = ['gdp'], ascending = False, axis = 0).head(5)
By applying above method, I am unable to see top 5 'country' based on 'gdp'. My expected output should be like below in dataframe. Please suggest. However, I tried to populate the expected output through MS excel to get a feel.
new_df = country gdp expense income
AUD 424 1000 7550
N 349 500 7000
W 75 500 550
IND 260 906 7060
N 60 700 5000
NE 200 206 2060
KRL 250 275 2750
N 250 275 2750
CAD 225 450 3950
N 160 150 1650
S 65 300 2300
KRW 210 210 2610
SE 210 210 2610
new_df = country gdp expense income area
AUD 424 1000 7550 N, W
IND 260 906 7060 N, NE
KRL 250 275 2750 N
CAD 225 450 3950 N,S
KRW 210 210 2610 SE

Use -
df.groupby('country', as_index=False).agg({'gdp': 'sum', 'area': ','.join}).sort_values(by='gdp', ascending=False).head(5)
Output
country gdp area
0 AUD 349 N,W
5 KRL 250 N
6 KRW 210 SE
3 IND 200 N,NE
1 CAD 160 S,N

you don't need to use pivot for this, it can be done this way
df.groupby('country').agg({'gdp':'sum',
'area':','.join}).sort_values('gdp',ascending = False).head(5)
Output:
country gdp area
AUD 424 N,W
IND 260 N,NE
KRL 250 N
CAD 225 S,N
KRW 210 SE

Related

SQL to find related rows in Loop in ANSI SQL or Snowflake SQL

I have a requirement where I need to link all related CUSTOMER ID and assign a Unified Cust ID to all the related Cust_id.
Ex: for below data,
INPUT DATA
PK_ID CUST_ID_1 CUST_ID_2 CUST_ID_3
1 123 456 567
2 898 567 780
3 999 780 111
4 111 222 333
Based on CUST_ID_1/CUST_ID_2/CUST_ID_3 need to link all the and assign a Unified ID to all the rows.
OUTPUT DATA
Unified ID CUST_ID_1 CUST_ID_2 CUST_ID_3
1000 123 456 567
1000 898 567 780
1000 999 780 111
1000 111 222 333
Trying to perform Self Join but it cannot be definite. Is there a function or ANSI SQL feature which can help in this?
What i have tried,
CREATE TEMP TBL_TEMP AS(
SELECT A.PK_ID
FROM TBL A
LEFT JOIN TBL B
ON A.CUST_ID_1=B.CUST_ID_1
AND A.PK_ID<>B.PK_ID)
UPDATE TBL
FROM TBL_TEMP
SET UNIFIED_ID=SEQ_UNIF_ID.nextval
WHERE TBL.PK_ID=TBL_TEMP.PK_ID
This update i have to write for each column and multiple times.
If you are ok with gap in sequences then following is what I can come up with as of now.
update cust_temp a
set unified_id = t.unified_id
from
(
select
case
when (select count(*) from cust_temp b
where arrays_overlap(array_construct(a.cust_id_1,a.cust_id_2,a.cust_id_3),
array_construct(b.cust_id_1,b.cust_id_2,b.cust_id_3)))>1 -- match across data-set
then 1000 -- same value for common rows
else
ts.nextval --- using sequence for non-common rows
end unified_id,
a.cust_id_1,a.cust_id_2,a.cust_id_3
from cust_temp a, table(getnextval(SEQ_UNIF_ID)) ts) t
where t.cust_id_1 = a.cust_id_1
and t.cust_id_2 = a.cust_id_2
and t.cust_id_3 = a.cust_id_3;
Updated data-set
select * from cust_temp;
UNIFIED_ID
CUST_ID_1
CUST_ID_2
CUST_ID_3
1000
123
456
567
1000
898
567
780
1000
111
222
333
20000
100
200
300
1000
999
780
111
1000
234
123
901
23000
260
360
460
24000
160
560
760
Original data set -
select * from cust_temp;
UNIFIED_ID
CUST_ID_1
CUST_ID_2
CUST_ID_3
NULL
123
456
567
NULL
898
567
780
NULL
111
222
333
NULL
100
200
300
NULL
999
780
111
NULL
234
123
901
NULL
260
360
460
NULL
160
560
760
Arrays_overlap logic is thanks to #Simeon.
Following procedure can be used -
EXECUTE IMMEDIATE $$
DECLARE
duplicate number;
x number;
BEGIN
duplicate := (select count(cnt) from (select a.unified_id,count(*) cnt from cust_temp a,
cust_temp b
where
arrays_overlap(array_construct(a.cust_id_1,a.cust_id_2,a.cust_id_3),
array_construct(b.cust_id_1,b.cust_id_2,b.cust_id_3))
AND a.cust_id_1 != b.cust_id_1
AND a.cust_id_2 != b.cust_id_2
AND a.cust_id_3 != b.cust_id_3
group by a.unified_id) where cnt>1
);
for x in 1 to duplicate do
update cust_temp a
set a.unified_id = (select min(b.unified_id) uid from cust_temp b
where arrays_overlap(array_construct(a.cust_id_1,a.cust_id_2,a.cust_id_3),
array_construct(b.cust_id_1,b.cust_id_2,b.cust_id_3)));
end for;
END;
$$
;
Which will produce following output dataset -
UNIFIED_ID
CUST_ID_1
CUST_ID_2
CUST_ID_3
1000
100
200
300
2000
123
456
567
2000
898
567
780
2000
111
222
333
2000
999
780
111
2000
234
123
901
7000
260
360
460
8000
160
560
760
8000
186
160
766
For an input data-set as -
UNIFIED_ID
CUST_ID_1
CUST_ID_2
CUST_ID_3
1000
100
200
300
2000
123
456
567
3000
898
567
780
4000
111
222
333
5000
999
780
111
6000
234
123
901
7000
260
360
460
8000
160
560
760
9000
186
160
766

how to select a value based on multiple criteria

I'm trying to select some values based on some proprietary data, and I just changed the variables to reference house prices.
I am trying to get the total offers for houses where they were sold at the bid or at the ask price, with offers under 15 and offers * sale price less than 5,000,000.
I then want to get the total number of offers for each neighborhood on each day, but instead I'm getting the total offers across each neighborhood (n1 + n2 + n3 + n4 + n5) across all dates and the total offers in the dataset across all dates.
My current query is this:
SELECT DISTINCT(neighborhood),
DATE(date_of_sale),
(SELECT SUM(offers)
FROM `big_query.a_table_name.houseprices`
WHERE ((offers * accepted_sale_price < 5000000)
AND (offers < 15)
AND (house_bid = sale_price OR
house_ask = sale_price))) as bid_ask_off,
(SELECT SUM(offers)
FROM `big_query.a_table_name.houseprices`) as
total_offers,
FROM `big_query.a_table_name.houseprices`
GROUP BY neighborhood, DATE(date_of_sale) LIMIT 100
Which I am expecting a result like, with date being repeated throughout as d1, d2, d3, etc.:
but am instead receiving
I'm aware that there are some inherent problems with what I'm trying to select / group, but I'm not sure what to google or what tutorials to look at in order to perform this operation.
It's querying quite a bit of data, and I want to keep costs down, as I've already racked up a smallish bill on queries.
Any help or advice would be greatly appreciated, and I hope I've provided enough information.
Here is a sample dataframe.
neighborhood date_of_sale offers accepted_sale_price house_bid house_ask
bronx 4/1/2022 3 323 320 323
manhattan 4/1/2022 4 244 230 244
manhattan 4/1/2022 8 856 856 900
queens 4/1/2022 15 110 110 135
brooklyn 4/2/2022 12 115 100 115
manhattan 4/2/2022 9 255 255 275
bronx 4/2/2022 6 330 300 330
queens 4/2/2022 10 405 395 405
brooklyn 4/2/2022 4 254 254 265
staten_island 4/3/2022 2 442 430 442
staten_island 4/3/2022 13 195 195 225
bronx 4/3/2022 4 650 650 690
manhattan 4/3/2022 2 286 266 286
manhattan 4/3/2022 6 356 356 400
staten_island 4/4/2022 4 361 361 401
staten_island 4/4/2022 5 348 348 399
bronx 4/4/2022 8 397 340 397
manhattan 4/4/2022 9 333 333 394
manhattan 4/4/2022 11 392 325 392
I think that this is what you need.
As we group by neighbourhood we do not need DISTINCT.
We take sum(offers) for total_offers directly from the table and bids from a sub-query which we join to so that it is grouped by neighbourhood.
SELECT
h.neighborhood,
DATE(h.date_of_sale) AS date_,
s.bids AS bid_ask_off,
SUM(h.offers) AS total_offers,
FROM
`big_query.a_table_name.houseprices` h
LEFT JOIN
(SELECT
neighborhood,
SUM(offers) AS bids
FROM
`big_query.a_table_name.houseprices`
WHERE offers * accepted_sale_price < 5000000
AND offers < 15
AND (house_bid = sale_price OR
house_ask = sale_price)
GROUP BY neighborhood) s
ON h.neighborhood = s.neighborhood
GROUP BY
h.neighborhood,
DATE(date_of_sale),
s.bids
LIMIT 100;
Or the following which modifies more the initial query but may be more like what you need.
SELECT
h.neighborhood,
DATE(h.date_of_sale) AS date_,
s.bids AS bid_ask_off,
SUM(h.offers) AS total_offers,
FROM
`big_query.a_table_name.houseprices` h
LEFT JOIN
(SELECT
date_of_sale dos,
neighborhood,
SUM(offers) AS bids
FROM
`big_query.a_table_name.houseprices`
WHERE offers * accepted_sale_price < 5000000
AND offers < 15
AND (house_bid = sale_price OR
house_ask = sale_price)
GROUP BY
neighborhood,
date_of_sale) s
ON h.neighborhood = s.neighborhood
AND h.date_of_sale = s.dos
GROUP BY
h.neighborhood,
DATE(date_of_sale),
s.bids
LIMIT 100;

SQL iterative // recursive cte with conditions (substract from previous rows)

I have this query calculating how many products I have to produce to serve my pending orders and the components I need to produce them.
select
l.codart as SKU, --final product
e.codartc as Component, --piece of final product
e.unicompo, --Components needed for each SKU
l1.SKU_pending - s.SKU_STOCK as "SKU to produce",
s2.C_STOCK as "Component stock",
s2.C_STOCK - sum((l1.SKU_pending - s.SKU_STOCK) * e.unicompo)
over (partition by e.codartc order by l.codart) as "Component stock after producing"
from linepedi l --table with sales orders
left join escandallo e on e.codartp = l.codart --table with SKU components
inner join (select l1.codart, sum(l1.unidades - l1.uniservida - l1.unianulada) as "SKU_pending" --pending sales. I called it from a subquery so I don't have to repeat the calculation each time I need it
from linepedi l1
where (l1.unidades - l1.uniservida - l1.unianulada) > 0
group by l1.codart) l1 on l1.CODART = l.codart
left join (select s.codart, sum(s.unidades) as "SKU_STOCK"
from __STOCKALMART s
group by s.codart) s on s.codart = l.codart
left join (select s.codart, sum(s.unidades) as "C_STOCK"
from __STOCKALMART s
group by s.codart) s2 on s2.codart = e.codartc
where l1.SKU_pending - s.SKU_STOCK > 0
group by l.codart, e.codartc, e.unicompo, l1.SKU_pending, s.SKU_STOCK, s2.C_STOCK
order by l.codart
Query returns next table:
SKU
Component
unicompo
SKU to produce
Component stock
Component stock after producing
20611
286
1
50
2021
1971
20611
329
1
50
2759
2709
20611
ARTZD031
1
50
643
593
220178
ARTZD027
1
384
477
93
220178
SICBB005
1
384
845
461
220178
265
1
384
894
510
220185
265
1
200
894
310
220185
SICBB005
1
200
845
261
220185
ARTZD028
1
200
71
-129
220192
ARTZD029
1
200
364
164
220192
SICBB005
1
200
845
61
220192
265
1
200
894
110
When Component stock after producing returns less than 0, I don't want it to substract the SKU to produce, but the mininum Component stock for that SKU, while "saving" this value for the next time I need the same component. I think I would need to make an iteration with conditionals.
This is what I'd like to accomplish:
SKU
Component
unicompo
SKU to produce
Component stock
Component stock after producing
20611
286
1
50
2021
1971
20611
329
1
50
2759
2709
20611
ARTZD031
1
50
643
593
220178
ARTZD027
1
384
477
93
220178
SICBB005
1
384
845
461
220178
265
1
384
894
510
220185
265
1
200
894
439
220185
SICBB005
1
200
845
390
220185
ARTZD028
1
200
71
0
220192
ARTZD029
1
200
364
164
220192
SICBB005
1
200
845
190
220192
265
1
200
894
239
I've been reading some articles and I feel like it might be done with a recursive CTE, but I don't really know how since I didn't find any example similar to mine.
How can achieve this? Any help will be appreciated. Thank you very much

How to calculate Growth rate of a time series variable in python pandas

I have a data in time series format like:
date value
1-1-2013 100
1-2-2013 200
1-3-2013 300
1-4-2013 400
1-5-2013 500
1-6-2013 600
1-7-2013 700
1-8-2013 650
1-9-2013 450
1-10-2013 350
1-11-2013 250
1-12-2013 150
Use Series.pct_change:
In [458]: df['growth rate'] = df.value.pct_change()
In [459]: df
Out[459]:
date value growth rate
0 1-1-2013 100 NaN
1 1-2-2013 200 1.000000
2 1-3-2013 300 0.500000
3 1-4-2013 400 0.333333
4 1-5-2013 500 0.250000
5 1-6-2013 600 0.200000
6 1-7-2013 700 0.166667
7 1-8-2013 650 -0.071429
8 1-9-2013 450 -0.307692
9 1-10-2013 350 -0.222222
10 1-11-2013 250 -0.285714
11 1-12-2013 150 -0.400000
Or:
If you want to show in percent multiply by 100:
In [480]: df['growth rate'] = df.value.pct_change().mul(100)
In [481]: df
Out[481]:
date value growth rate
0 1-1-2013 100 NaN
1 1-2-2013 200 100.000000
2 1-3-2013 300 50.000000
3 1-4-2013 400 33.333333
4 1-5-2013 500 25.000000
5 1-6-2013 600 20.000000
6 1-7-2013 700 16.666667
7 1-8-2013 650 -7.142857
8 1-9-2013 450 -30.769231
9 1-10-2013 350 -22.222222
10 1-11-2013 250 -28.571429
11 1-12-2013 150 -40.000000
Growth rate as single number for each year
df['col'] = df.groupby(['Year'])['col2'].pct_change(periods=1) * 100

ID rows containing values greater than corresponding values based on a criteria from another row

I have a grouped dataframe. I have created a flag that identifies if values in a row are less than the group maximums. This works fine. However I want to unflag rows where the value contained in a third column is greater than the value in the same (third) column within each group. I have a feeling there shoule be an elegant and pythonic way to do this but I can't figure it out.
The flag I have shown in the code compares the maximum value of tour_duration within each hh_id to the corresponding value of "comp_expr" and if found less, assigns "1" to the column flag. However, I want values in the flag column to be 0 if min(arrivaltime) for each subgroup tour_id > max(arrivaltime) for the tour_id whose tour_duration is found to be maximum within each hh_id. For example, in the given data, tour_id 16300 has the highest value of tour_duration. But tour_id 16200 has min arrivaltime 1080 which is < max(arrivaltime) for tour_id 16300 (960). So flag for all tour_id 16200 should be 0.
Kindly assist.
import pandas as pd
import numpy as np
stops_data = pd.DataFrame({'hh_id': [20044,20044,20044,20044,20044,20044,20044,20044,20044,20044,20044,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,],'tour_id':[16300,16300,16100,16100,16100,16100,16200,16200,16200,16000,16000,38100,38100,37900,37900,37900,38000,38000,38000,38000,38000,38000,37800,37800],'arrivaltime':[360,960,900,900,900,960,1080,1140,1140,420,840,300,960,780,720,960,1080,1080,1080,1080,1140,1140,480,900],'tour_duration':[600,600,60,60,60,60,60,60,60,420,420,660,660,240,240,240,60,60,60,60,60,60,420,420],'comp_expr':[1350,1350,268,268,268,268,406,406,406,974,974,1568,1568,606,606,606,298,298,298,298,298,298,840,840]})
stops_data['flag'] = np.where(stops_data.groupby(['hh_id'])
['tour_duration'].transform(max) < stops_data['comp_expr'],0,1)
This is my current output:Current dataset and output
This is my desired output, please see flag column: Desired output, see changed flag values in bold
>>> stops_data.loc[stops_data.tour_id
.isin(stops_data.loc[stops_data.loc[stops_data
.groupby(['hh_id','tour_id'])['arrivaltime'].idxmin()]
.groupby('hh_id')['arrivaltime'].idxmax()]['tour_id']), 'flag'] = 0
>>> stops_data
hh_id tour_id arrivaltime tour_duration comp_expr flag
0 20044 16300 360 600 1350 0
1 20044 16300 960 600 1350 0
2 20044 16100 900 60 268 1
3 20044 16100 900 60 268 1
4 20044 16100 900 60 268 1
5 20044 16100 960 60 268 1
6 20044 16200 1080 60 406 0
7 20044 16200 1140 60 406 0
8 20044 16200 1140 60 406 0
9 20044 16000 420 420 974 0
10 20044 16000 840 420 974 0
11 20122 38100 300 660 1568 0
12 20122 38100 960 660 1568 0
13 20122 37900 780 240 606 1
14 20122 37900 720 240 606 1
15 20122 37900 960 240 606 1
16 20122 38000 1080 60 298 0
17 20122 38000 1080 60 298 0
18 20122 38000 1080 60 298 0
19 20122 38000 1080 60 298 0
20 20122 38000 1140 60 298 0
21 20122 38000 1140 60 298 0
22 20122 37800 480 420 840 0
23 20122 37800 900 420 840 0