How to sum conditionally in pandas - pandas

I am trying to populate a DataFrame using the result of a calculation performed on a different DataFrame.
These calculations should be run on a series, when conditions are met in two separate series.
Here is what I have tried.
I have built a dataframe, rswcapacity on which calculations should be run, then created another dataframe annualcapacity where I would like the conditional calculations to be stored.
#First DataFrame
d = {'technology': ['EAF', 'EAF', 'EAF', 'BOF', 'BOF', 'BOF'], 'equip_detail1': [150, 130, 100, 200, 200, 150], 'equip_number' : [1, 2, 3, 1, 2, 3], 'capacity_actual': [2400, 2080, 1600, 3200, 3200, 2400], 'start_year': [1992, 1993, 1994, 1989, 1990, 1991], 'closure_year': [ '', 2002, '', '', 2001, 2011] }
rswcapacity = pd.DataFrame(data = d)
rswcapacity['closure_year'].replace('', np.nan, inplace = True)
#Second DataFrame
annualcapacity = pd.DataFrame(columns=['years', 'capacity'])
annualcapacity ['years'] = [1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
#Neither of the attempts below yields the desired results:
for y in years:
annualcapacity['capacity'].append(rswcapacity['capacity_actual'].apply(lambda x : x['capacity_actual'].sum() (x['start_year'] >= y & (x['closure_year'] <= y | x['closure_year'].isnull()))).sum())
annualcapacity
#other attempt:
for y in years:
if (rswcapacity['start_year'] >= y).any() & ((rswcapacity['closure_year'].isnull()).any() | (rswcapacity['closure_year'] <= y).any()):
annualcapacity['capacity'].append(rswcapacity['capacity_actual'].sum())
annualcapacity
The result I would like to obtain is a sum performed for every year.
For instance:
1985 should return NaN as 1985 is smaller than any of the years in start_year 1992 should return 14880, as 1992 is larger than any start_year and smaller than any closure_year
2001 should return 7200, as it is larger than all start_year and larger of all closure_years.
Instead all three my attempts are only returning NaN across the list of years.
There is something wrong with my setting the conditions, but have not managed to work out what.
Any insight much appreciated!

You can do this as follows:
# start with an empty dataframe for the summed capacity
# with int32 as type of the year and float32 as type for the capacity
annualcapacity = pd.DataFrame({'years': pd.Series(dtype='int32'), 'capacity': pd.Series(dtype='float32')})
# use your list of years
years= [1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
for y in years:
# create a sum for each year
indexer= (rswcapacity['start_year'] <= y) & ((rswcapacity['closure_year'].isnull()) | (rswcapacity['closure_year'] >= y))
capa= rswcapacity.loc[indexer, 'capacity_actual'].sum()
# and append it to the result frame
annualcapacity= annualcapacity.append(dict(years=y, capacity=capa), ignore_index=True)
annualcapacity
The result looks like this:
years capacity
0 1980 0.0
1 1981 0.0
2 1982 0.0
3 1983 0.0
4 1984 0.0
5 1985 0.0
6 1986 0.0
7 1987 0.0
8 1988 0.0
9 1989 3200.0
10 1990 6400.0
11 1991 8800.0
12 1992 11200.0
13 1993 13280.0
14 1994 14880.0
15 1995 14880.0
16 1996 14880.0
17 1997 14880.0
18 1998 14880.0
19 1999 14880.0
20 2000 14880.0
21 2001 14880.0
22 2002 11680.0
23 2003 9600.0
24 2004 9600.0
25 2005 9600.0
26 2006 9600.0
27 2007 9600.0
28 2008 9600.0
29 2009 9600.0
30 2010 9600.0
31 2011 9600.0
32 2012 7200.0
33 2013 7200.0
34 2014 7200.0
35 2015 7200.0
36 2016 7200.0
37 2017 7200.0
38 2018 7200.0
39 2019 7200.0
40 2020 7200.0
Note: the sums are always numeric, so if there is no capacity for a year, the value is 0.0 instead of NaN. If you need NaN for some reason, you can replace it with the line below.
The second point is, that I switched your condition,
(rswcapacity['start_year'] >= y) & ((rswcapacity['closure_year'].isnull()) | (rswcapacity['closure_year'] <= y))
so >= became <= because I thought, you want to sum all capacities which were available for that year, right?
So if you need NaN entries instead of 0.0 if no capacity is available at all, you can do that as follows:
annualcapacity.loc[annualcapacity['capacity] == 0, 'capacity']= np.NaN
For this, you need to add import numpy as np in your header.

Related

Pivot table with Pandas

I have a small issue trying to do a simple pivot with pandas. I have on one column some values that are entered more than once with a different value in a second column and a year on a third column. What i want to do is get a sum of the second column for the year, using as rows the values on the first column.
import pandas as pd
year = 2022
base = pd.read_csv("Database.csv")
raw_monthly = pd.read_csv("Monthly.csv")
raw_types = pd.read_csv("types.csv")
monthly = raw_monthly.assign(Year= year)
ty= raw_types[['cparty', 'sales']]
typ= sec.rename(columns={"Sales": "sales"})
type= typ.assign(Year=year)
fin = pd.concat([base, monthly, type])
fin.drop(fin.tail(1).index,inplace=True)
currentYear = fin.loc[fin['Year'] == 2022]
final = pd.pivot_table(currentYear, index=['cparty', 'sales'], values='sales', aggfunc='sum')
With the above, I am getting this result, but what i want is to have
the 2 sales values of '3' for 2022 summed in a single value so later i can also break it down by year. Any help appreciated!
Edit: The issue seems to come from the fact that the 3 csvs are concatenated into a single dataframe. Doing the 3->1 CSV conversion manually in excel and then trying to use the Groupby answer works as intended, but it does not work if i try to automatically make the 3 CVS to 1 using the
fin = pd.concat([base, monthly, type])
The 3 csvs look like this.
Base looks like this:
cparty sales year
0 50969 -146602.14 2016
1 51056 -104626.62 2016
2 51129 -101742.99 2016
3 51036 -81801.84 2016
4 51649 -35992.60 2016
monthly looks like this, missing the year
cparty sales
0 818243 -330,052.47
1 82827 -178,630.85
2 508637 -156,369.87
3 29253 -104,028.30
4 596037 -95,312.07
type is like this.
cparty sales
0 582454 -16,056.46
1 597321 24,336.16
2 567172 20,736.78
3 614070 18,590.45
4 5601295 -3,661.46
What i am attempting to do is add a new column for the last 2 to have the Year set as 2022, so that later i can do the groupby per year. When i try to concat the 3 csvs, it breaks down.
Suppose cparty is a categorical metric
# create sales and retail dataframes with year
df = pd.DataFrame({
'year':[2022, 2022, 2018, 2019, 2020, 2021, 2022, 2022, 2022, 2021, 2019, 2018],
'cparty':['cparty1', 'cparty1', 'cparty1', 'cparty2', 'cparty2', 'cparty2', 'cparty2', 'cparty3', 'cparty4', 'cparty4', 'cparty4', 'cparty4'],
'sales':[230, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100]
})
df
###
year cparty sales
0 2022 cparty1 230
1 2022 cparty1 100
2 2018 cparty1 200
3 2019 cparty2 300
4 2020 cparty2 400
5 2021 cparty2 500
6 2022 cparty2 600
7 2022 cparty3 700
8 2022 cparty4 800
9 2021 cparty4 900
10 2019 cparty4 1000
11 2018 cparty4 1100
output = df.groupby(['year','cparty']).sum()
output
###
sales
year cparty
2018 cparty1 200
cparty4 1100
2019 cparty2 300
cparty4 1000
2020 cparty2 400
2021 cparty2 500
cparty4 900
2022 cparty1 330
cparty2 600
cparty3 700
cparty4 800
Filter by year
final = output.query('year == 2022')
final
###
sales
year cparty
2022 cparty1 330
cparty2 600
cparty3 700
cparty4 800
Have figured out the issue.
result = res.groupby(['Year', 'cparty']).sum()
output = result.query('Year == 2022')
output
##
sales
Year cparty
2022 3 -20409.04
4 12064.34
5 9656.64
8081 51588.55
8099 5625.22
... ...
Baron's groupby method was the way to go. The issue is that it only works if I have all the data in 1 csv from the beginning. I was trying to add the year manually for the 2 new csv that i concat to the base, setting Year = 2022. The errors come when i concat the 3 different CSVs. If i don't add the year = 2022 it works giving this:
cparty sales Year
87174 3 -3.89 2022.0
27 3 -20,405.15 NaN
If i do .fillna(2022) then it won't work as expected.
C:\Users\user\AppData\Local\Temp/ipykernel_14252/1015456002.py:32: FutureWarning: Dropping invalid columns in DataFrameGroupBy.add is deprecated. In a future version, a TypeError will be raised. Before calling .add, select only columns which should be valid for the function.
result = fin.groupby(['Year', 'cparty']).sum()
cparty sales Year
87174 3 -3.89 2022.0
27 3 -20,405.15 2022.0
Adding the year but not doing the sum to have 'cparty' 3, 'sales' -20,409.04, Year 2022.
Any feedback appreciated.

Pandas Avoid Multidimensional Key Error Comparing 2 Dataframes

I am stuck on a multidimensional key value error. I have a datframe that looks like this:
year RMSE index cyear Corr_to_CY
0 2000 0.279795 5 1997 0.997975
1 2011 0.299011 2 1994 0.997792
2 2003 0.368341 1 1993 0.977143
3 2013 0.377902 23 2015 0.824441
4 1999 0.41495 10 2002 0.804633
5 1997 0.435813 8 2000 0.752724
6 2018 0.491003 24 2016 0.703359
7 2002 0.505771 3 1995 0.684926
8 2009 0.529308 17 2009 0.580481
9 2015 0.584146 27 2019 0.556555
10 2004 0.620946 26 2018 0.500790
11 2016 0.659388 22 2014 0.443543
12 1993 0.700942 19 2011 0.431615
13 2006 0.748086 11 2003 0.375111
14 2007 0.766675 21 2013 0.323143
15 2020 0.827913 12 2004 0.149202
16 2014 0.884109 7 1999 0.002438
17 2012 0.900184 0 1992 -0.351615
18 1995 0.919482 28 2020 -0.448915
19 1992 0.930512 20 2012 -0.563762
20 2001 0.967834 18 2010 -0.613170
21 2019 1.00497 9 2001 -0.677590
22 2005 1.00885 13 2005 -0.695690
23 2010 1.159125 14 2006 -0.843122
24 2017 1.173262 15 2007 -0.931034
25 1994 1.179737 6 1998 -0.939697
26 2008 1.212915 25 2017 -0.981626
27 1996 1.308853 16 2008 -0.985893
28 1998 1.396771 4 1996 -0.999990
I have selected the conditions for column values of 'Corr_to_CY' >= 0.70 and to return values of 'cyear' column into a new df called 'cyears'. I need to use this as an index to find the year and RMSE value where the 'year' column is in cyears df. This is my best attempt and I get the value error: cannot index with multidimensional key. Do I need to change the index df "cyears" to something else - series, list, etc for this to work? thank you and here is my code that produces the error:
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
cyears = cyears.to_frame()
result = comp.loc[comp['year'] == cyears,'RMSE']
ValueError: Cannot index with multidimensional key
You can use isin method:
import pandas as pd
# Sample creation
import io
comp = pd.read_csv(io.StringIO('year,RMSE,index,cyear,Corr_to_CY\n2000,0.279795,5,1997,0.997975\n2011,0.299011,2,1994,0.997792\n2003,0.368341,1,1993,0.977143\n2013,0.377902,23,2015,0.824441\n1999,0.41495,10,2002,0.804633\n1997,0.435813,8,2000,0.752724\n2018,0.491003,24,2016,0.703359\n2002,0.505771,3,1995,0.684926\n2009,0.529308,17,2009,0.580481\n2015,0.584146,27,2019,0.556555\n2004,0.620946,26,2018,0.500790\n2016,0.659388,22,2014,0.443543\n1993,0.700942,19,2011,0.431615\n2006,0.748086,11,2003,0.375111\n2007,0.766675,21,2013,0.323143\n2020,0.827913,12,2004,0.149202\n2014,0.884109,7,1999,0.002438\n2012,0.900184,0,1992,-0.351615\n1995,0.919482,28,2020,-0.448915\n1992,0.930512,20,2012,-0.563762\n2001,0.967834,18,2010,-0.613170\n2019,1.00497,9,2001,-0.677590\n2005,1.00885,13,2005,-0.695690\n2010,1.159125,14,2006,-0.843122\n2017,1.173262,15,2007,-0.931034\n1994,1.179737,6,1998,-0.939697\n2008,1.212915,25,2017,-0.981626\n1996,1.308853,16,2008,-0.985893\n1998,1.396771,4,1996,-0.999990\n'))
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
result = comp.loc[comp['year'].isin(cyears),'RMSE']
If you want to keep cyears as pandas DataFrame instead of Series, try the following:
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7, ['cyear']]
result = comp.loc[comp['year'].isin(cyears.cyear),'RMSE']

line plot by month and year using FacetGrid

I have this dataset:
year
month
victims
2016
7
4869
2016
8
4817
2016
9
4900
2016
10
4873
2016
11
4461
2016
12
4908
2017
1
4717
2017
2
4489
2017
3
4733
2017
4
4549
2017
5
4928
2017
6
4767
2017
7
4713
2017
8
4992
2017
9
4885
2017
10
5049
2017
11
4861
2017
12
4667
....
I want to plot the number of victims by year and month
I used this code :
sb.lineplot(data=number_victims, x='month', y='victims', hue='year')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), title = 'month', title_fontsize = 12)
this is the result:
I tried to use FacetGrid to get a better view of each year
this is my code :
g = sb.FacetGrid(number_victims, col="year", margin_titles=True, col_wrap=3, height=3, aspect=2)
g.map_dataframe(
sb.lineplot, x="month", y="victims"
)
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('The overall trend of victims number by month and year');
and this is the result :
So I have 2 questions:
How to sort the months in the first graph? (to start from 1 to 12)
Why in the second graph the name of months are from 1 to 10 ? and in 2016 it suppose to start from month #7 not #1, how can I fix that?
Thank you so much.

Calculate the percentage change between two rows into a view in Postgres

I have a table with yearly values for 200+ countries. For a graphical representation, I'd like to get the percentage change between two specific years, 1990 and 2013.
The table looks a bit like this:
id_country year value
886 2002 161.348
886 2003 161.348
886 2004 176.016
886 2005 176.016
886 2006 179.683
886 2007 183.35
886 2008 201.685
886 2009 227.354
886 2010 234.688
886 2011 245.689
886 2012 293.36
886 2013 440.04
620 1990 40.337
620 1991 1056.096
620 1992 1151.438
620 1993 1389.793
620 1994 1584.144
620 1995 1631.815
620 1996 1749.159
620 1997 1796.83
620 1998 1906.84
620 1999 1664.818
620 2000 1642.816
620 2001 2016.85
620 2002 1760.16
620 2003 1873.837
620 2004 1961.845
620 2005 2310.21
620 2006 2328.545
620 2007 2361.548
620 2008 3329.636
620 2009 3069.279
620 2010 3098.615
620 2011 2823.59
620 2012 3373.64
620 2013 2948.268
I thought the best way would be to produce a VIEW with the id_country, which calculates that difference. But I have no clue how that query would look like. It must SELECT all countries, and then divide year = 2013 by year = 1990 for each of these countries.
It could get more complicated as there are multiple variables in that table (represented by additional columns), which would be needed to be filtered by those additional column values, like id_source = 1 or id_source = 2, or id_sector = 1 or id_sector = 2.
Any help is very much appreciated!
One way, probably fastest:
CREATE VIEW pct_2013_1990 AS
SELECT id_country
, (sum(value) FILTER (WHERE year = 2013) * 100)
/ NULLIF(sum(value) FILTER (WHERE year = 1990), 0) AS pct
FROM tbl
WHERE year IN (1990, 2013)
AND id_source = 1 -- ??
GROUP BY id_country
-- ORDER BY ???
This assumes you have a value > 0 for every country in year 1990, else you get a division by zero. I defend against that with NULLIF in the example. The result is NULL in this case.
pct is the percentage for the 2013 value as compared to 1990. To get the percentage change, you would subtract 100 from it. Not sure what you need exactly.
You might use round() to reduce fractional digits.
The aggregate FILTER clause was introduced with Postgres 9.4:
How can I simplify this game statistics query?
In older versions you can substitute with CASE expressions.
You could use a set-returning function instead and parameterize the years to make it work for any set of years.
CREATE FUNCTION f_pct_calc(year1 integer, year2 integer)
RETURNS TABLE(id_country int, pct numeric) AS
$func$
SELECT t.id_country
, (sum(t.value) FILTER (WHERE year = $2) * 100)
/ NULLIF(sum(t.value) FILTER (WHERE year = $1), 0) AS pct
FROM tbl t
WHERE t.year IN ($1, $2)
AND t.id_source = 1 -- ??
GROUP BY t.id_country
-- ORDER BY ???
$func$ LANGUAGE sql STABLE;
Call:
SELECT * FROM f_pct_calc(1990, 2013);

Sum of Previous Yr

I have a simple query which does the below:
SELECT
B.WEEK_DT WEEK_DT,
SUM(A.PROFIT) PROFIT
FROM
CUSTOMERS A
INNER JOIN WEEK_TABLE B
ON A.WEEK_ID = B.WEEK_ID
Now, I want to extend this query to get Sum of profit for all of yr 2013. That means, the above data gives me value at weekly level and i also want a separate column which give me 2013_Profit, summing up all weeks of previous yr.
week_dt is in the format of mm-dd-yyyy
also, we have an offset in the week table, if that helps:
- WK_OFFSET WK_DT
-13 February 22, 2014
-12 March 1, 2014
-11 March 8, 2014
-10 March 15, 2014
-9 March 22, 2014
-8 March 29, 2014
-7 April 5, 2014
-6 April 12, 2014
-5 April 19, 2014
-4 April 26, 2014
-3 May 3, 2014
-2 May 10, 2014
-1 May 17, 2014
Please let me know how i can get another column for each customer which gives a sum previous yr profits.
Some thing like the below:
Customer Curr_WK_Profit Prev_YR_Profit
AAA 10 520
BBB 20 1040
CCC 30 1560