Pandas: Group by two columns to get sum of another column - pandas

I look most of the previously asked questions but was not able to find answer for my question:
I have following data.frame
id year month score num_attempts
0 483625 2010 01 50 1
1 967799 2009 03 50 1
2 213473 2005 09 100 1
3 498110 2010 12 60 1
5 187243 2010 01 100 1
6 508311 2005 10 15 1
7 486688 2005 10 50 1
8 212550 2005 10 500 1
10 136701 2005 09 25 1
11 471651 2010 01 50 1
I want to get following data frame
year month sum_score sum_num_attempts
2009 03 50 1
2005 09 125 2
2010 12 60 1
2010 01 200 2
2005 10 565 3
Here is what I tried:
sum_df = df.groupby(by=['year','month'])['score'].sum()
But this doesn't look efficient and correct. If I have more than one column need to be aggregate this seems like a very expensive call. for example if I have another column num_attempts and just want to sum by year month as score.

This should be an efficient way:
sum_df = df.groupby(['year','month']).agg({'score': 'sum', 'num_attempts': 'sum'})

Related

How to create one variable conditional on other variables in R

I have a very big data-frame like the following:
Region year prate
1 2005 24
1 2006 17
1 2007 56
2 2005 13
2 2006 65
2 2007 43
3 2005 91
3 2006 65
3 2007 12
.....
I want to create a new variable called prate07 in which the variable for year 2007 is the value of prate in that year and the value for other years is 0. Something like the following:
Region year prate prate07
1 2005 24 0
1 2006 17 0
1 2007 56 56
2 2005 13 0
2 2006 65 0
2 2007 43 43
3 2005 91 0
3 2006 65 0
3 2007 12 12
.....
May someone please help me to find the code for it?
Thanks for the help in advance
I used the following code, but it does not work:
library(tidyverse)
dat2 <- dat %>%
mutate(group2 = str_c("p_rate", year), prate07 = prate) %>%
spread(group2, prate07, fill = 0)

pandas- return Month containing Max value for each year

I have a dataframe like:
Year Month Value
2017 1 100
2017 2 1
2017 4 2
2018 3 88
2018 4 8
2019 5 87
2019 6 1
I'd the dataframe to return the Month and Value for each year where the value is the maximum:
year month value
2017 1 100
2018 3 88
2019 5 87
I've attempted something like df=df.groupby(["Year","Month"])['Value']).max() however, it returns the full data set because each Year / Month pair is unique (i believe).
You can get the index where the top Value occurs with .groupby(...).idxmax() and use that to index into the original dataframe:
In [28]: df.loc[df.groupby("Year")["Value"].idxmax()]
Out[28]:
Year Month Value
0 2017 1 100
3 2018 3 88
5 2019 5 87
Here is a solution that also handles duplicate possibility:
m = df.groupby('Year')['Value'].transform('max') == df['Value']
dfmax = df.loc[m]
Full example:
import pandas as pd
data = '''\
Year Month Value
2017 1 100
2017 2 1
2017 4 2
2018 3 88
2018 4 88
2019 5 87
2019 6 1'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+')
m = df.groupby('Year')['Value'].transform('max') == df['Value']
print(df[m])
Year Month Value
0 2017 1 100
3 2018 3 88
4 2018 4 88
5 2019 5 87

SQL Method to report next period value for this period

This may already be answered, but I can't figure out the correct search terms for what I need. We store values by Year / Period for the Beginning of Month (BOM). The BOM for one month is the same value as End of Month (EOM) for the previous month. I need a way to report this as such.
So 2018-02 BOM = 2018-01 EOM.
I thought I might be able to use something simple, but it does not account for the month/year wrap at 12 months as those fields are numerical.
select yr as YEAR, (pd-1) as PERIOD, sum(BOM) as EOM
from Table1
where type = '3'
group by yr, pd
order by yr desc, pd desc
This works for the middle months, but not for January, which becomes 2018-0 instead of 2017-12.
Example Data
Yr Pd Type BOM
18 02 3 100
18 02 3 100
18 02 2 200
18 02 2 100
18 01 3 100
18 01 3 100
18 01 2 200
18 01 2 100
18 01 3 100
18 01 2 300
17 12 3 100
17 12 3 200
17 12 2 300
17 12 3 200
17 12 2 100
17 11 3 300
17 11 2 400
17 11 3 400
17 11 2 100
So the results I am looking for would be:
Yr Pd EOM
18 01 200
17 12 300
17 11 500
17 10 700
I'm working in System iNavigator currently, but hoping to move this into an externally connected Excel query at some point.
Your DB2 database should be able to use CASE WHEN
Which can be used to calculate the year and the month, depending on the month.
For example:
select
CASE WHEN pd = 1 THEN yr - 1 ELSE yr END as Yr,
CASE WHEN pd = 1 THEN 12 ELSE pd - 1 END as Pd,
SUM(BOM) as EOM
from Table1
where type = '3'
group by yr, pd
order by yr desc, pd desc

Add column value to next column in SQL

My sql table is
Week Year Applications
1 2017 0
2 2017 10
3 2017 20
4 2017 50
5 2017 0
1 2018 10
2 2018 0
3 2018 40
4 2018 50
5 2018 10
And I want SQL query which give below output
Week Year Applications
1 2017 0
2 2017 10
3 2017 30
4 2017 80
5 2017 80
1 2018 10
2 2018 10
3 2018 50
4 2018 100
5 2018 110
Can anyone help me to write below query?
You could use SUM() OVER to get cumulative sum:
SELECT *, SUM(Applications) OVER(PARTITION BY Year ORDER BY Week)
FROM tab
It looks like you want a cumulative sum:
select week, year,
sum(applications) over (partition by year order by week) as cumulative_applications
from t;

Count and where conditions leades to perfomance issues?

I am working on a million data rows table.The table look likes below
Departement year Candidate Spent Saved
Electrical 2013 A 50 50
Electrical 2013 B 25 50
Electrical 2013 C 11 50
Electrical 2013 D 25 0
Electrical 2013 Dt 86 50
Electrical 2014 AA 50 50
Electrical 2014 BB 25 0
Electrical 2014 CH 11 50
Electrical 2014 DG 25 0
Electrical 2014 DH 0 50
Computers 2013 Ax 50 50
Computers 2013 Bc 25 50
Computers 2013 Cx 11 50
Computers 2013 Dx 25 0
Computers 2013 Dx 86 50
I am looking output like below.
Departement year NoOfCandidates NoOfCandidatesWith50$save NoOfCandidatesWith0$save
Electrical 2013 5 4 1
Electrical 2014 5 3 2
Computers 2013 5 4 1
I am using #TEMP tables for every count where conditions and left outer joining at last .So it takes me more time.
Is there any way so i can perform better for above Table .
Thanks in advance.
You want to do this as a single aggregation query. There is no need for temporary tables:
select department, year, count(*) as NumCandidates,
sum(case when saved = 50 then 1 else 0 end) as NumCandidatesWith50Save
sum(case when saved = 0 then 1 else 0 end) as NumCandidatesWith00Save
from table t
group by department, year
order by 1, 2;