Cumulative Differences Per Group in R/SQL - sql

I have this dataset on an R/SQL Server:
name year
1 john 2010
2 john 2011
3 john 2013
4 jack 2015
5 jack 2018
6 henry 2010
7 henry 2011
8 henry 2012
I am trying to add two columns that:
Column 1: Looks at the "number of missing years between successive rows" for each person.
Column 2: Sum the cumulative "number of missing years" for each person
For example - the first instance of each person will be 0, and then:
# note: in this specific example that I have created, "missing_ years" is the same as the "cumulative_missing_years"
name year missing_years cumulative_missing_years
1 john 2010 0 0
2 john 2011 0 0
3 john 2013 1 1
4 jack 2015 0 0
5 jack 2018 3 3
6 henry 2010 0 0
7 henry 2011 0 0
8 henry 2012 0 0
I think this can be done with a "grouped cumulative difference" and "grouped cumulative sums":
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group
final = my_data %>%
group_by(name) %>%
arrange(year) %>%
mutate(missing_year) = year- lag(year, default = first(year)) %>%
mutate(cumulative_missing_years) = mutate( cumulative_missing_years = cumsum(cs))
But I am not sure if I am doing this correctly.
Ideally, I am looking for an SQL approach or an R approach (e.g. via DBPLYR) that can be used to interact with the dataset.
Can someone please suggest an approach for doing this?
Thank you!

Using the data in the Note at the end perform a left self join to get the next year of the same name and then subtract and take the cumulative sum.
library(sqldf)
sqldf("select a.*,
coalesce(min(b.year) - a.year - 1, 0) as missing,
sum(coalesce(min(b.year) - a.year - 1, 0)) over
(partition by a.name order by a.year) as sum
from DF a
left join DF b on a.name = b.name and a.year < b.year
group by a.name, a.year
order by a.name, a.year")
giving:
name year missing sum
1 henry 2010 0 0
2 henry 2011 0 0
3 henry 2012 0 0
4 jack 2015 2 2
5 jack 2018 0 2
6 john 2010 0 0
7 john 2011 1 1
8 john 2013 0 1
Note
Lines <- "name year
1 john 2010
2 john 2011
3 john 2013
4 jack 2015
5 jack 2018
6 henry 2010
7 henry 2011
8 henry 2012
"
DF <- read.table(text = Lines)

I hope this helps
name <- c(rep("John", 3), rep("jack", 2), rep("henry", 3) )
year <- c(2010, 2011, 2013, 2015, 2018, 2010, 2011, 2012)
dt <- data.frame(name = name, year = year)
# first group the data by name then order by year then mutate
dt <- dt %>%
group_by(name) %>%
arrange(year, .by_group = TRUE) %>%
mutate( mis_yr = if_else(is.na(year - lag(year, n = 1L) -1), 0,
year - lag(year, n = 1L) -1) ,
cum_yr = cumsum(mis_yr)
) %>%
ungroup()
Hare is the outcome
name year mis_yr cum_yr
<chr> <dbl> <dbl> <dbl>
1 henry 2010 0 0
2 henry 2011 0 0
3 henry 2012 0 0
4 jack 2015 0 0
5 jack 2018 2 2
6 John 2010 0 0
7 John 2011 0 0
8 John 2013 1 1

Related

Merge rows and convert a string in row value to a user-defined one when condition related to other columns is matched

Assuming I'm dealing with this dataframe:
ID
Qualified
Year
Amount A
Amount B
1
No
2020
0
150
1
No
2019
0
100
1
Yes
2019
10
15
1
No
2018
0
100
1
Yes
2018
10
150
2
Yes
2020
0
200
2
No
2017
0
100
...
...
...
...
My desired output should be like this:
ID
Qualified
Year
Amount A
Amount B
1
No
2020
0
150
1
Partial
2019
10
115
1
Partial
2018
10
250
2
Yes
2020
0
200
2
No
2017
0
100
...
...
...
...
As you can see, Qualified column creates new merged values (Yes & No -> Partial, amount A + B ) from a condition: a year in an ID includes both Yes and No in Qualified column.
Don't know how to approach it. Anyone could provide any methodology?
You can use the function agg() and groupby() to perform this operation.
agg() allows you to use not only common aggregation functions (such as sum, mean, etc.) but also custom defined functions.
I would do as follows:
def agg_qualify(x):
values = x.unique()
if len(x)>1:
return 'Partial'
return values[0]
df.groupby(['ID', 'Year']).agg({
'Qualified': lambda x: agg_qualify(x),
'Amount A': 'sum',
'Amount B': 'sum',
}).reset_index()
Output:
ID Year Qualified Amount A Amount B
0 1 2018 Partial 10 250.0
1 1 2019 Partial 10 115.0
2 1 2020 No 0 150.0
3 2 2020 Yes 0 200.0

Create time variable based on binary variable using Stata

My dataset contains a year, ID, and binary value variable.
ID
Year
Value
1
2000
0
1
2001
0
1
2002
1
1
2003
1
1
2004
1
1
2005
1
Using Stata, I would like to create a new variable "YearValue" that takes the value of the variable "Year" when the variable value first turned 1.
ID
Year
Value
YearValue
1
2000
0
2002
1
2001
0
2002
1
2002
1
2002
1
2003
1
2002
1
2004
1
2002
1
2005
1
2002
Thank you for your help!
egen wanted = min(cond(Value == 1, Year, .)), by(ID)
See https://www.stata-journal.com/article.html?article=dm0055 (especially Section 9) for this technique in context.

Name-Specific Variability Calculations Pandas

I'm trying to calculate variability statistics from two df's - one with current data and one df with average data for the month. Suppose I have a df "DF1" that looks like this:
Name year month output
0 A 1991 1 10864.8
1 A 1997 2 11168.5
2 B 1994 1 6769.2
3 B 1998 2 3137.91
4 B 2002 3 4965.21
and a df called "DF2" that contains monthly averages from multiple years such as:
Name month output_average
0 A 1 11785.199
1 A 2 8973.991
2 B 1 8874.113
3 B 2 6132.176667
4 B 3 3018.768
and, i need a new DF calling it "DF3" that needs to look like this with the calculations specific to the change in the "name" column and for each "month" change:
Name year month Variability
0 A 1991 1 -0.078097875
1 A 1997 2 0.24454103
2 B 1994 1 -0.237197002
3 B 1998 2 -0.488287737
4 B 2002 3 0.644782
I have tried options like this below but with errors about duplicating the axis or key errors -
DF3['variability'] =
((DF1.output/DF2.set_index('month'['output_average'].reindex(DF1['name']).values)-1)
Thank you for your help in leaning Python row calculations coming from matlab!
For two columns, you can better use merge instead of set_index:
df3 = df1.merge(df2, on=['Name','month'], how='left')
df3['variability'] = df3['output']/df3['output_average'] - 1
Output:
Name year month output output_average variability
0 A 1991 1 10864.80 11785.199000 -0.078098
1 A 1997 2 11168.50 8973.991000 0.244541
2 B 1994 1 6769.20 8874.113000 -0.237197
3 B 1998 2 3137.91 6132.176667 -0.488288
4 B 2002 3 4965.21 3018.768000 0.644780

Standard deviation with groupby(multiple columns) Pandas

I am working with data from the California Air Resources Board.
site,monitor,date,start_hour,value,variable,units,quality,prelim,name
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
...
df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
inplace = True) #drops bottom columns without data in them, NaN
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df = df.rename(columns={'value':'conc'})
I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
year month day conc
0 2014 1 1 9.644583
1 2014 1 2 4.945652
2 2014 1 3 4.345238
3 2014 1 4 5.047917
4 2014 1 5 5.212857
5 2014 1 6 2.095714
After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
year month conc datetime
0 2014 1 4.330985 2014-01-31
1 2014 2 2.280096 2014-02-28
2 2014 3 4.464622 2014-03-31
3 2014 4 6.583759 2014-04-30
4 2014 5 9.069353 2014-05-31
5 2014 6 9.982330 2014-06-30
Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)
sd = d_avg['conc'].apply(lambda x: x.std())
However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:
year month sd
0 44.877611 1.000000 1.795868
1 44.877611 1.414214 2.355055
2 44.877611 1.732051 2.597531
3 44.877611 2.000000 2.538749
4 44.877611 2.236068 5.456785
5 44.877611 2.449490 3.315546
Please help me!
[1]: https://i.stack.imgur.com/ueVrG.png
I tried to reproduce your error and it works fine for me. Here's my complete code sample, which is pretty much exactly the same as yours EXCEPT for the generation of the original dataframe. So I'd suspect that part of the code. Can you provide the code that creates the dataframe?
import pandas as pd
columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
[2014, 1, 1, 4.0],
[2014, 1, 2, 6.0],
[2014, 1, 2, 8.0],
[2014, 2, 1, 2.0],
[2014, 2, 1, 6.0],
[2014, 2, 2, 10.0],
[2014, 2, 2, 14.0]]
df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')
Outputs:
Concentrations:
year month day conc
0 2014 1 1 2.0
1 2014 1 1 4.0
2 2014 1 2 6.0
3 2014 1 2 8.0
4 2014 2 1 2.0
5 2014 2 1 6.0
6 2014 2 2 10.0
7 2014 2 2 14.0
Daily Average:
year month day conc
0 2014 1 1 3.0
1 2014 1 2 7.0
2 2014 2 1 4.0
3 2014 2 2 12.0
Monthly Average:
year month conc
0 2014 1 5.0
1 2014 2 8.0
Monthly Standard Deviation:
year month conc
0 2014 1 2.828427
1 2014 2 5.656854
I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0)
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
and here's the new dataframe:
m_avg_sd.head(5)
Out[2]:
year month conc sd datetime
0 2009 1 48.350105 18.394192 2009-01-31
1 2009 2 21.929383 16.293645 2009-02-28
2 2009 3 15.094729 6.821124 2009-03-31
3 2009 4 12.021009 4.391219 2009-04-30
4 2009 5 13.449100 4.081734 2009-05-31

remove outliers by group in sql

In my column in SQL Server, I must delete outliers for each group separately. Here are my columns
select
customer,
sku,
stuff,
action,
acnumber,
year
from
mytable
Sample data:
customer sku year stuff action
-----------------------------------
1 1 2 2017 10 0
2 1 2 2017 20 1
3 1 3 2017 30 0
4 1 3 2017 40 1
5 2 4 2017 50 0
6 2 4 2017 60 1
7 2 5 2017 70 0
8 2 5 2017 80 1
9 1 2 2018 10 0
10 1 2 2018 20 1
11 1 3 2018 30 0
12 1 3 2018 40 1
13 2 4 2018 50 0
14 2 4 2018 60 1
15 2 5 2018 70 0
16 2 5 2018 80 1
I must delete outlier from stuff variable, but separately by group customer+sku+year.
All that is below the 25th percentile and above 75 percentile should be considered an outlier and this principle must be respected for each group.
How to clear dataset for next working ?
Note, in this dataset, there is variable action (it tales value 0 and 1). It is not group variable, but outliers must be delete only for ZERO(0) categories of action variable.
in R language this is decided as
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
new <- remove_outliers(vpg$stuff)
vpg=cbind(new,vpg)
Something like this, maybe:
DELETE mytable
WHERE PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) < .25 OR
PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) > .75