I'm running into trouble trying to replicate SQL window functions in R, in particular with relation to creating sum totals that specify the number of prior months I want to sum.
While the sqldf package in R allows for data manipulation, it doesn't seem to support window functions.
I have some mock data in R
set.seed(10)
data_1 <- data.table(Cust_ID = c(1,1,1,1,2,2,2,2,3,3,3,3),Month=c(4,3,2,1,4,3,2,1,4,3,2,1),
StatusCode=LETTERS[4:6],SalesValue=round(runif(12,50,1500)))
Cust_ID Month StatusCode SalesValue
1 4 D 786
1 3 E 495
1 2 F 669
1 1 D 1055
2 4 E 173
2 3 F 377
2 2 D 448
2 1 E 445
3 4 F 943
3 3 D 673
3 2 E 995
3 1 F 873
For each row, I would like to create a cumulative sum of values pertaining to the customer (Cust_ID), for the prior 2 months (not including the current month).
This would mean that for each customer, rows with Months 1 & 2 should be null (given there aren't 2 preceding months), Month 3 should contain summed SalesValue of Months 1 & 2 for that customer, and Month 4 should contain summed Sales Value for Month 2 & 3.
In SQL, I would use syntax similar to the following: SUM(SalesValue) OVER (PARTITION BY Cust_ID ORDER BY MONTH DESC ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING ) as PAST_3Y_SALES
Is there to achieve this in R - ideally using data.table (for efficiency)? Any guidance would be much appreciated.
PS Note: this is mock data, in my 'real' data customers have different data volumes - i.e. some customers have 5 months worth of data, others have >36 months worth of data, etc.
Since, OP has used data.table, hence a solution using RcppRoll::roll_sumr with in scope of data.table can be as:
library(data.table)
library(RcppRoll)
# Order on 'Cust_ID' and 'Month'
setkeyv(data_1,c("Cust_ID","Month"))
data_1[, Sum_prev:=shift(roll_sumr(SalesValue, n=2)), by=Cust_ID]
data_1
# Cust_ID Month StatusCode SalesValue Sum_prev
# 1: 1 1 D 1055 NA
# 2: 1 2 F 669 NA
# 3: 1 3 E 495 1724
# 4: 1 4 D 786 1164
# 5: 2 1 E 445 NA
# 6: 2 2 D 448 NA
# 7: 2 3 F 377 893
# 8: 2 4 E 173 825
# 9: 3 1 F 873 NA
# 10: 3 2 E 995 NA
# 11: 3 3 D 673 1868
# 12: 3 4 F 943 1668
The approach is to first calculate sum with width as 2 and then take previous value using data.table::shift with lag for current row having sum of previous 2 rows.
Here is a solution using dplyr
library(dplyr)
library(zoo)
as.data.frame(data_1) %>% group_by(Cust_ID) %>% arrange(Cust_ID, Month) %>%
mutate(Sum_prev =rollapplyr(SalesValue, list(-(1:2)), sum, fill = NA))
# A tibble: 12 x 5
# Groups: Cust_ID [3]
Cust_ID Month StatusCode SalesValue Sum_prev
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 1 D 1055 NA
2 1 2 F 669 NA
3 1 3 E 495 1724
4 1 4 D 786 1164
5 2 1 E 445 NA
6 2 2 D 448 NA
7 2 3 F 377 893
8 2 4 E 173 825
9 3 1 F 873 NA
10 3 2 E 995 NA
11 3 3 D 673 1868
12 3 4 F 943 1668
Using data.table:
library(data.table)
library(zoo)
#dt <- data_1[order(Cust_ID,Month)]
#dt[, Sum_prev:= rollapplyr(SalesValue, list(-(1:2)), sum, fill = NA), by=Cust_ID][]
#OR Without chaining
data_1[, Sum_prev := rollapplyr(SalesValue, list((1:2)), sum, fill = NA), by = Cust_ID][order(Cust_ID,Month)]
Cust_ID Month StatusCode SalesValue Sum_prev
1: 1 1 D 1055 NA
2: 1 2 F 669 NA
3: 1 3 E 495 1724
4: 1 4 D 786 1164
5: 2 1 E 445 NA
6: 2 2 D 448 NA
7: 2 3 F 377 893
8: 2 4 E 173 825
9: 3 1 F 873 NA
10: 3 2 E 995 NA
11: 3 3 D 673 1868
12: 3 4 F 943 1668
A data.table solution:
# sort the data first if the Month column is not ordered for any Cust_ID
data_1 <- data_1[order(Cust_ID, Month)]
# sum up the value of two previous Month for each Cust_ID
data_1[, rsum := shift(SalesValue, 1) + shift(SalesValue, 2), by = Cust_ID]
# Cust_ID Month StatusCode SalesValue rsum
# 1: 1 1 D 1055 NA
# 2: 1 2 F 669 NA
# 3: 1 3 E 495 1724
# 4: 1 4 D 786 1164
# 5: 2 1 E 445 NA
# 6: 2 2 D 448 NA
# 7: 2 3 F 377 893
# 8: 2 4 E 173 825
# 9: 3 1 F 873 NA
# 10: 3 2 E 995 NA
# 11: 3 3 D 673 1868
# 12: 3 4 F 943 1668
1) sqldf/RpostgreSQL You can use windowing functions with a PostgreSQL backend and your code (slightly modified to work) within R like this (where data_1 is a data frame in your workspace).
library(RPostgreSQL)
library(sqldf)
sql <- 'select *, SUM("SalesValue") OVER (PARTITION BY "Cust_ID"
ORDER BY "Month" DESC
ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING ) as PAST_3Y_SALES
from "data_1"'
sqldf(sql)
giving:
Cust_ID Month StatusCode SalesValue past_3y_sales
1 1 4 D 786 NA
2 1 3 E 495 786
3 1 2 F 669 1281
4 1 1 D 1055 1164
5 2 4 E 173 NA
6 2 3 F 377 173
7 2 2 D 448 550
8 2 1 E 445 825
9 3 4 F 943 NA
10 3 3 D 673 943
11 3 2 E 995 1616
12 3 1 F 873 1668
2) data.table/rollapply
Alternately use data.table with rollapply specifying the width as offsets using list(-2:-1).
The code below has been written to correspond to the SQL code in the question but if you wanted, instead, to have two NAs for each Cust_ID rather than one and sum previous months where months are in ascending order (not descending as specified in the question's SQL) then change -Month to Month in the setorder statement and remove the partial=TRUE argument in rollapply.
library(data.table)
library(zoo)
setorder(data_1, Cust_ID, -Month)
roll <- function(x) rollapply(x, list(-2:-1), sum, partial = TRUE, fill = NA)
data_1[, past_3y_sales := roll(SalesValue), by = Cust_ID]
giving:
> data_1
Cust_ID Month StatusCode SalesValue past_3y_sales
1: 1 4 D 786 NA
2: 1 3 E 495 786
3: 1 2 F 669 1281
4: 1 1 D 1055 1164
5: 2 4 E 173 NA
6: 2 3 F 377 173
7: 2 2 D 448 550
8: 2 1 E 445 825
9: 3 4 F 943 NA
10: 3 3 D 673 943
11: 3 2 E 995 1616
12: 3 1 F 873 1668
I had a similar problem, but the solutions above didn't help me. My data was data_1:
CIF_ID LEAD_RESULT
10000009 1
10000009 0
10000025 0
10000025 0
10000055 0
And I needed to sum LEAD_RESULT by CIF_ID.
I did the following within library(data.table):
dt <- data.table::as.data.table(data_1)
dt<-dt[, group_sum := sum(LEAD_RESULT), by = "CIF_ID"][]
dt
Result:
CIF_ID LEAD_RESULT group_sum
10000009 1 1
10000009 0 1
10000025 0 0
10000025 0 0
10000055 0 0
Related
Sample Dataset (Note that each combination of Col_A and Col_B is unique):
import pandas as pd
d = {'Col_A': [1,2,3,4,5,6,9,9,10,11,11,12,12,12,12,12,12,13,13],
'Col_B': ['A','K','E','E','H','A','J','A','L','A','B','A','J','C','D','E','A','J','L'],
'Value':[180,120,35,654,789,34,567,21,235,83,234,648,654,234,873,248,45,67,94]
}
df = pd.DataFrame(data=d)
The requirement is to generate a table with each Col_B's amount, Col_A's counts, and total amount per Col_A. Show the categories in Col_B in descending order by their total amount.
This is what I have so far:
df.groupby(['Col_B','Col_A']).agg(['count','sum'])
The output would look like this. However, I'd like to add subtotals for each Col_B category and rank those subtotals of the categories in descending order so that it fulfills the requirement of getting each Col_B's amount.
Thanks in advance, everyone!
The expected result is not clear for me but is it what your are looking for?
piv = df.groupby(['Col_B', 'Col_A'])['Amount'].agg(['count','sum']).reset_index()
tot = piv.groupby('Col_B', as_index=False).sum().assign(Col_A='Total')
cat = pd.CategoricalDtype(tot.sort_values('sum')['Col_B'], ordered=True)
out = pd.concat([piv, tot]).astype({'Col_B': cat}) \
.sort_values('Col_B', ascending=False, kind='mergesort') \
.set_index(['Col_B', 'Col_A'])
>>> out
count sum
Col_B Col_A
J 9 1 567
12 1 654
13 1 67
Total 3 1288
A 1 1 180
6 1 34
9 1 21
11 1 83
12 2 693
Total 6 1011
E 3 1 35
4 1 654
12 1 248
Total 3 937
D 12 1 873
Total 1 873
H 5 1 789
Total 1 789
L 10 1 235
13 1 94
Total 2 329
C 12 1 234
Total 1 234
B 11 1 234
Total 1 234
K 2 1 120
Total 1 120
I have data frame as shown below
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
13 4 F 10-Oct-18 500
14 4 M 10-Jan-19 200
15 4 F 10-Jun-19 600
16 2 M 29-Mar-18 100
17 2 M 29-Apr-18 100
18 2 F 29-Dec-18 500
F=Failure
M=Maintenance
Then sorted the data based on ID, Date by using below code.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
Then I want to filter ID's having more than one failure with at least one maintenance in between them.
The expected DF as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 F 2018-06-22 600
3 1 M 2018-08-22 150
4 1 F 2019-03-22 750
5 2 F 2018-01-29 500
6 2 M 2018-03-29 100
7 2 M 2018-04-29 100
8 2 F 2018-12-29 500
10 4 F 2018-10-10 500
11 4 M 2019-01-10 200
12 4 F 2019-06-10 600
logic used get above DF as below.
Let above DF be sl9.
select ID's which is having more than 1 F and at least one M in between them.
Remove the row, if ID wise first status is M.
Remove the row, if ID wise last status is M.
IF there is two consecutive F-F for ID, ignore the first F row.
Then I ran below code to calculate duration.
sl9['Date'] = pd.to_datetime(sl9['Date'])
sl9['D'] = sl9.groupby('ID')['Date'].diff().dt.days
ID Status Date Cost D
0 1 F 2017-06-22 500 nan
1 1 M 2017-07-22 100 30.00
2 1 F 2018-06-22 600 335.00
3 1 M 2018-08-22 150 61.00
4 1 F 2019-03-22 750 212.00
5 2 F 2018-01-29 500 nan
6 2 M 2018-03-29 100 59.00
7 2 M 2018-04-29 100 31.00
8 2 F 2018-12-29 500 244.00
10 4 F 2018-10-10 500 nan
11 4 M 2019-01-10 200 92.00
12 4 F 2019-06-10 600 151.00
From the above DF, I want create a DF as below.
ID Total_Duration No_of_F No_of_M
1 638 3 2
2 334 2 2
4 243 2 2
Tried following code.
df1 = sl9.groupby('ID', sort=False)["D"].sum().reset_index(name ='Total_Duration')
and the out put is shown below
ID Total_Duration
0 1 638.00
1 2 334.00
2 4 243.00
Idea is create new columns for each mask for easy debug, because compliacated solution:
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
#removed M groups if first or last groups per ID
m1 = df['Status'].eq('M')
df['g'] = df['Status'].ne(df.groupby('ID')['Status'].shift()).cumsum()
df['f'] = df.groupby('ID')['g'].transform('first').eq(df['g']) & m1
df['l'] = df.groupby('ID')['g'].transform('last').eq(df['g']) & m1
df1 = df[~(df['f'] | df['l'])].copy()
#count number of M and F and compare by ge for >=
df1['noM'] = df1['Status'].eq('M').groupby(df1['ID']).transform('size').ge(1)
df1['noF'] = df1['Status'].eq('F').groupby(df1['ID']).transform('size').ge(2)
#get non FF values for removing duplicated FF
df1['dupF'] = ~df.groupby('ID')['Status'].shift(-1).eq(df['Status']) | df1['Status'].eq('M')
df1 = df1[df1['noM'] & df1['noF'] & df1['dupF']]
df1 = df1.drop(['g','f','l','noM','noF','dupF'], axis=1)
print (df1)
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
7 1 F 2018-06-22 600
9 1 M 2018-08-22 150
10 1 F 2019-03-22 750
6 2 F 2018-01-29 500
16 2 M 2018-03-29 100
17 2 M 2018-04-29 100
18 2 F 2018-12-29 500
13 4 F 2018-10-10 500
14 4 M 2019-01-10 200
15 4 F 2019-06-10 600
And then:
#difference of days
df1['D'] = df1.groupby('ID')['Date'].diff().dt.days
#aggregate sum
df2 = df1.groupby('ID')['D'].sum().astype(int).to_frame('Total_Duration')
#count values by crosstab
df3 = pd.crosstab(df1['ID'], df1['Status']).add_prefix('No_of_')
#join together
df4 = df2.join(df3).reset_index()
print (df4)
ID Total_Duration No_of_F No_of_M
0 1 638 3 2
1 2 334 2 2
2 4 243 2 1
i am working on dataset where i need to write an query for below requirement either in R-programming or SQLDF , i want to know and learn to write in both language ( SQL and R ) ,kindly help.
Requirement is : i need to print Variable "a" from table when
Total_Scores of id 34 and Rank 3 is GREATER THAN Total_Scores of id 34 and Rank 4 else Print Variable b .
This above case Applicable for each id and Rank
id Rank Variable Total_Scores
34 3 a 11
34 4 b 6
126 3 c 15
126 4 d 18
190 3 e 9
190 4 f 10
388 3 g 20
388 4 h 15
401 3 i 15
401 4 x 11
476 3 y 11
476 4 z 11
536 3 p 15
536 4 q 6
i have tried to write SQL CASE statement and i am stuck ,could you please
help to write the query
"select id ,Rank ,
CASE
WHEN (select Total_Scores from table where id == 34 and Rank == 3) > (select Total_Scores from table where id == 34 and Rank == 4)
THEN "Variable is )
Final output Should be :
id Rank Variable Total_Scores
34 3 a 11
126 4 d 18
190 4 f 10
388 3 g 20
401 3 i 15
536 3 p 15
You seem to want the row with the highest score for each id. A canonical way to write this in SQL uses row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by score desc) as seqnum
from t
) t
where seqnum = 1;
This returns one row per id, even when the scores are tied. If you want all rows in that case, use rank() instead of row_number().
An alternative method can have better performance with an index on (id, score):
select t.*
from t
where t.score = (select max(t2.score) from t t2 where t2.id = t.id);
You can try this.
SELECT T.* FROM (
SELECT id,
MAX(Total_Scores) Max_Total_Scores
FROM MyTable
GROUP BY id
HAVING MAX(Total_Scores) > MIN(Total_Scores) ) AS MX
INNER JOIN MyTable T ON MX.id = T.id AND MX.Max_Total_Scores = T.Total_Scores
ORDER BY id
Sql Fiddle
In R
library(dplyr)
df %>% group_by(id) %>%
filter(Total_Scores == max(Total_Scores)) %>% filter(n()==1) %>%
ungroup()
# A tibble: 6 x 4
id Rank Variable Total_Scores
<int> <int> <chr> <int>
1 34 3 a 11
2 126 4 d 18
3 190 4 f 10
4 388 3 g 20
5 401 3 i 15
6 536 3 p 15
Data
df <- read.table(text="
id Rank Variable Total_Scores
34 3 a 11
34 4 b 6
126 3 c 15
126 4 d 18
190 3 e 9
190 4 f 10
388 3 g 20
388 4 h 15
401 3 i 15
401 4 x 11
476 3 y 11
476 4 z 11
536 3 p 15
536 4 q 6
",header=T, stringsAsFactors = F)
Assuming that what you want is to get the subset of rows whose Total_Scores is largest for that id here are two approaches.
The question did not discuss how to deal with ties. There is one id in the example that has a tie but there is no output corresponding to it which I assume was not intendedand that either both the rows should have been output or one of them. Anyways in the solutions below in (1) it will give one of the rows arbitrarily if there are duplicates whereas (2) will give both.
1) sqldf
If you use max in an SQLite select it will automatically select the other variables of the same row so:
library(sqldf)
sqldf("select id, Rank, Variable, max(Total_Scores) Total_Scores
from DF
group by id")
giving:
id Rank Variable Total_Scores
1 34 3 a 11
2 126 4 d 18
3 190 4 f 10
4 388 3 g 20
5 401 3 i 15
6 476 3 y 11
7 536 3 p 15
2) base R In base R we can use ave and subset like this:
subset(DF, ave(Total_Scores, id, FUN = function(x) x == max(x)) > 0)
giving:
id Rank Variable Total_Scores
1 34 3 a 11
4 126 4 d 18
6 190 4 f 10
7 388 3 g 20
9 401 3 i 15
11 476 3 y 11
12 476 4 z 11
13 536 3 p 15
Note
The input in reproducible form:
Lines <- "id Rank Variable Total_Scores
34 3 a 11
34 4 b 6
126 3 c 15
126 4 d 18
190 3 e 9
190 4 f 10
388 3 g 20
388 4 h 15
401 3 i 15
401 4 x 11
476 3 y 11
476 4 z 11
536 3 p 15
536 4 q 6"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
Having a data frame exex as
EXEX I J
1 702 2 3
2 3112 2 4
3 1360 2 5
4 702 3 2
5 221 3 5
6 591 3 11
7 3112 4 2
8 394 4 5
9 3416 4 11
10 1360 5 2
11 221 5 3
12 394 5 4
13 108 5 11
14 591 11 3
15 3416 11 4
16 108 11 5
is there a more efficient pandas approach to update the value of an existing dataframe df of 0 to the value exex.EXEX where the exex.I field is the index and the exex.J field is the column? Is there a way in where to update the data by specifing the name instead of the row index? This is because if the name fields change, the row index would be different and could lead to an erroneous result.
i get it by:
df = pd.DataFrame(0, index = range(1,908), columns=range(1,908))
for index, row in exex12.iterrows():
df.set_value(row[1],row[2],row[0])
Assign to df.values
df.values[exex.I.values - 1, exex.J.values - 1] = exex.EXEX.values
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 702 3112 1360
3 0 702 0 0 221
4 0 3112 0 0 394
5 0 1360 221 394 0
I have a dataset of a series with missing values that I want to replace by the index. The second column contains the same numbers than the first column, but in a different order.
here's an example:
>>> df
ind u v d
0 5 7 151
1 7 20 151
2 8 40 151
3 20 5 151
this should turn out to:
>>>df
ind u v d
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
i reindexed the values in row 'u' by creating a new column:
>>>df['new_index'] = range(1, len(numbers) + 1)
but how do I now replace values of the second column referring to the indexes?
Thanks for any advice!
You can use Series.rank, but first need create Series with unstack and last create DataFrame with unstack again:
df[['u','v']] = df[['u','v']].unstack().rank(method='dense').astype(int).unstack(0)
print (df)
u v d
ind
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
If use only DataFrame.rank, output in v is different:
df[['u','v']] = df[['u','v']].rank(method='dense').astype(int)
print (df)
u v d
ind
0 1 2 151
1 2 3 151
2 3 4 151
3 4 1 151