I have a pandas (v.0.23.4) dataframe with multindex('date', 'class').
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
4 7
In '2019-06-30' class 3 is missing because there are no data.
What I want is to add class 3 in the multindex and zero values in the Col_values column automatically.
Use DataFrame.unstack with fill_value=0 with DataFrame.stack:
df = df.unstack(fill_value=0).stack()
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7
Another solution is use DataFrame.reindex with MultiIndex.from_product:
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0)
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7
here is my first dataframe df1
269 270 271 346
0 1 153.00 2.14 1
1 1 153.21 3.89 2
2 1 153.90 2.02 1
3 1 154.18 3.02 1
4 1 154.47 2.30 1
5 1 154.66 2.73 1
6 1 155.35 2.82 1
7 1 155.70 2.32 1
8 1 220.00 15.50 1
9 0 152.64 1.44 1
10 0 152.04 2.20 1
11 0 150.48 1.59 1
12 0 149.88 1.73 1
13 0 129.00 0.01 1
here is my second dataframe df2
269 270 271 346
0 0 149.88 2.0 1
I would like the row at the index 12 to be remove because they have the same number in columns ['269'] & ['270']
Hope below solutions would match to your requirement
Using anti_join from dplyr
library(dplyr)
anti_join(df1, df2, by = c("269", "270"))
Using %in% operator
df1[!(df1$269 %in% df2$269 & df1$270 %in% df2$270),]
I'm running into trouble trying to replicate SQL window functions in R, in particular with relation to creating sum totals that specify the number of prior months I want to sum.
While the sqldf package in R allows for data manipulation, it doesn't seem to support window functions.
I have some mock data in R
set.seed(10)
data_1 <- data.table(Cust_ID = c(1,1,1,1,2,2,2,2,3,3,3,3),Month=c(4,3,2,1,4,3,2,1,4,3,2,1),
StatusCode=LETTERS[4:6],SalesValue=round(runif(12,50,1500)))
Cust_ID Month StatusCode SalesValue
1 4 D 786
1 3 E 495
1 2 F 669
1 1 D 1055
2 4 E 173
2 3 F 377
2 2 D 448
2 1 E 445
3 4 F 943
3 3 D 673
3 2 E 995
3 1 F 873
For each row, I would like to create a cumulative sum of values pertaining to the customer (Cust_ID), for the prior 2 months (not including the current month).
This would mean that for each customer, rows with Months 1 & 2 should be null (given there aren't 2 preceding months), Month 3 should contain summed SalesValue of Months 1 & 2 for that customer, and Month 4 should contain summed Sales Value for Month 2 & 3.
In SQL, I would use syntax similar to the following: SUM(SalesValue) OVER (PARTITION BY Cust_ID ORDER BY MONTH DESC ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING ) as PAST_3Y_SALES
Is there to achieve this in R - ideally using data.table (for efficiency)? Any guidance would be much appreciated.
PS Note: this is mock data, in my 'real' data customers have different data volumes - i.e. some customers have 5 months worth of data, others have >36 months worth of data, etc.
Since, OP has used data.table, hence a solution using RcppRoll::roll_sumr with in scope of data.table can be as:
library(data.table)
library(RcppRoll)
# Order on 'Cust_ID' and 'Month'
setkeyv(data_1,c("Cust_ID","Month"))
data_1[, Sum_prev:=shift(roll_sumr(SalesValue, n=2)), by=Cust_ID]
data_1
# Cust_ID Month StatusCode SalesValue Sum_prev
# 1: 1 1 D 1055 NA
# 2: 1 2 F 669 NA
# 3: 1 3 E 495 1724
# 4: 1 4 D 786 1164
# 5: 2 1 E 445 NA
# 6: 2 2 D 448 NA
# 7: 2 3 F 377 893
# 8: 2 4 E 173 825
# 9: 3 1 F 873 NA
# 10: 3 2 E 995 NA
# 11: 3 3 D 673 1868
# 12: 3 4 F 943 1668
The approach is to first calculate sum with width as 2 and then take previous value using data.table::shift with lag for current row having sum of previous 2 rows.
Here is a solution using dplyr
library(dplyr)
library(zoo)
as.data.frame(data_1) %>% group_by(Cust_ID) %>% arrange(Cust_ID, Month) %>%
mutate(Sum_prev =rollapplyr(SalesValue, list(-(1:2)), sum, fill = NA))
# A tibble: 12 x 5
# Groups: Cust_ID [3]
Cust_ID Month StatusCode SalesValue Sum_prev
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 1 D 1055 NA
2 1 2 F 669 NA
3 1 3 E 495 1724
4 1 4 D 786 1164
5 2 1 E 445 NA
6 2 2 D 448 NA
7 2 3 F 377 893
8 2 4 E 173 825
9 3 1 F 873 NA
10 3 2 E 995 NA
11 3 3 D 673 1868
12 3 4 F 943 1668
Using data.table:
library(data.table)
library(zoo)
#dt <- data_1[order(Cust_ID,Month)]
#dt[, Sum_prev:= rollapplyr(SalesValue, list(-(1:2)), sum, fill = NA), by=Cust_ID][]
#OR Without chaining
data_1[, Sum_prev := rollapplyr(SalesValue, list((1:2)), sum, fill = NA), by = Cust_ID][order(Cust_ID,Month)]
Cust_ID Month StatusCode SalesValue Sum_prev
1: 1 1 D 1055 NA
2: 1 2 F 669 NA
3: 1 3 E 495 1724
4: 1 4 D 786 1164
5: 2 1 E 445 NA
6: 2 2 D 448 NA
7: 2 3 F 377 893
8: 2 4 E 173 825
9: 3 1 F 873 NA
10: 3 2 E 995 NA
11: 3 3 D 673 1868
12: 3 4 F 943 1668
A data.table solution:
# sort the data first if the Month column is not ordered for any Cust_ID
data_1 <- data_1[order(Cust_ID, Month)]
# sum up the value of two previous Month for each Cust_ID
data_1[, rsum := shift(SalesValue, 1) + shift(SalesValue, 2), by = Cust_ID]
# Cust_ID Month StatusCode SalesValue rsum
# 1: 1 1 D 1055 NA
# 2: 1 2 F 669 NA
# 3: 1 3 E 495 1724
# 4: 1 4 D 786 1164
# 5: 2 1 E 445 NA
# 6: 2 2 D 448 NA
# 7: 2 3 F 377 893
# 8: 2 4 E 173 825
# 9: 3 1 F 873 NA
# 10: 3 2 E 995 NA
# 11: 3 3 D 673 1868
# 12: 3 4 F 943 1668
1) sqldf/RpostgreSQL You can use windowing functions with a PostgreSQL backend and your code (slightly modified to work) within R like this (where data_1 is a data frame in your workspace).
library(RPostgreSQL)
library(sqldf)
sql <- 'select *, SUM("SalesValue") OVER (PARTITION BY "Cust_ID"
ORDER BY "Month" DESC
ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING ) as PAST_3Y_SALES
from "data_1"'
sqldf(sql)
giving:
Cust_ID Month StatusCode SalesValue past_3y_sales
1 1 4 D 786 NA
2 1 3 E 495 786
3 1 2 F 669 1281
4 1 1 D 1055 1164
5 2 4 E 173 NA
6 2 3 F 377 173
7 2 2 D 448 550
8 2 1 E 445 825
9 3 4 F 943 NA
10 3 3 D 673 943
11 3 2 E 995 1616
12 3 1 F 873 1668
2) data.table/rollapply
Alternately use data.table with rollapply specifying the width as offsets using list(-2:-1).
The code below has been written to correspond to the SQL code in the question but if you wanted, instead, to have two NAs for each Cust_ID rather than one and sum previous months where months are in ascending order (not descending as specified in the question's SQL) then change -Month to Month in the setorder statement and remove the partial=TRUE argument in rollapply.
library(data.table)
library(zoo)
setorder(data_1, Cust_ID, -Month)
roll <- function(x) rollapply(x, list(-2:-1), sum, partial = TRUE, fill = NA)
data_1[, past_3y_sales := roll(SalesValue), by = Cust_ID]
giving:
> data_1
Cust_ID Month StatusCode SalesValue past_3y_sales
1: 1 4 D 786 NA
2: 1 3 E 495 786
3: 1 2 F 669 1281
4: 1 1 D 1055 1164
5: 2 4 E 173 NA
6: 2 3 F 377 173
7: 2 2 D 448 550
8: 2 1 E 445 825
9: 3 4 F 943 NA
10: 3 3 D 673 943
11: 3 2 E 995 1616
12: 3 1 F 873 1668
I had a similar problem, but the solutions above didn't help me. My data was data_1:
CIF_ID LEAD_RESULT
10000009 1
10000009 0
10000025 0
10000025 0
10000055 0
And I needed to sum LEAD_RESULT by CIF_ID.
I did the following within library(data.table):
dt <- data.table::as.data.table(data_1)
dt<-dt[, group_sum := sum(LEAD_RESULT), by = "CIF_ID"][]
dt
Result:
CIF_ID LEAD_RESULT group_sum
10000009 1 1
10000009 0 1
10000025 0 0
10000025 0 0
10000055 0 0
Having a data frame exex as
EXEX I J
1 702 2 3
2 3112 2 4
3 1360 2 5
4 702 3 2
5 221 3 5
6 591 3 11
7 3112 4 2
8 394 4 5
9 3416 4 11
10 1360 5 2
11 221 5 3
12 394 5 4
13 108 5 11
14 591 11 3
15 3416 11 4
16 108 11 5
is there a more efficient pandas approach to update the value of an existing dataframe df of 0 to the value exex.EXEX where the exex.I field is the index and the exex.J field is the column? Is there a way in where to update the data by specifing the name instead of the row index? This is because if the name fields change, the row index would be different and could lead to an erroneous result.
i get it by:
df = pd.DataFrame(0, index = range(1,908), columns=range(1,908))
for index, row in exex12.iterrows():
df.set_value(row[1],row[2],row[0])
Assign to df.values
df.values[exex.I.values - 1, exex.J.values - 1] = exex.EXEX.values
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 702 3112 1360
3 0 702 0 0 221
4 0 3112 0 0 394
5 0 1360 221 394 0
This question already has answers here:
How do I operate on a DataFrame with a Series for every column?
(3 answers)
Closed 4 years ago.
What is the best way to multiply all the columns of a Pandas DataFrame by a column vector stored in a Series? I used to do this in Matlab with repmat(), which doesn't exist in Pandas. I can use np.tile(), but it looks ugly to convert the data structure back and forth each time.
Thanks.
What's wrong with
result = dataframe.mul(series, axis=0)
?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mul.html#pandas.DataFrame.mul
This can be accomplished quite simply with the DataFrame method apply.
In[1]: import pandas as pd; import numpy as np
In[2]: df = pd.DataFrame(np.arange(40.).reshape((8, 5)), columns=list('abcde')); df
Out[2]:
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
5 25 26 27 28 29
6 30 31 32 33 34
7 35 36 37 38 39
In[3]: ser = pd.Series(np.arange(8) * 10); ser
Out[3]:
0 0
1 10
2 20
3 30
4 40
5 50
6 60
7 70
Now that we have our DataFrame and Series we need a function to pass to apply.
In[4]: func = lambda x: np.asarray(x) * np.asarray(ser)
We can pass this to df.apply and we are good to go
In[5]: df.apply(func)
Out[5]:
a b c d e
0 0 0 0 0 0
1 50 60 70 80 90
2 200 220 240 260 280
3 450 480 510 540 570
4 800 840 880 920 960
5 1250 1300 1350 1400 1450
6 1800 1860 1920 1980 2040
7 2450 2520 2590 2660 2730
df.apply acts column-wise by default, but it can can also act row-wise by passing axis=1 as an argument to apply.
In[6]: ser2 = pd.Series(np.arange(5) *5); ser2
Out[6]:
0 0
1 5
2 10
3 15
4 20
In[7]: func2 = lambda x: np.asarray(x) * np.asarray(ser2)
In[8]: df.apply(func2, axis=1)
Out[8]:
a b c d e
0 0 5 20 45 80
1 0 30 70 120 180
2 0 55 120 195 280
3 0 80 170 270 380
4 0 105 220 345 480
5 0 130 270 420 580
6 0 155 320 495 680
7 0 180 370 570 780
This could be done more concisely by defining the anonymous function inside apply
In[9]: df.apply(lambda x: np.asarray(x) * np.asarray(ser))
Out[9]:
a b c d e
0 0 0 0 0 0
1 50 60 70 80 90
2 200 220 240 260 280
3 450 480 510 540 570
4 800 840 880 920 960
5 1250 1300 1350 1400 1450
6 1800 1860 1920 1980 2040
7 2450 2520 2590 2660 2730
In[10]: df.apply(lambda x: np.asarray(x) * np.asarray(ser2), axis=1)
Out[10]:
a b c d e
0 0 5 20 45 80
1 0 30 70 120 180
2 0 55 120 195 280
3 0 80 170 270 380
4 0 105 220 345 480
5 0 130 270 420 580
6 0 155 320 495 680
7 0 180 370 570 780
Why not create your own dataframe tile function:
def tile_df(df, n, m):
dfn = df.T
for _ in range(1, m):
dfn = dfn.append(df.T, ignore_index=True)
dfm = dfn.T
for _ in range(1, n):
dfm = dfm.append(dfn.T, ignore_index=True)
return dfm
Example:
df = pandas.DataFrame([[1,2],[3,4]])
tile_df(df, 2, 3)
# 0 1 2 3 4 5
# 0 1 2 1 2 1 2
# 1 3 4 3 4 3 4
# 2 1 2 1 2 1 2
# 3 3 4 3 4 3 4
However, the docs note: "DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics are quite different in places from a matrix." Which presumably should be interpreted as "use numpy if you are doing lots of matrix stuff".