Sql query based on R values in datatable - sql

I have few blank values in one of my columns in dataset. I need to make sql query to database with only this few id (86) that contain missing value.
I have in mind something like that (not just paste id to in statement):
SELECT x.id,
x.sent
FROM x
WHERE x.id IN [my R vector with id]

glue_sql will expand vectors into a comma separated list if you use * after the variable name in the glue expression.
glue::glue_sql("
SELECT x.id,
x.sent
FROM x
WHERE x.id IN ({idVector*})
", .con = con)
con is a DBI connection to your database

Conjecture.
Real data is in #sourcemt (a temp table, the "#" is purely local for my database, ignore it).
sourcemt <- mt <- transform(mtcars, id = seq_len(nrow(mtcars)))
DBI::dbWriteTable(con, "#sourcemt", sourcemt)
mt <- head(mt)
mt$cyl[c(1,3,4)] <- NA
mt
# mpg cyl disp hp drat wt qsec vs am gear carb id
# Mazda RX4 21.0 NA 160 110 3.90 2.620 16.46 0 1 4 4 1
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
# Datsun 710 22.8 NA 108 93 3.85 2.320 18.61 1 1 4 1 3
# Hornet 4 Drive 21.4 NA 258 110 3.08 3.215 19.44 1 0 3 1 4
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6
We want to fix the missing cyl values.
miss_ids <- mt$id[is.na(mt$cyl)]
miss_ids
# [1] 1 3 4
(qmarks <- paste(rep("?", length(miss_ids)), collapse = ","))
# [1] "?,?,?"
fixmt <- DBI::dbGetQuery(con, sprintf("select id, cyl from #sourcemt where id in (%s)", qmarks), params = miss_ids)
fixmt
# id cyl
# 1 1 6
# 2 3 4
# 3 4 6
(FYI, the qmarks parts are using parameter-binding for safe querying of data without paste-ing or sprintf-ing the data into the query. See parameterized queries for good discussions about this.)
From here, we just need to merge them and coalesce the missing values back in. (Both methods below should capture the output back into mt.)
dplyr
library(dplyr)
mt %>%
left_join(fixmt, by = "id", suffix = c("", ".y")) %>%
mutate(cyl = coalesce(cyl, cyl.y)) %>%
select(-cyl.y)
# mpg cyl disp hp drat wt qsec vs am gear carb id
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3
# 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4
# 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5
# 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6
base R
merge(mt, fixmt, by = "id", all.x = TRUE, suffixes = c("", ".y")) |>
transform(cyl = ifelse(is.na(cyl), cyl.y, cyl), cyl.y = NULL)
# id mpg cyl disp hp drat wt qsec vs am gear carb
# 1 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# 4 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# 5 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# 6 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Related

Multindex add zero values if no in pandas dataframe

I have a pandas (v.0.23.4) dataframe with multindex('date', 'class').
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
4 7
In '2019-06-30' class 3 is missing because there are no data.
What I want is to add class 3 in the multindex and zero values in the Col_values column automatically.
Use DataFrame.unstack with fill_value=0 with DataFrame.stack:
df = df.unstack(fill_value=0).stack()
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7
Another solution is use DataFrame.reindex with MultiIndex.from_product:
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0)
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7

Delete row of a dataframe condition

here is my first dataframe df1
269 270 271 346
0 1 153.00 2.14 1
1 1 153.21 3.89 2
2 1 153.90 2.02 1
3 1 154.18 3.02 1
4 1 154.47 2.30 1
5 1 154.66 2.73 1
6 1 155.35 2.82 1
7 1 155.70 2.32 1
8 1 220.00 15.50 1
9 0 152.64 1.44 1
10 0 152.04 2.20 1
11 0 150.48 1.59 1
12 0 149.88 1.73 1
13 0 129.00 0.01 1
here is my second dataframe df2
269 270 271 346
0 0 149.88 2.0 1
I would like the row at the index 12 to be remove because they have the same number in columns ['269'] & ['270']
Hope below solutions would match to your requirement
Using anti_join from dplyr
library(dplyr)
anti_join(df1, df2, by = c("269", "270"))
Using %in% operator
df1[!(df1$269 %in% df2$269 & df1$270 %in% df2$270),]

R equivalent of SQL SUM OVER PARTITION BY ROWS PRECEDING

I'm running into trouble trying to replicate SQL window functions in R, in particular with relation to creating sum totals that specify the number of prior months I want to sum.
While the sqldf package in R allows for data manipulation, it doesn't seem to support window functions.
I have some mock data in R
set.seed(10)
data_1 <- data.table(Cust_ID = c(1,1,1,1,2,2,2,2,3,3,3,3),Month=c(4,3,2,1,4,3,2,1,4,3,2,1),
StatusCode=LETTERS[4:6],SalesValue=round(runif(12,50,1500)))
Cust_ID Month StatusCode SalesValue
1 4 D 786
1 3 E 495
1 2 F 669
1 1 D 1055
2 4 E 173
2 3 F 377
2 2 D 448
2 1 E 445
3 4 F 943
3 3 D 673
3 2 E 995
3 1 F 873
For each row, I would like to create a cumulative sum of values pertaining to the customer (Cust_ID), for the prior 2 months (not including the current month).
This would mean that for each customer, rows with Months 1 & 2 should be null (given there aren't 2 preceding months), Month 3 should contain summed SalesValue of Months 1 & 2 for that customer, and Month 4 should contain summed Sales Value for Month 2 & 3.
In SQL, I would use syntax similar to the following: SUM(SalesValue) OVER (PARTITION BY Cust_ID ORDER BY MONTH DESC ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING ) as PAST_3Y_SALES
Is there to achieve this in R - ideally using data.table (for efficiency)? Any guidance would be much appreciated.
PS Note: this is mock data, in my 'real' data customers have different data volumes - i.e. some customers have 5 months worth of data, others have >36 months worth of data, etc.
Since, OP has used data.table, hence a solution using RcppRoll::roll_sumr with in scope of data.table can be as:
library(data.table)
library(RcppRoll)
# Order on 'Cust_ID' and 'Month'
setkeyv(data_1,c("Cust_ID","Month"))
data_1[, Sum_prev:=shift(roll_sumr(SalesValue, n=2)), by=Cust_ID]
data_1
# Cust_ID Month StatusCode SalesValue Sum_prev
# 1: 1 1 D 1055 NA
# 2: 1 2 F 669 NA
# 3: 1 3 E 495 1724
# 4: 1 4 D 786 1164
# 5: 2 1 E 445 NA
# 6: 2 2 D 448 NA
# 7: 2 3 F 377 893
# 8: 2 4 E 173 825
# 9: 3 1 F 873 NA
# 10: 3 2 E 995 NA
# 11: 3 3 D 673 1868
# 12: 3 4 F 943 1668
The approach is to first calculate sum with width as 2 and then take previous value using data.table::shift with lag for current row having sum of previous 2 rows.
Here is a solution using dplyr
library(dplyr)
library(zoo)
as.data.frame(data_1) %>% group_by(Cust_ID) %>% arrange(Cust_ID, Month) %>%
mutate(Sum_prev =rollapplyr(SalesValue, list(-(1:2)), sum, fill = NA))
# A tibble: 12 x 5
# Groups: Cust_ID [3]
Cust_ID Month StatusCode SalesValue Sum_prev
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 1 D 1055 NA
2 1 2 F 669 NA
3 1 3 E 495 1724
4 1 4 D 786 1164
5 2 1 E 445 NA
6 2 2 D 448 NA
7 2 3 F 377 893
8 2 4 E 173 825
9 3 1 F 873 NA
10 3 2 E 995 NA
11 3 3 D 673 1868
12 3 4 F 943 1668
Using data.table:
library(data.table)
library(zoo)
#dt <- data_1[order(Cust_ID,Month)]
#dt[, Sum_prev:= rollapplyr(SalesValue, list(-(1:2)), sum, fill = NA), by=Cust_ID][]
#OR Without chaining
data_1[, Sum_prev := rollapplyr(SalesValue, list((1:2)), sum, fill = NA), by = Cust_ID][order(Cust_ID,Month)]
Cust_ID Month StatusCode SalesValue Sum_prev
1: 1 1 D 1055 NA
2: 1 2 F 669 NA
3: 1 3 E 495 1724
4: 1 4 D 786 1164
5: 2 1 E 445 NA
6: 2 2 D 448 NA
7: 2 3 F 377 893
8: 2 4 E 173 825
9: 3 1 F 873 NA
10: 3 2 E 995 NA
11: 3 3 D 673 1868
12: 3 4 F 943 1668
A data.table solution:
# sort the data first if the Month column is not ordered for any Cust_ID
data_1 <- data_1[order(Cust_ID, Month)]
# sum up the value of two previous Month for each Cust_ID
data_1[, rsum := shift(SalesValue, 1) + shift(SalesValue, 2), by = Cust_ID]
# Cust_ID Month StatusCode SalesValue rsum
# 1: 1 1 D 1055 NA
# 2: 1 2 F 669 NA
# 3: 1 3 E 495 1724
# 4: 1 4 D 786 1164
# 5: 2 1 E 445 NA
# 6: 2 2 D 448 NA
# 7: 2 3 F 377 893
# 8: 2 4 E 173 825
# 9: 3 1 F 873 NA
# 10: 3 2 E 995 NA
# 11: 3 3 D 673 1868
# 12: 3 4 F 943 1668
1) sqldf/RpostgreSQL You can use windowing functions with a PostgreSQL backend and your code (slightly modified to work) within R like this (where data_1 is a data frame in your workspace).
library(RPostgreSQL)
library(sqldf)
sql <- 'select *, SUM("SalesValue") OVER (PARTITION BY "Cust_ID"
ORDER BY "Month" DESC
ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING ) as PAST_3Y_SALES
from "data_1"'
sqldf(sql)
giving:
Cust_ID Month StatusCode SalesValue past_3y_sales
1 1 4 D 786 NA
2 1 3 E 495 786
3 1 2 F 669 1281
4 1 1 D 1055 1164
5 2 4 E 173 NA
6 2 3 F 377 173
7 2 2 D 448 550
8 2 1 E 445 825
9 3 4 F 943 NA
10 3 3 D 673 943
11 3 2 E 995 1616
12 3 1 F 873 1668
2) data.table/rollapply
Alternately use data.table with rollapply specifying the width as offsets using list(-2:-1).
The code below has been written to correspond to the SQL code in the question but if you wanted, instead, to have two NAs for each Cust_ID rather than one and sum previous months where months are in ascending order (not descending as specified in the question's SQL) then change -Month to Month in the setorder statement and remove the partial=TRUE argument in rollapply.
library(data.table)
library(zoo)
setorder(data_1, Cust_ID, -Month)
roll <- function(x) rollapply(x, list(-2:-1), sum, partial = TRUE, fill = NA)
data_1[, past_3y_sales := roll(SalesValue), by = Cust_ID]
giving:
> data_1
Cust_ID Month StatusCode SalesValue past_3y_sales
1: 1 4 D 786 NA
2: 1 3 E 495 786
3: 1 2 F 669 1281
4: 1 1 D 1055 1164
5: 2 4 E 173 NA
6: 2 3 F 377 173
7: 2 2 D 448 550
8: 2 1 E 445 825
9: 3 4 F 943 NA
10: 3 3 D 673 943
11: 3 2 E 995 1616
12: 3 1 F 873 1668
I had a similar problem, but the solutions above didn't help me. My data was data_1:
CIF_ID LEAD_RESULT
10000009 1
10000009 0
10000025 0
10000025 0
10000055 0
And I needed to sum LEAD_RESULT by CIF_ID.
I did the following within library(data.table):
dt <- data.table::as.data.table(data_1)
dt<-dt[, group_sum := sum(LEAD_RESULT), by = "CIF_ID"][]
dt
Result:
CIF_ID LEAD_RESULT group_sum
10000009 1 1
10000009 0 1
10000025 0 0
10000025 0 0
10000055 0 0

Set value from another dataframe

Having a data frame exex as
EXEX I J
1 702 2 3
2 3112 2 4
3 1360 2 5
4 702 3 2
5 221 3 5
6 591 3 11
7 3112 4 2
8 394 4 5
9 3416 4 11
10 1360 5 2
11 221 5 3
12 394 5 4
13 108 5 11
14 591 11 3
15 3416 11 4
16 108 11 5
is there a more efficient pandas approach to update the value of an existing dataframe df of 0 to the value exex.EXEX where the exex.I field is the index and the exex.J field is the column? Is there a way in where to update the data by specifing the name instead of the row index? This is because if the name fields change, the row index would be different and could lead to an erroneous result.
i get it by:
df = pd.DataFrame(0, index = range(1,908), columns=range(1,908))
for index, row in exex12.iterrows():
df.set_value(row[1],row[2],row[0])
Assign to df.values
df.values[exex.I.values - 1, exex.J.values - 1] = exex.EXEX.values
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 702 3112 1360
3 0 702 0 0 221
4 0 3112 0 0 394
5 0 1360 221 394 0

pandas dataframe multiply with a series [duplicate]

This question already has answers here:
How do I operate on a DataFrame with a Series for every column?
(3 answers)
Closed 4 years ago.
What is the best way to multiply all the columns of a Pandas DataFrame by a column vector stored in a Series? I used to do this in Matlab with repmat(), which doesn't exist in Pandas. I can use np.tile(), but it looks ugly to convert the data structure back and forth each time.
Thanks.
What's wrong with
result = dataframe.mul(series, axis=0)
?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mul.html#pandas.DataFrame.mul
This can be accomplished quite simply with the DataFrame method apply.
In[1]: import pandas as pd; import numpy as np
In[2]: df = pd.DataFrame(np.arange(40.).reshape((8, 5)), columns=list('abcde')); df
Out[2]:
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
5 25 26 27 28 29
6 30 31 32 33 34
7 35 36 37 38 39
In[3]: ser = pd.Series(np.arange(8) * 10); ser
Out[3]:
0 0
1 10
2 20
3 30
4 40
5 50
6 60
7 70
Now that we have our DataFrame and Series we need a function to pass to apply.
In[4]: func = lambda x: np.asarray(x) * np.asarray(ser)
We can pass this to df.apply and we are good to go
In[5]: df.apply(func)
Out[5]:
a b c d e
0 0 0 0 0 0
1 50 60 70 80 90
2 200 220 240 260 280
3 450 480 510 540 570
4 800 840 880 920 960
5 1250 1300 1350 1400 1450
6 1800 1860 1920 1980 2040
7 2450 2520 2590 2660 2730
df.apply acts column-wise by default, but it can can also act row-wise by passing axis=1 as an argument to apply.
In[6]: ser2 = pd.Series(np.arange(5) *5); ser2
Out[6]:
0 0
1 5
2 10
3 15
4 20
In[7]: func2 = lambda x: np.asarray(x) * np.asarray(ser2)
In[8]: df.apply(func2, axis=1)
Out[8]:
a b c d e
0 0 5 20 45 80
1 0 30 70 120 180
2 0 55 120 195 280
3 0 80 170 270 380
4 0 105 220 345 480
5 0 130 270 420 580
6 0 155 320 495 680
7 0 180 370 570 780
This could be done more concisely by defining the anonymous function inside apply
In[9]: df.apply(lambda x: np.asarray(x) * np.asarray(ser))
Out[9]:
a b c d e
0 0 0 0 0 0
1 50 60 70 80 90
2 200 220 240 260 280
3 450 480 510 540 570
4 800 840 880 920 960
5 1250 1300 1350 1400 1450
6 1800 1860 1920 1980 2040
7 2450 2520 2590 2660 2730
In[10]: df.apply(lambda x: np.asarray(x) * np.asarray(ser2), axis=1)
Out[10]:
a b c d e
0 0 5 20 45 80
1 0 30 70 120 180
2 0 55 120 195 280
3 0 80 170 270 380
4 0 105 220 345 480
5 0 130 270 420 580
6 0 155 320 495 680
7 0 180 370 570 780
Why not create your own dataframe tile function:
def tile_df(df, n, m):
dfn = df.T
for _ in range(1, m):
dfn = dfn.append(df.T, ignore_index=True)
dfm = dfn.T
for _ in range(1, n):
dfm = dfm.append(dfn.T, ignore_index=True)
return dfm
Example:
df = pandas.DataFrame([[1,2],[3,4]])
tile_df(df, 2, 3)
# 0 1 2 3 4 5
# 0 1 2 1 2 1 2
# 1 3 4 3 4 3 4
# 2 1 2 1 2 1 2
# 3 3 4 3 4 3 4
However, the docs note: "DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics are quite different in places from a matrix." Which presumably should be interpreted as "use numpy if you are doing lots of matrix stuff".