dplyr to data.table for speed up execution time - dataframe

I am currently dealing with a moderately large dataframe called d.mkt (> 2M rows and 12 columns). As dplyr is too slow when applying summarise() function combined with group_by_at, I am trying to write an equivalent statement using data.table to speed up the summarise computation part of dplyr. However, the situation is quite special in the case that the original dataframe is group_by_at and then summarising over the same set of columns (e.g. X %>% select(-id) %>% group_by_at(vars(-x,-y,-z,-t) %>% summarise(x = sum(x), y = sum(y), z = sum(z), y = sum(t)) %>% ungroup()).
With that in mind, below is my current attempt, which kept failing to work because of this error: keyby or by has length (1,1,1,1). Could someone please help let me know how to fix this error?
dplyr's code
d.mkt <- d.mkt %>%
left_join(codes, by = c('rte_cd', 'cd')) %>%
mutate(is_valid = replace_na(is_valid, FALSE),
rte_cd = ifelse(is_valid, rte_cd, 'RC'),
rte_dsc = ifelse(is_valid, rte_dsc, 'SKIPPED')) %>%
select(-is_valid) %>%
group_by_at(vars(-c_rv, -g_rv, -h_rv, -rn)) %>%
summarise(c_rv = sum(as.numeric(c_rv)), g_rv = sum(as.numeric(g_rv)), h_rv = sum(as.numeric(h_rv)), rn = sum(as.numeric(rn))) %>%
ungroup()
My attempt for translating the above
d.mkt <- as.data.table(d.mkt)
d.mkt <- d.mkt[codes, on = c('rte_cd', 'sb_cd'),
`:=` (is.valid = replace_na(is_valid, FALSE), rte_cd = ifelse(is_valid, rte_cd, 'RC00'),
rte_ds = ifelse(is_valid, rte_ds, 'SKIPPED'))]
d.mkt <- d.mkt[, -"is.valid", with=FALSE]
d.mkt <- d.mkt[, .(c_rv=sum(c_rv), g_rv=sum(g_rv), h_rv = sum(h_rv), rn = sum(rn)), by = .('prop', 'date')] --- Error here already, but how do we ungroup a `data.table` though?

Close. Some suggestions/answers.
If you're shifting to data.table for speed, I suggest use if fifelse in lieu of replace_na and ifelse, minor.
The canonical way to remove is_valid is d.mkt[, is.valid := NULL].
Grouping cab be done with a setdiff. In data.table, there is no need to "ungroup", each [-call uses its own grouping. (For the reason, if you have multiple chained [-operations that use the same grouping, it can be useful to store that group as a variable, perhaps index it, and/or combine all the [-chain into a single call. This is prone to lots of benchmarking discussion outside the scope of what we have here.)
Since all of your summary stats are the same, we can lapply(.SD, ..) this for a little readability improvement.
This might work:
library(data.table)
setDT(codes) # or using `as.data.table(codes)` below instead
setDT(d.mkt) # ditto
tmp <- codes[d.mkt, on = .(rte_cd, cd) ] %>%
.[, c("is_valid", "rte_cd", "rte_dsc") :=
.(fcoalesce(is_valid, FALSE),
fifelse(is.na(is_valid), rte_cd, "RC"),
fifelse(is.an(is_valid), rte_dsc, "SKIPPED")) ]
tmp[, is_valid := NULL ]
cols <- c("c_rv", "g_rv", "h_rv", "rn")
tmp[, lapply(.SD, function(z) sum(as.numeric(z))),
by = setdiff(names(tmp), cols), .SDcols = cols ]

Related

How to add a vector to a table in backend using dbplyr (R)

I created a table from a data source using tbl(). I need to add a column including 1:nrow() to my dataset and tried different methods but I didn't succeed. My code is as below:
nrow_df1 <- df1 %>% summarise(n = n()) %>% pull(n)
df1 <- df1 %>% mutate(ID = 1:nrow_df1, step = 1)
It doesn't add column ID to my dataset and only adds column step.
Using as.data.frame(), it works but so slow.
Do you have any ideas? thanks in advance
For this case, you can use row_number().
library(dbplyr)
library(DBI)
# simulate a fake database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
# add in the row
tbl(con, "mtcars") %>%
mutate(ID = row_number())
dbDisconnect(con)
I found the answer. It is to use row_number() but as.numeric is also needed to convert the output from integer64 to numeric:
df1 <- df1 %>% mutate(ID = as.numeric(row_number(a column)), step = 1)

NA_integer as a result of a sum

I need some help with a mystery that I'm dealing with. Sorry if it is a beginner question , but I'm just starting to work with R.
I'm doing a work of analyzing coronavirus cases in the world using the coronavirus package.
By the time I try to establish the total registered cases, the value that I get from this new data subset is always NA_int. I honestly don't know how to get out of this situation. Look at the codes I've made so far:
if(!require(pacman)) install.packages("pacman")
library(pacman)
pacman::p_load(readr, ggplot2, dplyr, coronavirus, knitr)
total_cases <- coronavirus_dataset %>%
group_by(type, date) %>%
summarise(diary_cases = sum(cases)) %>%
arrange(date) %>%
mutate(full_cases = cumsum(diary_cases))
**total_cases_all = sum(coronavirus_dataset$cases)** ==> here is the problem as the
total_cases_all appears in the environment as NA_integer
percent_cases <- coronavirus_dataset %>%
group_by(type) %>%
summarise(total_cases = sum(cases)) %>%
mutate(percent_total = round(total_cases / total_cases_all * 100, digits = 2))
%>%
arrange(desc(total_cases))
Thanks in advance.

Dropping containing NA rows with dbplyr

here is how I ran some SQL queries by dbplyr
library(tidyverse)
library(dbplyr)
library(DBI)
library(RPostgres)
library(bit64)
library(tidyr)
drv <- dbDriver('Postgres')
con <- dbConnect(drv,dbname='mydb',port=5432,user='postgres')
table1 <- tbl(con,'table1')
table2 <- tbl(con,'table2')
table3 <- tbl(con,'table3')
table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
I wanna drop some rows which include NA then collect my final table but couldn't find anything helpful works with dbplyr queries.
I tried to pipe drop_na() from tidyr and some other base functions (complete.cases() etc.). Would you suggest me anything to succeed my aim ? Piping an SQL query (like WHERE FOO IS NOT NULL) to dbplyr query is also welcome.
Thanks in advance.
Try using !is.na(col_name) as part of a filter:
library(dplyr)
library(dbplyr)
df = data.frame(my_num = c(1,2,3))
df = tbl_lazy(df, con = simulate_mssql())
output = df %>% filter(!is.na(my_num))
Calling show_query(output) to check the generated sql gives:
<SQL>
SELECT *
FROM `df`
WHERE (NOT(((`my_num`) IS NULL)))
The extra brackets are part of how dbplyr does its translation.
If you want to do this for multiple columns, try the following approach based on this answer:
library(rlang)
library(dplyr)
library(dbplyr)
df = data.frame(c1 = c(1,2,3), c2 = c(9,8,7))
df = tbl_lazy(df, con = simulate_mssql())
colnames = c("c1","c2")
conditions = paste0("!is.na(",colnames,")")
output = df %>%
filter(!!!parse_exprs(conditions))
Calling show_query(output) shows both columns appear in the generated query:
<SQL>
SELECT *
FROM `df`
WHERE ((NOT(((`c1`) IS NULL))) AND (NOT(((`c2`) IS NULL))))
Well, actually I still don't get a satisfying solution. What I exactly wanted to do is to drop containing NA rows in R environment without typing an SQL query, I think dbplyr doesn't support this function yet.
Then I wrote a little and simple code to make my wish come true;
main_query<-table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
colnames <- main_query %>% colnames
query1 <- main_query %>% sql_render %>% paste('WHERE')
query2<-''
for(i in colnames){
if(i == tail(colnames,1)){query2<-paste(query2,i,'IS NOT NULL')}
else{query2<-paste(query2,i,'IS NOT NULL AND')}
}
desiredTable <- dbGetQuery(con,paste(query1,query2))
Yeah, I know it doesn't seem magical but maybe someone can make use of it.

Julia DataFrame: Create new column sum of col values :x by :y

I have a DataFrame of x and y occurrences. I would like to count how often each occurrence happens in the DataFrame and what percentage of the :y occurrences that combination represents. I have the first part down now, thanks to a previous question.
using DataFrames
mydf = DataFrame(y = rand('a':'h', 1000), x = rand('i':'p', 1000))
mydfsum = by(mydf, [:x, :y], df -> DataFrame(n = length(df[:x])))
This successfully creates a column that counts how often each value of :x occurs with each value of :y. Now I need to be able to generate a new column that counts how often each value of :y occurs. I could next create a new DataFrame using:
mydfsumy = by(mydf, [:y], df -> DataFrame(ny = length(df[:x])))
Join the DataFrames together.
mydfsum = join(mydfsum, mydfsumy, on = :y)
And create the percentage :yp column
mydfsum[:yp] = mydfsum[:n] ./ mydfsum[:ny]
But this seems like a clunky workaround for a common data management problem. In R I would do all of this in one line using dplyr:
mydf %>% groupby(x,y) %>% summarize(n = n()) %>% groupby(y) %>% mutate(yp = n/sum(n))
You can do it in one line:
mydfsum = by(mydf, :y, df -> by(df, :x, dd -> DataFrame(n = size(dd,1), yp = size(dd,1)/size(df,1))))
or, if that becomes hard to read, you can use the do notation for anonymous functions:
mydfsum = by(mydf,:y) do df
by(df, :x) do dd
DataFrame(n = size(dd,1), yp = size(dd,1)/size(df,1))
end
end
What you are doing in R is actually doing a first by on both x and y, then mutating a column of the output. You can also do that, but you need to have created that column first. Here I first initialize the yp column with zeroes and then modify it in place with another by.
mydfsum = by(mydf,[:x,:y], df -> DataFrame(n = size(df,1), yp = 0.))
by(mydfsum, :y, df -> (df[:yp] = df[:n]/sum(df[:n])))
For more advanced data manipulation you may want to take a look at Query.jl

Understanding spraklyr library in sql_render() function

There is sql_render function which translate dplyr code to SQL,
but I cannot understand the result as SQL code.
sc <- spark_connect()
library(sparklyr)
library(dplyr)
iris <- copy_to(sc, iris, 'iris')
k = iris %>% filter(Sepal_Length > 3) %>% filter(Sepal_Width > 3) %>%
select(Petal_Length, Petal_Width, Species)
sql_render(k)
SELECT Petal_Length AS Petal_Length, Petal_Width AS Petal_Width, Species AS Species
FROM (SELECT *
FROM (SELECT *
FROM iris
WHERE (Sepal_Length > 3.0)) hezmcfppjh
WHERE (Sepal_Width > 3.0)) exwivyezte
What is the 'hezmcfppjh' and 'exwivyezte' ?
hezmcfppjh and exwivyezte are randomly generated query names that dplyr could have used to reference specific parts of the subquery.
In this case, they are unused aliases, but in other operations the alias might be relevant to support: joins, renames, and other operations that require name-disambiguation.