percentile_cont in bigrquery - sql

I would like to get the 20th percentile of a column in big query using dplyr syntax in bigrquery, but I keep getting the following errors. Here is a reproducible example:
library(bigrquery)
library(dplyr)
library(DBI)
billing <- YOUR_BILLING_INFO
con <- dbConnect(
bigrquery::bigquery(),
project = "publicdata",
dataset = "samples",
billing = billing
)
natality <- tbl(con, "natality")
natality %>%
filter(year %in% c(1969, 1970)) %>%
group_by(year) %>%
summarise(percentile_20 = percentile_cont(weight_pounds, 0.2))
I get the following error:
Error: Analytic function PERCENTILE_CONT cannot be called without an OVER clause at [1:16] [invalidQuery]
However, it is not clear how to include an OVER clause here. How can I get the 20th percentile with dplyr syntax?

Related

How to add a vector to a table in backend using dbplyr (R)

I created a table from a data source using tbl(). I need to add a column including 1:nrow() to my dataset and tried different methods but I didn't succeed. My code is as below:
nrow_df1 <- df1 %>% summarise(n = n()) %>% pull(n)
df1 <- df1 %>% mutate(ID = 1:nrow_df1, step = 1)
It doesn't add column ID to my dataset and only adds column step.
Using as.data.frame(), it works but so slow.
Do you have any ideas? thanks in advance
For this case, you can use row_number().
library(dbplyr)
library(DBI)
# simulate a fake database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
# add in the row
tbl(con, "mtcars") %>%
mutate(ID = row_number())
dbDisconnect(con)
I found the answer. It is to use row_number() but as.numeric is also needed to convert the output from integer64 to numeric:
df1 <- df1 %>% mutate(ID = as.numeric(row_number(a column)), step = 1)

Translate SQL statement in R code using dplyr

I need help transalting an SQL statement regarding this dataset https://www.kaggle.com/datasets/hugomathien/soccer , into r code using dplyr.
The SQL statement is :
SELECT Match.date ,Team.team_long_name, Team.team_short_name ,Match.home_team_goal
FROM Team JOIN Match
ON Match.home_team_api_id = Team.team_api_id
WHERE Match.match_api_id = 492476;
The r code that i have tried is:
con <- DBI::dbConnect(RSQLite::SQLite(), "data/database.sqlite")
library(tidyverse)
library(DBI)
match<-tbl(con,"Match")
team<-tbl(con,"Team")
table_4.2<-match %>%
filter(match_api_id=492476) %>%
select(date,home_team_goal,home_team_api_id) %>%
left_join(team)
and i get this error :
Error in dplyr::common_by():
! by required, because the data sources have no common variables.
Run rlang::last_error() to see where the error occurred.
Run rlang::last_error() to see where the error occurred.
Use the code:
library(tidyverse)
Team %>%
left_join(Match, by = c(home_team_api_id = 'team_api_id')) %>%
filter(match_api_id == 492476) %>%
select(date, team_long_name, team_short_name, home_team_goal)

Sort by one variable, group by another, and select first row in SQL Query in R

I need to apply a procedure in SQL that is easy for me since R, but has been really tortuous in SQL.
I need to sort the data from highest to lowest by two variables, group based on another variable, and select the first item in each group.
I leave the code that I am trying to pass from R to SQL. Unfortunately the dbplyr package throws me an error when trying to convert one language to another: Error: first() is only available in a windowed (mutate()) context
library(tidyverse)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2
mtcars2 %>%
arrange(-mpg,-disp) %>%
group_by(cyl) %>%
summarise(hp = first(hp)) %>%
show_query()
It seems to me that the DISTINCT ON function could help me.
Thanks for your help.
Maybe the following?
library(tidyverse)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2 %>%
arrange(-mpg,-disp) %>%
group_by(cyl) %>%
mutate(hp = first(hp)) %>%
select(cyl, hp) %>%
distinct %>%
show_query
#> <SQL>
#> SELECT DISTINCT `cyl`, FIRST_VALUE(`hp`) OVER (PARTITION BY `cyl` ORDER BY -`mpg`, -`disp`) AS `hp`
#> FROM `mtcars`
#> ORDER BY -`mpg`, -`disp`
See: https://github.com/tidyverse/dbplyr/issues/129

Dropping containing NA rows with dbplyr

here is how I ran some SQL queries by dbplyr
library(tidyverse)
library(dbplyr)
library(DBI)
library(RPostgres)
library(bit64)
library(tidyr)
drv <- dbDriver('Postgres')
con <- dbConnect(drv,dbname='mydb',port=5432,user='postgres')
table1 <- tbl(con,'table1')
table2 <- tbl(con,'table2')
table3 <- tbl(con,'table3')
table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
I wanna drop some rows which include NA then collect my final table but couldn't find anything helpful works with dbplyr queries.
I tried to pipe drop_na() from tidyr and some other base functions (complete.cases() etc.). Would you suggest me anything to succeed my aim ? Piping an SQL query (like WHERE FOO IS NOT NULL) to dbplyr query is also welcome.
Thanks in advance.
Try using !is.na(col_name) as part of a filter:
library(dplyr)
library(dbplyr)
df = data.frame(my_num = c(1,2,3))
df = tbl_lazy(df, con = simulate_mssql())
output = df %>% filter(!is.na(my_num))
Calling show_query(output) to check the generated sql gives:
<SQL>
SELECT *
FROM `df`
WHERE (NOT(((`my_num`) IS NULL)))
The extra brackets are part of how dbplyr does its translation.
If you want to do this for multiple columns, try the following approach based on this answer:
library(rlang)
library(dplyr)
library(dbplyr)
df = data.frame(c1 = c(1,2,3), c2 = c(9,8,7))
df = tbl_lazy(df, con = simulate_mssql())
colnames = c("c1","c2")
conditions = paste0("!is.na(",colnames,")")
output = df %>%
filter(!!!parse_exprs(conditions))
Calling show_query(output) shows both columns appear in the generated query:
<SQL>
SELECT *
FROM `df`
WHERE ((NOT(((`c1`) IS NULL))) AND (NOT(((`c2`) IS NULL))))
Well, actually I still don't get a satisfying solution. What I exactly wanted to do is to drop containing NA rows in R environment without typing an SQL query, I think dbplyr doesn't support this function yet.
Then I wrote a little and simple code to make my wish come true;
main_query<-table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
colnames <- main_query %>% colnames
query1 <- main_query %>% sql_render %>% paste('WHERE')
query2<-''
for(i in colnames){
if(i == tail(colnames,1)){query2<-paste(query2,i,'IS NOT NULL')}
else{query2<-paste(query2,i,'IS NOT NULL AND')}
}
desiredTable <- dbGetQuery(con,paste(query1,query2))
Yeah, I know it doesn't seem magical but maybe someone can make use of it.

Converting a table into a data frame - R Studio

I have the following table which i need to convert to a table and i have the below error which i can't figure out the problem. My main idea is to get a value from a particular column in the table. The view of the table is working fine. Thanks
library(RODBC)
library(odbc)
library(dplyr)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "MSIGS75\\SQLEXPRESS",
Database = "Players")
dbListTables(con)
table <- tbl(con, "playersData")
View(tbl(con, "playersData"))
tableDF <- as.data.frame(table)
Error
Error in as.data.frame.default(table) : cannot coerce class ‘"function"’ to a data.frame
We can use collect
library(dbplyr)
library(dplyr)
yourcolumn <- "some column name"
yourindex <- 5# row 5
table %>%
collect() %>%
as.data.frame %>%
select(yourcolumn) %>%
slice(yourindex)