Select statment with where condtion doesn't work in R - sql

I want to know can I passing a paremeter with sql query doesn't work
See below sample of my code
Vector <- c(123,436,765)
for ( i in 1:length(Vector){
Result <- sqldf("select * from DF where studentID =" , Vector[i] )
Print( Result)}
Note that my DF has studentId with data type: integer
Thanks

I prefer plain R, but if you want to use sql-like languages on data.frame you should consider using dplyr, that is pretty much mainstream nowdays.
Here a working example of doing an operation like the one you mentioned.
The pipe %>% simply take what is on the the left and put it as first argument on function of the right, making easier nested functions.
required(dplyr)
iris
iris <- tbl_df(iris)
iris %>% filter(Sepal.Length<=5) %>% select(Species)

Related

Dropping containing NA rows with dbplyr

here is how I ran some SQL queries by dbplyr
library(tidyverse)
library(dbplyr)
library(DBI)
library(RPostgres)
library(bit64)
library(tidyr)
drv <- dbDriver('Postgres')
con <- dbConnect(drv,dbname='mydb',port=5432,user='postgres')
table1 <- tbl(con,'table1')
table2 <- tbl(con,'table2')
table3 <- tbl(con,'table3')
table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
I wanna drop some rows which include NA then collect my final table but couldn't find anything helpful works with dbplyr queries.
I tried to pipe drop_na() from tidyr and some other base functions (complete.cases() etc.). Would you suggest me anything to succeed my aim ? Piping an SQL query (like WHERE FOO IS NOT NULL) to dbplyr query is also welcome.
Thanks in advance.
Try using !is.na(col_name) as part of a filter:
library(dplyr)
library(dbplyr)
df = data.frame(my_num = c(1,2,3))
df = tbl_lazy(df, con = simulate_mssql())
output = df %>% filter(!is.na(my_num))
Calling show_query(output) to check the generated sql gives:
<SQL>
SELECT *
FROM `df`
WHERE (NOT(((`my_num`) IS NULL)))
The extra brackets are part of how dbplyr does its translation.
If you want to do this for multiple columns, try the following approach based on this answer:
library(rlang)
library(dplyr)
library(dbplyr)
df = data.frame(c1 = c(1,2,3), c2 = c(9,8,7))
df = tbl_lazy(df, con = simulate_mssql())
colnames = c("c1","c2")
conditions = paste0("!is.na(",colnames,")")
output = df %>%
filter(!!!parse_exprs(conditions))
Calling show_query(output) shows both columns appear in the generated query:
<SQL>
SELECT *
FROM `df`
WHERE ((NOT(((`c1`) IS NULL))) AND (NOT(((`c2`) IS NULL))))
Well, actually I still don't get a satisfying solution. What I exactly wanted to do is to drop containing NA rows in R environment without typing an SQL query, I think dbplyr doesn't support this function yet.
Then I wrote a little and simple code to make my wish come true;
main_query<-table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))
colnames <- main_query %>% colnames
query1 <- main_query %>% sql_render %>% paste('WHERE')
query2<-''
for(i in colnames){
if(i == tail(colnames,1)){query2<-paste(query2,i,'IS NOT NULL')}
else{query2<-paste(query2,i,'IS NOT NULL AND')}
}
desiredTable <- dbGetQuery(con,paste(query1,query2))
Yeah, I know it doesn't seem magical but maybe someone can make use of it.

SQL query output as a dataframe in R

I am very new to R. I wanted to use a sql query to get data with R. (I am using athena but I think this doesn't matter).
con <- dbConnect(RAthena::athena(),
s3_staging_dir = 's3://bla/bla/'
)
df <- tbl(con, sql("SELECT * FROM db.my_data"))
My problem is that df is not a dataframe. So when I do names(df) for example I don't get the columns (as I would with Python) but get 'src''ops' instead. I don't know what this means (these are not columns). How can I convert this df so it is a dataframe in R?
You can use function dbGetQuery from package DBI
From documentation:
Description Returns the result of a query as a data frame.
dbGetQuery() comes with a default implementation (which should work
with most backends) that calls dbSendQuery(), then dbFetch(), ensuring
that the result is always free-d by dbClearResult().
Implemantation from your example:
df <- dbGetQuery(con, "SELECT * FROM db.my_data")

Understanding spraklyr library in sql_render() function

There is sql_render function which translate dplyr code to SQL,
but I cannot understand the result as SQL code.
sc <- spark_connect()
library(sparklyr)
library(dplyr)
iris <- copy_to(sc, iris, 'iris')
k = iris %>% filter(Sepal_Length > 3) %>% filter(Sepal_Width > 3) %>%
select(Petal_Length, Petal_Width, Species)
sql_render(k)
SELECT Petal_Length AS Petal_Length, Petal_Width AS Petal_Width, Species AS Species
FROM (SELECT *
FROM (SELECT *
FROM iris
WHERE (Sepal_Length > 3.0)) hezmcfppjh
WHERE (Sepal_Width > 3.0)) exwivyezte
What is the 'hezmcfppjh' and 'exwivyezte' ?
hezmcfppjh and exwivyezte are randomly generated query names that dplyr could have used to reference specific parts of the subquery.
In this case, they are unused aliases, but in other operations the alias might be relevant to support: joins, renames, and other operations that require name-disambiguation.

selecting every Nth column in using SQLDF or read.csv.sql

I am rather new to using SQL statements, and am having a little trouble using them to select the desired columns from a large table and pulling them into R.
I want to take a csv file and read selected columns into r, in particular, every 9th and 10th column. In R, something like:
read.csv.sql("myfile.csv", sql(select * from file [EVERY 9th and 10th COLUMN])
My trawl of the internet suggests that selecting every nth row could be done with an SQL statement using MOD something like this (please correct me if I am wrong):
"SELECT *
FROM file
WHERE (ROWID,0) IN (SELECT ROWID, MOD(ROWNUM,9) OR MOD(ROWNUM,10)"
Is there a way to make this work for columns? Thanks in advance.
read.csv read.csv would be adequate for this:
# determine number of columns
DF1 <- read.csv(myfile, nrows = 1)
nc <- ncol(DF1)
# create a list nc long where unwanted columns are NULL and wanted are NA
colClasses <- rep(rep(list("NULL", NA), c(8, 2)), length = nc)
# read in
DF <- read.csv(myfile, colClasses = colClasses)
sqldf To use sqldf replace the last line with these:
nms <- names(DF1)
vars <- toString(nms[is.na(colClasses)])
DF <- fn$read.csv.sql(myfile, "select $vars from file")
UPDATE: switched to read.csv.sql
UPDATE 2: correction.

R equivalent of SELECT DISTINCT on two or more fields/variables

Say I have a dataframe df with two or more columns, is there an easy way to use unique() or other R function to create a subset of unique combinations of two or more columns?
I know I can use sqldf() and write an easy "SELECT DISTINCT var1, var2, ... varN" query, but I am looking for an R way of doing this.
It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset:
uniques <- as.data.frame(ftable(df$var1, df$var2))
unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.
Another option is distinct from dplyr package:
df %>% distinct(var1, var2) # or distinct(df, var1, var2)
Note:
For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step
df %>% select(var1, var2) %>% distinct
(or oldish way distinct(select(df, var1, var2))).
#Marek's answer is obviously correct, but may be outdated. The current dplyrversion (0.7.4) allows for an even simpler code:
Simply use:
df %>% distinct(var1, var2)
If you want to keep all columns, add
df %>% distinct(var1, var2, .keep_all = TRUE)
To KEEP all other variables in df use this:
unique_rows <- !duplicated(df[c("var1","var2")])
unique.df <- df[unique_rows,]
Another less recommended method is using row.names() #(see David's comment below):
unique_rows <- row.names(unique(df[c("var1","var2")]))
unique.df <- df[unique_rows,]
In addition to answers above, the data.table version:
setDT(df)
unique_dt = unique(df, by = c('var1', 'var2'))