Remove Duplicate Columns via sqlQuery() - sql

I am working with the R programming language. Suppose I have the following data frame:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
head(testframe)
age height height2 gender gender2
1 18 76.1 76.1 M M
2 19 77.0 77.0 F F
3 20 78.1 78.1 M M
4 21 78.2 78.2 M M
5 22 78.8 78.8 F F
6 23 79.7 79.7 F F
If I want to remove columns with different names but identical values, I can use the following line of code:
no_dup = testframe[!duplicated(as.list(testframe))]
head(no_dup)
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
My Question: Suppose the data frame is not located in the global environment - is it possible to pass the above line of code through a sqlQuery() command? For example:
library(RODBC)
library(sqldf)
con = odbcConnect("some name", uid = "some id", pwd = "abc")
#not sure if this is correct?
sample_query = sqlQuery(con, "testframe[!duplicated(as.list(testframe))]")
Can someone please show me how to do this?
Thanks!

This does all the substantive processing on the SQL side and only does name manipulation on the R side. The database is not downloaded to R.
The first pipeline inputs the names (we have hard coded the names in Names but you can retrieve them from the database if necessary) and returns an SQL statement, sql1, that when run against your database will produce a one line data frame from the data base that has a column for each pair of variables in testframe and whose value is the number of unequal values.
We then run sql1 using sqldf for reproducibility but you can replace that with an appropriate call to sqlQuery.
The second pipeline then uses numDF to generate one or more SQL statements in character vector sql2 to drop the duplicated columns, i.e. those for which there are zero unequal values You can then run those SQL statements against your database.
We used sqldf with SQLite for reproducibility but you can replace the calls to sqldf with an appropriately modified call to sqlQuery, e.g. sqlQuery(con, sql1) where con is the connection you have previously defined.
It is likely that whatever database system you are using accepts the same SQL but if not there may be small changes needed in the code to generate SQL accepted by whatever it is you are using.
library(magrittr)
library(sqldf)
Names <- c("age", "height", "height2", "gender", "gender2")
sql1 <- Names %>%
{ toString(sprintf("sum(%s)", combn(., 2, paste, collapse = "!="))) } %>%
paste("select", ., "from testframe")
numDF <- sqldf(sq11) # replace with call to your database
sql2 <- numDF %>%
Filter(Negate(c), .) %>%
names %>%
sub(".*!=(.*.).", "alter table testframe drop \\1", .)
# Just run the sql2 part against your db, not select * ... part.
# The select * ... downloads table for demo purposes only.
sqldf(c(sql2, "select * from testframe")) # replace
## age height gender
## 1 18 76.1 M
## 2 19 77.0 F
## 3 20 78.1 M
## 4 21 78.2 M
## ...snip...
Note that sql1 and sql2 are the following. sql1 is a single sql select statement and sql2 is a vector of sql alter statements, one statement per column to drop. If your data base allows ALTER to drop multiple columns at once you might be able to simplify that but SQLite only allows one at a time.
sql1
## [1] "select sum(age!=height), sum(age!=height2), sum(age!=gender), sum(age!=gender2), sum(height!=height2), sum(height!=gender), sum(height!=gender2), sum(height2!=gender), sum(height2!=gender2), sum(gender!=gender2) from testframe"
sql2
## [1] "alter table testframe drop height2" "alter table testframe drop gender2"

Related

Select all columns except one in R sqldf package

Is there a way in R using the sqldf package to select all columns except one?
Your call to sqldf based on some query should return a data frame, where each DF column corresponds to one of the columns appearing in the select clause of your SQL query. Consider the following example:
sql <- "SELECT * FROM yourTable WHERE <some conditions>"
df <- sqldf(sql)
drop <- c("some_column")
df <- df[, !(names(df) %in% drop)]
Note in the above I am doing a SELECT * to fetch all columns in the table (what I assume is your use case). I then subset off a column some_column from the resulting data frame.
Note that doing this from SQL directly generally is not possible. That is, once you do SELECT *, the cat is out of the bag, and you end up with all columns.
1) SQLite Using the default SQLite backend, suppose we want to return the first 3 rows of all columns in mtcars except for the cyl column. First create a comma separated string, sel, of all such column names and then use fn$sqldf to allow string interpolation referring to it in the SQL statement as $sel. Add verbose=TRUE argument to sqldf if you want to see the SQL statement that was generated.
library(sqldf)
sel <- toString(setdiff(names(mtcars), "cyl"))
fn$sqldf("select $sel from mtcars limit 3")
giving:
mpg disp hp drat wt qsec vs am gear carb
1 21.0 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 108 93 3.85 2.320 18.61 1 1 4 1
2) H2 The H2 backend supports alter table ... drop column ... so we can write the following. Since alter does not return anything we add a select which returns the altered table.
library(RH2)
library(sqldf)
sqldf(c("alter table mtcars drop column cyl",
"select * from mtcars limit 3"))

sqldf R Error create a table

I am doing some experiments with SQL in R using the sqldf package.
I am trying to test some commands to check the output, in particular I am trying to create tables.
Here the code:
sqldf("CREATE TABLE tbl1 AS
SELECT cut
FROM diamonds")
Very simple code, however I get this error
sqldf("CREATE TABLE tbl1 AS
+ SELECT cut
+ FROM diamonds")
data frame with 0 columns and 0 rows
Warning message:
In result_fetch(res#ptr, n = n) :
Don't need to call dbFetch() for statements, only for queries
Why is it saying the the table create as 0 columns and 0 rows?
Can someone help?
That is a warning, not an error. The warning is caused by a backward incompatibility in recent versions of RSQLite. You can ignore it since it works anyways.
The sqldf statement that is shown in the question
creates an empty database
uploads the diamonds data frame to a table of the same name in that database
runs the create statement which creates a second table tbl1 in the database
returns nothing (actually a 0 column 0 row data frame) since a create statement has no value
destroys the database
When using sqldf you don't need create statements. It automatically creates a table in the backend database for any data frame referenced in your sql statement so the following sqldf statement
sqldf("select * from diamonds")
will
create an empty database
upload diamonds to it
run the select statement
return the result of the select statement as a data frame
destroy the database
You can use the verbose=TRUE argument to see the individual calls to the lower level RSQLite (or other backend database if you specify a different backend):
sqldf("select * from diamonds limit 3", verbose = TRUE)
giving:
sqldf: library(RSQLite)
sqldf: m <- dbDriver("SQLite")
sqldf: connection <- dbConnect(m, dbname = ":memory:")
sqldf: initExtension(connection)
sqldf: dbWriteTable(connection, 'diamonds', diamonds, row.names = FALSE)
sqldf: dbGetQuery(connection, 'select * from diamonds limit 3')
sqldf: dbDisconnect(connection)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
Suggest you thoroughly review help("sqldf") as well as the info on the sqldf github home page

in R, read special columns with read.csv.sql

I am trying to read a big csv file. Indeed, I want select a subset using a special column which name is Race Color. Reading the file via read.csv, I have the head
library(sqldf)
df <- read.csv(file = 'df.txt', header = T, sep = ";")
head(df)
id Region Race Color ....
1 1 1
2 1 1
3 2 1
4 3 2
5 4 1
6 4 1
I would like to use read.csv.sql for selecting a subset of df without use the read.csv file. For example, I want all the people with Race Color equal to 1.
Using read.csv.sql, I have something like
>df <- read.csv.sql("df.txt", sql = "select * from file where Race Color = 1", sep=";", header=T, eol="\n")
but I have the following error
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: near "Color": syntax error
Trying
>df <- read.csv.sql("df.txt", sql = "select * from file where 'Race Color' = 1", sep=";", header=T, eol="\n")
I have df with zero rows.
Any solution?
R automatically adds a . to column names with a space on reading in the data to make Race.Color, but a . has a special meaning in sql, so that will screw things up.
There is a built in method in sqldf using square brackets ([Race.Color]) to explicitly name columns we can use so that we don't run into that problem. You can also use escaped quotes : \"Race.Color\"
This should work:
library(sqldf)
read.csv.sql("test.csv", sql = "select * from file where [Race.Color] = 1", sep=";", header=T, eol="\n")

SAS how to calculate a variable that is the mean of the values of all the other observations for each observation?

For example, the dataset a is
id x
1 15
2 25
3 35
4 45
I want to add a column y to dataset a, y being the average of x excluding the current id.
so y_1 = (x_2+x_3+x_4)/3 = (25+35+45)/3.
Easiest way to do it without SQL is to add the mean and the n to each row (use PROC MEANS, then merge on the values), and then use math to remove the current value. IE, if x_mean=(15+25+35+45)/4 = 30, and x=15, then
x_mean_others = ((30*4)-15)/(4-1) = 105/3 = 35
Alternateively, in SQL, you can calculate it on the fly with the same idea.
proc sql;
create table want as
select x, (mean(x)*n(x) - x)/(n(x)-1) as y
from have H
;
quit;
This takes advantage of SAS's automatic remerging, in something like SQL Server you'd need a WITH clause to make this work I imagine.

Read only n-th column of a text file which has no header with R and sqldf

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.