in R, read special columns with read.csv.sql - sql

I am trying to read a big csv file. Indeed, I want select a subset using a special column which name is Race Color. Reading the file via read.csv, I have the head
library(sqldf)
df <- read.csv(file = 'df.txt', header = T, sep = ";")
head(df)
id Region Race Color ....
1 1 1
2 1 1
3 2 1
4 3 2
5 4 1
6 4 1
I would like to use read.csv.sql for selecting a subset of df without use the read.csv file. For example, I want all the people with Race Color equal to 1.
Using read.csv.sql, I have something like
>df <- read.csv.sql("df.txt", sql = "select * from file where Race Color = 1", sep=";", header=T, eol="\n")
but I have the following error
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: near "Color": syntax error
Trying
>df <- read.csv.sql("df.txt", sql = "select * from file where 'Race Color' = 1", sep=";", header=T, eol="\n")
I have df with zero rows.
Any solution?

R automatically adds a . to column names with a space on reading in the data to make Race.Color, but a . has a special meaning in sql, so that will screw things up.
There is a built in method in sqldf using square brackets ([Race.Color]) to explicitly name columns we can use so that we don't run into that problem. You can also use escaped quotes : \"Race.Color\"
This should work:
library(sqldf)
read.csv.sql("test.csv", sql = "select * from file where [Race.Color] = 1", sep=";", header=T, eol="\n")

Related

Remove Duplicate Columns via sqlQuery()

I am working with the R programming language. Suppose I have the following data frame:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
head(testframe)
age height height2 gender gender2
1 18 76.1 76.1 M M
2 19 77.0 77.0 F F
3 20 78.1 78.1 M M
4 21 78.2 78.2 M M
5 22 78.8 78.8 F F
6 23 79.7 79.7 F F
If I want to remove columns with different names but identical values, I can use the following line of code:
no_dup = testframe[!duplicated(as.list(testframe))]
head(no_dup)
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
My Question: Suppose the data frame is not located in the global environment - is it possible to pass the above line of code through a sqlQuery() command? For example:
library(RODBC)
library(sqldf)
con = odbcConnect("some name", uid = "some id", pwd = "abc")
#not sure if this is correct?
sample_query = sqlQuery(con, "testframe[!duplicated(as.list(testframe))]")
Can someone please show me how to do this?
Thanks!
This does all the substantive processing on the SQL side and only does name manipulation on the R side. The database is not downloaded to R.
The first pipeline inputs the names (we have hard coded the names in Names but you can retrieve them from the database if necessary) and returns an SQL statement, sql1, that when run against your database will produce a one line data frame from the data base that has a column for each pair of variables in testframe and whose value is the number of unequal values.
We then run sql1 using sqldf for reproducibility but you can replace that with an appropriate call to sqlQuery.
The second pipeline then uses numDF to generate one or more SQL statements in character vector sql2 to drop the duplicated columns, i.e. those for which there are zero unequal values You can then run those SQL statements against your database.
We used sqldf with SQLite for reproducibility but you can replace the calls to sqldf with an appropriately modified call to sqlQuery, e.g. sqlQuery(con, sql1) where con is the connection you have previously defined.
It is likely that whatever database system you are using accepts the same SQL but if not there may be small changes needed in the code to generate SQL accepted by whatever it is you are using.
library(magrittr)
library(sqldf)
Names <- c("age", "height", "height2", "gender", "gender2")
sql1 <- Names %>%
{ toString(sprintf("sum(%s)", combn(., 2, paste, collapse = "!="))) } %>%
paste("select", ., "from testframe")
numDF <- sqldf(sq11) # replace with call to your database
sql2 <- numDF %>%
Filter(Negate(c), .) %>%
names %>%
sub(".*!=(.*.).", "alter table testframe drop \\1", .)
# Just run the sql2 part against your db, not select * ... part.
# The select * ... downloads table for demo purposes only.
sqldf(c(sql2, "select * from testframe")) # replace
## age height gender
## 1 18 76.1 M
## 2 19 77.0 F
## 3 20 78.1 M
## 4 21 78.2 M
## ...snip...
Note that sql1 and sql2 are the following. sql1 is a single sql select statement and sql2 is a vector of sql alter statements, one statement per column to drop. If your data base allows ALTER to drop multiple columns at once you might be able to simplify that but SQLite only allows one at a time.
sql1
## [1] "select sum(age!=height), sum(age!=height2), sum(age!=gender), sum(age!=gender2), sum(height!=height2), sum(height!=gender), sum(height!=gender2), sum(height2!=gender), sum(height2!=gender2), sum(gender!=gender2) from testframe"
sql2
## [1] "alter table testframe drop height2" "alter table testframe drop gender2"

Sqldf in R - error with first column names

Whenever I use read.csv.sql I cannot select from the first column with and any output from the code places an unusual character (A(tilde)-..) at the begging of the first column's name.
So suppose I create a df.csv file in in Excel that looks something like this
df = data.frame(
a = 1,
b = 2,
c = 3,
d = 4)
Then if I use sqldf to query the csv which is in my working directory I get the following error:
> read.csv.sql("df.csv", sql = "select * from file where a == 1")
Error in result_create(conn#ptr, statement) : no such column: a
If I query a different column than the first, I get a result but with the output of the unusual characters as seen below
df <- read.csv.sql("df.csv", sql = "select * from file where b == 2")
View(df)
Any idea how to prevent these characters from being added to the first column name?
The problem is presumably that you have a file that is larger than R can handle and so only want to read a subset of rows into R and specifying the condition to filter it by involves referring to the first column whose name is messed up so you can't use it.
Here are two alternative approaches. The first one involves a bit more code but has the advantage that it is 100% R. The second one is only one statement and also uses R but additionally makes use an of an external utility.
1) skip header Read the file in skipping over the header. That will cause the columns to be labelled V1, V2, etc. and use V1 in the condition.
# write out a test file - BOD is a data frame that comes with R
write.csv(BOD, "BOD.csv", row.names = FALSE, quote = FALSE)
# read file skipping over header
DF <- read.csv.sql("BOD.csv", "select * from file where V1 < 3",
skip = 1, header = FALSE)
# read in header, assign it to DF and fix first column
hdr <- read.csv.sql("BOD.csv", "select * from file limit 0")
names(DF) <- names(hdr)
names(DF)[1] <- "TIME" # suppose we want TIME instead of Time
DF
## TIME demand
## 1 1 8.3
## 2 2 10.3
2) filter Another way to proceed is to use the filter= argument. Here we assume we know that the end of the column name is ime but there are other characters prior to that that we don't know. This assumes that sed is available and on your path. If you are on Windows install Rtools to get sed. The quoting might need to be changed depending on your shell.
When trying this on Windows I noticed that sed from Rtools changed the line endings so below we specified eol= to ensure correct processing. You may not need that.
DF <- read.csv.sql("BOD.csv", "select * from file where TIME < 3",
filter = 'sed -e "1s/.*ime,/TIME,/"' , eol = "\n")
DF
## TIME demand
## 1 1 8.3
## 2 2 10.3
So I figured it out by reading through the above comments.
I'm on a Windows 10 machine using Excel for Office 365. The special characters will go away by changing how I saved the file from a "CSV UTF-8 (Comma Delimited)" to just "CSV (Comma delimited)".

R: Why is numeric data treated like non-numeric data?

my R script is:
aa <- data.frame("1","3","1.5","2.1")
mean(aa)
Then I get:
[1] NA
Warning message: In mean.default(aa) : argument is not numeric or logical: returning NA
Can this be solved without removing the quotation marks or changing data.frame to something else?
For a data.frame if you want to get mean value for each column, you can do it using colMeans function:
aa <- data.frame("1","3","1.5","2.1", stringsAsFactors = FALSE)
aa <- sapply(aa, as.numeric)
colMeans(aa)
mean(colMeans(aa)) # if you want average across all columns
You must create the data.frame using stringsAsFactors = FALSE otherwise all character values will be separate factors with ordinal 1 each. Numeric representation would be 1 for each value.

Reformat wide Excel table to more SQL-friendly structure

I have a very wide Excel sheet, from Column A - DIE (about 2500 columns wide), of survey data. Each column is a question, and each row is a response. I'm trying to upload the data to SQL and convert it to a more SQL-friendly format using the UNPIVOT function, but I can't even get it loaded into SQL because it exceeds the 1024-column limit.
Basically, I have an Excel sheet that looks like this:
But I want to convert it to look like this:
What options do I have to make this change, either in Excel (prior to upload) or SQL (while circumventing the 1024 column limit)?
I have had to do this quite a bit. My solution was to write a Python script that would un-crosstab a CSV file (typically exported from Excel), creating another CSV file. The Python code is here: https://pypi.python.org/pypi/un-xtab/ and the documentation is here: http://pythonhosted.org/un-xtab/. I've never run it on a file with 2500 columns, but don't know why it wouldn't work.
R has a very specific function call in one of it's libraries. You can also connect, read, and write data with R into a database. Would suggest downloading R and Rstudio.
Here is a working script to get you started that does what you need:
Sample data:
df <- data.frame(id = c(1,2,3), question_1 = c(1,0,1), question_2 = c(2,0,2))
df
Input table:
id question_1 question_2
1 1 1 2
2 2 0 0
3 3 1 2
Code to transpose the data:
df2 <- gather(df, key = id, value = values)
df2
Output:
id id values
1 1 question_1 1
2 2 question_1 0
3 3 question_1 1
4 1 question_2 2
5 2 question_2 0
6 3 question_2 2
Some helper functions for you to import and export the csv data:
# Install and load the necessary libraries
install.packages(c('tidyr','readr'))
library(tidyr)
library(readr)
# to read a csv file
df <- read_csv('[some directory][some filename].csv')
# To output the csv file
write.csv(df2, '[some directory]data.csv', row.names = FALSE)
Thanks for all the help. I ended up using Python due to limitations in both SQL (over 1024 columns wide) and Excel (well over 1 million rows in the output). I borrowed the concepts from rd_nielson's code, but that was a bit more complicated than I needed. In case it's helpful to anyone else, this is the code I used. It outputs a csv file with 3 columns and 14 million rows that I can upload to SQL.
import csv
with open('Responses.csv') as f:
reader = csv.reader(f)
headers = next(reader) # capture current field headers
newHeaders = ['ResponseID','Question','Response'] # establish new header names
with open('PythonOut.csv','w') as outputfile:
writer=csv.writer(outputfile, dialect='excel', lineterminator='\n')
writer.writerow(newHeaders) # write new headers to output
QuestionHeaders = headers[1:len(headers)] # Slice the question headers from original header list
for row in reader:
questionCount = 0 # start counter to loop through each question (column) for every response (row)
while questionCount <= len(QuestionHeaders) - 1:
newRow = [row[0], QuestionHeaders[questionCount], row[questionCount + 1]]
writer.writerow(newRow)
questionCount += 1

Read only n-th column of a text file which has no header with R and sqldf

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.