I have this data set:
var_1 = rnorm(1000,1000,1000)
var_2 = rnorm(1000,1000,1000)
var_3 = rnorm(1000,1000,1000)
sample_data = data.frame(var_1, var_2, var_3)
I broke this data set into groups of 100 - thus creating 10 mini datasets:
list_of_dfs <- split(
sample_data, (seq(nrow(sample_data))-1) %/% 100
)
table_names <- paste0("sample_", 1:10)
I know want to upload these 10 mini datasets onto an SQL server :
library(odbc)
library(DBI)
library(RODBC)
library(purrr)
#establish connection
map2(
table_names,
list_of_dfs,
function(x,y) dbWriteTable(connection, x, y)
)
The problem is, that one of these mini datasets (e.g. sample_6) is not being accepted by the SQL server and gives this error:
Error in result_insert_dataframe(rs#prt, values): nanodbc/nanodbc.cpp:1587 : HY008 : Operation canceled
This means that "sample_1", "sample_2", "sample_3", "sample_4", "sample_5" were all successfully uploaded - but since "sample_6" was rejected, "sample_7", "sample_8", "sample_9" and "sample_10".
Is there a way to "override" this error and ensure that if one of these "sample_i" data sets are rejected, the computer will skip this data set and attempt to upload the remaining datasets?
If I were to do this manually, I could just "force" R to skip over the problem data set. For example, imagine if "sample_2" was causing the problem:
dbWriteTable(my_connection, SQL("sample_1"), sample_1)
dbWriteTable(my_connection, SQL("sample_2"), sample_2)
Error in result_insert_dataframe(rs#prt, values): nanodbc/nanodbc.cpp:1587 : HY008 : Operation canceled
dbWriteTable(my_connection, SQL("sample_3"), sample_3)
In the above code, "sample_1" and "sample_3" are successfully uploaded even though "sample_2" was causing a problem.
Is it possible to override these errors when "bulk-uploading" the datasets?
Thank you!
Related
So in a database Y I have a table X with more than 400 million observations. Then I have a KEY.csv file with IDs, that I want to use for filtering the data (small data set, ca. 50k unique IDs). If I had unlimited memory, I would do something like this:
require(RODBC)
require(dplyr)
db <- odbcConnect('Y',uid = "123",pwd = '123')
df <- sqlQuery(db,'SELECT * from X')
close(db)
keys <- read.csv('KEY.csv')
df_final <- df %>% filter(ID %in% KEY$ID)
My issue is, that I don't have the rights to upload the KEY.csv file to the database Y and do the filtering there. Would it be somehow possible to do the filtering in the query, while referencing the file loaded in R memory? And then write this filtered X table directly to a database I have access? I think even after filtering it R might not be able to keep it in the memory.
I could also try to do this in Python, however don't have much experience in that language.
I dont know how many keys you have but maybe you can try to use the build_sql() function to use the keys inside the query.
I dont use RODBC, I think you should use odbc and DBI (https://db.rstudio.com).
library(dbplyr) # dbplyr not dplyr
library(DBI)
library(odbc)
# Get keys first
keys = read.csv('KEY.csv')
db = dbConnect('Y',uid = "123",pwd = '123') # the name of function changes in odbc
# write your query (dbplyr)
sql_query = build_sql('SELECT * from X
where X.key IN ', keys, con = db)
df = dbGetQuery(db,sql_query) # the name of function changes in odbc
Whenever I use read.csv.sql I cannot select from the first column with and any output from the code places an unusual character (A(tilde)-..) at the begging of the first column's name.
So suppose I create a df.csv file in in Excel that looks something like this
df = data.frame(
a = 1,
b = 2,
c = 3,
d = 4)
Then if I use sqldf to query the csv which is in my working directory I get the following error:
> read.csv.sql("df.csv", sql = "select * from file where a == 1")
Error in result_create(conn#ptr, statement) : no such column: a
If I query a different column than the first, I get a result but with the output of the unusual characters as seen below
df <- read.csv.sql("df.csv", sql = "select * from file where b == 2")
View(df)
Any idea how to prevent these characters from being added to the first column name?
The problem is presumably that you have a file that is larger than R can handle and so only want to read a subset of rows into R and specifying the condition to filter it by involves referring to the first column whose name is messed up so you can't use it.
Here are two alternative approaches. The first one involves a bit more code but has the advantage that it is 100% R. The second one is only one statement and also uses R but additionally makes use an of an external utility.
1) skip header Read the file in skipping over the header. That will cause the columns to be labelled V1, V2, etc. and use V1 in the condition.
# write out a test file - BOD is a data frame that comes with R
write.csv(BOD, "BOD.csv", row.names = FALSE, quote = FALSE)
# read file skipping over header
DF <- read.csv.sql("BOD.csv", "select * from file where V1 < 3",
skip = 1, header = FALSE)
# read in header, assign it to DF and fix first column
hdr <- read.csv.sql("BOD.csv", "select * from file limit 0")
names(DF) <- names(hdr)
names(DF)[1] <- "TIME" # suppose we want TIME instead of Time
DF
## TIME demand
## 1 1 8.3
## 2 2 10.3
2) filter Another way to proceed is to use the filter= argument. Here we assume we know that the end of the column name is ime but there are other characters prior to that that we don't know. This assumes that sed is available and on your path. If you are on Windows install Rtools to get sed. The quoting might need to be changed depending on your shell.
When trying this on Windows I noticed that sed from Rtools changed the line endings so below we specified eol= to ensure correct processing. You may not need that.
DF <- read.csv.sql("BOD.csv", "select * from file where TIME < 3",
filter = 'sed -e "1s/.*ime,/TIME,/"' , eol = "\n")
DF
## TIME demand
## 1 1 8.3
## 2 2 10.3
So I figured it out by reading through the above comments.
I'm on a Windows 10 machine using Excel for Office 365. The special characters will go away by changing how I saved the file from a "CSV UTF-8 (Comma Delimited)" to just "CSV (Comma delimited)".
I am using read.csv.sql to conditionally read in data (my data set is extremely large so this was the solution I chose to filter it and reduce it in size prior to reading the data in). I was running into memory issues by reading in the full data and then filtering it so that is why it is important that I use the conditional read so that the subset is read in, versus the full data set.
Here is a small data set so my problem can be reproduced:
write.csv(iris, "iris.csv", row.names = F)
library(sqldf)
csvFile <- "iris.csv"
I am finding that the notation you have to use is extremely awkward using read.csv.sql the following is the how I am reading in the file:
# Step 1 (Assume these values are coming from UI)
spec <- 'setosa'
petwd <- 0.2
# Add quotes and make comma-separated:
spec <- toString(sprintf("'%s'", spec))
petwd <- toString(sprintf("'%s'", petwd))
# Step 2 - Conditionally read in the data, store in 'd'
d <- fn$read.csv.sql(csvFile, sql='select * from file where
"Species" in ($spec)'
and "Petal.Width" in ($petwd)',
filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))
My main problem is that if any of the values above (from UI) are null then it won't read in the data properly, because this chunk of code is all hard coded.
I would like to change this into: Step 1 - check which values are null and do not filter off of them, then filter using read.csv.sql for all non-null values on corresponding columns.
Note: I am reusing the code from this similar question within this question.
UPDATE
I want to clear up what I am asking. This is what I am trying to do:
If a field, say spec comes through as NA (meaning the user did not pick input) then I want it to filter as such (default to spec == EVERY SPEC):
# Step 2 - Conditionally read in the data, store in 'd'
d <- fn$read.csv.sql(csvFile, sql='select * from file where
"Petal.Width" in ($petwd)',
filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))
Since spec is NA, if you try to filter/read in a file matching spec == NA it will read in an empty data set since there are no NA values in my data, hence breaking the code and program. Hope this clears it up more.
There are several problems:
some of the simplifications provided in the link in the question were not followed.
spec is a scalar so one can just use '$spec'
petwd is a numeric scalar and SQL does not require quotes around numbers so just use $petwd
the question states you want to handle empty fields but not how so we have used csvfix to map them to -1 and also strip off quotes. (Alternately let them enter and do it in R. Empty numerics will come through as 0 and empty character fields will come through as zero length character fields.)
you can use [...] in place of "..." in SQL
The code below worked for me in both Windows and Ubuntu Linux with the bash shell.
library(sqldf)
spec <- 'setosa'
petwd <- 0.2
d <- fn$read.csv.sql(
"iris.csv",
sql = "select * from file where [Species] = '$spec' and [Petal.Width] = $petwd",
verbose = TRUE,
filter = 'csvfix map -smq -fv "" -tv -1'
)
Update
Regarding the update at the end of the question it was clarified that the NA could be in spec as opposed to being in the data being read in and that if spec is NA then the condition involving spec should be regarded as TRUE. In that case just expand the SQL where condition to handle that as follows.
spec <- NA
petwd <- 0.2
d <- fn$read.csv.sql(
"iris.csv",
sql = "select * from file
where ('$spec' == 'NA' or [Species] = '$spec') and [Petal.Width] = $petwd",
verbose = TRUE,
filter = 'csvfix echo -smq'
)
The above will return all rows for which Petal.Width is 0.2 .
I load several databasefiles (SQLite) and subject them to a simple query:
library("RSQLite")
drv <- dbDriver ("SQLite")
get.wa2 <- function(file){
con <- dbConnect (drv, dbname = file)
table <- dbGetQuery (con, "Select data3 from data where data2 like 'xxx' ")
return(table)
}
database.files<- dir(database.path)
database.files <- database.files[grep(".db$",database.files, perl = T)] ### select only database files
count.wa <- sapply(database.files,get.wa2)
I run into problems since my files are randomly corrupted, or wiped.. appearing as 0 byte in filesize.
Am I doing something wrong and should I be closing connections after each query. What is best practice here?
Try an additional logical qualifier vector:
database.files<- dir(database.path)
database.files <- database.files[grep(".db$",database.files, perl = T) &
file.info(database.files)[,"size"] > 0 ]
?file.info
If the error occurs as a result of processing, then you need to look at ?try and the various error handling capacities that R provides.
Iam using fdrtool for my pvalues but i have an error which is :
Error in if (max(x) > 1 | min(x) < 0) stop("input p-values must all be in the range 0 to 1!") : missing value where TRUE/FALSE needed
The p value are not less than 0,greater than 1.
The range of p value are [1,0]. the code is :
n=40000
pval1<-vector(length=n)
pval1[1:n]= pv1list[["Pvalue"]]
fdr<-fdrtool(pval1,statistic="pvalue")
I ran your code without problem (although I can't reproduce it because I don't have the object "pvlist").
Since you're having a missing value error, my guess is that you're having problems reading the csv file into R. I recommend the "read.table" function since from my experience it usually reads in data from a csv file without errors:
pvlist<- read.table("c:/pvslit.csv", header=TRUE,
sep=",", row.names="id")
And now you want to check the number of rows and missingness:
nrow(pvlist) # is this what you expect?
nrow(na.omit(pvlist)) # how many non-missing rows are there?
Additionally you want to make sure that your "p-value" column is not a character or factor:
str(pvlist) # examining the structure of the dataframe
pvlist[,2] <- as.numeric(pvlist[,2]) # assuming the 2nd column is the pvalue
In short, you most likely have a problem with reading in the data or the class of the data in the dataframe.