R: How to replace single columns of a table in database - sql

Does anyone know a workaround for replacing a single column in a table on a MS SQL Server database?
The point is that, due to memory restrictions, I want to avoid to pull the entire table into my R-Session. Instead of dbWriteTable(), can one only (over)write a colum?
What would also help was an appropriate query that can be taken to the database via dbSendQuery() and that takes into account the new column to be tranferred to the database.
To make things clearer, I add some code describing what I am searching for:
# pull "someColumn" and "identifier" from "databaseTable"
dplyr::tbl(odbConnection, dbplyr::in_schema("schemaName", "databaseTable")) %>%
select(identifier, someColumn) %>%
dplyr::collect() -> data2beUpdated
# update "someColumn" accoring to a "lookupTable" in R-session also having the "identifier"
data2beUpdated$newColumn <- lookupTable[match(data2beUpdated$identifier, lookupTable$identifier), "newValues"]
apply(data2beUpdated,
1,
FUN = function(x){ ifelse(
is.na(x["newValues"]),
x["someColumn"],
x["newValues"]
)#ifelse
}#fun
) -> data2beUpdated$someColumn
# now write back the data in the way I DON'T want to do it since it costs runtime to pull in the entire data table
dplyr::tbl(odbConnection, dbplyr::in_schema("schemaName", "databaseTable")) %>%
dplyr::collect() -> df
df$someColumn <- data2beUpdated$someColumn
DBI::dbWriteTable(odbcConnection, "databaseTable", df, overwrite = TRUE)
Any suggestions?!

Related

R: Summary of SQL Tables

I am working with the R programming language.
Normally, when I want to get the summary of a table, I can use something like the "str()" function or the "summary()" function:
str(my_table)
summary(my_table)
However, now I am trying to do this with tables on a server.
For instance, I am trying to get the summaries of variable types for a specific table (e.g. "my_table") on a server. I found a very indirect way to do this:
#load libraries
library(OBDC)
library(RODBC)
library(dbi)
#establish a connection and name it as "dbhandle"
rs <- dbSendQuery(dbhandle, 'select * from my_table limit 1')
dbColumnInfo(rs)
My Question: Is there a more "direct" way to do this? For example, can I get information about each column (e.g. whether the column is integer, character, date, etc.) in a table without first sending the query and then requesting the information? Can I do this directly?
Thanks!
You could try using fetch() from "RMySQL" to turn your SQL query into an R object (e.g. data frame)
library(RMySQL)
rs <- dbSendQuery(dbhandle, 'select * from my_table limit 1')
# Get the results from MySQL into R
my_table = fetch(rs, n=-1)
# clear result
dbClearResult(rs)
rm(rs)
Then use the functions you describe.
str(my_table)
summary(my_table)

Create SQL table from parquet files

I am using R to handle large datasets (largest dataframe 30.000.000 x 120). These are stored in Azure Datalake Storage as parquet files, and we would need to query these daily and restore these in a local SQL database. Parquet files can be read without loading the data into memory, which is handy. However, creating SQL tables from parquuet files is more challenging as I'd prefer not to load the data into memory.
Here is the code I used. Unfortunately, this is not a perfect reprex as the SQL database need to exist for this to work.
# load packages
library(tidyverse)
library(arrow)
library(sparklyr)
library(DBI)
# Create test data
test <- data.frame(matrix(rnorm(20), nrow=10))
# Save as parquet file
write_parquet(test2, tempfile(fileext = ".parquet"))
# Load main table
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
test <- spark_read_parquet(sc, name = "test_main", path = "/tmp/RtmpeJBgyB/file2b5f4764e153.parquet", memory = FALSE, overwrite = TRUE)
# Save into SQL table
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "schema", table = "table"),
value = test)
Is it possible to write a SQL table without loading parquet files into memory?
I lack the experience with T-sql bulk import and export but this is likely where you'll find your answer.
library(arrow)
library(DBI)
test <- data.frame(matrix(rnorm(20), nrow=10))
f <- tempfile(fileext = '.parquet')
write_parquet(test2, f)
#Upload table using bulk insert
dbExecute(connection,
paste("
BULK INSERT [database].[schema].[table]
FROM '", gsub('\\\\', '/', f), "' FORMAT = 'PARQUET';
")
)
here I use T-sql's own bulk insert command.
Disclaimer I have not yet used this command in T-sql, so it may riddled with error. For example I can't see a place to specify snappy compression within the documentation, although it can be specified if one instead defined a custom file format with CREATE EXTERNAL FILE FORMAT.
Now the above only inserts into an existing table. For your specific case, where you'd like to create a new table from the file, you would likely be looking more for OPENROWSET using CREATE TABLE AS [select statement].
column_definition <- paste(names(column_defs), column_defs, collapse = ',')
dbExecute(connection,
paste0("CREATE TABLE MySqlTable
AS
SELECT *
FROM
OPENROWSET(
BULK '", f, "' FORMAT = 'PARQUET'
) WITH (
", paste0([Column definitions], ..., collapse = ', '), "
);
")
where column_defs would be a named list or vector describing giving the SQL data-type definition for each column. A (more or less) complete translation from R data types to is available on the T-sql documentation page (Note two very necessary translations: Date and POSIXlt are not present). Once again disclaimer: My time in T-sql did not get to BULK INSERT or similar.

Can you name dbplyr's simulated lazy tables?

dbplyr has some very useful-looking simulation functions so you can write queries while not connected to any real database, but I can't seem to get actual table names into any of the queries I write that way. All their names are just `df`, and I can't seem to modify them afterward. In fact, I don't see `df` anywhere in the query object or its attributes (it doesn't have any), so now I have no idea how dbplyr processes table names at all.
MWE:
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(magrittr)
library(purrr, warn.conflicts = FALSE)
query <- tbl_lazy(df = mtcars)
query %$% names(ops)
#> [1] "x" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`
# The actual data frame is stored in the object under the name `x`, but
# renaming it has no effect, unsurprisingly (since it wasn't named `df`
# anyway)
query %<>% modify_at("ops", set_names, "mtcars", "vars")
query %$% names(ops)
#> [1] "mtcars" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`
My use case, by the way, is that I need to run SQL queries in another system with actual server access, so I'd like to have R scripts that produce SQL syntax that's ready to run in that system, even though R can't connect to it. Making an empty dummy database with the structure of the real thing (table & column names, column types, but no rows) is an option, but, obviously, it'd be simpler to just use these free-form simulations, iff the SQL can be generated ready to cut and paste. (lazy_frame() looked more appropriate for such non-existent tables, but, guess what, it's really just a wrapper for tbl_lazy(tibble()), so, same exact `df` name problem.)
Created on 2019-12-12 by the reprex package (v0.3.0)
I am not aware of any way to rename the simulated tables. According to to documentation, the important point of the simulate_* functions is to test database translation without actually connecting to a database.
When connected to a remote table, dbplyr uses the database, schema, and table name defined using tbl(). It also fetches the column names. Because of this, I would recommend developing in an environment where R can connect to the database. Consider the following:
# simulated
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
df_sim %>% head(5) %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM `df`
# actual
df = tbl('db_table_name', con = database_connection_object)
df %>% head(5) %>% show_query()
# output
<SQL>
SELECT TOP(5) col1, col2, col3
FROM "database"."db_table_name"
Not only does df get replaced by the table name, but the * in the simulated query is replaced by column names in the second query.
One option you might consider if it is important to generate SQL scripts via simulation is converting to text, replacing, and converting back. For example:
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
query = df_sim %>% head(5) %>% as.character()
query = gsub("`df`", "[db].[schema].[table]", query)
# write query out to file
writeLines(query, "file.sql")
# OR create a remote connection
remote_table = tbl(db_connection, sql(query))
remote_table %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM [db].[schema].[table]

How to pass data.frame into SQL "IN" condition using R?

I am reading list of values from CSV file in R, and trying to pass the values into IN condition of SQL(dbGetQuery). Can some one help me out with this?
library(rJava)
library(RJDBC)
library(dbplyr)
library(tibble)
library(DBI)
library(RODBC)
library(data.table)
jdbcDriver <- JDBC("oracle.jdbc.OracleDriver",classPath="C://Users/********/Oracle_JDBC/ojdbc6.jar")
jdbcConnection <- dbConnect(jdbcDriver, "jdbc:oracle:thin:Rahul#//Host/DB", "User_name", "Password")
## Setting working directory for the data
setwd("C:\\Users\\**********\\Desktop")
## reading csv file into data frame
pii<-read.csv("sample.csv")
pii
PII_ID
S0094-5765(17)31236-5
S0094-5765(17)31420-0
S0094-5765(17)31508-4
S0094-5765(17)31522-9
S0094-5765(17)30772-5
S0094-5765(17)30773-7
PII_ID1<-dbplyr::build_sql(pii$PII_ID)
PII_ID1
<SQL> ('S0094-5765(17)31236-5', 'S0094-5765(17)31420-0', 'S0094-5765(17)31508-4', 'S0094-5765(17)31522-9', 'S0094-5765(17)30772-5', 'S0094-5765(17)30773-7')
Data<-dbGetQuery(jdbcConnection, "SELECT ARTICLE_ID FROM JRBI_OWNER.JRBI_ARTICLE_DIM WHERE PII_ID in ?",(PII_ID1))
Expected:
ARTICLE_ID
12345
23456
12356
14567
13456
Actual result:
[1] ARTICLE_ID
<0 rows> (or 0-length row.names)
The SQL code you pass in the second argument to dbGetQuery is just a text string, hence you can construct this using paste or equivalents.
You are after something like the following:
in_clause <- paste0("('", paste0(pii$PII_ID, collapse = "', '"), "')")
sql_text <- paste0("SELECT ARTICLE_ID
FROM JRBI_OWNER.JRBI_ARTICLE_DIM
WHERE PII_ID IN ", in_clause)
data <- dbGetQuery(jdbcConnection, sql_text)
However, the exact form of the first paste0 depends on the format of PII_ID (I have assumed it is text) and how this format is represented in sql (I have assumed single quotes).
Be sure to check sql_text is valid SQL before passing it to dbGetQuery.
IMPORTANT: This approach is only suitable when pii contains a small number of values (I recommend fewer than 10). If pii contains a large number of values your query will be very large and will run very slowly. If you have many values in pii then a better approach would be a join or semi-join as per #nicola's comment.

update an SQL table via R sqlSave

I have a data frame in R having 3 columns, using sqlSave I can easily create a table in an SQL database:
channel <- odbcConnect("JWPMICOMP")
sqlSave(channel, dbdata, tablename = "ManagerNav", rownames = FALSE, append = TRUE, varTypes = c(DateNav = "datetime"))
odbcClose(channel)
This data frame contains information about Managers (Name, Nav and Date) which are updatede every day with new values for the current date and maybe old values could be updated too in case of errors.
How can I accomplish this task in R?
I treid to use sqlUpdate but it returns me the following error:
> sqlUpdate(channel, dbdata, tablename = "ManagerNav")
Error in sqlUpdate(channel, dbdata, tablename = "ManagerNav") :
cannot update ‘ManagerNav’ without unique column
When you create a table "the white shark-way" (see documentation), it does not get a primary index, but is just plain columns, and often of the wrong type. Usually, I use your approach to get the columns names right, but after that you should go into your database and assign a primary index, correct column widths and types.
After that, sqlUpdate() might work; I say might, because I have given up using sqlUpdate(), there are too many caveats, and use sqlQuery(..., paste("Update....))) for the real work.
What I would do for this is the following
Solution 1
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c("ManagerNav"))
Solution 2
Lcolumns <- list(dbdata[0,])
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c(Lcolumns))
Index is used to specify what columns R is going to update.
Hope this helps!
If none of the other solutions work and your data is not that big, I'd suggest using sqlQuery() and loop through your dataframe.
one_row_of_your_df <- function(i) {
sql_query <-
paste0("INSERT INTO your_table_name (column_name1, column_name2, column_name3) VALUES",
"(",
"'",your_dataframe[i,1],",",
"'",your_dataframe[i,2],"'",",",
"'",your_dataframe[i,3],"'",",",
")"
)
return(sql_query)
}
This function is Exasol specific, it is pretty similar to MySQL, but not identical, so small changes could be necessary.
Then use a simple for loop like this one:
for(i in 1:nrow(your_dataframe))
{
sqlQuery(your_connection, one_row_of_your_df(i))
}