# Create dummy dataframe
dataframe1 <- rbind("a","b","c")
# Create variable list
variablelist <- unique(dataframe1[,1])
# Loop through adding VARCHAR and commas
sql_var_list <- NULL
for(char in seq_along(variablelist)){
sql_var_list <- paste(sql_var_list,client_variables[char]," VARCHAR,",sep="")
}
# Remove the final comma
sql_var_list <- substr(sql_var_list, 1, nchar(sql_var_list)-1)
# Create structure table SQL string
create_strture_table <- paste("CREATE TABLE (",sql_var_list,")",sep="")
In the above code I'm using the contents from a column in a dataframe in R to create a structure table that will later be created in redshift. The above code is working, but my method seems a bit untidy, as I'm new to R I wonder if someone can suggest a better approach?
You can just use a combination of paste and paste0, as suggested by David Arenburg in the comments:
paste0("CREATE TABLE (",
paste(dataframe1, collapse = " VARCHAR, "),
" VARCHAR)")
# [1] "CREATE TABLE (a VARCHAR, b VARCHAR, c VARCHAR)"
Related
I have the following dataset: (this is just a little part)
Right now each "productid" corresponds to an "order_id"
I have to create e new column with the "product_id" for each "order_id_OK"
the majority of elements of "order_id_OK" are also in "order_id" but in a different order
So the objective would be to have a column where each "product_id" corresponds to the row of "order_id_OK" and not of "order_id"
Right now i'm trying to set up a for loop:
l = []
for i in df["order_id_OK"]:
for j in df["order_id"]:
if i == j:
for x in df["product_id"]:
l.append(x)
any idea?
you can merge your dataframe with itself, the output will be a dataframe where
data['order_id'][j]==data['order_id_OK'][i] (i and j same meaning as used in your for loops).
merged_data=data.merge(data, left_on=['order_id'], right_on=['order_id_OK'], how='inner')
in the merged data you will find new columns 'order_id_OK_y' and 'product_id_x' corresponding to your desired output.
I am attempting to create a small, training database for a package that I am writing. I am using the following code to create the database:
library(tidyverse)
library(DBI)
dat <- data.frame(name = rep("Clyde", 100),
DOB = sample(x = seq(as.POSIXct('1970/01/01'), as.POSIXct('1995/01/01'), by="day"),
size = 100, replace = T))
# Example using schemas with SQLite
train_con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
## create tables in primary db
copy_to(dest = train_con, df = dat, name = "client_list", temporary = FALSE)
The above portion works fine. However, when I attempt to pull data from the database, I see that all dates have been converted to numeric.
train_con %>% tbl("client_list")
Can anybody tell me how to fix this? Thanks!
SQLite does not have a datetime type. In the absence of such a type POSIXct objects are sent to the database as seconds since the UNIX Epoch and SQLite does not know that they are intended to represent date times.
Either convert such columns yourself after you read them back into R or else use a different database. Nearly all databases except SQLite support this.
Whenever I use read.csv.sql I cannot select from the first column with and any output from the code places an unusual character (A(tilde)-..) at the begging of the first column's name.
So suppose I create a df.csv file in in Excel that looks something like this
df = data.frame(
a = 1,
b = 2,
c = 3,
d = 4)
Then if I use sqldf to query the csv which is in my working directory I get the following error:
> read.csv.sql("df.csv", sql = "select * from file where a == 1")
Error in result_create(conn#ptr, statement) : no such column: a
If I query a different column than the first, I get a result but with the output of the unusual characters as seen below
df <- read.csv.sql("df.csv", sql = "select * from file where b == 2")
View(df)
Any idea how to prevent these characters from being added to the first column name?
The problem is presumably that you have a file that is larger than R can handle and so only want to read a subset of rows into R and specifying the condition to filter it by involves referring to the first column whose name is messed up so you can't use it.
Here are two alternative approaches. The first one involves a bit more code but has the advantage that it is 100% R. The second one is only one statement and also uses R but additionally makes use an of an external utility.
1) skip header Read the file in skipping over the header. That will cause the columns to be labelled V1, V2, etc. and use V1 in the condition.
# write out a test file - BOD is a data frame that comes with R
write.csv(BOD, "BOD.csv", row.names = FALSE, quote = FALSE)
# read file skipping over header
DF <- read.csv.sql("BOD.csv", "select * from file where V1 < 3",
skip = 1, header = FALSE)
# read in header, assign it to DF and fix first column
hdr <- read.csv.sql("BOD.csv", "select * from file limit 0")
names(DF) <- names(hdr)
names(DF)[1] <- "TIME" # suppose we want TIME instead of Time
DF
## TIME demand
## 1 1 8.3
## 2 2 10.3
2) filter Another way to proceed is to use the filter= argument. Here we assume we know that the end of the column name is ime but there are other characters prior to that that we don't know. This assumes that sed is available and on your path. If you are on Windows install Rtools to get sed. The quoting might need to be changed depending on your shell.
When trying this on Windows I noticed that sed from Rtools changed the line endings so below we specified eol= to ensure correct processing. You may not need that.
DF <- read.csv.sql("BOD.csv", "select * from file where TIME < 3",
filter = 'sed -e "1s/.*ime,/TIME,/"' , eol = "\n")
DF
## TIME demand
## 1 1 8.3
## 2 2 10.3
So I figured it out by reading through the above comments.
I'm on a Windows 10 machine using Excel for Office 365. The special characters will go away by changing how I saved the file from a "CSV UTF-8 (Comma Delimited)" to just "CSV (Comma delimited)".
I'm a beginner to R from a SAS background trying to do a basic "case when" match on two tables to get a flag where I have and have not found a match. Please see the SAS code I have in mind below. I just need something analogous to this in R. Thanks in advance.
proc sql;
create table
x as
select
a.*,
b.*,
case when a.first_column=b.column_first and
a.second_column=b.column_second
then 1 else 0 end as matched_flag
from table1 as a
left join
table2 as b
on a.first_column=b.column_first and a.second_column=b.column_second;
quit;
I'm not familiar with SAS, but I think I understand what you are trying to do. To see how many rows/columns are similar between two tables, you can use %in% and the length function.
For example, initialize two matrices of different dimensions and given them similar row names and column names:
mat.a <- matrix(1, nrow=3, ncol = 2)
mat.b <- matrix(1, nrow=2, ncol = 3)
rownames(mat.a) <- c('a','b','c')
rownames(mat.b) <- c('a','d')
colnames(mat.a) <- c('g','h')
colnames(mat.b) <- c('h','i')
mat.a and mat.b now exist with different row and column names. To match the rows by names, you can use:
row.match <- rownames(mat.a)[rownames(mat.a) %in% rownames(mat.b)]
num.row.match <- length(row.match)
Note that row.match can now be used to index into both of the matrices. The %in% operator returns a logical of the same length of the first argument (in this case, rownames(mat.a)) that indicates if the ith element of the first argument was found anywhere in the elements of the second argument. This nature of %in% means that you have to be sensitive to how you order the arguments for your indexing.
If you simply want to quantify how many rows or columns are the same between the two matrices, then you can use the sum function with the %in% operator:
sum(rownames(mat.a) %in% rownames(mat.b))
With the sum function used like this, you do not need to be sensitive to how you order the arguments, because the number of row names of mat.a in row names of mat.b is equivalent to the number of row names of mat.b in row names of mat.a. That is to say that this usage of %in% is commutative.
I hope this helps!
You will want to use dataframe objects. These are like datasets in SAS. You can use bind to put two dataframe objects together side by side. Then you can select rows based on conditions and set the flag based on this. In the code below you will see that I did this twice: once to set the 1 flag and once to set the 0 flag.
To select the rows where all fields match you can do something similar, but instead of assigning a new column you can assign all the results back to the name of the table you are working on.
Here's the code:
# make up example a and b data frames
table1 <- data.frame(list(a.first_column=c(1,2,3),a.second_column=c(4,5,6)))
table2 <- data.frame(list(b.first_column=c(1,3,6),b.second_column=c(4,5,9)))
# Combine columns (horizontally)
x <- cbind(table1, table2)
print("Combined Data Frames")
print(x)
# create matched flag (1 when the first columns match)
x$matched_flag[x$a.first_column==x$b.first_column] <- 1
x$matched_flag[!x$a.first_column==x$b.first_column] <- 0
# only select records that match both data frames
x <- x[x$a.first_column==x$b.first_column & x$a.second_column==x$b.second_column,]
print("Matched Data Frames")
print(x)
BTW: since you are used to using SQL, you might want to try the sqldf package in R. It will let you use the same techniques that you are used to but in R and on data frames.
I have a list of IDs in an R vector.
IDlist <- c(23, 232, 434, 35445)
I would like to write an RODBC sqlQuery with a clause stating something like
WHERE idname IN IDlist
Do I have to read the whole table and then merge it to the idList vector within R? Or how can I provide these values to the RODBC statement, so recover only the records I'm interested in?
Note: As the list is quite long, pasting individual values into the SQL statement, as in the answer below, won't do it.
You could always construct the statement using paste
IDlist <- c(23, 232, 434, 35445)
paste("WHERE idname IN (", paste(IDlist, collapse = ", "), ")")
#[1] "WHERE idname IN ( 23, 232, 434, 35445 )"
Clearly you would need to add more to this to construct your exact statement
I put together a solution to a similar problem by combining the tips here and here and running in batches. Approximate code follows (retyped from an isolated machine):
#assuming you have a list of IDs you want to match in vIDs and an RODBC connection in mycon
#queries that don't change
q_create_tmp <- "create table #tmptbl (ID int)"
q_get_records <- "select * from mastertbl as X join #tmptbl as Y on (X.ID = Y.ID)"
q_del_tmp <- "drop table #tmptbl"
#initialize counters and storage
start_row <- 1
batch_size <- 1000
allresults <- data.frame()
while(start_row <= length(vIDs) {
end_row <- min(length(vIDs), start_row+batch_size-1)
q_fill_tmp <- sprintf("insert into #tmptbl (ID) values %s", paste(sprintf("(%d)", vIDs[start_row:end_row]), collapse=","))
q_all <- list(q_create_tmp, q_fill_tmp, q_get_records, q_del_tmp)
sqlOutput <- lapply(q_all, function(x) sqlQuery(mycon, x))
allresults <- rbind(allresults, sqlOutput[[3]])
start_row <- end_row + 1
}