update an SQL table via R sqlSave - sql

I have a data frame in R having 3 columns, using sqlSave I can easily create a table in an SQL database:
channel <- odbcConnect("JWPMICOMP")
sqlSave(channel, dbdata, tablename = "ManagerNav", rownames = FALSE, append = TRUE, varTypes = c(DateNav = "datetime"))
odbcClose(channel)
This data frame contains information about Managers (Name, Nav and Date) which are updatede every day with new values for the current date and maybe old values could be updated too in case of errors.
How can I accomplish this task in R?
I treid to use sqlUpdate but it returns me the following error:
> sqlUpdate(channel, dbdata, tablename = "ManagerNav")
Error in sqlUpdate(channel, dbdata, tablename = "ManagerNav") :
cannot update ‘ManagerNav’ without unique column

When you create a table "the white shark-way" (see documentation), it does not get a primary index, but is just plain columns, and often of the wrong type. Usually, I use your approach to get the columns names right, but after that you should go into your database and assign a primary index, correct column widths and types.
After that, sqlUpdate() might work; I say might, because I have given up using sqlUpdate(), there are too many caveats, and use sqlQuery(..., paste("Update....))) for the real work.

What I would do for this is the following
Solution 1
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c("ManagerNav"))
Solution 2
Lcolumns <- list(dbdata[0,])
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c(Lcolumns))
Index is used to specify what columns R is going to update.
Hope this helps!

If none of the other solutions work and your data is not that big, I'd suggest using sqlQuery() and loop through your dataframe.
one_row_of_your_df <- function(i) {
sql_query <-
paste0("INSERT INTO your_table_name (column_name1, column_name2, column_name3) VALUES",
"(",
"'",your_dataframe[i,1],",",
"'",your_dataframe[i,2],"'",",",
"'",your_dataframe[i,3],"'",",",
")"
)
return(sql_query)
}
This function is Exasol specific, it is pretty similar to MySQL, but not identical, so small changes could be necessary.
Then use a simple for loop like this one:
for(i in 1:nrow(your_dataframe))
{
sqlQuery(your_connection, one_row_of_your_df(i))
}

Related

replace .withColumn with a df.select

I am doing a basic transformation on my pyspark dataframe but here i am using multiple .withColumn statements.
def trim_and_lower_col(col_name):
return F.when(F.trim(col_name) == "", F.lit("unspecified")).otherwise(F.lower(F.trim(col_name)))
df = (
source_df.withColumn("browser", trim_and_lower_col("browser"))
.withColumn("browser_type", trim_and_lower_col("browser_type"))
.withColumn("domains", trim_and_lower_col("domains"))
)
I read that creating multiple withColumn statements isn't very efficient and i should use df.select() instead.
I tried this:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(col) for col in cols_to_transform] + source_df.columns)
)
but it gives me a duplicate column error
What else can I try?
The duplicate column comes because you pass each transformed column twice in that list, once as your newly transformed column (through .alias) as original column (by name in source_df.columns). This solution will allow you to use a single select statement, preserve the column order and not hit the duplication issue:
df = (
source_df.select([trim_and_lower_col(col).alias(col) if col in cols_to_transform else col for col in source_df.columns])
)
Chaining many .withColumn does pose a problem as the unresolved query plan can get pretty large and cause StackOverflow error on Spark driver during query plan optimisation. One good explanation of this problem is shared here: https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015
You are naming your new columns the following: .alias(col).
That means that they have the same name as the column you use to create the new one.
During the creation (using .withColumn) this does not pose a problem. As soon as you are trying to select, Spark does not know which column to pick.
You could fix it for example by giving the new columns a suffix:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(f"{col}_new") for col in cols_to_transform] + source_df.columns)
)
Another solution, which does pollute the DAG though, would be:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
for col in cols_to_transform:
source_df = source_df.withColumn(col, trim_and_lower_col(col))
If you only have these few withColumns, keep using them.. It's still way more readable thus way more maintainable and self explanatory..
If you look into it, you'll see that spark says to be careful with the withColumns when you have like 200 of them.
Using select makes your code more error prone too since it's more complex to read.
Now, if you have many columns, I would define
the list of the column to transform,
the list of the column to keep
then do the select
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
cols_to_keep = [c for c in df.columns if c not in cols_to_transform]
cols_transformed = [trim_and_lower_col(c).alias(c) for c in cols_to_transform]
source_df.select(*cols_to_keep, *cols_transformed)
This would give you the same column order as the withColumns.

R: Summary of SQL Tables

I am working with the R programming language.
Normally, when I want to get the summary of a table, I can use something like the "str()" function or the "summary()" function:
str(my_table)
summary(my_table)
However, now I am trying to do this with tables on a server.
For instance, I am trying to get the summaries of variable types for a specific table (e.g. "my_table") on a server. I found a very indirect way to do this:
#load libraries
library(OBDC)
library(RODBC)
library(dbi)
#establish a connection and name it as "dbhandle"
rs <- dbSendQuery(dbhandle, 'select * from my_table limit 1')
dbColumnInfo(rs)
My Question: Is there a more "direct" way to do this? For example, can I get information about each column (e.g. whether the column is integer, character, date, etc.) in a table without first sending the query and then requesting the information? Can I do this directly?
Thanks!
You could try using fetch() from "RMySQL" to turn your SQL query into an R object (e.g. data frame)
library(RMySQL)
rs <- dbSendQuery(dbhandle, 'select * from my_table limit 1')
# Get the results from MySQL into R
my_table = fetch(rs, n=-1)
# clear result
dbClearResult(rs)
rm(rs)
Then use the functions you describe.
str(my_table)
summary(my_table)

Writing dataframe to Postgres database psycopg2

I am trying to write a pandas DataFrame to a Postgres database.
Code is as below:
dbConnection = psycopg2.connect(user = "user1", password = "user1", host = "localhost", port = "5432", database = "postgres")
dbConnection.set_isolation_level(0)
dbCursor = dbConnection.cursor()
dbCursor.execute("DROP DATABASE IF EXISTS FiguresUSA")
dbCursor.execute("CREATE DATABASE FiguresUSA")
dbCursor.execute("DROP TABLE IF EXISTS FiguresUSAByState")
dbCursor.execute("CREATE TABLE FiguresUSAByState(Index integer PRIMARY KEY, Province_State VARCHAR(50), NumberByState integer)");
for i in data_pandas.index:
query = """
INSERT into FiguresUSAByState(column1, column2, column3) values('%s',%s,%i);
""" % (data_pandas['Index'], data_pandas['Province_State'], data_pandas['NumberByState'])
dbCursor.execute(query)
When I run this, I get an error which just says : "Index". I know its somewhere in my for loop is the problem, is that % notation correct? I am new to Postgres and don't see how that could be correct syntax. I know I can use to_sql but I am trying to use different techniques.
Print out of data_pandas is as below:
One slight possible anomaly is that there an "index" in the IDE version. Could this be the problem?
If you use pd.DataFrame.to_sql, you can supply the index_label parameter to use that as a column.
data_pandas.to_sql('FiguresUSAByState', con=dbConnection, index_label='Index')
If you would prefer to stick with the custom SQL and for loop you have, you will need to reset_index first.
for row in data_pandas.reset_index().to_dict('rows'):
query = """
INSERT into FiguresUSAByState(index, Province_State, NumberByState) values(%i, '%s', %i);
""" % (row['index'], row['Province_State'], row['NumberByState'])
Note that the default name for the new column is index, uncapitalized, rather than Index.
In the insert statement:
query = """
        INSERT into FiguresUSAByState (column1, column2, column3) values ​​('%s',%s,%i);
        """% (data_pandas ['Index'], data_pandas ['Province_State'], data_pandas ['NumberByState'])
You have a '%s', I think that is the problem. So remove the quotes

Putting output from sql query into another query using R environment

I am wondering what approach should have been selected to perform action from title. I am using ODBC connection and what I get from first sql query are like 40-50 rows in one column. What I want is to put this output as a values in to search for.
How should i treat this? Like a array or separated variables? I still do not know R well so just need to know where to search for.
Regards
------more explanation below----
I have list of 40-50 numbers of 10 digits each, organized in a column.
I am trying to do this:
list <- c(my_input)
sql_in <- paste0(list, collapse="")
and characters are organized like this after this operations:
'c(1234567890, , 1234567890, 1234567890)'
and almost all looks fine and fit into my query besides additional c character at the beginning and missing apostrophes.I try to use gsub function but did not work in way I want.
You may likely do this in one SQL call using a subquery. Notice in the call below that the result of
SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4)
Is passed to the WHERE clause of the primary query. This is perfectly valid and will allow your query to execute entirely in SQL without having to do any intermediate steps in R.
(I use sqldf for simplicity of illustration, but this should work through just about any ODBC connection)
library(sqldf)
Gear <- data.frame(n_gear = 1:5)
sqldf(
"SELECT mpg, qsec, gear, wt
FROM mtcars
WHERE gear IN (SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4))"
)
Try something like this:
list<-c("try","this") #The output from your first query
sql_in<-paste0(list, collapse="','")
The Output
paste("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
[1] "select * from table where table.var in ('try','this')"
If yuo have space as first or last element of the string you can use this code:
`list<-c(" first element is a space","try","this","last element is a space ")` #The output from your first query
Find space at first or last character
first_space<-substr(list, start = 1, stop = 1)==" "
last_space<-substr(list, start = nchar(list), stop = nchar(list))==" "
Remove spaces
list[first_space]<-substr(list[first_space], start = 2, stop = nchar(list[first_space]))
list[last_space]<-substr(list[last_space], start = 1, stop = nchar(list[last_space])-1)
sql_in<-paste0(list, collapse="','")
Your output
paste0("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
"select * from table where table.var in ('first element is a space','try','this','last element is a space')"
I think You are expecting some thing like shown below code,
data <- dbGetQuery(con, "select column from yourfirsttable")
list <- paste(data$column, collapse="','")
result <- dbGetQuery(con, statement = sprintf("select * from yourresulttable where inv in ('%s')",list))
It's not entirely clear exactly what you're wanting to achieve here. For example, one use case just means you can do it all with a join. But I have cases where I don't know the values for the test without doing some computation. Then I do a separate query having created a query string thus:
> id <- 1:5
> paste0("SELECT * FROM table WHERE ID IN (", paste0(id, collapse = ","), ")")
[1] "SELECT * FROM table WHERE ID IN (1,2,3,4,5)"

What to do with the error "Ambiguous Columns Defined"?

This is my sql command:
select INCOME_TYPE_ID,
REGION_CODE,
FIN_YEAR_CODE,
PORTION_AMOUNT
from INCOME.INCOME_TYPE,
COMMON.REGION,
INCOME.RECEIVE_DOC_PORTION,
INCOME.ASSESS_ORDER_ITEM,
ACCOUNTING.VOUCHER_ITEM_RECEIVE_DOC,
ACCOUNTING.VOUCHER_ITEM,
ACCOUNTING.VOUCHER,
ACCOUNTING.FIN_YEAR
where INCOME.RECEIVE_DOC_PORTION.ASSESS_ORDER_ITEM_ID = INCOME.ASSESS_ORDER_ITEM.ASSESS_ORDER_ITEM_ID
and INCOME.ASSESS_ORDER_ITEM.INCOME_TYPE_ID=INCOME.INCOME_TYPE.INCOME_TYPE_ID
and INCOME.RECEIVE_DOC_PORTION.RECEIVE_DOC_PORTION_ID = ACCOUNTING.VOUCHER_ITEM_RECEIVE_DOC.RECEIVE_DOC_PORTION_ID
and ACCOUNTING.VOUCHER_ITEM_RECEIVE_DOC.VOUCHER_ITEM_ID = ACCOUNTING.VOUCHER_ITEM.VOUCHER_ITEM_ID
and ACCOUNTING.VOUCHER_ITEM.VOUCHER_ID = ACCOUNTING.VOUCHER.VOUCHER_ID
and ACCOUNTING.VOUCHER.REGION_CODE = COMMON.REGION.REGION_CODE
and ACCOUNTING.VOUCHER.FIN_YEAR_CODE = ACCOUNTING.FIN_YEAR.FIN_YEAR_CODE
and I got this error:
Ambiguous Columns Defined
I'm Using SQL Developer as Oracle client.
Apparently one (or more) column names in your select list exists in more than one table of the FROM list.
You need to prefix every column in the SELECT list with the table it's coming from (it's also a good practice to always do that, regardless of the fact if they are ambigous)
Mention name of table before every column in select query.
Ambiguous column means that you have more than one column with the same name in one of the SELECT statements.
Try this instead, prefgixing all selected columns with their fully qualified names (as you have done elsewhere in your SELECT):
select INCOME.INCOME_TYPE.INCOME_TYPE_ID,
COMMON.REGION.REGION_CODE,
ACCOUNTING.FIN_YEAR.FIN_YEAR_CODE,
ACCOUNTING.VOUCHER_ITEM_RECEIVE_DOC.PORTION_AMOUNT
from INCOME.INCOME_TYPE,
COMMON.REGION,
INCOME.RECEIVE_DOC_PORTION,
INCOME.ASSESS_ORDER_ITEM,
ACCOUNTING.VOUCHER_ITEM_RECEIVE_DOC,
ACCOUNTING.VOUCHER_ITEM,
ACCOUNTING.VOUCHER,
ACCOUNTING.FIN_YEAR
where INCOME.RECEIVE_DOC_PORTION.ASSESS_ORDER_ITEM_ID = INCOME.ASSESS_ORDER_ITEM.ASSESS_ORDER_ITEM_ID
and INCOME.ASSESS_ORDER_ITEM.INCOME_TYPE_ID = INCOME.INCOME_TYPE.INCOME_TYPE_ID
and INCOME.RECEIVE_DOC_PORTION.RECEIVE_DOC_PORTION_ID = ACCOUNTING.VOUCHER_ITEM_RECEIVE_DOC.RECEIVE_DOC_PORTION_ID
and ACCOUNTING.VOUCHER_ITEM_RECEIVE_DOC.VOUCHER_ITEM_ID = ACCOUNTING.VOUCHER_ITEM.VOUCHER_ITEM_ID
and ACCOUNTING.VOUCHER_ITEM.VOUCHER_ID = ACCOUNTING.VOUCHER.VOUCHER_ID
and ACCOUNTING.VOUCHER.REGION_CODE = COMMON.REGION.REGION_CODE
and ACCOUNTING.VOUCHER.FIN_YEAR_CODE = ACCOUNTING.FIN_YEAR.FIN_YEAR_CODE
I had to guess the filly qualified name for
ACCOUNTING.VOUCHER_ITEM_RECEIVE_DOC.PORTION_AMOUNT
It might be
INCOME.RECEIVE_DOC_PORTION.PORTION_AMOUNT
But you should be able to resolve that easily.
Hope it helps...