R: Summary of SQL Tables - sql

I am working with the R programming language.
Normally, when I want to get the summary of a table, I can use something like the "str()" function or the "summary()" function:
str(my_table)
summary(my_table)
However, now I am trying to do this with tables on a server.
For instance, I am trying to get the summaries of variable types for a specific table (e.g. "my_table") on a server. I found a very indirect way to do this:
#load libraries
library(OBDC)
library(RODBC)
library(dbi)
#establish a connection and name it as "dbhandle"
rs <- dbSendQuery(dbhandle, 'select * from my_table limit 1')
dbColumnInfo(rs)
My Question: Is there a more "direct" way to do this? For example, can I get information about each column (e.g. whether the column is integer, character, date, etc.) in a table without first sending the query and then requesting the information? Can I do this directly?
Thanks!

You could try using fetch() from "RMySQL" to turn your SQL query into an R object (e.g. data frame)
library(RMySQL)
rs <- dbSendQuery(dbhandle, 'select * from my_table limit 1')
# Get the results from MySQL into R
my_table = fetch(rs, n=-1)
# clear result
dbClearResult(rs)
rm(rs)
Then use the functions you describe.
str(my_table)
summary(my_table)

Related

how to dynamically build select list from a API payload using PyPika

I have a JSON API payload containing tablename, columnlist - how to build a SELECT query from it using pypika?
So far I have been able to use a string columnlist, but not able to do advanced querying using functions, analytics etc.
from pypika import Table, Query, functions as fn
def generate_sql (tablename, collist):
table = Table(tablename)
columns = [str(table)+'.'+each for each in collist]
q = Query.from_(table).select(*columns)
return q.get_sql(quote_char=None)
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue)']
print (generate_sql(tablename, collist)) #1
table = Table(tablename)
q = Query.from_(table).select(table.id, table.fname, fn.Sum(table.revenue))
print (q.get_sql(quote_char=None)) #2
#1 outputs
SELECT "customers".id,"customers".fname,"customers".fn.Sum(revenue) FROM customers
#2 outputs correctly
SELECT id,fname,SUM(revenue) FROM customers
You should not be trying to assemble the query in a string by yourself, that defeats the whole purpose of pypika.
What you can do in your case, that you have the name of the table and the columns coming as texts in a json object, you can use * to unpack those values from the collist and use the syntax obj[key] to get the table attribute with by name with a string.
q = Query.from_(table).select(*(table[col] for col in collist))
# SELECT id,fname,fn.Sum(revenue) FROM customers
Hmm... that doesn't quite work for the fn.Sum(revenue). The goal is to get SUM(revenue).
This can get much more complicated from this point. If you are only sending column names that you know to belong to that table, the above solution is enough.
But if you have complex sql expressions, making reference to sql functions or even different tables, I suggest you to rethink your decision of sending that as json. You might end up with something as complex as pypika itself, like a custom parser or wathever. than your better option here would be to change the format of your json response object.
If you know you only need to support a very limited set of capabilities, it could be feasible. For example, you can assume the following constraints:
all column names refer to only one table, no joins or alias
all functions will be prefixed by fn.
no fancy stuff like window functions, distinct, count(*)...
Then you can do something like:
from pypika import Table, Query, functions as fn
import re
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue / 2)', 'revenue % fn.Count(id)']
def parsed(cols):
pattern = r'(?:\bfn\.[a-zA-Z]\w*)|([a-zA-Z]\w*)'
subst = lambda m: f"{'' if m.group().startswith('fn.') else 'table.'}{m.group()}"
yield from (re.sub(pattern, subst, col) for col in cols)
table = Table(tablename)
env = dict(table=table, fn=fn)
q = Query.from_(table).select(*(eval(col, env) for col in parsed(collist)))
print (q.get_sql(quote_char=None)) #2
Output:
SELECT id,fname,SUM(revenue/2),MOD(revenue,COUNT(id)) FROM customers

I've performed a JOIN using bigrquery and the dbGetQuery function. Now I'd like to query the temporary table I've created but can't connect

I'm afraid that if a bunch of folks start running my actual code I'll be billed for the queries so my example code is for a fake database.
I've successfully established my connection to BigQuery:
con <- dbConnect(
bigrquery::bigquery(),
project = 'myproject',
dataset = 'dataset',
billing = 'myproject'
)
Then performed a LEFT JOIN using the coalesce function:
dbGetQuery(con,
"SELECT
`myproject.dataset.table_1x`.Pokemon,
coalesce(`myproject.dataset.table_1`.Type_1,`myproject.dataset.table_2`.Type_1) AS Type_1,
coalesce(`myproject.dataset.table_1`.Type_2,`myproject.dataset.table_2`.Type_2) AS Type_2,
`myproject.dataset.table_1`.Total,
`myproject.dataset.table_1`.HP,
`myproject.dataset.table_1`.Attack,
`myproject.dataset.table_1`.Special_Attack,
`myproject.dataset.table_1`.Defense,
`myproject.dataset.table_1`.Special_Defense,
`myproject.dataset.table_1`.Speed,
FROM `myproject.dataset.table_1`
LEFT JOIN `myproject.dataset.table_2`
ON `myproject.dataset.table_1`.Pokemon = `myproject.dataset.table_2`.Pokemon
ORDER BY `myproject.dataset.table_1`.ID;")
The JOIN produced the table I intended and now I'd like to query that table but like...where is it? How do I connect? Can I save it locally so that I can start working my analysis in R? Even if I go to BigQuery, select the Project History tab, select the query I just ran in RStudio, and copy the Job ID for the temporary table, I still get the following error:
Error: Job 'poke-340100.job_y0IBocmd6Cpy-irYtNdLJ-mWS7I0.US' failed
x Syntax error: Unexpected string literal 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae' at [2:6] [invalidQuery]
Run `rlang::last_error()` to see where the error occurred.
And if I follow up:
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/rlang_error>
Job 'poke-340100.job_y0IBocmd6Cpy-irYtNdLJ-mWS7I0.US' failed
x Syntax error: Unexpected string literal 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae' at [2:6] [invalidQuery]
Backtrace:
1. DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
2. DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
3. DBI:::.local(conn, statement, ...)
5. bigrquery::dbSendQuery(conn, statement, ...)
6. bigrquery:::BigQueryResult(conn, statement, ...)
7. bigrquery::bq_job_wait(job, quiet = conn#quiet)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
Job 'poke-340100.job_y0IBocmd6Cpy-irYtNdLJ-mWS7I0.US' failed
x Syntax error: Unexpected string literal 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae' at [2:6] [invalidQuery]
Backtrace:
x
1. +-DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
2. \-DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
3. \-DBI:::.local(conn, statement, ...)
4. +-DBI::dbSendQuery(conn, statement, ...)
5. \-bigrquery::dbSendQuery(conn, statement, ...)
6. \-bigrquery:::BigQueryResult(conn, statement, ...)
7. \-bigrquery::bq_job_wait(job, quiet = conn#quiet)
Can someone please explain? Is it just that I can't query a temporary table with the bigrquery package?
From looking at the documentation here and here, the problem might just be that you did not assign the results anywhere.
local_df = dbGetQuery(...
should take the results from your database query and copy them into local R memory. Take care as there is no check for the size of the results, so it is easy to run out of memory in when doing this.
You have tagged the question with dbplyr, but it looks like you are just using the DBI package. If you want to be writing R and have it translated to SQL, then you can do this using dbplyr. It would look something like this:
con <- dbConnect(...) # your connection details here
remote_tbl1 = tbl(con, from = "table_1")
remote_tbl2 = tbl(con, from = "table_2")
new_remote_tbl = remote_tbl1 %>%
left_join(remote_tbl2, by = "Pokemon", suffix = c("",".y")) %>%
mutate(Type_1 = coalesce(Type_1, Type_1.y),
Type_2 = coalesce(Type_2, Type_2.y)) %>%
select(ID, Pokemon, Type_1, Type_2, ...) %>% # list your return columns
arrange(ID)
When you use this approach, new_remote_tbl can be thought of as a new table in the database which you can query and manipulate further. (It is not actually a table - no data was saved to disc - but you can query it and interact with it as if it were and the database will produce it for you on demand).
There are some limitations of working with a remote table (the biggest is you are limited to commands that dbplyr can translate into SQL). When you want to copy the current remote table into local R memory, use collect:
local_df = remote_df %>%
collect()

dbReadTable won't pull data but dbListFields will see correct fields

I am trying to pull data from a SQL database that I have access to. I can connect to the database, see the tables and get the fields associated with a given table, but cannot read a table into an R variable.
I'm working in R Studio, in case this makes a difference.
I have tried using online code snippets (new to R) and these work, except for the dbReadTable() examples. I have used both "Payments" and name="Payments" as the second argument, and both with and without "" quotes.
library(DBI)
con<-(dbConnect(odbc::odbc(), .connection_string="Driver={SQL Server},
Server=example_1234
Database=exampleDB
TrustedConnection=TRUE")
testing123 <- dbListFields(con,"Payments")
testing456 <- dbReadTable(con,"Payments")
I expect a connection to the database which is now named con. This works.
I expect testing123 to contain all the fields in "Payments". This also works.
I expect testing456 to be a data.frame copy of Payments. This produces:
Error: 'SELECT * FROM "Payments"
nanodbc/nanobdc.cpp:1587 42s02 [Microsoft][ODBC SQL SERVER DRIVER][SQL SERVER]Invalid pbject name 'Payments'.
It's slightly different without "Payments" as the argument - simply saying "Object "Payments" not found".
Any help much appreciated.
I suspect that it's because your table is in a different catalog or schema.
Rationale: DBI::dbListFields is doing select * from ... limit 0 (which is not correct syntax for sql server), but odbc::dbListFields is really calling a C++ function connection_sql_columns that is SQL Server specific. It might be permitting you to be a touch sloppy in that it will find the table even if you do not specify the catalog and/or schema. This is why your dbListFields is working. However, DBI::dbReadTable is really doing select * from ... under the hood (and odbc:: is not overriding it), so it is not allowing you to omit the schema (and/or catalog).
First, find the specific table information for your case:
DBI::dbGetQuery(con, "select top 1 table_catalog, table_schema, table_name, column_name from information_schema.columns where table_name='events'")
# table_catalog table_schema table_name column_name
# 1 my_catalog dbo Payments Id
(I'm projecting what you'll find.)
From here, try one of the following until it works:
x <- DBI::dbReadTable(con, DBI::SQL("[Payments]")) # equivalent to the original
x <- DBI::dbReadTable(con, DBI::SQL("[dbo].[Payments]"))
x <- DBI::dbReadTable(con, DBI::SQL("[my_catalog].[dbo].[Payments]"))
My guess is that DBI::dbGetQuery(con, "select top 1 * from Payments") will not work, so for "regular queries" you'll need to use the same hierarchy of catalog.schema.table, such as one of
DBI::dbGetQuery(con, "select top 1 * from dbo.Payments")
DBI::dbGetQuery(con, "select top 1 * from [dbo].[Payments]")
DBI::dbGetQuery(con, "select top 1 * from [my_catalog].[dbo].[Payments]")
(The use of the [ and ] quoted-identifier brackets are often a personal preference, strictly required in only some corner cases.)
Try changing your con argument just slightly:
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "example_1234",
Database = "exampleDB",
TrustedConnection = TRUE)
# read table to df
testing456 <- dbReadTable(con,"Payments")
# you can also use SQL queries directly, such as:
testing789 <- dbGetQuery(con, statement = "SELECT * FROM Payments WHERE ...")

Putting output from sql query into another query using R environment

I am wondering what approach should have been selected to perform action from title. I am using ODBC connection and what I get from first sql query are like 40-50 rows in one column. What I want is to put this output as a values in to search for.
How should i treat this? Like a array or separated variables? I still do not know R well so just need to know where to search for.
Regards
------more explanation below----
I have list of 40-50 numbers of 10 digits each, organized in a column.
I am trying to do this:
list <- c(my_input)
sql_in <- paste0(list, collapse="")
and characters are organized like this after this operations:
'c(1234567890, , 1234567890, 1234567890)'
and almost all looks fine and fit into my query besides additional c character at the beginning and missing apostrophes.I try to use gsub function but did not work in way I want.
You may likely do this in one SQL call using a subquery. Notice in the call below that the result of
SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4)
Is passed to the WHERE clause of the primary query. This is perfectly valid and will allow your query to execute entirely in SQL without having to do any intermediate steps in R.
(I use sqldf for simplicity of illustration, but this should work through just about any ODBC connection)
library(sqldf)
Gear <- data.frame(n_gear = 1:5)
sqldf(
"SELECT mpg, qsec, gear, wt
FROM mtcars
WHERE gear IN (SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4))"
)
Try something like this:
list<-c("try","this") #The output from your first query
sql_in<-paste0(list, collapse="','")
The Output
paste("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
[1] "select * from table where table.var in ('try','this')"
If yuo have space as first or last element of the string you can use this code:
`list<-c(" first element is a space","try","this","last element is a space ")` #The output from your first query
Find space at first or last character
first_space<-substr(list, start = 1, stop = 1)==" "
last_space<-substr(list, start = nchar(list), stop = nchar(list))==" "
Remove spaces
list[first_space]<-substr(list[first_space], start = 2, stop = nchar(list[first_space]))
list[last_space]<-substr(list[last_space], start = 1, stop = nchar(list[last_space])-1)
sql_in<-paste0(list, collapse="','")
Your output
paste0("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
"select * from table where table.var in ('first element is a space','try','this','last element is a space')"
I think You are expecting some thing like shown below code,
data <- dbGetQuery(con, "select column from yourfirsttable")
list <- paste(data$column, collapse="','")
result <- dbGetQuery(con, statement = sprintf("select * from yourresulttable where inv in ('%s')",list))
It's not entirely clear exactly what you're wanting to achieve here. For example, one use case just means you can do it all with a join. But I have cases where I don't know the values for the test without doing some computation. Then I do a separate query having created a query string thus:
> id <- 1:5
> paste0("SELECT * FROM table WHERE ID IN (", paste0(id, collapse = ","), ")")
[1] "SELECT * FROM table WHERE ID IN (1,2,3,4,5)"

update an SQL table via R sqlSave

I have a data frame in R having 3 columns, using sqlSave I can easily create a table in an SQL database:
channel <- odbcConnect("JWPMICOMP")
sqlSave(channel, dbdata, tablename = "ManagerNav", rownames = FALSE, append = TRUE, varTypes = c(DateNav = "datetime"))
odbcClose(channel)
This data frame contains information about Managers (Name, Nav and Date) which are updatede every day with new values for the current date and maybe old values could be updated too in case of errors.
How can I accomplish this task in R?
I treid to use sqlUpdate but it returns me the following error:
> sqlUpdate(channel, dbdata, tablename = "ManagerNav")
Error in sqlUpdate(channel, dbdata, tablename = "ManagerNav") :
cannot update ‘ManagerNav’ without unique column
When you create a table "the white shark-way" (see documentation), it does not get a primary index, but is just plain columns, and often of the wrong type. Usually, I use your approach to get the columns names right, but after that you should go into your database and assign a primary index, correct column widths and types.
After that, sqlUpdate() might work; I say might, because I have given up using sqlUpdate(), there are too many caveats, and use sqlQuery(..., paste("Update....))) for the real work.
What I would do for this is the following
Solution 1
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c("ManagerNav"))
Solution 2
Lcolumns <- list(dbdata[0,])
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c(Lcolumns))
Index is used to specify what columns R is going to update.
Hope this helps!
If none of the other solutions work and your data is not that big, I'd suggest using sqlQuery() and loop through your dataframe.
one_row_of_your_df <- function(i) {
sql_query <-
paste0("INSERT INTO your_table_name (column_name1, column_name2, column_name3) VALUES",
"(",
"'",your_dataframe[i,1],",",
"'",your_dataframe[i,2],"'",",",
"'",your_dataframe[i,3],"'",",",
")"
)
return(sql_query)
}
This function is Exasol specific, it is pretty similar to MySQL, but not identical, so small changes could be necessary.
Then use a simple for loop like this one:
for(i in 1:nrow(your_dataframe))
{
sqlQuery(your_connection, one_row_of_your_df(i))
}