R vs SQL - loading only some data from db - sql

I'm using this code for load only id that are in my df.
library(dplyr)
tbl(conn, "table") %>%
filter(idvar %in% df$id) %>%
select(var1, var2, var3) %>%
collect()
The question is how to use that with joining and another criteria like on code below, but still load only that matched ids - there are milions ids in my db but in my df are only hundreds.
SELECT *
FROM table
LEFT JOIN table2 on table2.id = table.id
WHERE date > "2010-01-01" and column3 is not null

Hope this helps you with little workaround.
I have tried with similar scenario and it worked for me.
Note : I didn't try using dplyr.
I have used My-SQL as db and DBI & pool are the R packages.
library(DBI)
library(pool)
pool <- dbPool(drv = RMySQL::MySQL(),dbname = "db_name",host = "host_name",username = "User_name", password = "password", port = 3306, unix.sock = "/path/to/mysqld/mysqld.sock")
In the above line at unix.sock i gave My_SQL socket path because i've encountered a problem without it. To get the socket path:
mysql_config --socket (ubuntu)
users <- lapply(df$id, function(x){
dbGetQuery(pool, paste0("SELECT * FROM table LEFT JOIN table2 on table2.id = table.id
WHERE table.user_id IN('", x,"');" ))
})
Please edit the SQL query according to your requirement till WHERE condition.
It fetches from database as a list. Process that list as per your requirement.

Related

Re-Writing "Fuzzy Join" Functions from R to SQL

In the R programming language, I am interested in performing a "fuzzy join" and passing this through a SQL Connection:
library(fuzzyjoin)
library(dplyr)
library(RODBC)
library(sqldf)
con = odbcConnect("some name", uid = "some id", pwd = "abc")
sample_query = sqlQuery( stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) %>%
filter(date_1 >= date_2, date_1 <= date_3) )
view(sample_query)
However, I do not think this is possible, because the function which us being used for the "fuzzy join" (stringdist_inner_join) is not supported .
I tried to find the source code for this "fuzzy join" function, and found it over here: https://rdrr.io/cran/fuzzyjoin/src/R/stringdist_join.R
My Question: Does anyone know if it is possible to (manually) convert this "fuzzy join" function into an SQL format that will be recognized? Are there any quick ways to re-write this function (stringdist_inner_join) such that it can be recognized by Netezza? Are there any pre-existing ways to do this?
Right now I can only execute "sample_query" on locally - re-writing this function (stringdist_inner_join) would let perform the "sample_query" much faster.
Does anyone know if this is possible?
Note:
My data looks like this:
table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)
table_2 = data.frame(id1 = c("0123", "1233", "125 .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)
Based on your other post about this issue, a solution to the question of how to structure the SQL query was solved:
SAS: Fuzzy Joins
select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3)
and (le_dst(a.id1, b.id1) = 1 or a.id2 = b.id2)
To get this to run in an R script, I would recommend using dbplyr and creating this using tbl so you can continue doing basic manipulation of it as if it were a data.frame and dbplyr will translate it into SQL (at least basic commands), then combine everything into a query and eventually pull the data from the query with the collect() function.
Edit: Just a note, the tbl command will start building a SQL statement and get column names, but it won't run it to pull data until you enter collect() at which point, R will send the query to the server, the server will run the query and send the data.
Just keep this in mind because if dbplyr can't translate something to SQL, it will assume it's a SQL command and try to send it, so you won't know there's an error until you try to collect. For example, a function from the stringr package, str_dectect, isn't implemented in dbplyr and so dbplyr would send that command to the database, which would throw an error because it doesn't know what that is, but only after running collect(). Check out the dbplyr page linked above for details.
library(dbplyr)
new_con<- dbConnect(
odbc(),
Driver= "ODBC Driver 17 for SQL Server (as an example)",
Server = "Server name here",
uid = "some_id",
pwd = "abc"
)
sample_query<- dbplyr::tbl(
new_con,
dbplyr::sql(
"select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3)"
sample_data<-sample_query %>%
filter(silly_example==TRUE) %>%
collect()
I agree with #Roger-123's approach. But here is a variation that might assist:
Assuming you are using remote connections to access the Netezza database, you could do this using dbplyr as follows:
remote_1 = tbl(con, "table_1_name")
remote_2 = tbl(con, "table_2_name")
# create dummy column
remote_1 = mutate(remote_1, ones = 1)
remote_2 = mutate(remote_2, ones = 1)
output = remote_1 %>%
# cross_join
inner_join(remote_2, by = "ones", prefix = c("_1","_2")) %>%
# calculate Levenshtein distance
mutate(distance = le_dst(id1, id2)) %>%
# filter to close matches
filter(distance <= 2)
Notes:
dbplyr does not allow for complex conditions in its joins. Hence we do the most general join possible and then filter.
If you also want joins by date, then you can put them into the inner_join if the conditions are simple, or create another filter condition if they are complex.
le_dst is not an R function and there is no dbplyr translation for it, so dbplyr will pass it to the server as-is.
Netezza accepts two distance functions for text: le_dst and dle_dst. You can use whichever you please here.
Output is a query, it will act like a table but it is being generated/calculated on the fly. It has not been written to disk or loaded into R memory. Depending on your application you will want to store/save this.

PySpark - Querying a Oracle Database Directly without loading it entirely beforehand

I am trying to query a database directly:
file_df.createOrReplaceTempView("file_contents")
QUERY = "SELECT * FROM TABLE1 INNER JOIN file_contents on TABLE1.ID = file_contents.ID"
df = sqlContext.read.format("jdbc").options(
url=URL,
driver=DRIVER,
query=QUERY,
user=USER,
password=PASSWORD
).load()
TABLE1 is in the Oracle Database.
However, this code results in the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o343.load.
: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
How can I fix this? That is I want to not load the large database table and instead query it directly and load only the contents that result from the inner join with the TempView file_contents.
You cannot do it without taking to same platform
Option 1 - Preferred
spark = SparkSession.builder.getOrCreate()
jdbcUrl = "jdbc:oracle:thin:#{0}:{1}/{2}".format("asdas", "1521", "asdasd")
connectionProperties = {
"user" : "asdasd",
"password" : "asdasda",
"driver" : "oracle.jdbc.driver.OracleDriver",
"fetchsize" : "100000"
}
pushdown_query = "(SELECT * FROM TABLE1 ) aliasname"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
file_df.createOrReplaceTempView("file_contents")
df.createOrReplaceTempView("table1")
spark.sql("SELECT * FROM TABLE1 INNER JOIN file_contents on TABLE1.ID = file_contents.ID")
Option 2
You have to have temp table in the oracle end to load it say table is filecontent so this and then try extract the required.
file_df.write.format('jdbc').options(url=tgt_url,driver=tgtdriver, dbtable=filecontent,user=tgt_username,password=tgt_password).mode("overwrite).option("truncate","true").save()
Option3 - If the if file content is something collected as list and passed in clause
file_df_id= file_df.select("ID").rdd.flatMap(lambda x: x).collect()
query_param = ",".join(file_df_id)
query = f'select * from table1 where TABLE1.ID in ({query_param}) query_temp'
print(query)
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)

How to pipe SQL into R's dplyr?

I can use the following code in R to select distinct rows in any generic SQL database. I'd use dplyr::distinct() but it's not supported in SQL syntax. Anyways, this does indeed work:
dbGetQuery(database_name,
"SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
I've been using it with success, but wonder how I can pipe that same SQL query after other dplyr steps, as opposed to just using it as a first step as shown above. This is best illustrated with an example:
distinct.df <-
left_join(sql_table_1, sql_table_2, by = "col5") %>%
sql("SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
So I dplyr::left_join() two SQL tables, then I want to look at distinct rows, and keep all columns. Do I pipe SQL code into R as shown above (simply utilizing the sql() function)? And if so what would I use for the table_name on the line FROM table_name t?
In my first example I use the actual table name that I'm pulling from. It's too obvious! But in this case I am piping and am used to using the magrittr pronoun . or sometimes the .data pronoun from rlang if I were in memory working in R without databases.
I'm in a SQL database though... so how do I handle this situation? How do I properly pipe my known working SQL into my R code (with a proper table name pronoun)? dbplyr's reference page is a good starting point but doesn't really answer this specific question.
It looks like you are wanting to combine custom SQL code with auto-generated SQL code from dbplyr. For this it is important to distinguish between:
DBI::db* commands - that execute the provided SQL on the database and return the result.
dbplyr translation - where you work with a remote connection to a table
You can only combine these in certain ways. Below I have given several examples depending on your particular use case. All assume that DISTINCT is a command that is accepted in your specific SQL environment.
Reference examples that cover many of the different use cases
If you'll excuse some self-promotion, I recommend you take a look at my dbplyr_helpers GitHub repository (here). This includes:
union_all function that takes in two tables accessed via dbplyr and outputs a single table using some custom SQL code.
write_to_datebase function that takes a table accessed via dbplyr and converts it to code that can be executed via DBI::dbExecute
Automatic piping
dbplyr automatically pipes your code into the next query for you when you are working with standard dplyr verbs for which there are SQL translations defined. So long as sql translations are defined you can chain together many pipes (I used 10 or more at once) with the (almost) only disadvantage being that the sql translated query gets difficult for a human to read.
For example, consider the following:
library(dbplyr)
library(dplyr)
tmp_df = data.frame(col1 = c(1,2,3), col2 = c("a","b","c"))
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1") %>%
distinct()
When you then call show_query(df) R returns the following auto-generated SQL code:
SELECT DISTINCT *
FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) `dbplyr_002`
But not as nicely formatted. Note that the initial command (left join) appears as a nested query, with a distinct in the outer query. Hence df is an R link to a remote database table defined by the above sql query.
Creating custom SQL functions
You can pipe dbplyr into custom SQL functions. Piping means that the thing being piped becomes the first argument of the receiving function.
custom_distinct <- function(df){
db_connection <- df$src$con
sql_query <- build_sql(con = db_connection,
"SELECT DISTINCT * FROM (\n",
sql_render(df),
") AS nested_tbl"
)
return(tbl(db_connection, sql(sql_query)))
}
df = left_join(df1, df2, by = "col1") %>%
custom_distinct()
When you then call show_query(df) R should return the following SQL code (I say 'should' because I can not get this working with simulated sql connections), but not as nicely formatted:
SELECT DISTINCT * FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) nested_tbl
As with the previous example, df is an R link to a remote database table defined by the above sql query.
Converting dbplyr to DBI
You can take the code from an existing dbplyr remote table and convert it to a string that can be executed using DBI::db*.
As another way of writing a distinct query:
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1")
custom_distinct2 = paste0("SELECT DISTINCT * FROM (",
as.character(sql_render(df)),
") AS nested_table")
local_table = dbGetQuery(db_connection, custom_distinct2)
Which will return a local R dataframe with the equivalent sql command as per the previous examples.
If you want to do custom SQL processing on the result of a dbplyr operation, it may be useful to compute() first, which creates a new table (temporary or permanent) with the result set on the database. The reprex below shows how to access the name of the newly generated table if you rely on autogeneration. (Note that this relies on dbplyr internals and is subject to change without notice -- perhaps it's better to name the table explicitly.) Then, use dbGetQuery() as usual.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
lazy_query
#> # Source: lazy query [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed <-
lazy_query %>%
compute()
lazy_query_computed
#> # Source: table<dbplyr_002> [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#> c
#> <dbl>
#> 1 20
lazy_query_computed$ops$x
#> <IDENT> dbplyr_002
Created on 2020-01-01 by the reprex package (v0.3.0)
If your SQL dialect supports CTEs, you could also extract the query string and use this as part of a custom SQL, perhaps similarly to Simon's suggestion.
library(tidyverse)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
lazy_query <-
memdb_frame(a = 1:3) %>%
mutate(b = a + 1) %>%
summarize(c = sum(a * b, na.rm = TRUE))
sql <-
lazy_query %>%
sql_render()
cte_sql <-
paste0(
"WITH my_result AS (", sql, ") ",
"SELECT c + 1 AS d FROM my_result"
)
cte_sql
#> [1] "WITH my_result AS (SELECT SUM(`a` * `b`) AS `c`\nFROM (SELECT `a`, `a` + 1.0 AS `b`\nFROM `dbplyr_001`)) SELECT c + 1 AS d FROM my_result"
DBI::dbGetQuery(
lazy_query$src$con,
cte_sql
)
#> d
#> 1 21
Created on 2020-01-01 by the reprex package (v0.3.0)

R equivalent of SQL update statement

I use the below statement to update to the postgreSQL db using the following statement
update users
set col1='setup',
col2= 232
where username='rod';
Can anyone guide how to do similar to using R ?I am not good in R
Thanks in advance for the help
Since you didn't provide any data, I've created some here.
users <- data.frame(username = c('rod','stewart','happy'), col1 = c(NA_character_,'do','run'), col2 = c(111,23,145), stringsAsFactors = FALSE)
To update using base R:
users[users$username == 'rod', c('col1','col2')] <- c('setup', 232)
If you prefer the more explicit syntax provided by the data.table package, you would execute:
library(data.table)
setDT(users)
users[username == 'rod', `:=`(col1 = 'setup', col2 = 232)]
To update your database through RPostgreSQL, you will first need to create Database Connection, and then simply store your query in a string, e.g.
con <- dbConnect('PostgreSQL', dbname = <your database name>, user=<user>, password= <password>)
statement <- "update <schema>.users set col1='setup', col2= 232 where username='rod';"
dbGetQuery(con, statement)
dbDisconnect()
Note depending upon your PostgreSQL configs, you may need to also set your search path dbGetQuery(con, 'set search_path = <schema>;')
I'm more familiar with RPostgres, so you may want to double check the syntax and vignettes of the PostgreSQL package.
EDIT: Seems like RPostgreSQL prefers dbGetQuery to send updates and commands rather than dbSendQuery

ROracle Errors When Trying to Use Bound Parameters

I'm using ROracle on a Win7 machine running the following R version:
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 1.1
year 2014
month 07
day 10
svn rev 66115
language R
version.string R version 3.1.1 (2014-07-10)
nickname Sock it to Me
Eventually, I'm going to move the script to a *nix machine, cron it, and run it with RScript.
I want to do something similar to:
select * from tablename where 'thingy' in ('string1','string2')
This would return two rows with all columns in SQLDeveloper (or Toad, etc).
(Ultimately, I want to pull results from one DB into a single column in a data.frame then use those results to loop through
and pull results from a second db, but I also need to be able to do just this function as well.)
I'm following the documentation for RORacle from here.
I've also looked at this (which didn't get an answer):
Bound parameters in ROracle SELECT statements
When I attempt the query from ROracle, I get two different errors, depending on whether I try a dbGetQuery() or dbSendQuery().
As background, here are the versions, queries and data I'm using:
Driver name: Oracle (OCI)
Driver version: 1.1-11
Client version: 11.2.0.3.0
The connection information is standard:
library(ROracle)
ora <- dbDriver("Oracle")
dbcon <- dbConnect(ora, username = "username", password = "password", dbname = "dbnamefromTNS")
These two queries return the expected results:
rs_send <- dbSendQuery(dbcon, "select * from tablename where columname_A = 'thingy' and rownum <= 1000")
rs_get <- dbGetQuery(dbcon, "select * from tablename where columname_A = 'thingy' and rownum <= 1000")
That is to say, 1000 rows from tablename where 'thingy' exists in columnname_A.
I have a data.frame of one column, with two rows.
my.data = data.frame(RANDOM_STRING = as.character(c('string1', 'string2')))
and str(my.data) returns this:
str(my.data)
'data.frame': 2 obs. of 1 variable:
$ RANDOM_STRING: chr "string1" "string2"
my attempted queries are:
nope <- dbSendQuery(dbcon, "select * from tablename where column_A = 'thingy' and widget_name =:1", data = data.frame(widget_name =my.data$RANDOM_STRING))
which gives me an error of:
Error in .oci.SendQuery(conn, statement, data = data, prefetch = prefetch, :
bind data does not match bind specification
and
not_this_either <- dbGetQuery(dbcon, "select * from tablename where column_A = 'thingy' and widget_name =:1", data = data.frame(widget_name =my.data$RANDOM_STRING))
which gives me an error of:
Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch, :
bind data has too many rows
I'm guessing that my problem is in the data=(widget_name=my.data$RANDOM_STRING) part of the queries, but haven't been able to rubber duck my way through it.
Also, I'm very curious as to why I get two separate and different errors depending on whether the queries use the send (and fetch later) format or the get format.
If you like the tidyverse there's a slightly more compact way to achieve the above using purrr
library(ROracle)
library(purrr)
ora <- dbDriver("Oracle")
con <- dbConnect(ora, username = "username", password = "password", dbname = "yourdbnamefromTNSlist")
yourdatalist <- c(12345, 23456, 34567)
output <- map_df(yourdatalist, ~ dbGetQuery(con, "select * from YourTableNameHere where YOURCOLUMNNAME = :d", .x))
Figured it out.
It wasn't a problem with Oracle or ROracle (I'd suspected this) but with my R code.
I stumbled over the answer trying to solve another problem.
This answer about "dynamic strings" was the thing that got me moving towards a solution.
It doesn't fit exactly, but close enough to rubberduck my way to an answer from there.
The trick is to wrap the whole thing in a function and run an ldply on it:
library(ROracle)
ora <- dbDriver("Oracle")
con <- dbConnect(ora, username = "username", password = "password", dbname = "yourdbnamefromTNSlist")
yourdatalist <- c(12345, 23456, 34567)
thisfinallyworks <- function(x) {
dbGetQuery(con, "select * from YourTableNameHere where YOURCOLUMNNAME = :d", data = x)
}
ldply(yourdatalist, thisfinallyworks)
row1 of results where datapoint in YOURCOLUMNNAME = 12345
row2 of results where datapoint in YOURCOLUMNNAME = 23456
row3 of results where datapoint in YOURCOLUMNNAME = 34567
etc