DBI/Spark: how to store the result in a Spark Dataframe? - sql

I am using sparklyr to run some analysis, but I am interested also in writing raw SQL queries using DBI.
I am able to run the following query
query <- "SELECT col1, FROM mydata WHERE some_condition"
dataframe <- dbGetQuery(spark_connection, query)
but this returns the data into R (in a dataframe).
What I want instead is keep the data inside Spark and store it in another Spark Dataframe for further interaction with sparklyr.
Any ideas?

The issue with using DBI is memory. You wont be able to fetch a huge amount of data with that. If your query results return a huge amount of data, the will overwhelm spark's driver memory and cause out of memory errors...
What's happening with sparklyr is the following. DBI runs the sql command a returns an R DataFrame which means it is collecting the data to materialize it in a regular R context.
Thus if you want to use it to return small dataset, you don't need spark for the matter.
Then DBI isn't the solution for you; you ought using regular SparkR if you want to stick with R for that.
This is an example on how you can use the sql in sparkr :
sc %>% spark_session %>%
invoke("sql", "SELECT 1") %>%
invoke("createTempView", "foo")

You may also do:
mydata_spark_df <- tbl(sc, sql("select * from mydata"))

Related

Conserving timestamps for Date/Time column when importing access sheets into R using RODBC

I have an access database (/access.mdb), one sheet of which ("dive") I am trying to import into R using the following code:
db <- odbcDriverConnect("Driver={Microsoft Access Driver (*.mdb, *.accdb)};
DBQ=D:/folder/access.mdb")
data <- as_tibble(sqlFetch(db, "dive", rownames=TRUE)) %>%
select("ref", "DE_DATE")
The sheet imports fine however the "ds_date" column, which is a Date/Time object in the database, only includes the dates as a POSIXct object, and not the timestamp. In the database they are space separated of the form dd/mm/yyyy HH:MM:SS as seen below which I think is where the issue is arising:
I've had a look into the sqlFetch function but can't see an immediately obvious way to manipulate individual columns when reading in a sheet. I'm not well versed in SQL so not sure how I would query this to ensure all the information in those cells is conserved. I would like to import the column including both date and timestamps, of the same format as in the database.
Many thanks for any help.
Since you're using dplyr/tidyverse, why not go full DBI/dbplyr?
library(DBI)
library(odbc)
library(dplyr)
library(dbplyr)
db <- dbConnect(odbc(), .connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=D:/folder/access.mdb")
data <- tbl(db, "dive") %>%
select(ref, DE_DATE) %>%
collect()
In my experience, DBI tends to get types right more often, and using dbplyr (since you're using dplyr already) has the additional advantage of not fetching data you don't use. In the example, only the columns ref and DE_DATE are fetched, while if you use RODBC, all columns would be fetched and hen the unused columns would be discarded.
Note the collect() call means you're actually fetching the data, until then any operations actually compose an SQL statement and do not fetch data.

Is there a way to filter a SQL table based on a dataframe in R? [duplicate]

I have rather a large table in MS SQL Server (120 million rows) which I would like to query. I also have a dataframe in R that has unique ID's that I would like to use as part of my query criteria. I am familiar with the dplyr package but not sure if its possible to have the R query execute on the MS SQL server rather than bring all data onto my laptop memory (likely would crash my laptop).
of course, other option is to load the dataframe onto sql as a table which is currently what I am doing but I would prefer not to do this.
depending on what exactly you want to do, you may find value in the RODBCext package.
let's say you want to pull columns from an MS SQL table where IDs are in a vector that you have in R. you might try code like this:
library(RODBC)
library(RODBCext)
library(tidyverse)
dbconnect <- odbcDriverConnect('driver={SQL Server};
server=servername;database=dbname;trusted_connection=true')
v1 <- c(34,23,56,87,123,45)
qdf <- data_frame(idlist=v1)
sqlq <- "SELECT * FROM tablename WHERE idcol %in% ( ? )"
qr <- sqlExecute(dbconnect,sqlq,qdf,fetch=TRUE)
basically you want to put all the info you want to pass to the query into a dataframe. think of it like variables or parameters for your query; for each parameter you want a column in a dataframe. then you write the query as a character string and store it in a variable. you put it all together using the sqlExecute function.

Spark Dataframe from SQL Query

I'm attempting to use Apache Spark in order to load the results of a (large) SQL query with multiple joins and sub-selects into a DataFrame from Spark as discussed in Create Spark Dataframe from SQL Query.
Unfortunately, my attempts to do so result in an error from Parquet:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Unable to infer schema for Parquet. It must be specified manually.
I have seen information from google implying that this error occurs when a DataFrame is empty. However, the results of the query load plenty of rows in DBeaver.
Here is an example query:
(SELECT REPORT_DATE, JOB_CODE, DEPT_NBR, QTY
FROM DBO.TEMP
WHERE BUSINESS_DATE = '2019-06-18'
AND STORE_NBR IN (999)
ORDER BY BUSINESS_DATE) as reports
My Spark code looks like this.
val reportsDataFrame = spark
.read
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
scheduledHoursDf.show(10)
I read in the previous answer that it is possible to run queries against an entire database using this method. In particular, that if you specify the "dbtable" parameter to be an aliased query when you first build your DataFrame in Spark. You can see I've done this in the query by specifying the entire query to be aliased "as reports".
I don't believe this to be a duplicate question. I've extensively researched this specific problem and have not found anyone facing the same issue online. In particular, the Parquet error resulting from running the query.
It seems the consensus is that one should not be running SQL queries this way and should instead use Spark's DataFrames many methods to filter, group by and aggregate data. However, it would be very valuable for us to be able to use raw SQL instead even if it incurs a performance penalty.
Quick look at your code tells me you are missing .format("jdbc")
val reportsDataFrame = spark
.read
.format("jdbc")
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
This should work provided you have username and password set to connect to the database.
Good resource to know more about the JDBC Sources in spark (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

Deletion/Updation of rows using Spark SQL

I want to delete the existing row and to update with new row.
Can we delete or update rows using Spark SQL from database?
Spark SQL does not support UPDATE statements yet.
However Hive does support UPDATE/DELETE statements (since version 0.14), but only on tables that support transactions, as mentioned in the hive documentation.
sparkR code
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
#create R data frame
df <- data.frame(col= c("A","A","B","B"),des= c("a","b","b","c"))
#converting to spark dataframe
sdf <- createDataFrame( sqlContext, df)
registerTempTable(sdf, "sdf")
head(sql(sqlContext, "SQL QUERY"))
try the corresponding sql query in it and execute it. Dont know whether it will support update statement.

Convert sqlalchemy ORM query object to sql query for Pandas DataFrame

This question feels fiendishly simple but I haven't been able to find an answer.
I have an ORM query object, say
query_obj = session.query(Class1).join(Class2).filter(Class2.attr == 'state')
I can read it into a dataframe like so:
testdf = pd.read_sql(query_obj.statement, query_obj.session.bind)
But what I really want to do is use a traditional SQL query instead of the ORM:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)
Where query is a traditional SQL string. Now when I run this I get an error along the lines of "query_obj is not executable" Anyone know how to convert the ORM query to a traditional query? Also how does one get the columns in after getting the dataframe?
Context why I'm doing this: I've set up an ORM layer on top of my database and am using it to query data into a Pandas DataFrame. It works, but it's frequently maxing out my memory. I want to cut my in-memory overhead with some string folding (pass 3 outlined here: http://www.mobify.com/blog/sqlalchemy-memory-magic/). That requires (and correct me if I'm wrong here) not using the read_sql string and instead processing the query's return as raw tuples.
The long version is described in detail in the FAQ of sqlalchemy: http://sqlalchemy.readthedocs.org/en/latest/faq/sqlexpressions.html#how-do-i-render-sql-expressions-as-strings-possibly-with-bound-parameters-inlined
The short version is:
statement = query.statement
print(statement.compile(engine))
The result of this can be used in read_sql.
this may be a later version of sqlalchemy since the post.
print(query)
outputs the query you can copy and paste back into your script.
Fiendishly simple indeed. Per Jori's link to the docs, it just query_obj.statement to get the SQL query. So my code is:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj.statement)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)