I want to delete the existing row and to update with new row.
Can we delete or update rows using Spark SQL from database?
Spark SQL does not support UPDATE statements yet.
However Hive does support UPDATE/DELETE statements (since version 0.14), but only on tables that support transactions, as mentioned in the hive documentation.
sparkR code
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
#create R data frame
df <- data.frame(col= c("A","A","B","B"),des= c("a","b","b","c"))
#converting to spark dataframe
sdf <- createDataFrame( sqlContext, df)
registerTempTable(sdf, "sdf")
head(sql(sqlContext, "SQL QUERY"))
try the corresponding sql query in it and execute it. Dont know whether it will support update statement.
Related
I have rather a large table in MS SQL Server (120 million rows) which I would like to query. I also have a dataframe in R that has unique ID's that I would like to use as part of my query criteria. I am familiar with the dplyr package but not sure if its possible to have the R query execute on the MS SQL server rather than bring all data onto my laptop memory (likely would crash my laptop).
of course, other option is to load the dataframe onto sql as a table which is currently what I am doing but I would prefer not to do this.
depending on what exactly you want to do, you may find value in the RODBCext package.
let's say you want to pull columns from an MS SQL table where IDs are in a vector that you have in R. you might try code like this:
library(RODBC)
library(RODBCext)
library(tidyverse)
dbconnect <- odbcDriverConnect('driver={SQL Server};
server=servername;database=dbname;trusted_connection=true')
v1 <- c(34,23,56,87,123,45)
qdf <- data_frame(idlist=v1)
sqlq <- "SELECT * FROM tablename WHERE idcol %in% ( ? )"
qr <- sqlExecute(dbconnect,sqlq,qdf,fetch=TRUE)
basically you want to put all the info you want to pass to the query into a dataframe. think of it like variables or parameters for your query; for each parameter you want a column in a dataframe. then you write the query as a character string and store it in a variable. you put it all together using the sqlExecute function.
I'm attempting to use Apache Spark in order to load the results of a (large) SQL query with multiple joins and sub-selects into a DataFrame from Spark as discussed in Create Spark Dataframe from SQL Query.
Unfortunately, my attempts to do so result in an error from Parquet:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Unable to infer schema for Parquet. It must be specified manually.
I have seen information from google implying that this error occurs when a DataFrame is empty. However, the results of the query load plenty of rows in DBeaver.
Here is an example query:
(SELECT REPORT_DATE, JOB_CODE, DEPT_NBR, QTY
FROM DBO.TEMP
WHERE BUSINESS_DATE = '2019-06-18'
AND STORE_NBR IN (999)
ORDER BY BUSINESS_DATE) as reports
My Spark code looks like this.
val reportsDataFrame = spark
.read
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
scheduledHoursDf.show(10)
I read in the previous answer that it is possible to run queries against an entire database using this method. In particular, that if you specify the "dbtable" parameter to be an aliased query when you first build your DataFrame in Spark. You can see I've done this in the query by specifying the entire query to be aliased "as reports".
I don't believe this to be a duplicate question. I've extensively researched this specific problem and have not found anyone facing the same issue online. In particular, the Parquet error resulting from running the query.
It seems the consensus is that one should not be running SQL queries this way and should instead use Spark's DataFrames many methods to filter, group by and aggregate data. However, it would be very valuable for us to be able to use raw SQL instead even if it incurs a performance penalty.
Quick look at your code tells me you are missing .format("jdbc")
val reportsDataFrame = spark
.read
.format("jdbc")
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
This should work provided you have username and password set to connect to the database.
Good resource to know more about the JDBC Sources in spark (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)
I am using sparklyr to run some analysis, but I am interested also in writing raw SQL queries using DBI.
I am able to run the following query
query <- "SELECT col1, FROM mydata WHERE some_condition"
dataframe <- dbGetQuery(spark_connection, query)
but this returns the data into R (in a dataframe).
What I want instead is keep the data inside Spark and store it in another Spark Dataframe for further interaction with sparklyr.
Any ideas?
The issue with using DBI is memory. You wont be able to fetch a huge amount of data with that. If your query results return a huge amount of data, the will overwhelm spark's driver memory and cause out of memory errors...
What's happening with sparklyr is the following. DBI runs the sql command a returns an R DataFrame which means it is collecting the data to materialize it in a regular R context.
Thus if you want to use it to return small dataset, you don't need spark for the matter.
Then DBI isn't the solution for you; you ought using regular SparkR if you want to stick with R for that.
This is an example on how you can use the sql in sparkr :
sc %>% spark_session %>%
invoke("sql", "SELECT 1") %>%
invoke("createTempView", "foo")
You may also do:
mydata_spark_df <- tbl(sc, sql("select * from mydata"))
This question feels fiendishly simple but I haven't been able to find an answer.
I have an ORM query object, say
query_obj = session.query(Class1).join(Class2).filter(Class2.attr == 'state')
I can read it into a dataframe like so:
testdf = pd.read_sql(query_obj.statement, query_obj.session.bind)
But what I really want to do is use a traditional SQL query instead of the ORM:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)
Where query is a traditional SQL string. Now when I run this I get an error along the lines of "query_obj is not executable" Anyone know how to convert the ORM query to a traditional query? Also how does one get the columns in after getting the dataframe?
Context why I'm doing this: I've set up an ORM layer on top of my database and am using it to query data into a Pandas DataFrame. It works, but it's frequently maxing out my memory. I want to cut my in-memory overhead with some string folding (pass 3 outlined here: http://www.mobify.com/blog/sqlalchemy-memory-magic/). That requires (and correct me if I'm wrong here) not using the read_sql string and instead processing the query's return as raw tuples.
The long version is described in detail in the FAQ of sqlalchemy: http://sqlalchemy.readthedocs.org/en/latest/faq/sqlexpressions.html#how-do-i-render-sql-expressions-as-strings-possibly-with-bound-parameters-inlined
The short version is:
statement = query.statement
print(statement.compile(engine))
The result of this can be used in read_sql.
this may be a later version of sqlalchemy since the post.
print(query)
outputs the query you can copy and paste back into your script.
Fiendishly simple indeed. Per Jori's link to the docs, it just query_obj.statement to get the SQL query. So my code is:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj.statement)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)
I am using RODBC package in R to import / export data frames from SQL Server database. While there is no problem in importing. I dont know how to export the contents of a data frame into an existing SQL table.
I am trying to use sqlQuery() function available in the package, but I am not sure how to insert multiple records in the table.
A sample on how to insert the rows will be helpful
I have ensured that columns of my table and data frame are same.
Use sqlSave with append property. See my code below:
sqlSave(uploaddbconnection, outputframe, tablename =
"your_TableName",rownames=FALSE, append = TRUE)
This is my code on using sqlSave(). I am using SQL Server 2008. Conn is a connection that I created using odbcConnect():
#creating data to be saved in SQL Table
data_to_save<-cbind(scenario_1,scenario_2,scenario_3,scenario_4,store_num,future_date,Province,index)
#use sqlSave() rather than sqlQuery() for saving data into SQL Server
sqlSave(conn,data.frame(data_to_save),"CC_Forecast",safer=FALSE,append=TRUE)
dbWriteTable(conn, "RESULTS", results2000, append = T)
use DBI package
I would like to add upon Yan's answer.
Before you use sqlSave() function, make sure you change your default database to the right database where your table are. Especially if you want to write to an existing table!
See here: for how to set up ODBC connection and how to change default database.
After that, you can use these to dump data to sql:
specificDB= odbcConnect(dsn ='name you set up in ODBC',uid = '***', pwd = '****')
sqlSave(specificDB, output_to_sql, tablename = 'a_table_in_specificDB', rownames = F,append = T)
close(specificDB)
It is slow. be patient.