Does Apache Spark SQL support MERGE clause? - sql

Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause?
MERGE into <table> using (
select * from <table1>
when matched then update...
DELETE WHERE...
when not matched then insert...
)

Spark does support MERGE operation using Delta Lake as storage format. The first thing to do is to save the table using the delta format to provide support for transactional capabilities and support for DELETE/UPDATE/MERGE operations with spark
Python/scala:
df.write.format("delta").save("/data/events")
SQL: CREATE TABLE events (eventId long, ...) USING delta
Once the table exists, you can run your usual SQL Merge command:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
The command is also available in Python/Scala:
DeltaTable.forPath(spark, "/data/events/")
.as("events")
.merge(
updatesDF.as("updates"),
"events.eventId = updates.eventId")
.whenMatched
.updateExpr(
Map("data" -> "updates.data"))
.whenNotMatched
.insertExpr(
Map(
"date" -> "updates.date",
"eventId" -> "updates.eventId",
"data" -> "updates.data"))
.execute()
To support Delta Lake format, you also need the delta package as dependency in your spark job:
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_x.xx</artifactId>
<version>xxxx</version>
</dependency>
See https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge for more details

As of Spark 3.0, Spark offers a very clean way of doing the merge operation using the spark delta table.
https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge

It does not. As of now (it might change in the future) Spark doesn't support UPDATES, DELETES or any other variant of record modification.
It can only overwrite existing storage (with different implementation depending on the source) or append with plain INSERT.

you can write your custom code: Below code you can edit to go with merge instead of Insert. Make sure this is computation heavy operations. but get y
df.rdd.coalesce(2).foreachPartition(partition => {
val connectionProperties = brConnect.value
val jdbcUrl = connectionProperties.getProperty("jdbcurl")
val user = connectionProperties.getProperty("user")
val password = connectionProperties.getProperty("password")
val driver = connectionProperties.getProperty("Driver")
Class.forName(driver)
val dbc: Connection = DriverManager.getConnection(jdbcUrl, user, password)
val db_batchsize = 1000
var pstmt: PreparedStatement = null
partition.grouped(db_batchsize).foreach(batch => {
batch.foreach{ row =>
{
val id = row.id
val fname = row.fname
val lname = row.lname
val userid = row.userid
println(id, fname)
val sqlString = "INSERT employee USING " +
" values (?, ?, ?, ?) "
var pstmt: PreparedStatement = dbc.prepareStatement(sqlString)
pstmt.setLong(1, row.id)
pstmt.setString(2, row.fname)
pstmt.setString(3, row.lname)
pstmt.setString(4, row.userid)
pstmt.addBatch()
pstmt.executeBatch()
}
}
//pstmt.executeBatch()
dbc.commit()
pstmt.close()
})
dbc.close()
} )

If you are working over Spark, maybe this answers could help you to lead with the merge issue using DataFrames.
Anyway, reading some documentation of Hortonworks, it says that Merge sentence is supported in Apache Hive 0.14 and later.

There is an Apache project - Apache Iceberg - which creates a table format type with editing capabilities, including MERGE:
https://iceberg.apache.org/docs/latest/spark-writes/

Related

Push a SQL query to a server from JDBC connection which reads from multiple databases within that server

I'm pushing a query down to a server to read data into Databricks as below:
val jdbcUsername = dbutils.secrets.get(scope = "", key = "")
val jdbcPassword = dbutils.secrets.get(scope = "", key = "")
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
val jdbcHostname = ""
val jdbcPort = ...
val jdbcDatabase = ""
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
// define a query to be passed to database to display the tables available for a given DB
val query_results = "(SELECT * FROM INFORMATION_SCHEMA.TABLES) as tables"
// push the query down to the server to retrieve the list of available tables
val table_names = spark.read.jdbc(jdbcUrl, query_results, connectionProperties)
table_names.createOrReplaceTempView("table_names")
Running display(table_names) would provide a list of tables for a given defined database. This is no issue, however when trying to read and join tables from multiple databases in the same server I havent yet found a solution that works.
An example would be:
// define a query to be passed to database to display a result across many tables
val report1_results = "(SELECT a.Field1, b.Field2 FROM database_1 as a left join database_2 as b on a.Field4 == b.Field8) as report1"
// push the query down to the server to retrieve the query result
val report1_results = spark.read.jdbc(jdbcUrl, report1_results, connectionProperties)
report1_results .createOrReplaceTempView("report1_results")
Any pointers appreciated wrt to restructuring this code (equivalent in Python would also be super helpful).
SQL Server uses 3-part naming like database.schema.table. This example comes from the SQL Server information_schema docs:
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLUMN_DEFAULT
FROM AdventureWorks2012.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = N'Product';
To query across databases you need to specify all 3 parts in the query being pushed down to SQL Server.
SELECT a.Field1, b.Field2
FROM database_1.schema_1.table_1 as a
LEFT JOIN database_2.schema_2.table_2 as b
on a.Field4 == b.Field8

how to delete data from a delta file in databricks?

I want to delete data from a delta file in databricks.
Im using these commands
Ex:
PR=spark.read.format('delta').options(header=True).load('/mnt/landing/Base_Tables/EventHistory/')
PR.write.format("delta").mode('overwrite').saveAsTable('PR')
spark.sql('delete from PR where PR_Number=4600')
This is deleting data from the table but not from the actual delta file. And i want to delete the data in the file without using merge operation, because the join condition is not matching. Can anyone please help me in resolving this issue.
Thanks
Please do remember : Subqueries are not supported in the DELETE in Delta.
Issue Link : https://github.com/delta-io/delta/issues/730
From the documentation itself , an Alternative is as follows
For Example :
DELETE FROM tdelta.productreferencedby_delta
WHERE id IN (SELECT KEY
FROM delta.productreferencedby_delta_dup_keys)
AND srcloaddate <= '2020-04-15'
Can be written as below in case of DELTA
MERGE INTO delta.productreferencedby_delta AS d
using (SELECT KEY FROM tdatamodel_delta.productreferencedby_delta_dup_keys) AS k
ON d.id = k.KEY
AND d.srcloaddate <= '2020-04-15'
WHEN MATCHED THEN DELETE
Using Spark SQL functions in python would be:
dt_path = "/mnt/landing/Base_Tables/EventHistory/"
my_dt = DeltaTable.forPath(spark, dt_path)
seq_keys = ["4600"] // You could add here as many as you want
my_dt.delete(col("PR_Number").isin(seq_keys))
And in scala:
val dt_path = "/mnt/landing/Base_Tables/EventHistory/"
val my_dt : DeltaTable = DeltaTable.forPath(spark, dt_path)
val seq_keys = Seq("4600") // You could add here as many as you want
my_dt.delete(col("PR_Number").isin(seq_keys:_*))
You can remove data that matches a predicate from a Delta table
https://docs.delta.io/latest/delta-update.html#delete-from-a-table
It worked like
delete from delta.`/mnt/landing/Base_Tables/EventHistory/` where PR_Number=4600

R equivalent of SQL update statement

I use the below statement to update to the postgreSQL db using the following statement
update users
set col1='setup',
col2= 232
where username='rod';
Can anyone guide how to do similar to using R ?I am not good in R
Thanks in advance for the help
Since you didn't provide any data, I've created some here.
users <- data.frame(username = c('rod','stewart','happy'), col1 = c(NA_character_,'do','run'), col2 = c(111,23,145), stringsAsFactors = FALSE)
To update using base R:
users[users$username == 'rod', c('col1','col2')] <- c('setup', 232)
If you prefer the more explicit syntax provided by the data.table package, you would execute:
library(data.table)
setDT(users)
users[username == 'rod', `:=`(col1 = 'setup', col2 = 232)]
To update your database through RPostgreSQL, you will first need to create Database Connection, and then simply store your query in a string, e.g.
con <- dbConnect('PostgreSQL', dbname = <your database name>, user=<user>, password= <password>)
statement <- "update <schema>.users set col1='setup', col2= 232 where username='rod';"
dbGetQuery(con, statement)
dbDisconnect()
Note depending upon your PostgreSQL configs, you may need to also set your search path dbGetQuery(con, 'set search_path = <schema>;')
I'm more familiar with RPostgres, so you may want to double check the syntax and vignettes of the PostgreSQL package.
EDIT: Seems like RPostgreSQL prefers dbGetQuery to send updates and commands rather than dbSendQuery

Scio saveAsTypedBigQuery write to a partition for SCollection of Typed Big Query case class

I'm trying to write a SCollection to a partition in Big Query using:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val date = LocateDate.parse("2017-06-21")
val col = sCollection.typedBigQuery[Blah](query)
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.ISO_LOCAL_DATE),
writeDisposition = WriteDisposition.WRITE_EMPTY,
createDisposition = CreateDisposition.CREATE_IF_NEEDED)
The error I get is
Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used."
How can I write to a partition? I don't see any options to specify partitions via either saveAsTypedBigQuery method so I was trying the Legacy SQL table decorators.
See: BigqueryIO Unable to Write to Date-Partitioned Table. You need to manually create the table. BQ IO cannot create a table and partition it.
Additionally, the no table decorators was a complete ruse. It's the alphanumeric part I was missing.
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.BASIC_ISO_DATE),
writeDisposition = WriteDisposition.WRITE_APPEND,
createDisposition = CreateDisposition.CREATE_NEVER)

Spark sql SQLContext

I'm trying to select data from MSSQL database via SQLContext.sql in Spark application.
Connection works but I'm not able to select data from table, because it always fail on table name.
Here is my code:
val prop=new Properties()
val url2="jdbc:jtds:sqlserver://servername;instance=MSSQLSERVER;user=sa;password=Pass;"
prop.setProperty("user","username")
prop.setProperty("driver" , "net.sourceforge.jtds.jdbc.Driver")
prop.setProperty("password","mypassword")
val test=sqlContext.read.jdbc(url2,"[dbName].[dbo].[Table name]",prop)
sqlContext.sql("""
SELECT *
FROM 'dbName.dbo.Table name'
""")
I tried table name without (') or [dbName].[dbo].[Table name] but still the same ....
Exception in thread "main" java.lang.RuntimeException: [3.14] failure:
``union'' expected but `.' found
dependencies:
// https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.1" //%"provided"
// https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.6.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.1" //%"provided"
I think the problem in your code is that the query you pass to sqlContext has no access to the original table in the source database. It has only access to the tables saved within the sql context, for example with df.write.saveAsTable() or with df.registerTempTable() (df.createTempView in Spark 2+).
So, in your specific case, I can suggest a couple of options:
1) if you want the query to be executed on the source database with the exact syntax of your database SQL, you can pass the query to the "dbtable" argument:
val query = "SELECT * FROM dbName.dbo.TableName"
val df = sqlContext.read.jdbc(url2, s"($query) AS subquery", prop)
df.show
Note that the query needs to be in parentheses, because it will be passed to a "FROM" clause, as specified in the docs:
dbtable: The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
2) If you don't need to run the query on the source database, you can just pass the table name and then create a temp view in the sqlContext:
val table = sqlContext.read.jdbc(url2, "dbName.dbo.TableName", prop)
table.registerTempTable("temp_table")
val df = sqlContext.sql("SELECT * FROM temp_table")
// or sqlContext.table("temp_table")
df.show()