how to delete data from a delta file in databricks? - sql

I want to delete data from a delta file in databricks.
Im using these commands
Ex:
PR=spark.read.format('delta').options(header=True).load('/mnt/landing/Base_Tables/EventHistory/')
PR.write.format("delta").mode('overwrite').saveAsTable('PR')
spark.sql('delete from PR where PR_Number=4600')
This is deleting data from the table but not from the actual delta file. And i want to delete the data in the file without using merge operation, because the join condition is not matching. Can anyone please help me in resolving this issue.
Thanks

Please do remember : Subqueries are not supported in the DELETE in Delta.
Issue Link : https://github.com/delta-io/delta/issues/730
From the documentation itself , an Alternative is as follows
For Example :
DELETE FROM tdelta.productreferencedby_delta
WHERE id IN (SELECT KEY
FROM delta.productreferencedby_delta_dup_keys)
AND srcloaddate <= '2020-04-15'
Can be written as below in case of DELTA
MERGE INTO delta.productreferencedby_delta AS d
using (SELECT KEY FROM tdatamodel_delta.productreferencedby_delta_dup_keys) AS k
ON d.id = k.KEY
AND d.srcloaddate <= '2020-04-15'
WHEN MATCHED THEN DELETE

Using Spark SQL functions in python would be:
dt_path = "/mnt/landing/Base_Tables/EventHistory/"
my_dt = DeltaTable.forPath(spark, dt_path)
seq_keys = ["4600"] // You could add here as many as you want
my_dt.delete(col("PR_Number").isin(seq_keys))
And in scala:
val dt_path = "/mnt/landing/Base_Tables/EventHistory/"
val my_dt : DeltaTable = DeltaTable.forPath(spark, dt_path)
val seq_keys = Seq("4600") // You could add here as many as you want
my_dt.delete(col("PR_Number").isin(seq_keys:_*))

You can remove data that matches a predicate from a Delta table
https://docs.delta.io/latest/delta-update.html#delete-from-a-table

It worked like
delete from delta.`/mnt/landing/Base_Tables/EventHistory/` where PR_Number=4600

Related

Redshift - ERROR: Target table must be part of an equijoin predicate

I am trying to make an update on a temporal table I created in Redshift. The code I am trying to run goes like this:
UPDATE #new_emp
SET rpt_to_emp_id = CAST(ht.se_value AS INTEGER),
rpt_to_extrnl_email = ht.extrnl_email_addr,
rpt_to_fst_nm = ht.first_nm,
rpt_to_lst_nm = ht.last_nm,
rpt_to_mdl_init = ht.mdl_nm,
rpt_to_nm = ht.full_nm,
rpt_to_ssn = CAST(ht.ssn AS INTEGER),
FROM #new_emp,
(SELECT DISTINCT t.se_value,h.first_nm,h.last_nm,
h.mdl_nm,h.full_nm,h.ssn,h.extrnl_email_addr
FROM spec_hr.dtbl_translate_codes_dw t, spec_hr.emp_hron h
WHERE t.inf_name = 'system'
AND t.fld_name = 'HRONDirector'
AND h.foreign_emp_id = t.se_value
) ht
WHERE #new_emp.foreign_emp_id <> ht.se_value
AND (#new_emp.emp_status_cd <> 'T'
AND (#new_emp.ult_rpt_emp_id = #new_emp.foreign_emp_id
OR #new_emp.ult_rpt_emp_id = #new_emp.psoft_id
OR #new_emp.ult_rpt_emp_id IS NULL));
I've tried both with and without specyfing the updated table from the FROM command. But it keeps throwing me this error:
ERROR: Target table must be part of an equijoin predicate
Any ideas why is this failing? Thank you!
Redshift needs an equality join condition to know what value to update and with which values. Your join condition is "#new_emp.foreign_emp_id <> ht.se_value" which is an inequality or Redshift speak - this is not "an equijoin predicate". You have a SET of "rpt_to_lst_nm = ht.last_nm" but if the only join condition is an inequality then which value of last_nm is Redshift putting in the table?
To put it the other way around - you need to tell Redshift exactly which rows of the target table are receiving which values (equijoin). The join condition you have doesn't meet this requirement.

Is it viable to build type 2 history in BiqQuery without knowing primary keys?

I am importing a lot of mainframe extracts into BigQuery daily. Each extract is a full export of all the data available. I've been loading the data into BigQuery and then generating a type 2 history using a SQL MERGE statement, where I join on the primary keys of each table and compare all columns to find differences, closing outdated rows and inserting new/updated rows into a history table. This works fine.
A colleague made the case that we don't need to know the primary keys to do this, we can just treat all data columns together as the unique constraint. With that in mind I created a new MERGE statement, which seems to work just as well as the old one, but without having to know the primary keys, which makes a lot of things easier for us. Here is an example of the new query:
MERGE ods.kfir_history AS main USING (
SELECT FF_NR, FIRMA_PRODENH_TYPE, FIRMA_NAVN_1, FIRMA_NAVN_2, SE_NR, KONCERN_NR, FIRMA_STATUS_DATO, FIRMA_STATUS_DATO_NUL, FIRMA_STATUS, FIRMA_TYPE, ANTAL_ANSAT_DATO, ANTAL_ANSAT_DATO_NUL, ANTAL_ANSAT, ANTAL_ANSAT_NUL, ANTAL_ANSAT_KILDE, FIRMA_STIFTET_DATO, FIRMA_STIFTET_DATO_NUL, AS_REGISTRERET, MOMS_REGISTRERET, FAGLIG_FORENING, HEKTAR, RET_SBH, RET_TIMESTAMP, SUPL_FIRMA_NAVN, SUPL_FIRMA_NAVN_NUL, CVR_NR, P_NR from staging.kfir_new
) AS delta ON
IFNULL(main.FF_NR,'null') = IFNULL(delta.FF_NR,'null') AND
IFNULL(main.FIRMA_PRODENH_TYPE,'null') = FNULL(delta.FIRMA_PRODENH_TYPE,'null') AND
IFNULL(main.FIRMA_NAVN_1,'null') = IFNULL(delta.FIRMA_NAVN_1,'null') AND
IFNULL(main.FIRMA_NAVN_2,'null') = IFNULL(delta.FIRMA_NAVN_2,'null') AND
IFNULL(main.SE_NR,0) = IFNULL(delta.SE_NR,0) AND IFNULL(main.KONCERN_NR,'null') = IFNULL(delta.KONCERN_NR,'null') AND
IFNULL(main.FIRMA_STATUS_DATO,'null') = IFNULL(delta.FIRMA_STATUS_DATO,'null') AND
IFNULL(main.FIRMA_STATUS_DATO_NUL,'null')= IFNULL(delta.FIRMA_STATUS_DATO_NUL,'null') AND
IFNULL(main.FIRMA_STATUS,'null') = IFNULL(delta.FIRMA_STATUS,'null') AND IFNULL(main.FIRMA_TYPE,'null') = IFNULL(delta.FIRMA_TYPE,'null') AND IFNULL(main.ANTAL_ANSAT_DATO,'null') = IFNULL(delta.ANTAL_ANSAT_DATO,'null') AND IFNULL(main.ANTAL_ANSAT_DATO_NUL,'null') = IFNULL(delta.ANTAL_ANSAT_DATO_NUL,'null') AND IFNULL(main.ANTAL_ANSAT,0) = IFNULL(delta.ANTAL_ANSAT,0) AND IFNULL(main.ANTAL_ANSAT_NUL,'null') = IFNULL(delta.ANTAL_ANSAT_NUL,'null') AND IFNULL(main.ANTAL_ANSAT_KILDE,'null') = IFNULL(delta.ANTAL_ANSAT_KILDE,'null') AND IFNULL(main.FIRMA_STIFTET_DATO,'null') = IFNULL(delta.FIRMA_STIFTET_DATO,'null') AND IFNULL(main.FIRMA_STIFTET_DATO_NUL,'null') = IFNULL(delta.FIRMA_STIFTET_DATO_NUL,'null') AND IFNULL(main.AS_REGISTRERET,0) = IFNULL(delta.AS_REGISTRERET,0) AND IFNULL(main.MOMS_REGISTRERET,'null') = IFNULL(delta.MOMS_REGISTRERET,'null') AND IFNULL(main.FAGLIG_FORENING,'null') = IFNULL(delta.FAGLIG_FORENING,'null') AND IFNULL(main.HEKTAR,0) = IFNULL(delta.HEKTAR,0) AND IFNULL(main.RET_SBH,'null') = IFNULL(delta.RET_SBH,'null') AND IFNULL(main.RET_TIMESTAMP,'null') = IFNULL(delta.RET_TIMESTAMP,'null') AND IFNULL(main.SUPL_FIRMA_NAVN,'null') = IFNULL(delta.SUPL_FIRMA_NAVN,'null') AND IFNULL(main.SUPL_FIRMA_NAVN_NUL,'null') = IFNULL(delta.SUPL_FIRMA_NAVN_NUL,'null') AND IFNULL(main.CVR_NR,0) = IFNULL(delta.CVR_NR,0) AND IFNULL(main.P_NR,0) = IFNULL(delta.P_NR,0)
WHEN NOT MATCHED BY SOURCE AND main.SystemTimeEnd = "5999-12-31 23:59:59" --Close all updated records
THEN UPDATE SET SystemTimeEnd=delta.SystemTime, current_timestamp, LastAction = 'U'
WHEN NOT MATCHED BY TARGET --insert all new and updated records
THEN INSERT (FF_NR, FIRMA_PRODENH_TYPE, FIRMA_NAVN_1, FIRMA_NAVN_2, SE_NR, KONCERN_NR, FIRMA_STATUS_DATO, FIRMA_STATUS_DATO_NUL, FIRMA_STATUS, FIRMA_TYPE, ANTAL_ANSAT_DATO, ANTAL_ANSAT_DATO_NUL, ANTAL_ANSAT, ANTAL_ANSAT_NUL, ANTAL_ANSAT_KILDE, FIRMA_STIFTET_DATO, FIRMA_STIFTET_DATO_NUL, AS_REGISTRERET, MOMS_REGISTRERET, FAGLIG_FORENING, HEKTAR, RET_SBH, RET_TIMESTAMP, SUPL_FIRMA_NAVN, SUPL_FIRMA_NAVN_NUL, CVR_NR, P_NR, SystemTime, SystemTimeEnd, Lastupdated, LastAction)
VALUES (delta.FF_NR, delta.FIRMA_PRODENH_TYPE, delta.FIRMA_NAVN_1, delta.FIRMA_NAVN_2, delta.SE_NR, delta.KONCERN_NR, delta.FIRMA_STATUS_DATO, delta.FIRMA_STATUS_DATO_NUL, delta.FIRMA_STATUS, delta.FIRMA_TYPE, delta.ANTAL_ANSAT_DATO, delta.ANTAL_ANSAT_DATO_NUL, delta.ANTAL_ANSAT, delta.ANTAL_ANSAT_NUL, delta.ANTAL_ANSAT_KILDE, delta.FIRMA_STIFTET_DATO, delta.FIRMA_STIFTET_DATO_NUL, delta.AS_REGISTRERET, delta.MOMS_REGISTRERET, delta.FAGLIG_FORENING, delta.HEKTAR, delta.RET_SBH, delta.RET_TIMESTAMP, delta.SUPL_FIRMA_NAVN, delta.SUPL_FIRMA_NAVN_NUL, delta.CVR_NR, delta.P_NR, delta.SystemTime, '5999-12-31 23:59:59', current_timestamp, 'I')
Can anybody tell me if there are any downsides to this approach? Considering BigQuery doesn't deal with primary keys at all, it would be nice to lose the concept completely for us in the solution.
As per my understanding this should not be a problem. Unless your conditions acts on columns which values are not unique, while you have the need to update only that particular datapoint.

Scio saveAsTypedBigQuery write to a partition for SCollection of Typed Big Query case class

I'm trying to write a SCollection to a partition in Big Query using:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val date = LocateDate.parse("2017-06-21")
val col = sCollection.typedBigQuery[Blah](query)
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.ISO_LOCAL_DATE),
writeDisposition = WriteDisposition.WRITE_EMPTY,
createDisposition = CreateDisposition.CREATE_IF_NEEDED)
The error I get is
Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used."
How can I write to a partition? I don't see any options to specify partitions via either saveAsTypedBigQuery method so I was trying the Legacy SQL table decorators.
See: BigqueryIO Unable to Write to Date-Partitioned Table. You need to manually create the table. BQ IO cannot create a table and partition it.
Additionally, the no table decorators was a complete ruse. It's the alphanumeric part I was missing.
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.BASIC_ISO_DATE),
writeDisposition = WriteDisposition.WRITE_APPEND,
createDisposition = CreateDisposition.CREATE_NEVER)

data processing in pig , with tab separate

I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();

update an SQL table via R sqlSave

I have a data frame in R having 3 columns, using sqlSave I can easily create a table in an SQL database:
channel <- odbcConnect("JWPMICOMP")
sqlSave(channel, dbdata, tablename = "ManagerNav", rownames = FALSE, append = TRUE, varTypes = c(DateNav = "datetime"))
odbcClose(channel)
This data frame contains information about Managers (Name, Nav and Date) which are updatede every day with new values for the current date and maybe old values could be updated too in case of errors.
How can I accomplish this task in R?
I treid to use sqlUpdate but it returns me the following error:
> sqlUpdate(channel, dbdata, tablename = "ManagerNav")
Error in sqlUpdate(channel, dbdata, tablename = "ManagerNav") :
cannot update ‘ManagerNav’ without unique column
When you create a table "the white shark-way" (see documentation), it does not get a primary index, but is just plain columns, and often of the wrong type. Usually, I use your approach to get the columns names right, but after that you should go into your database and assign a primary index, correct column widths and types.
After that, sqlUpdate() might work; I say might, because I have given up using sqlUpdate(), there are too many caveats, and use sqlQuery(..., paste("Update....))) for the real work.
What I would do for this is the following
Solution 1
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c("ManagerNav"))
Solution 2
Lcolumns <- list(dbdata[0,])
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c(Lcolumns))
Index is used to specify what columns R is going to update.
Hope this helps!
If none of the other solutions work and your data is not that big, I'd suggest using sqlQuery() and loop through your dataframe.
one_row_of_your_df <- function(i) {
sql_query <-
paste0("INSERT INTO your_table_name (column_name1, column_name2, column_name3) VALUES",
"(",
"'",your_dataframe[i,1],",",
"'",your_dataframe[i,2],"'",",",
"'",your_dataframe[i,3],"'",",",
")"
)
return(sql_query)
}
This function is Exasol specific, it is pretty similar to MySQL, but not identical, so small changes could be necessary.
Then use a simple for loop like this one:
for(i in 1:nrow(your_dataframe))
{
sqlQuery(your_connection, one_row_of_your_df(i))
}