Using Update on Delta table is changing the state of an intermediate DataFrame - dataframe

I am facing a situation here.
So below are the steps I am using to transform my DataFrame
val filteringRecordsToExpire = collectAllActiveRecords.join(collectingSrcSysIdsToExpire, Seq("trans_id"), "leftsemi")
filteringRecordsToExpire contains few of the IDS which I need to make Invalid
val expiredList = filteringRecordsToExpire.select("trans_id").distinct().collect()
expiredList.foreach(v => expireRecords(v(0).toString)) --> Here I am Updating each record
Now I want to use those Same IDs that I expired and further want to re-Enter them in the same Table with some new Values.
But I am getting an Empty DataFrame After I perform the Expire ( Which is basically updating the existing table for those same IDs )
collectingSrcSysIdsToExpire - So this DataFrame holds all those IDs which further I want to modify and INSERT into the Table.
But in this process The Whole Dataframe is going empty.
I have tried persisting this dataframe and Also registered to a Temp Table and tried using it. But nothing is working.
Any Help or suggestion would be a big help. Thanks in advance.
-----------------------------solution----------------------------------
So here is how I solved this issue.
As Suggested I used MERGE INTO which was a lot faster and as I am using unique transaction ids so I dint have any duplicate issues. Previously I was Updating the table for those transaction IDS then tried to use those same unique IDs with modified values and INSERT INTO the same table.
As a solution I first picked distinct transaction ids from my source and INSERT INTO the table with my updated values , then stored those same list of transaction ids and updated existing older record entries from the table.
val filteringRecordsToExpire = delta.join(collectingSrcSysIdsToExpire, Seq("trans_id"), "leftsemi")
.distinct()
collectingSrcSysIdsToExpire.select(TargetTable.schema.map(f => col(f.name)): _*).write.insertInto(Table)
val sqlUpdateQry =
s""" MERGE INTO TargetTable as tgtTable
USING expireSrsIds as source
ON tgtTable.trans_id = source.trans_id
AND few more conditions
WHEN MATCHED
THEN UPDATE SET
expiring older entries
So somehow INSERT then UPDATE works sequentially.
But UPDATE then INSERT does not work.

The foreach is by definition doesn't return any data - you can see from API docs that the return type is Unit. I also don't recommend to update individual records - it will be too slow as it will rewrite the data for each record separately. Instead, use the MERGE operation, with something like this (it's not Scala, just algorithm):
sourceTable
.as("source")
.merge(
dfUpdates.as("updates"),
"source.id = updates.id")
.whenMatched
.updateExpr(
Map(
"status" -> "'expired"
))
See MERGE documentation for full details. Also instead of updating records, you can delete them.

Related

Updating a column based on a condition in another column in the same table

Let me start by admitting that this is probably not the best engineering, but I am having the following question/problem.
I want to add values to column 'gc_stand'. I have data which connects the 'gc_stand' to 'startnummer' (e.a. (5 , 145) (78 , 2) (125 , 98) etc).
So my question is how to update the 'gc_stand' column without having to enter the values manually (around 200 values), but based on the connection between gc_stand and startnummer. I have inserted the data for the first two columns (startnummer and rit_uitslag) the same way (insert instead of update).
I am thinking about something like:
update etappe_4
set gc_stand = ??
where startnummer = 'startnummer'
But where should i input my connected values then?
I have inserted the values by:
INSERT INTO etappe_1 (startnummer, rit_uitslag)
VALUES (1,5), (2,145), (3,32) etc etc
And now I want to add the column (gc_stand). It is not possible by inserting, because it would create new rows. So therefore i guess i have to use UPDATE. But how?
It's a bit hard to make out what you are after, but I think you are looking for something like this:
update etappe_4
set gc_stand = etappe_1.rit_uitslag
from etappe_1
where etappe_1.startnummer = etappe_4.startnummer
Note that this will only work properly if startnummer is unique in both tables.

Using merge to track changes in SQL

We're working on doing incremental updates in our database using merge to update last updated values. This works great, however we also track changes when updating the whole table once a day.
Ideally, we would like to track everything that changed during the merge. While previously we would take the old table and the new table and delete them from each other to see what was added & deleted, we would like to get this from the merge.
Here is our condensed merge:
MERGE data_warehouse.dbo.data_table AS data_table
USING (
SELECT
id AS id,
target_date
po
FROM #exploded_data_table
WHERE id IN (SELECT id FROM #updated_records) AS updates
ON (data_table.id = updateds.id)
WHEN MATCHED AND (data_table.id IN (SELECT id FROM #updated_records) AND (data_table.id <> updated.id OR
data_table.target_date <> updates.target_date OR
data_table.po <> updates.po))
THEN UPDATE SET
id = updates.id,
target_date = updates.target_date,
po = updates.po
WHEN NOT MATCHED BY TARGET THEN INSERT
(id,
target_date,
po
)
VALUES
(updates.id,
updates.target_date,
updates.po
)
Using the merge function, we can get the inserted & deleted items, however, we were wondering if it was possible to get items that were updated.
Our current code takes the merge information put into a temp table an pulls it from there as follows:
OUTPUT $action,DELETED.*,INSERTED.*;
Now this works, but it seems adding
UPDATED.*
does not work. Ideally we would like to get the updated changes in there as well. Is it possible to get all three change actions outputs?
Thank you.
EDIT: I have found after rereading some material I was originally looking for that the answer seems to be here: Using the Output Clause with T-SQL Merge
I will have to do some further testing to get the output I'm looking for but it seems updates should be labeled, unsure why they weren't showing up as such previously.

Google BQ - how to upsert existing data in tables?

I'm using Python client library for loading data in BigQuery tables. I need to update some changed rows in those tables. But I couldn't figure out how to correctly update them? I want some similar UPSERT function - insert row only if its not exists, otherwise - update existing row.
Is it the right way to use a special field with checksum in tables (and compare sum in loading process)? If there is a good idea, how to solve this with Python client? (As I know, it can't update existing data)
Please explain me, what's the best practice?
BigQuery now supports MERGE, which can combine both an INSERT and UPDATE in one atomic operation i.e. UPSERT.
Using Mikhail's example tables, it would look like:
MERGE merge_example.table_data T
USING merge_example.table_changes S
ON T.id = S.id
WHEN MATCHED THEN
UPDATE SET value = s.value
WHEN NOT MATCHED THEN
INSERT (id, value) VALUES(id, value)
See here.
BigQuery is by design append-only preferred. That means that you better let duplicate rows from the same entity in the table and write your queries to always read most recent row.
Updating rows as you know in transactional tables possible with limitations. Your project can make up to 1,500 table operations per table per day. That's very limited and their purpose is totally different. 1 operation can touch multiple rows, but still 1500 operation per table per day. So if you want individual updates to rows, that's not working out as it limits to 1500 rows per day.
Since BQ is used as data lake, you should just stream new rows every time the user eg: updates their profile. You will end up having from 20 saves 20 rows for the same user. Later you can rematerilize your table to have unique rows by removing duplicate data.
See the most question for the later: BigQuery - DELETE statement to remove duplicates
BigQuery does not support UPSERT directly, but if you really need it - you can use UPDATE and INSERT one after another to achieve the same. See below simplified example
Assume you have two tables as below - one that holds your data (yourproject.yourdadtaset.table_data) and another (yourproject.yourdadtaset.table_changes) that contains your changes that you want to apply to first table
table_data
table_changes
Now below queries run one after another do the trick:
Update Query:
#standardSQL
UPDATE `yourproject.yourdadtaset.table_data` t
SET t.value = s.value
FROM `yourproject.yourdadtaset.table_changes` s
WHERE t.id = s.id
result will be
And now - INSERT Query
#standardSQL
INSERT `yourproject.yourdadtaset.table_data` (id, value)
SELECT id, value
FROM `yourproject.yourdadtaset.table_changes`
WHERE NOT id IN (SELECT id FROM `yourproject.yourdadtaset.table_data`)
with result as (and we are done here)
Hope above example simple and clear, so you can apply it in your case
I maybe late for this but you can perform upsert in BigQuery using Dataflow/Apache Beam. You can do a CoGroupByKey to get values sharing common key from both data sources (one being the destination table) and update the data read from the destination BQ table. Finally load the data in truncate load mode. Hope this helps.
This way you avoid all the quota limits in BigQuery and do all updation in Dataflow.
An example of it using Java. You must be able to easily convert it to Python:
// Each shares a common key ("K").
PCollection<KV<K, V1>> source = p.apply(...Read source...);
PCollection<KV<K, V2>> bigQuery = BigQueryIO.readTableRows().from(...table-id...);
//You can also use read() instead of readTableRows() and fromQuery() instead of from() depending on your use-case.
// Create tuple tags for the value types in each collection.
final TupleTag<V1> t1 = new TupleTag<V1>();
final TupleTag<V2> t2 = new TupleTag<V2>();
//Merge collection values into a CoGbkResult collection
PCollection<KV<K, CoGbkResult>> coGbkResultCollection =
KeyedPCollectionTuple.of(t1, pt1)
.and(t2, pt2)
.apply(CoGroupByKey.<K>create());
// Access results and do something.
PCollection<TableRow> finalResultCollection =
coGbkResultCollection.apply(ParDo.of(
new DoFn<KV<K, CoGbkResult>, T>() {
#Override
public void processElement(ProcessContext c) {
KV<K, CoGbkResult> e = c.element();
// Get all collection 1 values
Iterable<V1> pt1Vals = e.getValue().getAll(t1);
// Now get collection 2 values
// This must always be unique as you are upserting the table. Hence used getOnly()...
V2 pt2Val = e.getValue().getOnly(t2);
if(pt1Vals is null){ //no matching key
output V2 value in PCollection
}
else if(V2 is null){ // pt1Vals are latest
output latest/distinct value from pt1Vals to PCollection
}
else if(both are not null){ // pt1Vals are latest
output latest/distinct value from pt1Vals to PCollection and
don't output anything from V2
}
c.output(elements);
}
}));
finalResultCollection.apply(BigQueryIO.writeTableRows()
.to("my-project:output.output_table")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

Data manipulation logic !! should be as a part of fetch via DB link or an independent routine

The question may not be sufficient to give an insight.
I have two DB instances : A and B on the same server.
B reads data from several tables in A(A1,A2,A3...) via DB link and maintains history of data in the replicated tables(A1_ext, A2_ext, A3_ext, as they have additional columns assume its status column) , i.e if its identified that a new row has been added in A1 ,a row is created in A1_ext with some status called as VALID , if a row is updated in A1, the existing data in A1_ext is updated to INVALID and a new row having latest data from A1 is created in A1_ext with status VALID.
For now the implemenetd logic is : Read data from A1 via db link , check if its exists in A1_ext , if does, delimit the existing one and create a new one.
Is it an efficient approach??
Or should it be like read all updated data from A1 and pull them at one go(bulk collect say) on B instance in A1_stag table(new). Then run the logic of update/insert on A1_ext.
The best I can come up with is something like the following:
-- Insert the new and changed records with status NEW
insert into A1_ext
with upsert as (
select id, val from A1#RemoteDB
minus
select id, val from A1_ext where status = 'VALID'
) select id, val, 'NEW' from upsert;
-- Update the old VALID records that have NEW records to INVALID
update A1_ext old
set status = 'INVALID'
where status = 'VALID'
and exists (select 1 from A1_ext new
where new.id = old.id
and new.status = 'NEW');
-- Update all NEW records to VALID
update A1_ext set status = 'VALID' where status = 'NEW';
Unfortunately the first query is going to do a full table scan on A1#RemoteDB and transmit all that data across the database link. Possibly not a big deal when both DBs reside on the same server, but possibly a performance problem for large tables across a network. The minus operation will prune away the unchanged records after they've crossed the link but before they get into the *_EXT table. If you can reliably filter the source data to just the new and updated records that would help limit the amount of data crossing the DB link.
The 2nd and 3rd queries are just house keeping to mark updated records as invalid and to mark the new data as valid.
If possible keep this as pure SQL and avoid context switching between SQL and PL/SQL as much as possible.

multithreading with the trigger

I have written a Trigger which is transferring a record from a table members_new to members_old. The Function of trigger is to insert a record into members_old on after insert in members_new. So suppose a record is getting inserted into a members_new like
nMmbID nMmbName nMmbAdd
1 Abhi Bangalore
This record will get inserted into members_old with the same data structure of the table
My trigger is like :
create trigger add_new_record
after
insert on members_new
for each row
INSERT INTO `test`.`members_old`
(
`nMmbID`,
`nMmbName`,
`nMmbAdd`
)
(
SELECT
`members_new`.`nMmbID`,
`members_new`.`nMmbName`,
`members_new`.`nMmbAdd`
FROM `test`.`members_new`
where nMmbID = (select max(nMmbID) from `test`.`members_new` // written to read the last record from the members_new and stop duplication on the members_old , also this will reduce the chances of any error . )
)
This trigger is working for now , but my confusion is that what will happen if a multiple insertion is happening at one instance of time.
Will it reduce the performance?
Will I face deadlock condition ever in any case as my members_old have FKs?
If any better solution for this situation is there, please give limelight on that
From the manual:
You can refer to columns in the subject table (the table associated with the trigger) by using the aliases OLD and NEW. OLD.col_name refers to a column of an existing row before it is updated or deleted. NEW.col_name refers to the column of a new row to be inserted or an existing row after it is updated.
create trigger add_new_record
after
insert on members_new
for each row
INSERT INTO `test`.`members_old`
SET
`nMmbID` = NEW.nMmbID,
`nMmbName` = NEW.nMmbName,
`nMmbAdd` = NEW.nMmbAdd;
And you will have no problem with deadlocks or whatever. Also it should be much faster, because you don't have to read the max value before (which is also unsecure and might lead to compromised data). Read about isolation levels and transactions if you're interested why...