I was just working on SCD Type 2 and was unable to fully implement it in a way that some scenarios were not getting full filled. I had done it in IICS. Really finding it very very difficult to cover all possible scenarios. So, below is the flow:
Src
---> Lkp ( on src.id = tgt.id)
---> expression ( flag= iif (isnull (tgt.surrogatekey) then Insert, iif(isnotnull(tgt.surrogatekey) and md5(other_non_key_cole)<>tgt.md5)then Update)
----> insert on flag insert(works fine)
but on update i pass updates
to 2 target instances
of same target table
in one i am updating it
as new update as insert
and in other i
am updating tgt_end_date=lkp_start_date for previously stored ids and active_ind becomes 'N'.
But what happens is this works but not in when i receive new updates with again same records meaning duplicates or simply rerunning the mapping inserts duplicates in the target table and changing of end_date also becomes unstable when i insert multiple changes of the same record it sets all active_flags to 'Y' what expected is all should be 'N' except the last latest in evry run. Could anyone please help with this even in SQL if you can interpret.
If I’ve understood your question correctly, in each run you have multiple records in your source that are matching a single record in your target.
If this is the case then you need to process your data so that you have a single source record per target record before you put the data through the SCD2 process
Related
We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery.
The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time
The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table?
Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated
I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
Option One
If this is a one-off fix, I recommend you simply
navigate to the table (your_dataset.your_table) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (e.g. your_dataset.your_table_deduplicated)
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (i.e. call the copy your_dataset.your_table)
delete your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
Option Two
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
The Future
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (i.e. doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, e.g. in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet
This will prevent the sort of issues you have encountered here.
I have a single database table on a relational database. Data will be loaded into it. I then want to have multiple servers processing that data concurrently (I don't want to have only one server running at a time). E.g. each server will:
Query for a fixed number of rows
Do some work for each row retrieved
Update each row to show it has been processed
How do I ensure that each row is only processed once? Note I don't want to pre-assign a row of data to a server; i'm designing for high availability so the solution should keep running if one or more servers goes down.
The solution I've gone for so far is as follows:
The table has three columns: LOCKED_BY (VARCHAR), LOCKED_AT (TIMESTAMP) and PROCESSED (CHAR)
Each server starts by attempting to "pseudo-lock" some rows by doing:
UPDATE THE_TABLE
SET LOCKED_BY= $servername,
LOCKED_AT = CURRENT_TIMESTAMP,
WHERE (LOCKED_BY = null OR (CURRENT_TIMESTAMP- LOCKED_AT > $timeout)
AND PROCSSED = 'N'
i.e. try to "pseudo-lock" rows that aren't locked already or where the pseudo-lock as expired. Only do this for unprocessed rows.
More than one server may have attempted this at the same time. The current server needs to query to find out if it was successful in the "pseudo-lock":
SELECT * FROM THE_TABLE
WHERE LOCKED_BY = $server_name
AND PROCESSED = 'N'
If any rows are returned the server can process them.
Once the processing has been done the row is updated
UPDATE THE_TABLE SET PROCESSED = 'Y' WHERE PRIMARYKEYCOL = $pk
Note: the update statement should ideally limit the number of rows updated.
If you are open to changing platform then I would suggest moving to a modern, cloud-based solution like Snowflake. This will do what you want but in the background and by default - so you don't need to know what it's doing or how it's doing it (unless you want to).
This may come across as patronising, which is not my intention, but what you are attempting (in the way you are attempting it) is very complex; so if you don't already know how to do it then someone telling you how to do it is not going to give you the skills/experience you need to be able to implement it successfully
I'm new to .net.
I have a task at hand for my .net web project, using entity framework.
The task is: To delete a certain row in the db called Products using a webpage interface.
What I have right now: A webpage that displays all the rows in that db, with a button on the side for each row ( that doesn't have any logic behind it yet ), that is supposed to mean the delete button.
How I plan to delete the row: When the button is clicked, it should set all the contents of that row's columns to NULL and save it in the db.
The question is: Is that a viable solution for deleting a row? Could there be any problems in the future if I use that method for deleting the row? Are there maybe better solutions for that task?
i suggest that that you set , for instance the idProduct to -1 (assuming that the id are always positive) since you don't want to delete it directly and Handel the db separately by using a trigger that launches when you update the product table
CREATE TRIGGER tr_onUpdate_delete
ON product
AFTER UPDATE
AS
if (select idProduct from inserted) = -1
begin
DELETE FROM product WHERE id=(select idProduct from deleted)
end
and please note that : i assume that you are updating one row only every time
I have seen some other posts about this, but have not found an answer that permanently works.
We have a table, and I had to add two columns to it. In order to do so, I dropped the table and recreated it. But since it was an external table, it did not drop the associated data files. The data gets loaded from a control file and is partitioned by date. So let's say the dates that were in the table were 2021-01-01 and 2021-01-02. But only 2021-01-02 is in the control file. So when I am loading that date, it gets re-run with the new columns and everything is fine. But 2021-01-01 is still there, but with a different schema.
This is no issue in Hive, as it seems to default to resolve by name, not position. But Impala resolves by position, so the new columns throw it off.
If I have a table that before had the columns c1,c2,c3, and now have the additional columns c4,c5, if I try to run a query such as
select * from my_table where c5 is null limit 3;
This will give an incompatible parquet schema error in Impala (but Hive is fine, it would just have null for c4 and c5 for the date 2021-01-01).
If I run the command set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; and then the above query again, it is fine. But I would have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; at the beginning of each session, which is not ideal.
From searching online, I have come up with a few solutions:
Drop all data files when creating the new table and just start loading from scratch (I think we want to keep the old files)
Re-load each date (this might not be ideal as there could be many, many dates that would have to be re-loaded and overwritten)
Change the setting permanently in Cloudera Manager (I do not have access to CM and don't know how feasible it would be to change it)
Are there any other solutions to have it so I don't have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; each time I want to use this table in Impala?
This is a very strange thing happening i dont know why. I have created a mapping that transforms the data via expression and loads the data into the target(file) based on lookup on the same target.
Source table
#CompanyName
Acne Lmtd
Acne Ltd
N/A
None
Abc Ltd
Abc Ltd
X
Mapping
Source
->Exp(trim..)
->Lookup(source.company_name
= tgt.company_name)
ReturnPort is CompId
-> filter(ISNULL(CompId))
-> Target
Compid (via sequence
gen)
CompName
The above mapping logic inserts duplicate companynames as well like in source 2 Abc Ltd records same is repeated in target as well. I dont know why. I have tried to debug as well the condition evaluates to true in filter that companyid is null even if the record is already inserted in target.
Also, i thought it might be the case of lookup cache i do enabled dynamic as well but same result. It should have worked like an sql query
select company_id
From lkptarget where
company_name
In (select company_name
from
Source)
Therefore, for Abc Ltd the filter condition should have result in false
Isnull(company_id) false
But, this is getting true. How do I get unique records via lookup and without using distinct?
Note: lookup used is dynamic lookup already
That was in fact a dynamic cache issue the newLookupRow gets assigned a value of 0 on duplicates so I have added the condition in filter as ISNULL(COMPANYID) AND NEWLOOKUPROW=1
and finally that did work.
The Lookup transformation has not way to know what happens in further transformations in the mapping. It can't see results in the target itself, because the Lookup cache is loaded once at the beginning of the mapping using a separate connection to the database. Even if you disable caching (that would mean one query for each Lookup input row), data is not immediately committed (so not visible to other connections) when writing to the target.
That's the reason to use dynamic Lookup cache, which works by adding new lines to the Lookup cache. However in your case there is a catch : the company_id is created after the Lookup (it's the right place to do so), so it can't be added to the Lookup cache.
I think you could configure the Lookup so that :
You activate the options Dynamic Lookup Cache, Update Else Insert and Insert Else Update
You use the company_name to make comparison between source data and Lookup data
You create a fake field company_id with value 0 before the Lookup and associate it to the corresponding Lookup field
You check the checkbox Disable in comparison for the company_id field
You can then use the predefined field NewLookupRow (it appears when you check the Dynamic Lookup Cache option) which should have a value of 1 for new rows or 2 for existing rows with updates (0 for identical rows)
The Lookup should now output NewLookupRow = 1 for the first Abc Ltd and then NewLookupRow = 0 for the second. The filter just after the Lookup should have a condition like NewLookupRow = 1.
For more details you can have a look at the Informatica documentation :
https://docs.informatica.com/data-integration/data-services/10-2/developer-transformation-guide/dynamic-lookup-cache.html