How do I modify Upsert key for a snowflake output in Upsolver - sqlake

Is it possible to add an Upsert key to an existing output or change the Upsert key for an existing output without creating a new output?
This is all very hypothetical right now! Since this is new data and a lot of the schemas have been changing during design phases, I just want to get ahead of any problems we might have.

You won't be able to update the Upsert keys for an existing output in Upsolver. Best way to do handle this is to create a new snowflake table (or delete existing data and keep same table), duplicate your output, change your Upsert key and replay

Related

SCD Type-2 Using INSERT strategy

I am using Data vault 2.0 model. Here we are not supposed to use UPDATE in a table.
Normally in RDBMS, we implement SCD-2 using UPDATE & INSERT strategy.
But in my case, I can use only INSER strategy.
My Source is Kafka input and its loaded into Snowflake (as avro format) and I'm flattening and loading the same into another table in RDBMS format.
My question is: Can anyone please help me to implement SCD type-2 logic without using UPDATE. Only with INSERT strategy I have to do this.
I should keep the history also. For example, if I get rec-1 in interval-a and in interval-b, I got 2 updates for rec-1, then I need to load all the three records in my target pointing the last record came in will be the latest one.
Your satellite key needs to consist out of the Hash Key + the Load Date Timestamp. Each individual version will then have a new primary key and thus can be inserted into the satellite. By querying the latest timestamp, you will get the latest version.
You can find a sample script here: https://simplesqlbi.home.blog/2019/06/29/part-3-data-vault-for-beginners/

In MS-Access is there any way to bypass/override/easily temporarily disable data validation of an INSERT SQL statement

We have many tables with lots of fields that have data validation on them. When running an SQL insert statement to append imported data to these tables in the event of a data validation error they fail with the extremely helpful error message "27 records could not be added do to validation rule violations".
Tracking down these violations is very time consuming, so I am wondering if there is a way to do ANY of the following:
1) Find out the record and field(s) that caused the validation error programmatically
2) Override data validation during import so that people can afterwards look through the table to find the issues highlighted
3) Disable and then re enable validation for the same effect (short of looping through all columns, disabling data validation while saving the settings, then after the insert loop through again and re enable it for each column).
4) Some other suggestion I haven't thought of because it is past 10:00 pm here.
Any help would be very much appreciated.
No, Access does not allow validation rules to be disabled or overridden.
A separate import table is the best alternative that would facilitate possibilities #1 and #2.
Copy the destination table structure to a new table. You can do this with Copy (Ctrl+C) and Paste (Ctrl+V) within the navigation panel.
Alter the copied table definition by removing all validation rules and constraints.
Pay special attention to the primary key for the destination table.
If the primary key is an AutoNumber column, then go ahead and keep this column in the import table as the primary key. Just be aware that this value will be an arbitrary, temporary number used only for import row identification. Eventually the destination table will have a new unique value assigned.
If the primary key is composed of meaningful data columns, I suggest removing the primary key and unique constraint. Instead, add an additional AutoNumber primary key column just for importing.
If there is no primary key, you probably want to consider adding one and reviewing proper table normalization. As before, at least add an AutoNumber primary key as already described.
Define an import process. Some or all of this could be automated via code / macros.
Clear import table of existing data.
Import external data into the import table.
Copy rows from the import table to the permanent table.
The easiest method is probably an SQL INSERT INTO statement. Within the Access Query designer, an Append Query is the same thing as the INSERT INTO statement.
Alternative methods are to use VBA to loop through rows via RecordSet objects.
Run one or more comparison queries to detect and report rows which failed the copy operation.
This assumes that the data has a unique field or fields that can be used to distinguish rows from each other, both for the imported data as well as existing data in permanent table.
If the data does NOT have identifying fields, the only viable option is probably to loop through a RecordSet object in VBA code since a RecordSets can be used to select and operate on individual rows. But SQL statements operate on entire sets of data and rely on unique values to distinguish between rows of data.
Aside from problems verifying imported data, one should probably question the general usefulness and/or validity of the data rows if they have no individual identity.
For any import row which fails the copy operation, do one or more of the following, recursively improving the process and building a set of validations (queries and/or code) that you can automate.
Manually inspect data and try to determine failure reasons.
Write a validation query or set of validation queries which identify import rows that do not satisfy validation rules or constraints of the permanent table.
Possibly write additional queries which "massage" data into the permanent table, or which report on bad data to be fixed manually or researched further.

Eliminate duplicates automatically from table

Table will be getting new data everyday from source system and i want the duplicates to be deleted automatically as soon as new data gets loaded to table.
Is it possible in bigquery?
I tried to create a view named sites_view in bigquery with below query
SELECT DISTINCT * FROM prd.sites
but duplicates not getting deleted automatically.
Below is for BigQuery:
Duplicates will not be deleted automatically - there is no such functionality in BigQuery
You should have some process to make this happen as frequently as you need or use views
Bigquery is based on append-only kind of a design. So, it accepts all the data.
This is one of the reasons there are no Primary/Unique key constraints on it, so you can't prevent duplicates from entering in the table.
So, you have to have a process like:
1.) Create a new table without duplicates from your original table.
(You can use DISTINCT/ROW_NUMBER() for doing this.)
2.) Drop original table.
3.) Rename new table with original table name.
Let me know if this information helps.

How do I carry old ID keys forward into a new database structure?

Given a project I'm working on, we have an old database structure we're migrating data from into a new database structure, and we need to preserve the old keys for a few tables for backwards compatibility with some existing application functionality.
Currently, there are two approaches we are considering for addressing this need:
Create an extra nullable field for each table and insert the old key into that new field
Create companion table(s) that contain the old and new key mappings
Note: new data will not generate old ID keys, so in approach #1, eventually the nullable field will contain nulls over time for new records.
Which approach is better for a cleaner database design, and data management long-term?
Do you see any issues with either approach, and if so, what issues?
Is there a #3 approach that I haven't thought of yet?
You mention sql, but is it SQL-Server?
if SQL-Server, look into SET INSERT_IDENTITY. This allows you to explicitly insert values for the auto-increment columns vs being in a protected mode for that column.
However, I believe that if you explicitly include the PK in the insert statement with its value, it will respect that and save the original key in the original column you are hoping to retain without having to force yet another column for backward compatibility purposes.

updating primary key of master and child tables for large tables

I have a fairly huge database with a master table with a single column GUID (custom GUID like algorithm) as primary key and 8 child tables that have foreign key relationships with this GUID column. All the tables have approximately 3-8 million records. None of these tables have any BLOB/CLOB/TEXT or any other fancy data types just normal numbers, varchars, dates, and timestamps (about 15-45 columns in each table). No partitions or other indexes other than the primary and foreign keys.
Now, the custom GUID algorithm has changed and though there are no collisions I would like to migrate all the old data to use GUIDs generated using the new algorithm. No other columns need to be changed. Number one priority is data integrity and performance is secondary.
Some of the possible solutions that I could think of were (as you will probably notice they all revolve around one idea only)
add new column ngu_id and populate with new gu_id; disable constraints; update child tables with ngu_id as gu_id; renaname ngu_id->gu_id; re-enable constraints
read one master record and its dependent child records from child tables; insert into the same table with new gu_id; remove all records with old gu_ids
drop constraints; add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids; re-enable constraints
add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids
create new column ngu_ids on all master and child tables; create foreign key constraints on ngu_id columns; add update trigger to the master table to cascade values to child tables; insert new gu_id values into ngu_id column; remove old foreign key constraints based on gu_id; remove gu_id column and rename ngu_id to gu_id; recreate constraints if necessary;
use on update cascade if available?
My questions are:
Is there a better way? (Can't burrow my head in the sand, gotta do this)
What is the most suitable way to do this? (I've to do this in Oracle, SQL server and mysql4 so, vendor-specific hacks are welcome)
What are the typical points of failure for such an exercise and how to minimize them?
If you are with me so far, thank you and hope you can help :)
Your ideas should work. the first is probably the way I would use. Some cautions and things to think about when doing this:
Do not do this unless you have a current backup.
I would leave both values in the main table. That way if you ever have to figure out from some old paperwork which record you need to access, you can do it.
Take the database down for maintenance while you do this and put it in single user mode. The very last thing you need while doing something like this is a user attempting to make changes while you are in midstream. Of course, the first action once in single user mode is the above-mentioned backup. You probably should schedule the downtime for some time when the usage is lightest.
Test on dev first! This should also give you an idea as to how long you will need to close production for. Also, you can try several methods to see which is the fastest.
Be sure to communicate in advance to users that the database will be going down at the scheduled time for maintenance and when they can expect to have it be available again. Make sure the timing is ok. It really makes people mad when they plan to stay late to run the quarterly reports and the database is not available and they didn't know it.
There are a fairly large number of records, you might want to run the updates of the child tables in batches (one reason not to use cascading updates). This can be faster than trying to update 5 million records with one update. However, don't try to update one record at a time or you will still be here next year doing this task.
Drop indexes on the GUID field in all the tables and recreate after you are done. This should improve the performance of the change.
Create a new table with the old and the new pk values in it. Place unique constraints on both columns to ensure you haven't broken anything so far.
Disable constraints.
Run an updates against all the tables to modify the old value to the new value.
Enable the PK, then enable the FK's.
It's difficult to say what the "best" or "most suitable" approach is as you have not described what you are looking for in a solution. For example, do the tables need to be available for query while you are migrating to new IDs? Do they need to be available for concurrent modification? Is it important to complete the migration as fast as possible? Is it important to minimize the space used for migration?
Having said that, I would prefer #1 over your other ideas, assuming they all met your requirements.
Anything that involves a trigger to update the child tables seems error-prone and over complicated and likely will not perform as well as #1.
Is it safe to assume that new IDs will never collide with old IDs? If not, solutions based on updating the IDs one at a time will have to worry about collisions -- this will get messy in a hurry.
Have you considered using CREATE TABLE AS SELECT (CTAS) to populate new tables with the new IDs? You'll be making a copy of your existing tables and this will require additional space, however it is likely to be faster than updating the existing tables in place. The idea is: (i) use CTAS to create new tables with new IDs in place of the old, (ii) create indexes and constraints as appropriate on the new tables, (iii) drop the old tables, (iv) rename the new tables to the old names.
In fact, it depend on your RDBMS.
Using Oracle, the simpliest choice is to make all of the foreign key constraints "deferred" (check on commit), perform updates in a single transaction, then commit.