How to deal with data structure changes when performing a full historical data load? - sql

I'm dealing with a SQL Server database which contains a column "defined data" with JSON data in it (and some other simple columns). The data builds up over time, right now we have about 8 million rows.
The data from this db is periodically read by an ETL system which then reads the JSON data in the "defined data" column and maps the data to a new SQL Server table based on the columns names contained in the JSON data.
This SQL Server table is prone to changes, meaning that about every 4 months additional columns are needed or column names change. Whenever this SQL Server table changes its data structure, a new version is introduced, which also forces the JSON data structure to change.
However, the ETL system should still be able to load all historical (JSON) data from the SQL Server database, regardless of the changing version throughout time. How can I make this work, taking into consideration version changes of the SQL Server tables and the JSON data?
!example]1
So in this example my question is:
How can I ensure that I can load both client 20 and 21 into one SQL Server table without getting errors because the JSON data structure is not reflecting version 2 in the case of historical data?
Given the size of the SQL Server database, it doesn't seem like an option to update all historical JSON data according to the latest version (in this example that would mean adding "AssetType" for the 01-01-2021 data and filling it in with NULL).
Many, many thanks in advance!

First I would check if json fields exist in the table as column names by looking them up in the information schema. If not exists then alter table add column.

How can I ensure that I can load both client 20 and 21 into one SQL Server table without getting errors because the JSON data structure is not reflecting version 2 in the case of historical data?
You maintain 2 separate tables. A Raw/Staging/Bronze table that has the same schema as the source, and a Cleansed/Warehouse/Silver table that has the desired schema for reporting. If you have multiple separate sources, you may have separate Raw tables.
Periodically you enhance the schema of the Cleansed table to add new data that has appeared in the Raw table.

Related

How to sync/update a database connection from MS Access to SQL Server

Problem:
I need to get data sets from CSV files into SQL Server Express (SSMS v17.6) as efficiently as possible. The data sets update daily into the same CSV files on my local hard drive. Currently using MS Access 2010 (v14.0) as a middleman to aggregate the CSV files into linked tables.
Using the solutions below, the data transfers perfectly into SQL Server and does exactly what I want. But I cannot figure out how to refresh/update/sync the data at the end of each day with the newly added CSV data without having to re-import the entire data set each time.
Solutions:
Upsizing Wizard in MS Access - This works best in transferring all the tables perfectly to SQL Server databases. I cannot figure out how to update the tables though without deleting and repeating the same steps each day. None of the solutions or links that I have tried have panned out.
SQL Server Import/Export Wizard - This works fine also in getting the data over to SSMS one time. But I also cannot figure out how to update/sync this data with the new tables. Another issue is that choosing Microsoft Access as the data source through this method requires a .mdb file. The latest MS Access file formats are .accdb files so I have to save the database in an older .mdb version in order to export it to SQL Server.
Constraints:
I have no loyalty towards MS Access. I really am just looking for the most efficient way to get these CSV files consistently into a format where I can perform SQL queries on them. From all I have read, MS Access seems like the best way to do that.
I also have limited coding knowledge so more advanced VBA/C++ solutions will probably go over my head.
TLDR:
Trying to get several different daily updating local CSV files into a program where I can run SQL queries on them without having to do a full delete and re-import each day. Currently using MS Access 2010 to SQL Server Express (SSMS v17.6) which fulfills my needs, but does not update daily with the new data without re-importing everything.
Thank you!
You can use a staging table strategy to solve this problem.
When it's time to perform the daily update, import all of the data into one or more staging tables. Execute SQL statement to insert rows that exist in the imported data but not in the base data into the base data; similarly, delete rows from the base data that don't exist in the imported data; similarly, update base data rows that have changed values in the imported data.
Use your data dependencies to determine in which order tables should be modified.
I would run all deletes first, then inserts, and finally all updates.
This should be a fun challenge!
EDIT
You said:
I need to get data sets from CSV files into SQL Server Express (SSMS
v17.6) as efficiently as possible.
The most efficient way to put data into SQL Server tables is using SQL Bulk Copy. This can be implemented from the command line, an SSIS job, or through ADO.Net via any .Net language.
You state:
But I cannot figure out how to refresh/update/sync the data at the end
of each day with the newly added CSV data without having to re-import
the entire data set each time.
It seems you have two choices:
Toss the old data and replace it with the new data
Modify the old data so that it comes into alignment with the new data
In order to do number 1 above, you'd simply replace all the existing data with the new data, which you've already said you don't want to do, or at least you don't think you can do this efficiently. In order to do number 2 above, you have to compare the old data with the new data. In order to compare two sets of data, both sets of data have to be accessible wherever the comparison is to take place. So, you could perform the comparison in SQL Server, but the new data will need to be loaded into the database for comparison purposes. You can then purge the staging table after the process completes.
In thinking further about your issue, it seems the underlying issue is:
I really am just looking for the most efficient way to get these CSV
files consistently into a format where I can perform SQL queries on
them.
There exist applications built specifically to allow you to query this type of data.
You may want to have a look at Log Parser Lizard or Splunk. These are great tools for querying and digging into data hidden inside flat data files.
An Append Query is able to incrementally add additional new records to an existing table. However the question is whether your starting point data set (CSV) is just new records or whether that data set includes records already in the table.
This is a classic dilemma that needs to be managed in the Append Query set up.
If the CSV includes prior records - then you have to establish the 'new records' data sub set inside the CSV and append just those. For instance if you have a sequencing field then you can use a > logic from the existing table max. If that is not there then one would need to do a NOT compare of the table data with the csv data to identify which csv records are not already in the table.
You state you seek something 'more efficient' - but in truth there is nothing more efficient than a wholesale delete of all records and write of all records. Most of the time one can't do that - but if you can I would just stick with it.

Pentaho update/insert

I am trying to have a setup in Pentaho where :
My source data is in MySQL DB and target database is Amazon redshift.
I want to have incremental loads on Redshift database table, based on the last updated timestamp from MySQL DB table.
Primary key is student ID.
Can I implement this using update/insert in Pentaho ?
Insert/Update step in Pentaho Data Integration serves the purpose of inserting the row if it doesn't exist in the destination table or updating it if it's already there. It has nothing to do with incremental loads, but if your loads should be inserting or updating the record based on some Change Data Capture mechanism then this is the right step at the end of the process.
For example you could go one of two ways:
If you have a CDC then limit the data at Table Input for MySQL since you already know the last time a record has been modified (last load)
If you don't have a CDC and you are comparing entire tables then go for joining the sets to produce rows that has changed and then perform a load (slower solution)

Using SSIS at run-time how do you create a destination table in SQL Server based off a source table structure from Oracle?

Using SSIS at run-time how do you create a destination table in SQL Server (with the right number of fields, names, data types etc.) solely based off a source table structure from Oracle? The reason for asking is my Oracle tables often changes structure (fields added, data types changed etc.) and my package breaks if there is a mismatch say between the number of source and destination table fields. I need to automate this as I am scheduling packages to run as jobs in SSMS. I have looked everywhere and tried everything but cannot find a suitable solution to this problem....however I'm sure it is possible...somehow.

Replace part of oracle database with data in excel

I have a table with 103 columns and more than 100,000 records. I was told that Oracle Data base has bad data in few of the columns(four columns) and in few of the records(20,000 records approx) and was told to modify data. Also I was given a query where I should modify data(bad data). With the help of query, I exported all the bad data into excel and modified using macros.
How to replace existing bad data in Oracle data base to data which I have in excel(good data). Is it possible to modify data in some part of the database using SQL query? I mean to say 4 out of 103 columns and 20k out of 100k records has to be modified without affecting already existing good data in Oracle database.
I am using SQL developer and Oracle 11g
My query to retrieve bad data
select e.id_number
, e.gender_code
, e.pref_mail_name
, e.prefix
,e.first_name
,e.last_name
,e.spouse_name
,e.spouse_id_number
,e.pref_jnt_mail_name1
,e.pref_jnt_mail_name2
from advance.entity e
where e.person_or_org = 'P'
and ascii(e.spouse_name) <> 32
or ascii(e.spouse_id_number) <> 32"
Note: Moreover I am not changing any primary key or secondary key data. Bad data is in other columns.
You can start by importing all corrected data (which now resides in your Excel file), to a temporary table. See Load Excel data sheet to Oracle database for more details.
The it should be a simple task to write one or more UPDATE statements. If you are new to Oracle syntax, Tech on the Net has some nice examples.
Also, do not forget to backup the original table to a temporary table before you make any changes there. That way, if you mess up the data repair, you can start over.

Using SSIS to create new Database from two separate databases

I am new to SSIS.I got the task have according to the scenario as explained.
Scenario:
I have two databases A and B on different machines and have around 25 tables and 20 columns with relationships and dependencies. My task is to create a database C with selected no of tables and in each table I don't require all the columns but selected some. Conditions to be met are that the relationships should be intact and created automatically in new database.
What I have done:
I have created a package using the transfer SQL Server object task to transfer the tables and relationships.
then I have manually edited the columns that are not required
and then I transferred the data using the data source and destination
My question is: can I achieve all these things in one package? Also after I have transferred the data how can I schedule the package to just transfer the recently inserted rows in the database to the new database?
Please help me
thanks in advance
You can schedule the package by using a SQL Server Agent Job - one of the options for a job step is run SSIS package.
With regard to transferring new rows, I would either:
Track your current "position" in another table, assumes you have either an ascending key or a time stamp column - load the current position into an SSIS variable, use this variable in the WHERE statement of your data source queries.
Transfer all data across into "dump" copies of each table (no relationships/keys etc required just the same schema) & use a T-SQL MERGE statement to load new rows in, then truncate "dump" tables.
Hope this makes sense - its a bit difficult to get across in writing.