So I'm working on migrating some data to a new server. In the new server, each entry in the MAIN table is assigned a new GUID when the transfer takes place. A few other tables must be migrated, and their records must link to the GUID in the MAIN table. Example...
WorksheetID --- GUID
1245677903 --- 1
AccidentID --- WorksheetID --- Guid
12121412 --- 1245677903 --- 1
The guid is used moreso for versioning purposes, but my question is this. In SSIS, is there any way to pull the Worksheet's GUID from the destination database and assign it directly to the entries in the 'Accident' table? Or do I have to just dump the data into the source DB and run some scripts to get everything nicely referenced? Any help would be greatly appreciated.
There's always the Lookup transformation No need for sorts to use it.
I seem to answer a few question on using the lookup transformation
ssis lookup with derived columns?
Excel Source as Lookup Transformation Connection
Using SSIS, How do I find the cities with the largest population?
SSIS LookUp is not dealing with NULLs like the docs say it should
You should be able to do this using a Merge Join Transformation. You might need to pass your data through a Sort Transformation first before merging them.
Your inputs to the Merge Join Transformation will be the Accident table from your source data and the Worksheet table from your destination database. Just do the merge join on the WorksheetID.
Related
I have 9M records. We needed to do the following operations:-
daily we receive the entire file of 9M records with 150GB of file size
It is truncate and loads in Snowflake. Daily deleting the entire 9B records and loading
We would want to send only incremental file load to Snowflake. Meaning that:
For example, out of 9Million records, we would only have an update in 0.5Million records(0.1 M Inserts,0.3 Deletes, and 0.2 Updates). How we will be able to compare the file and extract only delta file and load to the snowflake. How to do it cost-effectively and fast way in AWS native tools and load to S3.
P.s data doesn't have any date column. It is a pretty old concept written in 2012. We need to optimize this. The file format is fixed width. Attaching sample RAW data.
Sample Data:
https://paste.ubuntu.com/p/dPpDx7VZ5g/
In a nutshell, I want to extract only Insert, Updates, and Deletes into a File. How do you classify this best and cost-efficient way.
Your tags and the question content does not match, but I am guessing that you are trying to load data from Oracle to Snowflake. You want to do an incremental load from Oracle but you do not have an incremental key in the table to identify the incremental rows. You have two options.
Work with your data owners and put the effort to identify the incremental key. There needs to be one. People are sometimes lazy to put this effort. This will be the most optimal option
If you cannot, then look for a CDC(change data capture) solution like golden gate
CDC stage comes by default in DataStage.
Using CDC stage in combination of Transformer stage, is best approach to identify new rows, changed rows and rows for deletion.
You need to identify column(s) which makes row unique, doing CDC with all columns is not recommended, DataStage job with CDC stage consumes more resources if you add more change columns in CDC stage.
Work with your BA to identifying column(s) which makes row unique in the data.
I had the similar problem what you have. In my case, there are no Primary key and there is no date column to identify the difference. So what I did is actually, I used AWS Athena (presto managed) to calculate the difference between source and the destination. Below is the process:
Copy the source data to s3.
Create Source Table in athena pointing the data copied from source.
Create Destination table in athena pointing to the destination data.
Now use, SQL in athena to find out the difference. As I did not have the both primary key and date column, I used the below script:
select * from table_destination
except
select * from table_source;
If you have primary key, you can use that to find the difference as well and create the result table with the column which says "update/insert/delete"
This option is aws native and then it will be cheaper as well, as it costs 5$ per TB in athena. Also, in this method, do not forget to write file rotation scripts, to cut down your s3 costs.
I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.
Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.
I have found this question and answer already, and together they provide most of the answer to my problem, in format entirely intelligible to a complete novice!
I have an additional query to hopefully fill in some of the gaps in both my process and my understanding:
I have a series of XML files each containing all the information for a single person, instead of a single CSV containing multiple people, as the salient input. This information requires splitting and manipulation, so I have multiple streams in my Kettle transformation to reflect this (and will have a 'for each file' loop set up in the parent Job to handle multiple files).
To use the method outlined in the selected answer to import my processed data from a person file into a database, do I need to recombine my many processed datastreams into several streams within which all the data to be joined resides, or is there an alternate way of approaching this?
i.e. If I have datastreams A, B, C, D & E in my Kettle transformation, and within my database A is joined to B & C, and D to E, do I necessarily need to combine streams A, B & C into one stream and D & E into another?
Thanks in advance
Update: I have now solved this issue. To do so involved (as many problems) coming at it from a different direction, but I'm adding my solution to possibly aid a similarly unskilled Kettle user:
I used the operator outlined in the linked question (combined insert/update) to insert/update the data into my 'outermost' tables for each file (streamed using the StAX operator), my schema being much akin to a spider's web, with a nodal 'core' table of unique information for each file, branching off into tables where I aim to store single incidences of data common to many files, e.g. country.
The data streams in my transformation now being in the database, I join streams incrementally (e.g. place+city+country streams after processing and insertion are related to generate a location entity in a location table) producing the requisite relationships. This eventually results in producing join tables from e.g. my location table to the 'core' table, such that all the disparate data for a given XML file is appropriately connected.
New to SSIS (2k12).
Importing a csv file containing any new or changed PO Lines. My db has a PO master, and a POLine child, so if the PO is new, I need to insert a row into the master before loading up the child(ren). I may have half a dozen children in the POLineDetail import.
To create a Master, I have to match up the ProjectNbr from the tblProjects table to get the ProjectID, similarly with the Vendor (VendorName and VendorID...) I can do this in T-SQL, but I'm not sure how best to do it using SSIS. What's the strategy?
You just need to use the lookup transformation on the data flow task and route the unmatched records to the no match output. The no match output will be records that do not exist and need to be inserted, which you would attach to a destination transformation.
It sounds like the first step that's needed is to load the data into a staging table so that you can work with the data. From there you can use the Lookup Transformations in SSIS to do the matching to populate your master data based on your mentioned criteria. You could also use the same lookup transformation with the CSV as the source without going into the table, but I like to stage the data so that there is an opportunity to do any additional cleanup that's needed. Either way though, the lookup transformation would provide the functionality that you're looking for.