Pentaho Data Integration - Star schema (PostgreSQL)

Pentaho Data Integration - Star schema (PostgreSQL) - pentaho

I have a CSV file with data and the database I need for the star schema.
However, the CSV file doesn't have the ID's of the dimensions tables (the primary keys), which means I only get those ID's after inserting the data into the dimension's tables (the ID will be an auto-increment value).
This means that first I need to load the data into the dimensions and after that, I need to read the dimensions tables (to know the ID's) and the remaning data from the CSV file and load all that into the facts table.
To load the data into the dimensions,I made this Transformation and it works perfectly.
The problem is in getting the ID's from the tables (and, simultaneously, the remaning data from the CSV file) and load all that into the facts table.
I don't know if it is even possible to do all this in a single Transformation.
Any suggestions ?
I would really appreciate any help you could provide. (A sketch of the correct Transformation would be great)

This entire thing is possible in one job not in one transformation.
Create a job,inside that select two transformation.
First load the dimension table using first transformation and in next transformation load the fact table.

Related

Standard data validation practices for multiple partitions & multiple loads in a single day

Looking for data validation technique between layers.
Here is the data flow
Source(RDBMS) > flat file(Stage) > AVRO/json(final destination) on Azure.
Problem is, there could be multiple flat files(partition) for single table at each stage and from there could be more potentially more partitions on destination.
Plan is to create SQL table with bunch columns but not sure how to handle partitions, multiple job loads.
Here is the basic table idea..
Data validation(table): dt_validation JobId|tblname|RC_RDBMS|RC_FF|RC_AVRO|Job_run_date|Partition_1|Partition_2
RC= RowCount, FF=Flat file Note: Idea is each time i pass thru layer, i'll get the rowcount(RC) and insert/update the table.
Does above table design work for multiple partitions, multiple loads/jobs in a single day?
Need suggestions on how my table should look like considering partitions & multiple loads in a single day.

Table without date and Primary Key

I have 9M records. We needed to do the following operations:-
daily we receive the entire file of 9M records with 150GB of file size
It is truncate and loads in Snowflake. Daily deleting the entire 9B records and loading
We would want to send only incremental file load to Snowflake. Meaning that:
For example, out of 9Million records, we would only have an update in 0.5Million records(0.1 M Inserts,0.3 Deletes, and 0.2 Updates). How we will be able to compare the file and extract only delta file and load to the snowflake. How to do it cost-effectively and fast way in AWS native tools and load to S3.
P.s data doesn't have any date column. It is a pretty old concept written in 2012. We need to optimize this. The file format is fixed width. Attaching sample RAW data.
Sample Data:
https://paste.ubuntu.com/p/dPpDx7VZ5g/
In a nutshell, I want to extract only Insert, Updates, and Deletes into a File. How do you classify this best and cost-efficient way.

Your tags and the question content does not match, but I am guessing that you are trying to load data from Oracle to Snowflake. You want to do an incremental load from Oracle but you do not have an incremental key in the table to identify the incremental rows. You have two options.
Work with your data owners and put the effort to identify the incremental key. There needs to be one. People are sometimes lazy to put this effort. This will be the most optimal option
If you cannot, then look for a CDC(change data capture) solution like golden gate

CDC stage comes by default in DataStage.
Using CDC stage in combination of Transformer stage, is best approach to identify new rows, changed rows and rows for deletion.

You need to identify column(s) which makes row unique, doing CDC with all columns is not recommended, DataStage job with CDC stage consumes more resources if you add more change columns in CDC stage.
Work with your BA to identifying column(s) which makes row unique in the data.

I had the similar problem what you have. In my case, there are no Primary key and there is no date column to identify the difference. So what I did is actually, I used AWS Athena (presto managed) to calculate the difference between source and the destination. Below is the process:
Copy the source data to s3.
Create Source Table in athena pointing the data copied from source.
Create Destination table in athena pointing to the destination data.
Now use, SQL in athena to find out the difference. As I did not have the both primary key and date column, I used the below script:
select * from table_destination
except
select * from table_source;
If you have primary key, you can use that to find the difference as well and create the result table with the column which says "update/insert/delete"
This option is aws native and then it will be cheaper as well, as it costs 5$ per TB in athena. Also, in this method, do not forget to write file rotation scripts, to cut down your s3 costs.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype

You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.

I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

SSIS Create Parent

New to SSIS (2k12).
Importing a csv file containing any new or changed PO Lines. My db has a PO master, and a POLine child, so if the PO is new, I need to insert a row into the master before loading up the child(ren). I may have half a dozen children in the POLineDetail import.
To create a Master, I have to match up the ProjectNbr from the tblProjects table to get the ProjectID, similarly with the Vendor (VendorName and VendorID...) I can do this in T-SQL, but I'm not sure how best to do it using SSIS. What's the strategy?

You just need to use the lookup transformation on the data flow task and route the unmatched records to the no match output. The no match output will be records that do not exist and need to be inserted, which you would attach to a destination transformation.

It sounds like the first step that's needed is to load the data into a staging table so that you can work with the data. From there you can use the Lookup Transformations in SSIS to do the matching to populate your master data based on your mentioned criteria. You could also use the same lookup transformation with the CSV as the source without going into the table, but I like to stage the data so that there is an opportunity to do any additional cleanup that's needed. Either way though, the lookup transformation would provide the functionality that you're looking for.

Delete/replace or edit columns of an access table when updating daily

I support an access database for which the primary data resource is a text file imported daily. The fields in the text file are variable. When the fields change, it impacts multiple tables within my access database, so I'm maintaining them dynamically by importing the data into a temp table and aligning the columns and data types of the rest with a "master structure" table. It contains the field names, data types, lengths and so on. This master table will be changed every time the text file changes, but this approach seems to fit my users' needs.
My question: would it be better to systematically delete/replace my existing access tables daily, or systematically alter the tables to match the changes in the data? Are there any performance or size repercussions to deleting and replacing tables on a daily basis with VBA and SQL?

There are no size repercussions for deleting and replacing Access tables using the VBA.
This operation however could have a performance hit , but it will be negligible if number of your tables are small.
So the answer to your question depends the size and number of your tables.
If the number is large then altering the tables method gives you a noticable performace over deleting and replacing method.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pentaho Data Integration - Star schema (PostgreSQL) - pentaho

This entire thing is possible in one job not in one transformation. Create a job,inside that select two transformation. First load the dimension table using first transformation and in next transformation load the fact table.

Related

Standard data validation practices for multiple partitions & multiple loads in a single day

Table without date and Primary Key

BigQuery: Best way to handle frequent schema changes?

SSIS Create Parent

Delete/replace or edit columns of an access table when updating daily

Categories

Resources