How do I upsert into a data warehouse from multiple sources - sql

I am in the process of building a new data warehouse. On this occasion the warehouse needs to support incremental nightly updates. Pretty standard stuff really.
Previously when building a data warehouse I've either used Created / Updated date columns to drive inserts and updates, or, used the PrimaryKey on the source table and stored that in the warehouse as the natural key and used HashBytes to compare row data.
However on this particular warehouse, the data will be fed from multiple sql tables, which would therefore have multiple primary keys, and / or multiple Created / Updated dates.
What is a typical design pattern for dealing with this situation?

Related

How to handle a large dimension in BigQuery

I have a dimension table in my current warehouse (Netezza) which has 10 million records and which is being updated on a daily basis.
Should we keep this dimension table in BigQuery as it is as we are planning to migrate to BigQuery.
How can we redesign this large dimension in BigQuery?
Because bigquery is not intended for updates, it's not that easy to implement a dimension table. The proper answer depends on your use case.
But here are some alternatives:
Have an append-only dimension table with an "UpdatedAt" field. Than, use window function to get the last version (you can even create a view that has only the last version)
Truncate the dimension table daily with the latest version of your data.
Create an external table based on GCS / Big Table / Cloud SQL, and have the dimensions updated there.
Save your dimension table in a separate database, and use Cloud Dataflow to perform the join
Save the dimension data together with the fact table (Yes, there will be a lot of duplications, but sometimes it's worth the cost)
Simply update the dimension table whenever there is a change (there is a limit to do that)
All of these approaches have drawbacks. The solution can even be a mix of more than one approach.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Sqlite database - attribute as table vs column?

I'm planning on making a SQLite database to hold data acquired from numerous sensors. The data would be simple things like date added, volume in mL, temperature, etc.
Would it be good practice to create one table per sensor? Or am I better off creating a column for the sensor name instead and putting everything under one table? I plan to query data from sensor(s) based on the date added attribute.

Is there a term to describe tables where only the last value should be used?

In my database there are tables where each row represents an entity, and there are other tables where the same entity can appear multiple times, but only the latest entry is the valid one.
For example, I have a table customer where each row represents a customer and another table customer_membership_status where the same customer can be referenced multiple times, but only the last record for each customer is suppost to be used. Data is never updated in customer_membership_status, only inserted.
Is there a term to describe this pattern? I'm asking because I would like to quickly and easily explain the intended use of the table to others.
Probably the best term would be CQRS and event sourcing
Using the stream of events as the write store, rather than the actual data at a point in time, avoids update conflicts on a single aggregate and maximizes performance and scalability
Event Sourcing pattern
Instead of storing just the current state of the data in a domain, use an append-only store to record the full series of actions taken on that data. The store acts as the system of record and can be used to materialize the domain objects. This can simplify tasks in complex domains, by avoiding the need to synchronize the data model and the business domain, while improving performance, scalability, and responsiveness. It can also provide consistency for transactional data, and maintain full audit trails and history that can enable compensating actions.
EDIT:
After closer look you may also read about SCD(Slowly Changing Dimension) Type 2.
This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert.
Temporal table. It's a table where a timestamp / version attribute is part of a key. The temporal / version attribute allows you to identify which is the latest row for each customer.

SQL Server 2012 - Column Store indexes - Reporting Solution

We (Team) are in process of putting Audit Reporting solution for a huge online transactional website.
Our auditing solution is to enable CDC on source table and tracking every change happens on the objects, grab them and push them into Destination table for reporting.
As of now we got one to one table in source - destination.
There will be only inserts in destination and no updates or delete.
So end of the day audit tables will grow large than actual source tables as these keeps history of changes.
My plan is flatten the destination tables to fewer based on subject / module, enable column store indexes and then utilize same for reporting.
Is there any suggestion on the above approach or there is any alternative.
I would recomend that you rather keep the table structure in a single table and have a look at Partitioned Tables and Indexes
SQL Server supports table and index partitioning. The data of
partitioned tables and indexes is divided into units that can be
spread across more than one filegroup in a database. The data is
partitioned horizontally, so that groups of rows are mapped into
individual partitions. All partitions of a single index or table must
reside in the same database. The table or index is treated as a single
logical entity when queries or updates are performed on the data.