Data warehouse, data update strategy with Bigquery - sql

We have a MIS where stores all the information about Customers, Accounts, Transactions and etc. We are building a data warehouse with BigQuery.
I am pretty new on this topic, Should we
1. everyday extract ALL the customer's latest information and append them to a BigQuery table with timestamp,
2. or we only extract the updated customer's information on that day?
First solution uses a lot of storage and takes time to upload data, and got lots of duplicates. But it's very clear for me to run query. For 2nd solution, given a specific date how can I get the latest record for that day?
Similar for Account data, an example of simplified Account table, only 4 fields here.
AccountId, CustomerId, AccountBalance, Date
If I need to build a report or graphic of a group of customers' AccountBalance everyday, I need to know the balance of each account on every specific date. So should I extract each account record everyday, even it's the same as last day, or I can only extract the account when the balance changed?
What is the best solution or your suggestion? I prefer the 2nd one because there are no duplicates, but how can I construct the query in BigQuery, will performance be an issue?
What else should I consider? Any recommendation for me to read?

When designing DWH you need to start from business questions, translate them to KPIs, measures, dimensions, etc.
When you have those in place...
you chose technology based on some of the following questions (and many more):
who are your users? in what frequency and what resolutions they consume the data? what are your data sources? are they structured? what are the data volumes? what is your data quality? how often your data structure changes? etc.
when choosing technology you need to think of the following: ETL, DB, Scheduling, Backup, UI, Permissions management, etc.
after you have all those defined... data schema design is pretty straight forward and is derived from "The purpose of the DWH" and your technology limits.
You have pointed out some of the points to consider, but the answer is based of your needs... and is not related to specific DB technology.
I am afraid your question is too general to be answered without deep understanding of your needs.
Referring to your comment bellow:
How reliable is your source data? Are you interested in the analyzing trends or just snapshots? Does your source system allow "Select all" operations? what are the data volumes? What resources does your source allow for extraction (locks, bandwidth, etc.)?
If you just need a daily snapshot of the current balance, and there are no limits by your source system,
it would be much simpler to run a daily snapshot.
this way you don't need to manage "increments", handle data integrity issues and systems discrepancies etc. however, this approach might have undesired impact on your source system, and your network costs...
If you do have resources limits, and you chose the incremental ETL approach, you can either
create a "Changes log" table and query it, you can use row_number()
in order to find latest record per account.
or yo can construct a copy of the source accounts table, merging
changes everyday to an existing table.
each approach has its own aspect of simplicity, costs, and resource consumption...
Hope this helps

Related

What is the best way to structure this database?

So I am in the process of building a database from my clients data. Each month they create roughly 25 csv's, which are unique by their topic and attributes, but they all have 1 thing in common; a registration number.
The registration number is the only common variable across all of these csv's.
My task is to move all of this into a database, for which I am leaning towards postgres (If anyone believes nosql would be best for this then please shout out!).
The big problem; structuring this within a database. Should I create 1 table per month that houses all the data, with column 1 being registration and column 2-200 being the attributes? Or should put all the csv's into postgres as they are, and then join them later?
I'm struggling to get my head around the method to structure this when there will be monthly updates to every registration, and we dont want to destroy historical data - we want to keep it for future benchmarks.
I hope this makes sense - I welcome all suggestions!
Thank you.
There are some ways where your question is too broad and asking for an opinion (SQL vs NoSQL).
However, the gist of the question is whether you should load your data one month at a time or into a well-developed data model. Definitely the latter.
My recommendation is the following.
First, design the data model around how the data needs to be stored in the database, rather than how it is being provided. There may be one table per CSV file. I would be a bit surprised, though. Data often wants to be restructured.
Second, design the archive framework for the CSV files.
You should archive all the incoming files in a nice directory structure with files from each month. This structure should be able to accommodate multiple uploads per month, either for all the files or some of them. Mistakes happen and you want to be sure the input data is available.
Third, copy (this is the Postgres command) the data into staging tables. This is the beginning of the monthly process.
Fourth, process the data -- including doing validation checks to load it into your data model.
There may be tweaks to this process, based on questions such as:
Does the data need to be available 24/7 even during the upload process?
Does a validation failure in one part of the data prevent uploading any data?
Are SQL checks (referential integrity and check) sufficient for validating the data?
Do you need to be able to "rollback" the system to any particular update?
These are just questions that can guide your implementation. They are not intended to be answered here.

Stream data into rotating log tables in BigQuery

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

Create BigQuery job that creates tables daily [duplicate]

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

Amazon Datawarehouse architecture and design

I have some flight booking data in a Sql server table with booking for a passenger.
The below query highlights all the tables involved along with joins
"SELECT distinct * FROM
Booking B
JOIN BookingPassenger BP
ON B.BookingId = BP.BookingId
JOIN PassengerJourneyLeg PJL
ON PJL.PassengerId = BP.PassengerId
JOIN InventoryLeg IL
ON IL.InventoryLegId = PJL.InventoryLegId
join passengerjourneysegment ps
on ps.PassengerId= BP.PassengerId
WHERE IL.departuredate = '2014/03/26' and il.flightnumber = 123
AND B.CreatedDate < '2014/03/22'"
Now Revenue departments needs this data to be put into a Data Warehouse so that they can compute a booking curve for each flight on any day or all flights on any day or within specific dates. Currently they are doing it via excel which fetches data through sql but it is very time consuming and does not give real time data. Later they want to gather data from our corporate booking website and want to manage customer profile on this data warehouse which will be our main analytical platform. I am fresh to data warehouse and learning and researching on how to implement a effective data-warehouse to meet their needs.
Can someone help me on How should i collect the data ? Should i upload it into dynamodb or s3 and what is the best way to do that as one time job and as recurring job?
The later aim of this data warehouse will be to plot all information related to the PNR. Flight revenue by day, by class, by subclass, by event etc.
Later phase, every time user interact with our website, i want to store that in redshift.. so when should i write the files to S3 or dynamodb? and how many ? i.e : - If i do write files to S3 on each user event, i will end up with hundreds of files, that may not seems like a good solution. What about introducing RDS or dynamodb to store each transaction? or is it possible to allow server log files to store info (user interaction on website) and doing any event (booking, cancel etc) to be recorded into RDS or Dynamodb?
what are the best practices ? What may be the best design in my specific scenario? Also if someone can please give more clarify on How can that be implemented?
what are the best practices to have reports with 1-5 TB of data to come back in few minutes or seconds and avoid any duplication or latency?
Also can someone suggest how can have ease of maintenance and have the cost effectiveness and at par with some the best solutions?
I will really appreciate any help, links, suggestion on the topics of Data warehousing, Amazon (Redshift,s3, Dynamo Db) technologies specific to my requirements.
There are a lot of questions, i suspect that some of them was answered due the elapsed time. Anyway, let me explain my thougths about this.
Later they want to gather data from our corporate booking website and want to manage customer profile on this data warehouse which will be our main analytical platform
Establish a staging area database is a good idea, a "database for drafts". You can create simple tables to deal with this data.
Can someone help me on How should i collect the data ? Should i upload it into dynamodb or s3 and what is the best way to do that as one time job and as recurring job?
A good path is to use some ETL tool to collect the data. I like Pentaho CE and its PDI.
Later phase, every time user interact with our website, i want to store that in redshift.. so when should i write the files to S3 or dynamodb? and how many ? i.e : - If i do write files to S3 on each user event, i will end up with hundreds of files, that may not seems like a good solution. What about introducing RDS or dynamodb to store each transaction? or is it possible to allow server log files to store info (user interaction on website) and doing any event (booking, cancel etc) to be recorded into RDS or Dynamodb?
I prefer the last idea. Store your server log and time by time copy to the staging database keeping the more amount of log you can for statistical purposes in the staging area. In my opinion it is the most common practice.
what are the best practices ? What may be the best design in my specific scenario? Also if someone can please give more clarify on How can that be implemented?
Take a look at this answer on this discussion https://stackoverflow.com/a/2015115/2249963 (I just didnĀ“t want to copy and paste as if it was my answer, so i agree :) )
what are the best practices to have reports with 1-5 TB of data to come back in few minutes or seconds and avoid any duplication or latency?
There are many ways to do this, basically duplication is used and latency is the counterpart of the performance. You really need to elect how many time it is required to analyse your data. The previous day, also called D-1 is usual.
Also can someone suggest how can have ease of maintenance and have the cost effectiveness and at par with some the best solutions?
Doing a good design on model and keep things simple, as in everything on IT.
I hope it can help

SQL: Joins vs Denormalization (lots of data)

I know, variations of this question had been asked before. But my case may be a little different :-)
So, I am building a site that tracks events. Each event has id and value. It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)
I need to be able to quickly get answers to two queries:
get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
get sum(maybe avg also) of values of events from users with certain profile -
Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.
Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.
I expect LOTS of data, certainly more than one table or single server can handle.
I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines).
This way, every such table will have limited amount of rows, let's say, 10M tops.
So, the question is: what to do with user's attributes?
Option 1, normalized: store them in separate table and reference from event tables.
(pro) No repetition of data.
(con) joins, which are expensive (or so
I heard).
(con) this requires user table and event tables to be on
the same server
Option 2, redundant: store user attributes in event tables and index them.
(pro) easier load balancing (self-contained tables can be moved around)
(pro) simpler (faster?) queries
(con) lots of disk space and memory used for repeating user attributes and corresponding indexes
Your design should be normalized, you physical schema may end up denormalized for performance reasons.
Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.
Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.
In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.
Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.
Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.
Please Normalize
use partitions and indexing to balance load