Amazon Datawarehouse architecture and design - amazon-s3

I have some flight booking data in a Sql server table with booking for a passenger.
The below query highlights all the tables involved along with joins
"SELECT distinct * FROM
Booking B
JOIN BookingPassenger BP
ON B.BookingId = BP.BookingId
JOIN PassengerJourneyLeg PJL
ON PJL.PassengerId = BP.PassengerId
JOIN InventoryLeg IL
ON IL.InventoryLegId = PJL.InventoryLegId
join passengerjourneysegment ps
on ps.PassengerId= BP.PassengerId
WHERE IL.departuredate = '2014/03/26' and il.flightnumber = 123
AND B.CreatedDate < '2014/03/22'"
Now Revenue departments needs this data to be put into a Data Warehouse so that they can compute a booking curve for each flight on any day or all flights on any day or within specific dates. Currently they are doing it via excel which fetches data through sql but it is very time consuming and does not give real time data. Later they want to gather data from our corporate booking website and want to manage customer profile on this data warehouse which will be our main analytical platform. I am fresh to data warehouse and learning and researching on how to implement a effective data-warehouse to meet their needs.
Can someone help me on How should i collect the data ? Should i upload it into dynamodb or s3 and what is the best way to do that as one time job and as recurring job?
The later aim of this data warehouse will be to plot all information related to the PNR. Flight revenue by day, by class, by subclass, by event etc.
Later phase, every time user interact with our website, i want to store that in redshift.. so when should i write the files to S3 or dynamodb? and how many ? i.e : - If i do write files to S3 on each user event, i will end up with hundreds of files, that may not seems like a good solution. What about introducing RDS or dynamodb to store each transaction? or is it possible to allow server log files to store info (user interaction on website) and doing any event (booking, cancel etc) to be recorded into RDS or Dynamodb?
what are the best practices ? What may be the best design in my specific scenario? Also if someone can please give more clarify on How can that be implemented?
what are the best practices to have reports with 1-5 TB of data to come back in few minutes or seconds and avoid any duplication or latency?
Also can someone suggest how can have ease of maintenance and have the cost effectiveness and at par with some the best solutions?
I will really appreciate any help, links, suggestion on the topics of Data warehousing, Amazon (Redshift,s3, Dynamo Db) technologies specific to my requirements.

There are a lot of questions, i suspect that some of them was answered due the elapsed time. Anyway, let me explain my thougths about this.
Later they want to gather data from our corporate booking website and want to manage customer profile on this data warehouse which will be our main analytical platform
Establish a staging area database is a good idea, a "database for drafts". You can create simple tables to deal with this data.
Can someone help me on How should i collect the data ? Should i upload it into dynamodb or s3 and what is the best way to do that as one time job and as recurring job?
A good path is to use some ETL tool to collect the data. I like Pentaho CE and its PDI.
Later phase, every time user interact with our website, i want to store that in redshift.. so when should i write the files to S3 or dynamodb? and how many ? i.e : - If i do write files to S3 on each user event, i will end up with hundreds of files, that may not seems like a good solution. What about introducing RDS or dynamodb to store each transaction? or is it possible to allow server log files to store info (user interaction on website) and doing any event (booking, cancel etc) to be recorded into RDS or Dynamodb?
I prefer the last idea. Store your server log and time by time copy to the staging database keeping the more amount of log you can for statistical purposes in the staging area. In my opinion it is the most common practice.
what are the best practices ? What may be the best design in my specific scenario? Also if someone can please give more clarify on How can that be implemented?
Take a look at this answer on this discussion https://stackoverflow.com/a/2015115/2249963 (I just didnĀ“t want to copy and paste as if it was my answer, so i agree :) )
what are the best practices to have reports with 1-5 TB of data to come back in few minutes or seconds and avoid any duplication or latency?
There are many ways to do this, basically duplication is used and latency is the counterpart of the performance. You really need to elect how many time it is required to analyse your data. The previous day, also called D-1 is usual.
Also can someone suggest how can have ease of maintenance and have the cost effectiveness and at par with some the best solutions?
Doing a good design on model and keep things simple, as in everything on IT.
I hope it can help

Related

What is the best way to structure this database?

So I am in the process of building a database from my clients data. Each month they create roughly 25 csv's, which are unique by their topic and attributes, but they all have 1 thing in common; a registration number.
The registration number is the only common variable across all of these csv's.
My task is to move all of this into a database, for which I am leaning towards postgres (If anyone believes nosql would be best for this then please shout out!).
The big problem; structuring this within a database. Should I create 1 table per month that houses all the data, with column 1 being registration and column 2-200 being the attributes? Or should put all the csv's into postgres as they are, and then join them later?
I'm struggling to get my head around the method to structure this when there will be monthly updates to every registration, and we dont want to destroy historical data - we want to keep it for future benchmarks.
I hope this makes sense - I welcome all suggestions!
Thank you.
There are some ways where your question is too broad and asking for an opinion (SQL vs NoSQL).
However, the gist of the question is whether you should load your data one month at a time or into a well-developed data model. Definitely the latter.
My recommendation is the following.
First, design the data model around how the data needs to be stored in the database, rather than how it is being provided. There may be one table per CSV file. I would be a bit surprised, though. Data often wants to be restructured.
Second, design the archive framework for the CSV files.
You should archive all the incoming files in a nice directory structure with files from each month. This structure should be able to accommodate multiple uploads per month, either for all the files or some of them. Mistakes happen and you want to be sure the input data is available.
Third, copy (this is the Postgres command) the data into staging tables. This is the beginning of the monthly process.
Fourth, process the data -- including doing validation checks to load it into your data model.
There may be tweaks to this process, based on questions such as:
Does the data need to be available 24/7 even during the upload process?
Does a validation failure in one part of the data prevent uploading any data?
Are SQL checks (referential integrity and check) sufficient for validating the data?
Do you need to be able to "rollback" the system to any particular update?
These are just questions that can guide your implementation. They are not intended to be answered here.

Google CloudSQL or BigQuery for Big Data Actively Update Every Second

So now I'm currently using Google CloudSQL for my needs.
I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main parameters like user locations (latitude longitude), timestamp, user activities and conversations and more.
I need to constantly access a lot of insight from this user activities, like "how many users between latitude-longitude A and latitude-longitude B who use my app per hour since 30 days ago?".
Because my table become bigger every day, it's hard to manage the performance of select query in my table. (I already implemented the indexing method in my table especially for most common use parameter)
All my data insert, select, update and more is executed from API that I code in PHP.
So my question is can I get much more better benefit if I use Google BigQuery for my needs?
If yes, how can I do this? Because is Google BigQuery (forgive my if I'm wrong) designed to be used for static data? (Not a constantly update data)? How can I connect my CloudSQL data into BigQuery in real time?
Which one is better: optimizing my table in CloudSQL to maximize the select process or use BigQuery (if possible)
I also open for another alterntive or sugget to optimize my CloudSQL performance :)
Thank you
Sounds like BigQuery would be far better suited your use case. I can think of a good solution:
Migrate existing data from CloudSQL to BigQuery.
Stream events directly to BigQuery (using a async queue).
Use time partitioned table in BigQuery.
If you use BigQuery, you don't need to worry about performance or scaling. That's all handled for you by Google.

Options for replicating a large amount of data to many subscribers

I'm using transactional replication and have run into a problem. The problem is there's a table that stores Inventory QTY Per Site data, and it's huge. It represents 90% of the total data. Each subscriber is located at a different site, and really only needs the inventory qtys for the site its located at. The initial snapshot takes a really long time because of this big table, so i'm looking for some solutions to this.
Here's some possible solutions i've thought of so far:
Initialize snapshots from a backup. This would be perfect, except for when a subscriber needs to be reinitialized. The sites are in a country that has unreliable internet - so the chance of a subscriber going offline for a long period is pretty high.
Compress snapshot - according to the documentation they recommend a reliable network, which isn't available.
Filter the big table by site in other words, only copy the data for the subscriber's site. I looked this up and apparently it's only available for merge replication? Maybe switching to merge replication would be a better option so that i can use parameterized row filter?
Dont replicate the big table, copy it with a job instead - copy it on demand or based on a schedule. However, i don't think there's a way to enable the table for transactional replication after copying the data like this, so it would get out of sync until you ran the job again.
Is anyone aware of other possible solutions to this. I'm open to any suggestions, even switching to merge replication if that would solve the problem.

Data warehouse, data update strategy with Bigquery

We have a MIS where stores all the information about Customers, Accounts, Transactions and etc. We are building a data warehouse with BigQuery.
I am pretty new on this topic, Should we
1. everyday extract ALL the customer's latest information and append them to a BigQuery table with timestamp,
2. or we only extract the updated customer's information on that day?
First solution uses a lot of storage and takes time to upload data, and got lots of duplicates. But it's very clear for me to run query. For 2nd solution, given a specific date how can I get the latest record for that day?
Similar for Account data, an example of simplified Account table, only 4 fields here.
AccountId, CustomerId, AccountBalance, Date
If I need to build a report or graphic of a group of customers' AccountBalance everyday, I need to know the balance of each account on every specific date. So should I extract each account record everyday, even it's the same as last day, or I can only extract the account when the balance changed?
What is the best solution or your suggestion? I prefer the 2nd one because there are no duplicates, but how can I construct the query in BigQuery, will performance be an issue?
What else should I consider? Any recommendation for me to read?
When designing DWH you need to start from business questions, translate them to KPIs, measures, dimensions, etc.
When you have those in place...
you chose technology based on some of the following questions (and many more):
who are your users? in what frequency and what resolutions they consume the data? what are your data sources? are they structured? what are the data volumes? what is your data quality? how often your data structure changes? etc.
when choosing technology you need to think of the following: ETL, DB, Scheduling, Backup, UI, Permissions management, etc.
after you have all those defined... data schema design is pretty straight forward and is derived from "The purpose of the DWH" and your technology limits.
You have pointed out some of the points to consider, but the answer is based of your needs... and is not related to specific DB technology.
I am afraid your question is too general to be answered without deep understanding of your needs.
Referring to your comment bellow:
How reliable is your source data? Are you interested in the analyzing trends or just snapshots? Does your source system allow "Select all" operations? what are the data volumes? What resources does your source allow for extraction (locks, bandwidth, etc.)?
If you just need a daily snapshot of the current balance, and there are no limits by your source system,
it would be much simpler to run a daily snapshot.
this way you don't need to manage "increments", handle data integrity issues and systems discrepancies etc. however, this approach might have undesired impact on your source system, and your network costs...
If you do have resources limits, and you chose the incremental ETL approach, you can either
create a "Changes log" table and query it, you can use row_number()
in order to find latest record per account.
or yo can construct a copy of the source accounts table, merging
changes everyday to an existing table.
each approach has its own aspect of simplicity, costs, and resource consumption...
Hope this helps

What database solution will you suggest for competitive online tickets sale

Can you please give me an database design suggestion?
I want to sell tickets for events but the problem is that the database can become bootleneck when many user what to buy simultaneously tickets for the same event.
if I have an counter for tickets left for each event there will be more updates on this field (locking) but I will easy found how much tickets are left
if I generate tickets for each event in advance it will be hard to know how much tickets are left
May be it will be better if each event can use separate database (if the requests for this event are expected to be high)?
May be reservation also have to asynchronous operation?
Do I have to use relation database (MySQL, Postgres) or no relation database (MongoDB)?
I'm planing to use AWS EC2 servers so I can run more servers if I need them.
I heard that "relation databases don't scale" but I think that I need them because they have transactions and data consistency that I will need when working with definite number of tickets, Am I right or not?
Do you know some resources in internet for this kind of topics?
If you sell 100.000 tickets in 5 minutes, you need a database that can handle at least 333 transactions per second. Almost any RDBMS on recent hardware, can handle this amount of traffic.
Unless you have a not so optimal database schema and/of SQL, but that's another problem.
First things first: when it comes to selling stuff (ecommerce), you really do need a transactional support. This basically excludes any type of NoSQL solutions like MongoDB or Cassandra.
So you must use database that supports transactions. MySQL does, but not in every storage engine. Make sure to use InnoDB and not MyISAM.
Of cause many popular databases support transactions, so it's up to you which one to choose.
Why transactions? Because you need to complete a bunch of database updates and you must be sure that they all succeed as one atomic operation. For example:
1) make sure ticket is available.
2) Reduce the number of available tickets by one
3) process credit card, get approval
4) record purchase details into database
If any of the operations fail you must rollback the previous updates. For example if credit card is declined you should rollback the decreasing of available ticket.
And database will lock those tables for you, so there is no change that in between step 1 and 2 someone else tries to purchase a ticket but the count of available tickets has not yet been decreased. So without the table lock it would be possible for a situation where only 1 ticket is left available but it is sold to 2 people because second purchase started between step 1 and step 2 of first transaction.
It's essential that you understand this before you start programming ecommerce project
Check out this question regarding releasing inventory.
I don't think you'll run into the limits of a relational database system. You need one that handles transactions, however. As I recommended to the poster in the referenced question, you should be able to handle reserved tickets that affect inventory vs tickets on orders where the purchaser bails before the transaction is completed.
your question seems broader than database design.
first of all, relational database will scale perfectly well for this. You may need to consider a web services layer which will provide the actual ticket brokering to the end users. here you will be able to manage things in a cached manner independent of the actual database design. however, you need to think through the appropriate steps for data insertion, and update as well as select in order to optimize your performance.
first step would be to go ahead and construct a well normalized relational model to hold your information.
second, build some web service interface to interact with the data model
then put that into a user interface and stress test for many simultaneous transactions.
my bet will be you need to then rework your web services layer iteratively until you are happy - but your database (well normalized) will not be cusing you any bottleneck issues.