Google Cloud Platform architecture - google-bigquery

A simple question:
Is the data that is processed via Google Big Query stored on Google Cloud Storage, and is just segmented for GBQ purposes? or does Google Big Query hold it's own Storage mechanism.
I'm trying to learn the architecture, and I see arrows pointing back and forth to each other, but it doesn't say where GBQ's architecture sits?
Thanks.

From Bigquery under the hood:
Colossus - Distributed Storage
BigQuery relies on Colossus, Google’s latest generation distributed
file system. Each Google datacenter has its own Colossus cluster, and
each Colossus cluster has enough disks to give every BigQuery user
thousands of dedicated disks at a time. Colossus also handles
replication, recovery (when disks crash) and distributed management
(so there is no single point of failure). Colossus is fast enough to
allow BigQuery to provide similar performance to many in-memory
databases, but leveraging much cheaper yet highly parallelized,
scalable, durable and performant infrastructure.
BigQuery leverages the ColumnIO columnar storage format and
compression algorithm to store data in Colossus in the most optimal
way for reading large amounts of structured data.Colossus allows
BigQuery users to scale to dozens of Petabytes in storage seamlessly,
without paying the penalty of attaching much more expensive compute
resources — typical with most traditional databases.
The part about ColumnIO is outdated--BigQuery uses the Capacitor format now--but the rest is still relevant.

Related

cloud vs HDFS for Big Data staging area

What are the advantages and Disadvantages of having HDFS as a staging area for Big Data of 20 TB.
Which is best staging persistant layer. can we use HDFS or shall we opt for s3 cloud. Kindly share your expertise.
my findings:
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed
for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets
to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures.
The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests).
But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of
deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue,
mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold.
There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas.
The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage).
This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch
scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA
Doubts: will s3 is better or HDFS?
Firstly, I think you are confusing Cassandra as an HDFC system which is wrong. Also, I don't think you should be comparing Cassandra and an HDFS system. They have exactly opposite use case.
Cassandra is used when you have a high throughput of writes and reads are limited. It is very difficult to run map-reduce operations in Cassandra as you will be limited by partition and clustering keys.
HDFS is mainly used for map-reduce jobs where you are uploading files in a pre-defined format and you want to run analytical queries on any column which may or may not be partitioning key.
Doubts: will s3 is better or HDFS?
S3 is an HDFS system hosted on cloud. So I am assuming whether cloud HDFS is better or local HDFS. It depends on your use case. But using S3 gives you many advantages of almost infinite scalability. If your data is present in S3 you can use AWS EMR to run your map-reduce jobs. They give a high level of monitoring. These things are difficult to do if you are running a local HDFS.
Here is a good tutorial that you should read.

Why does BigQuery has its own storage?

BigQuery (BQ) has its own storage system which is completely separated from the Google Cloud Store (GCS).
My question is: why doesn't BQ directly process data stored on the GCS like Hadoop Hive? What is the benefit and necessity of this design?
That is because BigQuery uses column oriented database systems and it has background processes that constantly check if the data is stored in the optimal way. Therefore, the data is managed by BigQuery (that's why it has own storage) and it only exposes the highest layer to the user.
See this article for more details:
When you load bits into BigQuery, the service takes on the full
responsibility of managing that data, and only exposing the logical
database primitives to you
BigQuery gains several benefits from having its own separate storage.
For one, BigQuery is able to optimize the storage of it’s data constantly by moving and reordering it on the disks that it is stored on and by adding more disks and repeating the process as the database grows larger and larger.
BigQuery also utilizes a separate compute layer to query the storage layer, allowing the storage layer to scale while requiring less overall hardware to run the queries. This gives BigQuery the ability to call on more processing power as it needs it, but not have idle hardware when queries from a specific database are not being executed.
For a more in depth explanation of BigQueries structure and optimizations you can checkout this article I wrote for The Data School.

Hazelcast vs Redis vs S3

I am currently evaluating the fastest possible caching solutions that we can use among the Technologies in question. We know that while Redis and Hazelcast are caching solutions by their very intent and definition, and there is a clear stackoverflow link # redis vs hazelcast, there is also the AWS S3 which may not be a caching solution but is nevertheless, a storage and retrieval service + it supports SQL as well which makes it in my opinion a qualifier in the race as well. Considering this, are there any forethoughts on comparing the three based on speed, volumes of data etc.?
Hazelcast also provides SQL alike capabilities - run queries to fetch data in a resultset. Technology wise, Hazelcast/Redis and S3 are fundamentally different; for the latter is a disk bound data store and that are proven/known to be significantly slower than their in-memory counterparts.
To put things in a logical perspective: S3 or any other disk bound data store can not match the performance of accessing data from in-memory data stores.
However, it is also a common practice to run Hazelcast on top of a disk bound data store to get performance boost. In such type of architectures, your applications basically always only interact with Hazelcast. One can then use Hazelcast tools to keep the cached data in sync with underlying database.

Data Consolidation for ETL pipeline

I am currently planning to move some data sources to one place for posterior analysis.
Currently I have any data sources (databases) such as:
MSSQL
Mysql
mongodb
Postgres
Cassandra will be use for analytics in a big data pipeline. What is the best way to migrate any source to a Cassandra cluster?
I will highly recommend using NiFi for this use case. Some of benefits that I can outline right away.
Inbuilt "Processors" available for reading the data from all listed data sources and writing to Cassandra.
Very high throughput with low latency.
Rapid data acquisition pipeline development without writing a lot of code.
Ability to do "Change Data Capture" very easily later in your project, if needed.
Provides a highly concurrent model without a developer having to worry about the typical complexities of concurrency.
Is inherently asynchronous which allows for very high throughput and natural buffering even as processing and flow rates fluctuate
The resource-constrained connections make critical functions such as back-pressure and pressure release very natural and intuitive.
The points at which data enters and exits the system as well as how it flows through are well understood and easily tracked
And biggest of all, OPEN SOURCE.
You can refer Apache NiFi homepage for more information.
Hope that helps!

Combining Relational and Document based "Databases"

I am developing a system that is all about media archiving, searching, uploading, distributing and thus about handling BLOBs.
I am currently trying to find out the best way how to handle the BLOB's. I have limited resources for high end servers with a lot of memory and huge disks, but I can access a large array of medium performance off-the-shelf computers and hook them to the Internet.
Therefore I decided to not store the BLOBs in a central Relational Database, because I would then have, in the worst case, one very heavy Database Instance, possibly on a single average machine. Not an option.
Storing the BLOBs as files directly on the filesystem and storing their path in the database is also somewhat ugly and distribution would have to be managed manually, keeping track of the different copies myself. I don't even want to get close to that.
I looked at CouchDB and I really like their peer-to-peer based design. This would allow me to run a distributed cluster of machines across the Internet, implies:
Low cost Hardware
Distribution for Redundancy and Failover out of the box
Lightweight REST Interface
So if I got it right, one could summarize it like this: Cloud like API and self managed, distributed, replicated system
The rest of the system does the normal stuff any average web application does: handling session, security, users, searching and the like. For this part I still want to use a relational datamodel. (CouchDB claims not to be a replacement for relational databases).
So I would have all the standard data, including the BLOB's meta data in the relational database but the BLOBs themselves in CouchDB.
Do you see a problem with this approach? Am I missing something important? Can you think of better solutions?
Thank you!
You could try Amazon's relational database SimpleDB and S3 toghether with SimpleJPA. SimpleJPA is a JPA-implementation on top of SimpleDB. SimpleJPA uses SimpleDB for the relational structure and S3 to store BLOBs.
Take a look at MongoDB, it supports storing binary data in an efficient format and is incredibly fast
No problem. I have done a design very similar to that one. You may also want to take a peek to HBase as an alternative to CouchDB and to the Adaptive Object-Model architectural pattern, as a way to manage your data and meta-data.