Does Google BigQuery use different nodes for storing data and for computation of queries? I have seen that Amazon Redshift uses nodes that do both and Snowflake has a patent-pending architecture that separates storage and compute layers.
Thank you for the help.
Yes, the storage and compute layers are separate with Google BigQuery. You don't need to manage either layer with BigQuery since it is considered a serverless architecture. Storage automatically scales with the data you put into it, and compute automatically scales with the needs of the query.
If you are interested in learning more about how BigQuery works, you can check out the Google Research paper on Dremel.
Related
We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)
I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.
We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.
Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.
I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time.
And a week to do the same using a series of SQL with potens.io utilizing
Cloud Function for Windowed Tables and parallelization with clustering transient tables.
I know there's been bunch improvement in CloudDataFlow since then, but at a time
the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.
Both BigQuery and DataProc can handle huge amounts of complex data.
I think that you should consider two points:
Which transformation would you like to do in your data?
Both tools can make complex transformations but you have to consider that PySpark will provide you a full programming language processing capability while BigQuery will provide you SQL transformations and some scripting structures. If only SQL and simple scripting structures can handle your problem, BigQuery is an option. If you need some complex scripts to transform your data or if you think you'll need to build some extra features involving transformations in the future, PySpark may be a better option. You can find the BigQuery scripting reference here
Pricing
BigQuery and DataProc have different pricing systems. While in BigQuery you'd need to concern about how much data you would process in your queries, in DataProc you have to concern about your cluster's size and VM's configuration, how much time your cluster would be running and some other configurations. You can find the pricing reference for BigQuery here and for DataProc here. Also, you can simulate the pricing in the Google Cloud Platform Pricing Calculator
I suggest that you create a simple POC for your project in both tools to see which one has the best cost benefit for you.
I hope these information help you.
BigQuery (BQ) has its own storage system which is completely separated from the Google Cloud Store (GCS).
My question is: why doesn't BQ directly process data stored on the GCS like Hadoop Hive? What is the benefit and necessity of this design?
That is because BigQuery uses column oriented database systems and it has background processes that constantly check if the data is stored in the optimal way. Therefore, the data is managed by BigQuery (that's why it has own storage) and it only exposes the highest layer to the user.
See this article for more details:
When you load bits into BigQuery, the service takes on the full
responsibility of managing that data, and only exposing the logical
database primitives to you
BigQuery gains several benefits from having its own separate storage.
For one, BigQuery is able to optimize the storage of it’s data constantly by moving and reordering it on the disks that it is stored on and by adding more disks and repeating the process as the database grows larger and larger.
BigQuery also utilizes a separate compute layer to query the storage layer, allowing the storage layer to scale while requiring less overall hardware to run the queries. This gives BigQuery the ability to call on more processing power as it needs it, but not have idle hardware when queries from a specific database are not being executed.
For a more in depth explanation of BigQueries structure and optimizations you can checkout this article I wrote for The Data School.
Context:
My text input pipeline currently consists of two main parts:
I. A complex text preprocessing and exporting of tf.SequenceExamples to tfrecords (custom tokenization, vocabulary creation, statistics calculation, normalization and many more over the full dataset as well as per each individual example). That is done once for each data configuration.
II. A tf.Dataset (TFRecords) pipeline that does quite a bit of processing during training, too (string_split into characters, table lookups, bucketing, conditional filtering, etc.).
Original Dataset is present across multiple locations (BigQuery, GCS, RDS, ...).
Problem:
The problem is that as the production dataset increases rapidly (several terabytes), it is not feasible to recreate a tfrecords files for each possible data configuration (part 1 has a lot of hyperparameters) as each will have an enormous size of hundreds of terabytes. Not to mention, that tf.Dataset reading speed surprisingly slows down when tf.SequenceExamples or tfrecords grow in size.
There are quite a few possible solutions:
Apache Beam + Cloud DataFlow + feed_dict;
tf.Transform;
Apache Beam + Cloud DataFlow + tf.Dataset.from_generator;
tensorflow/ecosystem + Hadoop or Spark
tf.contrib.cloud.BigQueryReader
, but neither of the following seem to fully fulfill my requierments:
Streaming and processing on the fly data from BigQuery, GCS, RDS, ... as in part I.
Sending data (protos?) directly to tf.Dataset in one way or another to be used in part II.
Fast and reliable for both training and inference.
(optional) Being able to pre-calculate some full pass statistics over the selected part of the data.
EDIT: Python 3 support would be just wonderful.
What is the most suitable choice for the tf.data.Dataset pipeline? What are the best practices in this case?
Thanks in advance!
I recommend to orchestrate the whole use case using Cloud Composer(GCP integration of Airflow).
Airflow provided operators which let you orchestrate a pipeline with a script.
In your case you can use the dataflow_operator to have the dataflow job spin up when you have enough data to process.
To get the data from BigQuery you can use the bigquery_operator.
Furthermore you can use the python operator or the bash operator to monitor and pre-calculate statistics.
A simple question:
Is the data that is processed via Google Big Query stored on Google Cloud Storage, and is just segmented for GBQ purposes? or does Google Big Query hold it's own Storage mechanism.
I'm trying to learn the architecture, and I see arrows pointing back and forth to each other, but it doesn't say where GBQ's architecture sits?
Thanks.
From Bigquery under the hood:
Colossus - Distributed Storage
BigQuery relies on Colossus, Google’s latest generation distributed
file system. Each Google datacenter has its own Colossus cluster, and
each Colossus cluster has enough disks to give every BigQuery user
thousands of dedicated disks at a time. Colossus also handles
replication, recovery (when disks crash) and distributed management
(so there is no single point of failure). Colossus is fast enough to
allow BigQuery to provide similar performance to many in-memory
databases, but leveraging much cheaper yet highly parallelized,
scalable, durable and performant infrastructure.
BigQuery leverages the ColumnIO columnar storage format and
compression algorithm to store data in Colossus in the most optimal
way for reading large amounts of structured data.Colossus allows
BigQuery users to scale to dozens of Petabytes in storage seamlessly,
without paying the penalty of attaching much more expensive compute
resources — typical with most traditional databases.
The part about ColumnIO is outdated--BigQuery uses the Capacitor format now--but the rest is still relevant.
I want to store product images in google bigquery database so that i can display these images in my reports.
Is there any way to achieve it?
You can store binary data in BigQuery by base64-encoding it and storing it as a string.
That said, make sure you consider the caveats in Incognito's answer. Storing images in BigQuery may not be the most efficient use of your query dollar, since you'll pay to query the entire column, which will likely contain a large amount of data and will therefore make your query much more expensive. You might consider Google Cloud Storage or Google Cloud Datastore as better alternatives for storing binary data for direct lookup.
I don't think storing images is possible in Google Storage to be processed by Google BigQuery.
Also, as per the official documentation available for Google BigQuery, BigQuery supports CSV and JSON data formats and the supported data types can include:
Data types
Your data can include strings, integers, floats, booleans,
nested/repeated records and timestamps
Also, since BigQuery is for analyzing huge datasets, why would you store anything other than data which you don't want to process and harness for gathering more information. Plus, you also pay for storage and storing images obviously takes a lot of space on the server, so you would cut on that too, to save costs.