BigQuery Datawarehouse design? - google-bigquery

In a typical HDFS environment for Datawarehouse, I have seen some different stages in which the data are staged and transformed like in below. I am trying to design a system in Google cloud platform where I can perform all these transformations. Please help.
HDFS::
Landing Zone -> Stage 1 Zone -> Stage 2 Zone
Landing Zone - for having the raw data
Stage 1 Zone - the raw data from Landing zone is transformed, and then changed to a different data format and/or denormalized and stored in Stage 1
Stage 2 Zone - Data from stage 1 is updated on a transaction table say HBASE. If it is just a time period data, then still HDFS based HIVE table
Then, reporting happens from Stage 2 (There could also be multiple zones in between if for transformation)
My thought process of implementation in Google Cloud::
Landing(Google cloud storage) -> Stage 1 (BigQuery - hosts all time based data) -> Stage 2 (BigQuery for time based data/Maintain Big table for transactional data based on key)
My questions are below:
a) Does this implementation looks realistic. I am planning to use Dataflow for read and load between these Zones? What would be a better design, if anyone has implemented one before to build a warehouse?
b) How effective it is to use Dataflow to read Big Query and then update Big table? I have seen some Dataflow connector for Big table updates here
c) Can Json data be used as the primary format, since BigQuery supports that?

There's a solution that may fit your scenario. I would load the data to Cloud Storage, read it and do the transformation with Dataflow, then either send it to Cloud Storage to be loaded in Bigquery after that and/or write directly to BigTable with the Dataflow connector that you mentioned.
As I mentioned before, you could send your transformed data to both databases from Dataflow. Keep in mind that BigQuery and Bigtable are good for analytics, however, Bigtable has a low-latency read and write access and BigQuery has a high latency as it does query jobs to gather the data.
Yes, it'll be a good idea as you could load your JSON data from Cloud Storage to BigQuery directly.

Related

Load batch CSV Files from Cloud Storage to BigQuery and append on same table

I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.
There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)

Is there (still) an advantage to staging data on Google Cloud Storage before loading into BigQuery?

I have a data set stored as a local file (~100 GB uncompressed JSON, could still be compressed) that I would like to ingest into BigQuery (i.e. store it there).
Certain guides (for example, https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html) suggest to first upload this data to Google Cloud Storage before loading it from there into BigQuery.
Is there an advantage in doing this, over just loading it directly from the local source into BigQuery (using bq load on a local file)? It's been suggested in a few places that this might speed up loading or make it more reliable (Google Bigquery load data with local file size limit, most reliable format for large bigquery load jobs), but I'm unsure whether that's still the case today. For example, according to its documentation, BigQuery supports resumable uploads to improve reliability (https://cloud.google.com/bigquery/docs/loading-data-local#resumable), although I don't know if those are used when using bq load. The only limitation I could find that still holds true is that the size of a compressed JSON file is limited to 4 GB (https://cloud.google.com/bigquery/quotas#load_jobs).
Yes, having data in Cloud Storage is a big advantage during development. In my cases I often create a BigQuery table from data in the Cloud Storage multiple times till I tune up all things like schema, model, partitioning, resolving errors etc. It would be really time consuming to upload data every time.
Cloud Storage to BigQuery
Pros
loading data is incredibly fast
possible to remove BQ table when not used and import it when needed (BQ table is much bigger than plain maybe compressed data in Cloud Storage)
you save your local storage
less likely fail during table creation (from local storage there could be networking issues, computer issues etc.)
Cons
you pay some additional cost for storage (in the case you do not plan to touch your data often e.g. once per month - you can decrease price to use the nearline storage)
So I would go for storing data to the Cloud Storage first but of course, it depends on your use case.

Hive data transformations in real time?

I have the following data pipeline:
A process writes messages to Kafka
A Spark structured streaming application is listening for new Kafka messages and writes them as they are to HDFS
A batch Hive job runs on a hourly basis and reads the newly ingested messages from HDFS and via some medium complex INSERT INTO statements populates some tables (I do not have materialized views available). EDIT: Essentially after my Hive job I have as result Table1 storing the raw data, then another table Table2 = fun1(Table1), then Table3 = fun2(Table2), then Table4 = join(Table2, Table3), etc. Fun is a selection or an aggregation.
A Tableau dashboard visualizes the data I wrote.
As you can see, step 3 makes my pipeline not real time.
What can you suggest me in order to make my pipeline fully real time? EDIT: I'd like to have Table1, ... TableN updated on real time!
Using Hive with Spark Streaming is not recommended at all. Since the purpose of Spark streaming is to have low latency. Hive introduces the highest latency possible (OLAP) since at backend it executes MR/Tez job (depends on hive.execution.engine).
Recommendation: Use spark streaming with the low latency DB like HBASE, Phoenix.
Solution: Develop a Spark streaming job with Kafka as a source and use the custom sink to write the data into Hbase/Phoenix.
Introducing HDFS obviously isn't real time. MemSQL or Druid/Imply offer much more real time ingestion from Kafka
You need historical data to perform roll ups and aggregations. Tableau may cache datasets, but it doesn't store persistently itself. You therefore need some storage, and you've chosen to use HDFS rather than a database.
Note: Hive / Presto can read directly from Kafka. Therefore you don't really even need Spark.
If you want to do rolling aggregates from Kafka and make it queryable, KSQL could be used instead, or you can write your own Kafka Streams solution

How can I load data from BigQuery to Spanner?

I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.

ETL on Google Cloud - (Dataflow vs. Spring Batch) -> BigQuery

I am considering BigQuery as my data warehouse requirement. Right now, I have my data in google cloud (cloud SQL and BigTable). I have exposed my REST APIs to retrieve data from both. Now, I would like to retrieve data from these APIs, do the ETL and load the data into BigQuery. I am evaluating 2 options of ETL (daily frequency of job for hourly data) right now:-
Use JAVA Spring Batch and create microservice and use Kubernetes as deployment environment. Will it scale?
Use Cloud DataFlow for ETL
Then use BigQuery batch insert API (for initial load) and streaming insert API (for incremental load when new data available in source) to load BigQuery denormalized schema.
Please let me know your opinions.
Without knowing your data volumes, specifically how much new or diff data you have per day and how you are doing paging with your REST APIs - here is my guidance...
If you go down the path of a using Spring Batch you are more than likely going to have to come up with your own sharding mechanism: how will you divide up REST calls to instantiate your Spring services? You will also be in the Kub management space and will have to handle retries with the streaming API to BQ.
If you go down the Dataflow route you will have to write a some transform code to call your REST API and peform the paging to populate your PCollection destined for BQ. With the recent addition of Dataflow templates you could: create a pipeline that is triggered every N hours and parameterize your REST call(s) to just pull data ?since=latestCall. From there you could execute BigQuery writes. I recommend doing this in batch mode as 1) it will scale better if you have millions of rows 2) be less cumbersome to manage (during non-active times).
Since Cloud Dataflow has built in re-try logic for BiqQuery and provides consistency across all input and output collections -- my vote is for Dataflow in this case.
How big are your REST call results in record count?