Data processing - BigQuery vs Data Proc+BigQuery - google-bigquery

We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)

I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.
We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.
Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.
I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time.
And a week to do the same using a series of SQL with potens.io utilizing
Cloud Function for Windowed Tables and parallelization with clustering transient tables.
I know there's been bunch improvement in CloudDataFlow since then, but at a time
the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.

Both BigQuery and DataProc can handle huge amounts of complex data.
I think that you should consider two points:
Which transformation would you like to do in your data?
Both tools can make complex transformations but you have to consider that PySpark will provide you a full programming language processing capability while BigQuery will provide you SQL transformations and some scripting structures. If only SQL and simple scripting structures can handle your problem, BigQuery is an option. If you need some complex scripts to transform your data or if you think you'll need to build some extra features involving transformations in the future, PySpark may be a better option. You can find the BigQuery scripting reference here
Pricing
BigQuery and DataProc have different pricing systems. While in BigQuery you'd need to concern about how much data you would process in your queries, in DataProc you have to concern about your cluster's size and VM's configuration, how much time your cluster would be running and some other configurations. You can find the pricing reference for BigQuery here and for DataProc here. Also, you can simulate the pricing in the Google Cloud Platform Pricing Calculator
I suggest that you create a simple POC for your project in both tools to see which one has the best cost benefit for you.
I hope these information help you.

Related

Best approach for BigQuery data transformations

I already have terabytes of data stored on BigQuery and I'm wondering to perform heavy data transformations on it.
Considering COSTS and PERFORMANCE, what the best approach you guys would suggest to perform these transformations for future usage of these data on BigQuery?
I'm considering a few options:
1. Read raw data from DataFlow and then load the transformed data back into BigQuery?
2. Do it directly from BigQuery?
Any ideas about how to proceed with this?
I wrote down some most important things about performance, you can find there consideration regarding your question about using DataFlow.
Best practices considering performance:
Choosing file format:
BigQuery supports a wide variety of file formats for data ingestion. Some are going to be naturally faster than others. When optimizing for load speed, prefer using the AVRO file format, which is binary, row-based format and enables to split it and then read it in parallel with multiple workers.
Loading data from compressed files, specifically CSV and JSON, is going to be slower than loading data in a other format. And the reason being is because, since the compression of Gzip is not splitable, there is a need to take that file, load it onto a slot within BQ, and then do the decompression, and then finally parallelize the load afterwards.
**FASTER**
Avro(Compressed)
Avro(Uncompressed)
Parquet/ORC
CSV
JSON
CSV (Compressed)
JSON(Compressed
**SLOWER**
ELT / ETL:
After loading data into BQ, you can think about transformations (ELT or ETL). So in general, you want to prefer ELT over ETL where possible. BQ is very scalable and can handle large transformations on a ton of data. ELT is also quite a bit simpler, because you could just write some SQL queries, transform some data and then move data around between tables, and not have to worry about managing a separate ETL application.
Raw and staging tables:
Once, you have started loading data into BQ, in general, within your warehouse, you're going to want to leverage raw and staging tables before publishing to reporting tables. The raw table essentially contains the full daily extract, or a full load of the data that they're loading. Staging table then is basically your change data capture table, so you can utilize queries or DML to marge that data into your staging table and have a full history of all the data that's being inserted. And then finally your reporting tables are going to be the ingest that you publish out to your users.
Speeding up pipelines using DataFlow:
When you're getting into streaming loads really complex batch loads (that doesn't really fit into SQL cleanly), you can leverage DataFlow or DataFusion to speed up those pipelines, and do more complex activities on that data. And if you're starting with streaming, I recommend using the DataFlow templates - Google provides it for loading data from multiple different places and moving data around. You can find those templates in DataFlow UI, within Create Job from Template button, you'll find all these templates.
And if you find that it mostly fits your use case, but want to make one slight modification, all those templates are also open sourced (so you can go to repo, modify the code to fit your needs).
Partitioning:
Partition in BQ physically split your data on disk, based on ingestion time or based on a column within your data. Efficiently query over the parts of the table you want. This provides huge cost and performance benefits, especially on large fact tables. Whenever you have a fact table or temporal table, utilize a partition column on your date dimension.
Cluster Frequently Accessed Fields:
Clustering allows you to physically order data within a partition. So you can do Clustering by one or multiple keys. This provide massive performance benefits when used properly.
BQ reservations:
It allows to create reservations of slots, assign project to those reservations, so you can allocate more or less resources to certain types of queries.
Best practices considering saving costs you can find in official documentation.
I hope it helps you.
According to this Google Cloud Documentation, the following questions should be done to choose between DataFlow or BigQuery tool for ELT.
Although the data is small and can quickly be uploaded by using the BigQuery UI, for the purpose of this tutorial you can also use Dataflow for ETL. Use Dataflow for ETL into BigQuery instead of the BigQuery UI when you are performing massive joins, that is, from around 500-5000 columns of more than 10 TB of data, with the following goals:
You want to clean or transform your data as it's loaded into BigQuery, instead of storing it and joining afterwards. As a result,
this approach also has lower storage requirements because data is only
stored in BigQuery in its joined and transformed state.
You plan to do custom data cleansing (which cannot be simply achieved with SQL).
You plan to combine the data with data outside of the OLTP, such as logs or remotely accessed data, during the loading process.
You plan to automate testing and deployment of data-loading logic using continuous integration or continuous deployment (CI/CD).
You anticipate gradual iteration, enhancement, and improvement of the ETL process over time.
You plan to add data incrementally, as opposed to performing a one-time ETL.

Apache Ignite analogue of Spark vector UDF and distributed compute in general

I have been using Spark for some time now with success in Python however we have a product written in C# that would greatly benefit from distributed and parallel execution. I did some research and tried out the new C# API for Spark but this is a little restrictive at the moment.
In regards to Ignite, on the surface it seems like a decent alternative. Its got good .NET support, it has clustering ability and the ability to distribute compute across the grid.
However, I was wondering if it really can be used to replace Spark in our use case - what we need is a distributed way in which to perform data frame type operations. In particular a lot of our code in Python was implemented using Pandas UDF and we let Spark worry about the data transfer and merging of results.
If i wanted to use Ignite, where our data is really more like a table (typically CSV sourced) rather than key/value based, is there an efficient way to represent that data across the grid and send computations to the cluster that execute on an arbitrary subset of the data in the same way Spark does, especially in the sense that the result of the calculations just become 1..n more columns in the dataframe without having to collect all the results back to the main program?
You can load your structured data (CSV) to Ignite using its SQL implementation:
https://apacheignite-sql.readme.io/docs/overview
it will provide the possibility to do distributed SQL queries over this data and indexes support. Spark also provides the possibility to work with structured data using SQL but there are no indexes. Indexes will help you to significantly increase the performance of your SQL operations.
In case if you have already had some solution worked using Spark data frames then you also can save the same logic but use Ignite integration with Spark instead:
https://apacheignite-fs.readme.io/docs/ignite-data-frame
In this case, you can have all data stored in Ignite SQL tables and do SQL requests and other operations using Spark.
Here you can see an example how to load CSV data to Ignite using Spark DF and how it can be configured:
https://www.gridgain.com/resources/blog/how-debug-data-loading-spark-ignite

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :
easier to reason about (less verbose)
easier to maintain (SQL vs scala/python code)
you can run it easily on the GUI if needed
fast without having to really reason about partitioning, caching and so on...
In the end, I only use Spark when I've got something to do that I can't express using SQL.
To be clear, my workflow is often like :
preprocessing (previously in Spark, now in SQL)
feature engineering (previously in Spark, now mainly in SQL)
machine learning model and predictions (Spark ML)
Am I missing something ?
Is there any con in using BigQuery this way instead of Spark ?
Thanks
A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.
If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:
BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.
I'm using PySpark (on GCP Dataproc), BigQuery and we have jobs in both. I will summarize my vision about Pros and Cons of one system against the other. And I do admit that your environment could be different, so that something which I think is Pros might not be like this for you.
Pros of Spark:
better testing of the code, simpler to build unit tests and run them with mocked data and classes, rather in trying to do this with BigQuery
it's possible to use SQL (SparkSQL) for operations and even combine operations over different data sources (DB, files, BQ)
we have JSON files in the format which is not valid for BigQuery, and it cannot parse them (while files have valid JSON format)
possible to implement naturally more complicated logic for some cases, for example, traversing arrays in nested fields and other complicated calculations
better custom monitoring is possible, when we need to check specific metrics in the pipeline we can send related metrics (StatsD, etc.) easier
more natural for CI/CD processes
Pros of BigQuery (all with a note: if all data is available):
simplicity of SQL, when all data is available in a convenient format
DBAs who are not familiar with Python/Scala still could contribute (bcs they know SQL)
awesome infrastructure behind the scene, very performant
With both approaches it's possible to check quickly the result in GUI. For example, Jupyter Notebook allows to run PySpark instantly. I cannot add my notes about ML related traits, though.

Can I the same programming language in BigQuery and Google Cloud Dataflow?

I want to use the same function for parsing events in two different technologies: Goolge Bigquery and DataFlow. Is there a language I can do this in? If not, is google planning to support one any time soon?
Background: Some of this parsing is complex (e.g., applying custom URL extraction rules, extracting information out of the user agent) but it's not computationally expensive and doesn't involve joining the events to any other large look-up tables. Because the parsing can be complex, I want to write my parsing logic in only one language and run it wherever I need it: sometimes in BigQuery, sometimes in other environments like DataFlow. I want to avoid writing the same complex parsers/extractors in different languages because of the bugs and inconsistencies that can result from that.
I know BigQuery supports javascript UDFs. Is there a clean way to run javascript on Google Cloud DataFlow? Will BigQuery someday support UDFs in some other language?
We tend to use Java to puppet bigquery jobs and parse their resulting data, and then we also do that in dataflow as well.
Likewise, you have leeway with the amount of sql that you write vs auto-generate from the code-base, and how much you lean on bigquery vs dataflow.
(we have found with our larger amounts of data, that there is a big benefit to offloading as much initial grouping/filtering into bigquery before pulling it into dataflow)

Pros & cons of BigQuery vs. Amazon Redshift [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Comparing Google BigQuery vs. Amazon Redshift shows that both can answer same set of requirements, differ mostly by cost plans. It seems that Redshift is more complex to configure (defining keys and optimization work) vs. Google BigQuery that perhaps has an issue with joining tables.
Is there a pros & cons list of Google BigQuery vs. Amazon Redshift?
I posted this comparison on reddit. Quickly enough a long term RedShift practitioner came to comment on my statements. Please see https://www.reddit.com/r/bigdata/comments/3jnam1/whats_your_preference_for_running_jobs_in_the_aws/cur518e for the full conversation.
Sizing your cluster:
Redshift will ask you to choose a number of CPUs, RAM, HD, etc. and to turn them on.
BigQuery doesn't care. Use it whenever you want, no provisioning needed.
Hourly costs when doing nothing:
Redshift will ask you to pay per hour of each of these servers running, even when you are doing nothing.
When idle BigQuery only charges you $0.02 per month per GB stored. 2 cents per month per GB, that's it.
Speed of queries:
Redshift performance is limited by the amount of CPUs you are paying for
BigQuery transparently brings in as many resources as needed to run your query in seconds.
Indexing:
Redshift will ask you to index (correction: distribute) your data under certain criteria, and you'll only be able to run fast queries based on this index.
BigQuery has no indexes. Every operation is fast.
Vacuuming:
Redshift requires periodic maintenance and 'vacuum' operations that last hours. You are paying for each of these server hours.
BigQuery does not. Forget about 'vacuuming'.
Data partitioning and distributing:
Redshift requires you to think about how to distribute data within your servers to keep performance up - optimization that works only for certain queries.
BigQuery does not. Just run whatever query you want.
Streaming live data:
Impossible(?) with Redshift.
BigQuery easily handles ingesting up to 100,000 rows per second per table.
Growing your cluster:
If you have more data, or more concurrent users scaling up will be painful with Redshift.
BigQuery will just work.
Multi zone:
You want a multi-zone Redshift for availability and data integrity? Painful.
BigQuery is multi-zoned by default.
To try BigQuery you don't need a credit card or any setup time. Just try it (quick instructions to try BigQuery).
When you are ready to put your own data into BigQuery, just copy your JSON new-line separated logs from to Google Cloud Storage and import them.
See this in depth guide to data warehouse pricing on the cloud:
Understanding Cloud Pricing Part 3.2 - More Data Warehouses
Amazon Redshift is a standard SQL database (based on Postgres) with MPP features that allow it to scale. These features also require you to conform your data model somewhat to get the best performance. It supports a large amount of the SQL standard and most tools that can speak to Postgres can use it unchanged.
BigQuery is not a database, in the sense that there it doesn't use standard SQL and doesn't provide JDBC/ODBC connectivity. It's a unique service with it's own API and interfaces. It provides limited support for SQL queries but most users interact with via custom code (Java, Python, etc.). Some 3rd party tools have added support for BigQuery but existing tools will not work without modification.
tl;dr - Redshift is better for interacting with existing tools and using complex SQL. BigQuery is better for custom coded interactions and teams who dislike SQL.
UPDATE 2017-04-17 - Here's a much more up to date summary of the cost and speed differences (wrapped in a sales pitch so YMMV). TL;DR - Redshift is usually faster and will be cheaper if you query the data somewhat regularly. http://blog.panoply.io/a-full-comparison-of-redshift-and-bigquery
UPDATE - Since I keep getting down votes on this (🤷‍♂️) here's an up-to-date response to the items in the other answer:
Sizing your cluster:
Redshift allows you to tailor your costs to your usage. If you want the fastest possible queries choose SSD nodes and if you want the lowest possible cost per GB choose HDD nodes. Start small and add nodes whenever you want.
Hourly costs when doing nothing:
Redshift keeps your cluster ready for queries, can respond in milliseconds (result cache) and it provides a simple, predictable monthly bill.
For example, even if some script accidentally runs 10,000 giant queries over the weekend your Redshift bill will not increase at all.
Speed of queries:
Redshift performance is absolutely best in class and gets faster all the time. 3-5x faster in the last 6 months.
Indexing:
Redshift has no indexes. It allows you to define sort keys to optimize performance from fast to insanely fast.
Vacuuming:
Redshift now automatically runs routine maintenance such as ANALYZE and VACUUM DELETE when your cluster has free resource.
Data partitioning and distributing:
Redshift never requires distribution. It allows you to define distribution keys which can make even huge joins very fast.
{Ask competitors about join performance…}
Streaming live data:
Redshift has 2 choices
Stream real time data into Redshift using Amazon Kinesis Firehose.
Skip ingestion altogether by querying your real time instantly on S3 as soon as it land (and at high speeds) using Redshift Spectrum external tables.
Growing your cluster:
Redshift can elastically resize most clusters in a few minutes.
Multi zone:
Redshift seamlessly replaces any failed hardware and continuously backs up your data, including across regions if desired.