Can we do windowing on batch processing? - batch-processing

In the Apache Beam documentation we talked about windowing with bounded PCollections.
https://beam.apache.org/documentation/programming-guide/#windowing
if we have a bounded data than all the elements would belong to the same GLOBAL window.
In which use cases would we need to apply windowing on batch processing?

Bounded data does not have a notion of time and watermark. So for batch pipelines, windows are nothing but one part of a multipart key which can be used at the time of Grouping.
A possible use case would be to get list of unique users per hour from the corpus of 1 day data. Hoever, this can also be done by applying your own key (hour) and grouping.
Beam has the unified batch and streaming model and similar apis for both batch and streaming hence the windowing concept is available in both batch and streaming.

Related

Data processing - BigQuery vs Data Proc+BigQuery

We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)
I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.
We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.
Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.
I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time.
And a week to do the same using a series of SQL with potens.io utilizing
Cloud Function for Windowed Tables and parallelization with clustering transient tables.
I know there's been bunch improvement in CloudDataFlow since then, but at a time
the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.
Both BigQuery and DataProc can handle huge amounts of complex data.
I think that you should consider two points:
Which transformation would you like to do in your data?
Both tools can make complex transformations but you have to consider that PySpark will provide you a full programming language processing capability while BigQuery will provide you SQL transformations and some scripting structures. If only SQL and simple scripting structures can handle your problem, BigQuery is an option. If you need some complex scripts to transform your data or if you think you'll need to build some extra features involving transformations in the future, PySpark may be a better option. You can find the BigQuery scripting reference here
Pricing
BigQuery and DataProc have different pricing systems. While in BigQuery you'd need to concern about how much data you would process in your queries, in DataProc you have to concern about your cluster's size and VM's configuration, how much time your cluster would be running and some other configurations. You can find the pricing reference for BigQuery here and for DataProc here. Also, you can simulate the pricing in the Google Cloud Platform Pricing Calculator
I suggest that you create a simple POC for your project in both tools to see which one has the best cost benefit for you.
I hope these information help you.

DataWeave vs Java Performance

I have requirement to iterate over close to million records. The current code was written in Dataweave with filter and ordering logic. However, I'm seeing a performance issue. I was thinking of converting this DataWeave logic to Java using the Java Component and seeing if that improves performance.
How can I improve the performance of my code?
Data weave has some performance issues if you are using Global functions or using 'p()' functions.
If you are having any such functions in your dwl, avoid it.
Since you are processing high volume of records, if the records are identical, you can use scatter-gather pattern and leverage the asynchronous processing of records. You can further tune the performance by configuring the thread pool in which you are doing the transformation/filtering logic.
For implementation of scatter gather pattern you can refer this link.
The order logic that you have implemented in data weave, can be moved to a custom aggregator and you can reorder records as per your custom logic
If nothing helps, consider using Java8 Streams API in your custom Java component to both filter and order the records.
Dataweave is best at what it does. It Doesn't matter how many records you are willing to process through your app. Moreover, the primary constraints are not with dataweave, but with the App memory and Vcores assigned. You must consider to recess the processing of records if it's over a million. Also, you must do your processing action in chunks/batches at regular intervals with a reasonable time delay.
From what I have tested, any App which is running on 0.1Vvores and 1 worker, usually will hit the Mule Health Monitor, eventually leading to a crash, if ran straight for 15 hours or more.
A good thumbrule is to never cross the system resource usage or the CPU beyond 70 %.
Note : It is strongly recommended to not use Mule Java Components for complex, repetitive, higher load executions.

Is there a way to partition a tf.Dataset with TensorFlow’s Dataset API?

I checked the doc but I could not find a method for it. I want to de cross validation, so I kind of need it.
Note that I'm not asking how to split a tensor, as I know that TensorFlow provides an API for that an has been answered in another question. I'm asking on how to partition a tf.Dataset (which is an abstraction).
You could either:
1) Use the shard transformation partition the dataset into multiple "shards". Note that for best performance, sharding should be to data sources (e.g. filenames).
2) As of TensorFlow 1.12, you can also use the window transformation to build a dataset of datasets.
I am afraid you cannot. The dataset API is a way to efficiently stream inputs to your net at run time. It is not a set of tools to manipulate datasets as a whole -- in that regards it might be a bit of a misnomer.
Also, if you could, this would probably be a bad idea. You would rather have this train/test split done once and for all.
it let you review those sets offline
if the split is done each time you run an experiment there is a risk that samples start swapping sets if you are not extremely careful (e.g. when you add more data to your existing dataset)
See also a related question about how to split a set into training & testing in tensorflow.

Optimal data streaming and processing solution for enormous datasets into tf.data.Dataset

Context:
My text input pipeline currently consists of two main parts:
I. A complex text preprocessing and exporting of tf.SequenceExamples to tfrecords (custom tokenization, vocabulary creation, statistics calculation, normalization and many more over the full dataset as well as per each individual example). That is done once for each data configuration.
II. A tf.Dataset (TFRecords) pipeline that does quite a bit of processing during training, too (string_split into characters, table lookups, bucketing, conditional filtering, etc.).
Original Dataset is present across multiple locations (BigQuery, GCS, RDS, ...).
Problem:
The problem is that as the production dataset increases rapidly (several terabytes), it is not feasible to recreate a tfrecords files for each possible data configuration (part 1 has a lot of hyperparameters) as each will have an enormous size of hundreds of terabytes. Not to mention, that tf.Dataset reading speed surprisingly slows down when tf.SequenceExamples or tfrecords grow in size.
There are quite a few possible solutions:
Apache Beam + Cloud DataFlow + feed_dict;
tf.Transform;
Apache Beam + Cloud DataFlow + tf.Dataset.from_generator;
tensorflow/ecosystem + Hadoop or Spark
tf.contrib.cloud.BigQueryReader
, but neither of the following seem to fully fulfill my requierments:
Streaming and processing on the fly data from BigQuery, GCS, RDS, ... as in part I.
Sending data (protos?) directly to tf.Dataset in one way or another to be used in part II.
Fast and reliable for both training and inference.
(optional) Being able to pre-calculate some full pass statistics over the selected part of the data.
EDIT: Python 3 support would be just wonderful.
What is the most suitable choice for the tf.data.Dataset pipeline? What are the best practices in this case?
Thanks in advance!
I recommend to orchestrate the whole use case using Cloud Composer(GCP integration of Airflow).
Airflow provided operators which let you orchestrate a pipeline with a script.
In your case you can use the dataflow_operator to have the dataflow job spin up when you have enough data to process.
To get the data from BigQuery you can use the bigquery_operator.
Furthermore you can use the python operator or the bash operator to monitor and pre-calculate statistics.

Windowing functions in Dataflow and Big Query

I am looking at analysing streaming data (web events).
Is there a good rule of thumb to help me determine if I should
Perform Grouping and Aggregation in Dataflow and write the output
or
Use Dataflow to stream into Big Query and possibly use a range decorator to limit data / use a windowing function for partitions and aggregate via SQL.
Looking at the examples in the documentation and this article
https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
Classic Batch Programming, Hourly Team Scores, All-time User Scores, User Behaviour Analysis feel like they are straightforward to create via SQL (given "created" and "write" timestamps are recorded)
The Spam filtering example I can see the limitations to using BQ if this applied on a per-event streaming basis).
The semantics of Dataflow seem to overlap in terms of GroupBy, Join, Combine, Windowing as well as BQ supporting streaming inserts with availability in seconds, well short enough for hour level aggregation.
Is there something fundamental I have not understood? Or is there a case that streaming into BigQuery and then querying will start to become unreliable?
Thank you
Chris
(Apologies if this question is a bit vague - happy to be redirected to a better place to ask)
Whether one chooses to perform grouping and aggregation in Dataflow or using BigQuery operations (after having ingested data using Dataflow) is something that depends on the application logic and on what consumes the output. For example, sessions and sliding windows are both hard to express in SQL; while Dataflow supports arbitrary processing such as triggered estimates. Another thing to consider is that it may be easier to express the computation logic using an imperative programming language instead of using SQL.
Below, not necessarily answers your exact question, but rather adds yet another aspect to consider:
1. If you are building process that supposed to power your infrastructure – dataflow might be a good choice. Of course you bound to your tech team resources.
2. In case if you plan for ad-hocs and self-serve type of activity by non-tech personnel (of course tech personnel is not excluded here also) – you can focus on employing BigQuery’s query features (including windowing functions) and make sure you have good real working examples that rest of your company can use as a template to start leveraging power of BigQuery and GCP in general. This proved to work great! Domain experts now can answer their questions (like you enlisted in your question) by themselves w/o having tech people in between. Quality and Timing much better in this scenario!