Using 1 Dataflow Job (Apache Beam Pipeline) to aggregate data on different window periods, and write them to different column families in BigTable - optimization

I am trying to optimize my Apache Beam pipeline on Google Cloud Platform Dataflow.
Background information: I am trying to read streaming data from PubSub Messages, and aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such aggregations consists of summing, averaging, finding the maximum or minimum, etc. For example, for all data collected from 1200 to 1201, I want to aggregate them and write the output into BigTable's 1-min column family. And for all data collected from 1200 to 1205, I want to similarly aggregate them and write the output into BigTable's 5-min column. Same goes for 60min.
The current approach I took is to have 3 separate dataflow jobs (i.e. 3 separate Beam Pipelines), each one having a different window duration (1min, 5min and 60min). See https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html. And the outputs of all 3 dataflow jobs are written to the same BigTable, but on different column families. Other than that, the function and aggregations of the data are the same for the 3 jobs.
However, this seems to be very computationally inefficient, and cost inefficient, as the 3 jobs are essentially doing the same function, with the only exception being the window time duration and output column family.
Some challenges and limitations we faced was that from the Apache Beam documentation, it seems like we are unable to create multiple windows of different periods in a singular dataflow job. Also, when we write the final data into big table, we would have to define the table, column family, column, and rowkey. And unfortunately, the column family is a fixed property (i.e. it cannot be redefined or changed given the window period).
Hence, I am wondering if there is a way to only use 1 dataflow job (i.e. 1 Apache Beam pipeline) that fulfils the objective of this project? Which is to aggregate data on different window periods, and write them to different column families of the same BigTable.
I was considering using Split stream: first window by 1-min, then split into 3 streams (1 write to bigtable for 1-min interval, another for 5-min aggregation, and another for 60-min aggregation). However, the problem is that we are working with streaming data and not batch data.
Thank you

Related

How to make the output as another input?

The case: We have a 1day aggregation window which sum the total value received from Event hub (1, 2, 3 … each minute send a value), we set the output to a blob called 1dayresult. Now we want to get the blob data as another 1week aggregation input, each week we want to get the data from the blob and do calculation, so can we set the 1day result blob as the input for the 1week aggregation? We know we can set the window unit is 7day, but we think it will slow down the performance, because if we make the 1day result blob as the input, we only need 7 values, but if we use the 7day window, we will get more than 7*24*60 values and then do the calculation. We also want to have a month aggregation, but the The maximum size of the window is 7 days. So how to achieve it?
You can use WITH statement to "chain" multiple subqueries together, where next subquery can be using output of the previous as input. Take a look at this doc
However, as you noted, sometimes it can be more efficient to persist intermediate output results in blob storage or another Event Hub. You can define same storage location both as input and as output and one subquery can be outputting while another is reading.
The maximum size of windows in Azure Stream Analytics is indeed 7 days. For bigger size window computations involving bigger amounts of historic data it might be better to use product like Azure Data Factory.

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

Lambda Architecture Modelling Issue

I am considering implementing a Lambda Architecture in order to process events transmitted by multiple devices.
In most cases (averages etc.) its seems to fit my requirements. However, I am stuck trying to model a specific use case. In short...
Each device has a device_id. Every device emits 1 event per second. Each event has an event_id ranging from {0-->10}.
An event_id of 0 indicates START & an event_id of 10 indicates END
All the events between START & END should be grouped into one single group (event_group).
This will produce tuples of event_groups i.e. {0,2,2,2,5,10}, (0,4,2,7,...5,10), (0,10)
This (event_group) might be small i.e. 10 minutes or very large say 3hours.
According to Lambda Architecture these events transmitted by every device are my "Master Data Set".
Currently, the events are sent to HDFS & Storm using Kafka (Camus, Kafka Spout).
In the Streaming process I group by device_id, and use Redis to maintain a set of incoming events in memory, based on a key which is generated each time an event_id=0 arrives.
The problem lies in HDFS. Say I save a file with all incoming events every hour. Is there a way to distinguish these (group_events)?
Using Hive I can group tuples in the same manner. However, each file will also contain "broken" event_groups
(0,2,2,3) previous computation (file)
(4,3,) previous computation (file)
(5,6,7,8,10) current computation (file)
so that I need to merge them based on device_id into (0,2,2,3,4,3,5,6,7,8,10) (multiple files)
Is a Lambda Architecture a fit for this scenario? Or should the streaming process be the only source of truth? I.e. write to hbase, hdfs itself won't this affect the overall latency.
As far as I understand your process, I don't see any issue, as the principle of Lambda Architecure is to re-process regularly all your data on a batch mode.
(by the way, not all your data, but a time frame, usually larger than the speed layer window)
If you choose a large enough time window for your batch mode (let's say your aggregation window + 3 hours, in order to include even the longest event groups), your map reduce program will be able to compute all your event groups for the desired aggregation window, whatever file the distincts events are stored (Hadoop shuffle magic !)
The underlying files are not part of the problem, but the time windows used to select data to process are.

How to create inhouse funnel analytics?

I want to create in-house funnel analysis infrastructure.
All the user activity feed information would be written to a database / DW of choice and then, when I dynamically define a funnel I want to be able to select the count of sessions for each stage in the funnel.
I can't find an example of creating such a thing anywhere. Some people say I should use Hadoop and MapReduce for this but I couldn't find any examples online.
Your MapReduce is pretty simple:
Mapper reads row of a session in log file, its output is (stag-id, 1)
Set number of Reducers to be equal to the number of stages.
Reducer sums values for each stage. Like in wordcount example (which is a "Hello World" for Hadoop - https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0).
You will have to set up a Hadoop cluster (or use Elastic Map Reduce on Amazon).
To define funnel dynamically you can use DistributedCache feature of Hadoop. To see results you will have to wait for MapReduce to finish (minimum dozens of seconds; or minutes in case of Amazon's Elastic MapReduce; the time depends on the amount of data and the size of your cluster).
Another solution that may give you results faster - use a database: select count(distinct session_id) group by stage from mylogs;
If you have too much data to quickly execute that query (it does a full table scan; HDD transfer rate is about 50-150MB/sec - the math is simple) - then you can use a distributed analytic database that runs over HDFS (distributed file system of Hadoop).
In this case your options are (I list here open-source projects only):
Apache Hive (based on MapReduce of Hadoop, but if you convert your data to Hive's ORC format - you will get results much faster).
Cloudera's Impala - not based on MapReduce, can return your results in seconds. For fastest results convert your data to Parquet format.
Shark/Spark - in-memory distributed database.

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability