How to batch process data from Google Pub/Sub to Cloud Storage using Dataflow?

How to batch process data from Google Pub/Sub to Cloud Storage using Dataflow? - google-bigquery

I'm building a Change Data Capture pipeline that reads data from a MYSQL database and creates a replica in BigQuery. I'll be pushing the changes in Pub/Sub and using Dataflow to transfer them to Google Cloud Storage. I have been able to figure out how to stream the changes, but I need to run batch processing for a few tables in my Database.
Can Dataflow be used to run a batch job while reading from an unbounded source like Pub/Sub? Can I run this batch job to transfer data from Pub/Sub to Cloud Storage and then load this data to BigQuery? I want a batch job because a stream job costs more.

Thank you for the precision.
First, when you use PubSub in Dataflow (Beam framework), it's only possible in streaming mode
Cloud Pub/Sub sources and sinks are currently supported only in streaming pipelines, during remote execution.
If your process don't need realtime, you can skip Dataflow and save money. You can use Cloud Functions or Cloud Run for the process that I propose you (App Engine also if you want, but not my first recommendation).
In both cases, create a process (Cloud Run or Cloud Function) that is triggered periodically (every week?) by Cloud Scheduler.
Solution 1
Connect your process to the pull subscription
Every time that you read a message (or a chunk of message, for example 1000), write stream into BigQuery. -> However, stream write is not free on big Query ($0.05 per Gb)
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue
Solution 2
Connect your process to the pull subscription
Read a chunk of messages (for example 1000) and write them in memory (into an array).
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue. Set also the memory to the max value (2Gb) for preventing out of memory crashes.
Create a load job into BigQuery from your in memory data array. -> Here the load job is free and you are limited to 1000 load jobs per day and per table.
However, this solution can fail if your app + the data size is larger than the ma memory value. An alternative, is to create a file into GCS every, for example, each 1 million of rows (depends the size and the memory footprint of each row). Name the file with a unique prefix, for example the date of the day (YYYYMMDD-tempFileXX), and increment the XX at each file creation. Then, create a load job, not from data in memory, but with data in GCS with a wild card in the file name (gs://myBucket/YYYYMMDD-tempFile*). Like this all the files which match with the prefix will be loaded.
Recommendation The PubSub messages are kept up to 7 days into a pubsub subscription. I recommend you to trigger your process at least every 3 days for having time to react and debug before message deletion into the subscription.
Personal experience The stream write into BigQuery is cheap for a low volume of data. For some cents, I recommend you to consider the 1st solution is you can pay for this. The management and the code are smaller/easier!

Related

How to enrich events using a very large database with azure stream analytics?

I'm in the process of analyzing Azure Stream Analytics to replace a stream processing solutions based on NiFi with some REST microservices.
One step is the enrichment of sensor data form a very large database of sensors (>120Gb).
Is it possible with Azure Stream Analytics? I tried with a very small subset of the data (60Mb) and couldn't even get it to run.
Job logs give me warnings of memory usage being too high. Tried scaling to 36 stream units to see if it was even possible, to no avail.
What strategies do I have to make it work?
If I deterministically (via a hash function) partition the input stream using N partitions by ID and then partition the database using the same hash function (to get id on stream and ID on database to the same partition) can I make this work? Do I need to create several separated stream analytics jobs do be able to do that?
I suppose I can use 5Gb chunks, but I could not get it to work with ADSL Gen2 datalake. Does it really only works with Azure SQL?

Stream Analytics supports reference datasets of up to 5GB. Please note that large reference datasets come with the downside of making jobs/nodes restarts very slow (up to 20 minutes for the ref data to be distributed; restarts that may be user initiated, for service updates, or various errors).
If you can downsize that 120Gb to 5Gb (scoping only the columns and rows you need, converting to types that are smaller in size), then you should be able to run that workload. Sadly we don't support partitioned reference data yet. This means that as of now, if you have to use ASA, and can't reduce those 120Gb, then you will have to deploy 1 distinct job for each subset of stream/reference data.
Now I'm surprised you couldn't get a 60Mb ref data to run, if you have details on what exactly went wrong, I'm happy to provide guidance.

Cloud Pub/Sub to BigQuery through Dataflow SQL

I Would like to understand the working of Dataflow pipeline.
In my case, I have something published to cloud pub/sub periodically which Dataflow then writes to BigQuery. The volume of messages that come through are in the thousands so my publisher client has a batch setting for 1000 messages, 1 MB and 10 sec of latency.
The question is: When published in a batch as stated above, Does Dataflow SQL take in the messages in the batch and writes it to BigQuery all in one go? or, Does it writes one message at a time?
On the other hand, Is there any benefit of one over the other?
Please comment if any other details required. Thanks

Dataflow SQL is just a way to define, with SQL syntax, an Apache Beam pipeline, and to run it on Dataflow.
Because it's PubSub, it's a streaming pipeline that is created based on your SQL definition. When you run your SQL command, a Dataflow job starts and wait the messages from pubSub.
If you publish a bunch of messages, Dataflow is able to scale up to process them as soon as possible.
Keep in ming that Dataflow streaming never scale to 0 and therefore you will always pay for 1 or more VM to keep your pipeline up and running.

PubSub topic with binary data to BigQuery

I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?

Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.

Sending data from website to BigQuery using Pub/Sub and Cloud Functions

Here's what I'm trying to accomplish
A visitor lands on my website
Javascript collects some information and sends a hit
The hit is processed and inserted into BigQuery
And here's how I have planned to solve it
The hit is sent to Cloud Functions HTTP trigger (using Ajax)
Cloud Functions sends a message to Pub/Sub
Pub/Sub sends data to another Cloud Function using a Pub/Sub trigger
The second Cloud Function processes the hit into Biguery row and inserts it into BigQuery
Is there a simpler way to solve this?
Some other details to take into account
There are around 1 million hits a day
Don't want to use Cloud Dataflow because it inflates the costs
Can't (probably) skip Pub/Sub because some hits are sent when a person is leaving the site and the request might not have enough time to process everything.

You can perform a Big Query streaming, this one is less expensive and you avoid reach the Load Jobs quotas 1000 per table per day.
Another option is if you don't mind that the data spend a lot of time loading, you can store all the info in a Cloud Storage bucket and then load all the data with a transfer. You can program it in order that data be uploaded daily. This solution is focus in a batch environment in which you will store all the info in one side and then you transfer it to the final destination. If you only want to streaming the solution that you mentioned is ok.
It’s up to you to choose the option that better fits to your specific usage.

Concatenate files in S3 using AWS Lambda

Is there a way to use Lambda for S3 file concatenation?
I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).
I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).
The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.
Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.
Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?

I would reconsider whether you actually want to do this:
The S3 costs will go up.
The pipeline complexity will go up.
The latency from Firehose input to Spark input will go up.
If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…
Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:
You can archive S3 objects almost immediately, lowering costs.
Data is available in Spark as soon as possible.
If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
There's a tiny amount of latency increase from establishing TCP connections and authentication.
I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:
A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
A live Spark/EMR/underlying data service instance ready to receive the data.
In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.
Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.
If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.

You may create a Lambda function that will be invoked only once a day using Scheduled Events and in your Lambda function you should use Upload Part - Copy that does not need to download your files on the Lambda function. There is already an example of this in this thread

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas