Efficient way yo sink Pub/Sub messages to BigQuery using Cloud Functions

Efficient way yo sink Pub/Sub messages to BigQuery using Cloud Functions - google-bigquery

One of the recommended ways of using Cloud Functions is to invoke them with Pub/Sub Push subscription to write Pub/Sub messages to BigQuery. I have very tiny messages and I have a lot of them.
I'm paying around ~$80 for 60.000.000 messages written to GBQ. The configuration of my GCF is 128 MB, 1 sec timeout.
Source code:
exports['write-to-gbq'] = async (event, context) => {
const message = JSON.parse(Buffer.from(event.data, 'base64').toString())
const dataset = bigquery.dataset('dataset')
const table = dataset.table('table')
await table.insert({...message, created_at: new BigQueryDatetime(String(new Date().toISOString()))})
};
Is there a way cut the cost 10+ folds? Compute Engine instance seems to be way less expensive, but maybe I'm doing something wrong?

You can use Cloud Run. You can trigger it with a PubSub push subscription, but you can process several concurrent message on the same instance, compared to Cloud Functions with accept only 1 message at a time.
And you are charged for the running time of your instance (charged on CPU time and Memory time). I wrote (a quite old and slightly out of date) article on that.
Now you can have up to 250 concurrent request in GA and up to 1000 concurrent request in Preview per instance. You can speed up your process and decrease your costs.

Since August 2022, you can use BigQuery Subscription to write pubsub messages into BQ https://cloud.google.com/pubsub/docs/bigquery
You can find announcement here https://cloud.google.com/blog/products/data-analytics/pub-sub-launches-direct-path-to-bigquery-for-streaming-analytics

You may use a dataflow template for this purpose as shown below screenshot

Related

Kafka Streams Fault Tolerance with Offset Management in Parellel

Description :
I have one Kafka Stream application which is consuming from a topic.
The events are coming at high volumes.
KafkaStream will consume the events as a terminal operation and club the events in a bunch say 1000 events and writes it to AWS S3.
I have threads that are writing to s3 in parallel after consuming events from Kafka topic.
Not using kafka-connector-s3 due to some business application logics and processings.
Problem ::
I want the application to be fault-tolerant don't want to loose messages.
--> CRASH SCENARIO
Suppose the application has 10 threads all are running and trying to put the events in S3, and a crash happens, in that case, since the KafkaStream has ( enable.auto.commit = false )and we cannot commit the offset manually and all the threads have consumed messages from Kafka topic.
In this case, KafkaStreams has already committed the offset after reading but it could not have processed the events to S3.
I need a mechanism so that I can be sure of that what was the last offset till the events get written to the S3 file successfully.
And In crash scenarios, how should I deal with this and how to manage the Kafka offsets in Kafka Streams as I am using say 10 threads. What if some failed to write to s3 and some are passed. How do I ensure the ordering of offset getting successfully processed to s3 or not?
Let me know if I am not clear to describe my problem statement.
Thanks!

I can assure you that enable.auto.commit is set to false in Kafka Streams. The Javadocs at https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/StreamsConfig.html state
"enable.auto.commit" (false) - Streams client will always disable/turn off auto committing
You are right that Kafka Streams will automatically commit in more or less regular intervals. However, Kafka Streams waits until records are processed before committing the corresponding offsets. That means you would at least get at-least-once guarantees and not lose messages.
As far as I understand your application, your terminal processor does not block until the records are sent to S3. That means, Kafka Streams cannot know when the sending is completed. Kafka Streams just sees that the terminal processor completed its processing and then -- if the commit interval elapsed -- it commits the offsets.
You say
Not using kafka-connector-s3 due to some business application logics and processings.
Could you put the business application logic in the Kafka Streams application, write the results to a Kafka topic with operator to(), and then use the kafka-connector-s3 to send the messages in that topic to S3?
I am not a connect expert, but I guess that would make sure that messages are not lost and would make your implementation simpler.

Using kafka-stream ,you could aggragate 5000 messages from source topic to one big message and send the big one to another topic like middle_topic. You need another proceccor source from the middle_topic and sink to s3 using s3-connector.

How to increase the topic consumer throughput by using `Function<Flux<String>, Flux<String>>`?

I have a service based on webFlux and will consume then produce message from a kafka topic.
My code is just like this
#Bean
public Function<Flux<String>, Flux<String>> reactiveUpperCase() {
return flux -> flux.map(val -> val.toUpperCase());
}
What I found is when I have 2 instance, I could consume 750 message per 30 minutes, but my CPU is never higher than 10%.
As time goes by, the lag keeps increasing, so I'm wondering how could I increase the consumer throughput.
From the documents, the concurrency doesn't take effect for reactive, link
Does any one know how could I increase the throughput without adding more instance?

As I'm using kotlin, what I found is using Flow.flatMapMerge(parallelCount)

Creating multiple subs on same topic to implement load sharing (pub/sub)

I spent almost a day on google pub sub documentation to create a small app. I am thinking of switching from rabbitMQ to google pub/sub. Here is my question:
I have an app that push messages to a topic (T). I wanted to do load sharing via subscribers. So I created 3 subscribers to T. I have kept the name of all 3 subs same (S), so that I don't get same message 3 times.
I have 2 issues:
There is no where I console I see 3 same subscribers to T. It shows 1
If I try to start all 3 instances of subscribers at same time. I get "A service error has occurred.". Error disappeared if I start in sequential manner.
Lastly, Is google serious about pub/sub ? Looking at the documentations and public participations, I am not sure if I should switch to google pub/sub.
Thanks,

In pub/sub, each subscription gets a copy of every message. So to load balance handling message, you don't want 3 different subscriptions, but a single subscription that distributes messages to 3 workers.
If you are using pull delivery, simply create a single subscription (as a one-time action when you set up the system), and have each worker pull from the same subscription.
If you are using push delivery, have a single subscription pushing to a single endpoint that provides load balancing (e.g. push to a HTTP load balancer with multiple instances in a backend service
Google is serious about Pub/Sub, it is deeply integrated into many products (GCS, BigQuery, Dataflow, Stackdriver, Cloud Functions etc) and Google uses it internally.

As per documentation on GCP,https://cloud.google.com/pubsub/architecture.
Load balanced subscribers are possible, but all of them have to use same subscription. Don't have any code sample or POC ready but working on same.

Handling of pubsub subscribers for distributed longrunning tasks

I am evaluating the use of using pubsub for long-running tasks such as video transcoding, where a particular transcode may take between 2-10 minutes. Is pubsub a good approach for such a task distribution? For example, let's say I have five servers:
- publisher1
- publisher2
- publisher3
- publisher4
- publisher5
And a topic called "videos". Would it be possible to spread out the messages equally across those five servers? What about when servers are added or removed? What would be a good approach to doing this, or is pubsub not the right tool for something like this?

This does sound like a reasonable use case for pubsub. Specifically, if you use a pull subscriber, you can configure flow control settings to have at most one outstanding message to your server, and configure the max ack extension period (in java) to be a reasonable upper bound of your processing time. This api is described here http://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com/google/cloud/pubsub/v1/package-summary.html
This should effectively load balance across your servers by default if you use the same subscriber id for all jobs. If a server is added and backlog exists, it will receive a new entry. If a server is removed, it will no longer be sent messages. If it removed while processing or crashes, the message it was working on will be resent to another server.
One concern however is that pubsub has a limit of 10MB per message. You might consider instead putting the data itself in a google cloud storage bucket. Cloud storage can publish the file location to a pubsub topic when an upload is complete. https://cloud.google.com/storage/docs/pubsub-notifications

Is there a way to do hourly batched writes from Google Cloud Pub/Sub into Google Cloud Storage?

I want to store IoT event data in Google Cloud Storage, which will be used as my data lake. But doing a PUT call for every event is too costly, therefore I want to append into a file, and then do a PUT call per hour. What is a way of doing this without losing data in case a node in my message processing service goes down?
Because if my processing service ACKs the message, the message will no longer be in Google Pub/Sub, but also not in Google Cloud Storage yet, and at that moment if that processing node goes down, I would have lost the data.
My desired usage is similar to this post that talks about using AWS Kinesis Firehose to batch messages before PUTing into S3, but even Kinesis Firehose's max batch interval is only 900 seconds (or 128MB):
https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-and-aws-lambda/

If you want to continuously receive messages from your subscription, then you would need to hold off acking the messages until you have successfully written them to Google Cloud Storage. The latest client libraries in Google Cloud Pub/Sub will automatically extend the ack deadline of messages for you in the background if you haven't acked them.
Alternatively, what if you just start your subscriber every hour for some portion of time? Every hour, you could start up your subscriber, receive messages, batch them together, do a single write to Cloud Storage, and ack all of the messages. To determine when to stop your subscriber for the current batch, you could either keep it up for a certain length of time or you could monitor the num_undelivered_messages attribute via Stackdriver to determine when you have consumed most of the outstanding messages.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Efficient way yo sink Pub/Sub messages to BigQuery using Cloud Functions - google-bigquery

Since August 2022, you can use BigQuery Subscription to write pubsub messages into BQ https://cloud.google.com/pubsub/docs/bigquery You can find announcement here https://cloud.google.com/blog/products/data-analytics/pub-sub-launches-direct-path-to-bigquery-for-streaming-analytics

You may use a dataflow template for this purpose as shown below screenshot

Related

Kafka Streams Fault Tolerance with Offset Management in Parellel

How to increase the topic consumer throughput by using `Function<Flux<String>, Flux<String>>`?

Creating multiple subs on same topic to implement load sharing (pub/sub)

Handling of pubsub subscribers for distributed longrunning tasks

Is there a way to do hourly batched writes from Google Cloud Pub/Sub into Google Cloud Storage?

Categories

Resources