GCP - Message from PubSub to BigQuery

GCP - Message from PubSub to BigQuery - google-bigquery

I need to get the data from my pubsub message and insert into bigquery.
What I have:
const topicName = "-----topic-name-----";
const data = JSON.stringify({ foo: "bar" });
// Imports the Google Cloud client library
const { PubSub } = require("#google-cloud/pubsub");
// Creates a client; cache this for further use
const pubSubClient = new PubSub();
async function publishMessageWithCustomAttributes() {
// Publishes the message as a string, e.g. "Hello, world!" or JSON.stringify(someObject)
const dataBuffer = Buffer.from(data);
// Add two custom attributes, origin and username, to the message
const customAttributes = {
origin: "nodejs-sample",
username: "gcp",
};
const messageId = await pubSubClient
.topic(topicName)
.publish(dataBuffer, customAttributes);
console.log(`Message ${messageId} published.`);
}
publishMessageWithCustomAttributes().catch(console.error);
I need to get the data/attributes from this message and query in BigQuery, anyone can help me?
Thaks in advance!

In fact, there is 2 solutions to consume the messages: either a message per message, or in bulk.
Firstly, before going in detail, and because you will perform BigQuery calls (or Facebook API calls), you will spend a lot of the processing time to wait the API response.
Message per Message
If you have an acceptable volume of message, you can perform a message per message processing. You have 2 solutions here:
You can handle each message with Cloud Functions. Set the minimal amount of memory to the functions (128Mb) to limit the CPU cost and thus the global cost. Indeed, because you will wait a lot, don't spend expensive CPU cost to do nothing! Ok, you will process slowly the data when they will be there but, it's a tradeoff.
Create Cloud Function on the topic, or a Push Subscription to call a HTTP triggered Cloud Functions
You can also handle request concurrently with Cloud Run. Cloud Run can handle up to 250 requests concurrently (in preview), and because you will wait a lot, it's perfectly suitable. If you need more CPU and memory, you can increase these value to 4CPU and 8Gb of memory. It's my preferred solution.
Bulk processing is possible if you are able to easily manage multi-cpu multi-(light)thread development. It's easy in Go. Concurrency in Node is also easy (await/async) but I don't know if it's multi-cpu capable or only single-cpu. Anyway, the principle is the following
Create a pull subscription on PubSub topic
Create a Cloud Run (better for multi-cpu, but also work with App Engine or Cloud Functions) that will listen the pull subscription for a while (let's say 10 minutes)
For each message pulled, an async process is performed: get the data/attribute, make the call to BigQuery, ack the message
After the timeout of the pull connexion, close the message listening, finish the current message processing and exit gracefully (return 200 HTTP code)
Create a Cloud Scheduler that call every 10 minutes the Cloud Run service. Set the timeout to 15 minutes and discard retries.
Deploy the Cloud Run service with a timeout of 15 minutes.
This solution offers a better message throughput processing (you can process more than 250 message per Cloud Run service), but don't have a real advantage because you are limited by the API call latency.
EDIT 1
Code sample
// For pubsunb triggered function
exports.logMessageTopic = (message, context) => {
console.log("Message Content")
console.log(Buffer.from(message.data, 'base64').toString())
console.log("Attribute list")
for (let key in message.attributes) {
console.log(key + " -> " + message.attributes[key]);
};
};
// For push subscription
exports.logMessagePush = (req, res) => {
console.log("Message Content")
console.log(Buffer.from(req.body.message.data, 'base64').toString())
console.log("Attribute list")
for (let key in req.body.message.attributes) {
console.log(key + " -> " + req.body.message.attributes[key]);
};
};

Related

Akka.NET with persistence dropping messages when CPU in under high pressure?

I make some performance testing of my PoC. What I saw is my actor is not receiving all messages that are sent to him and the performance is very low. I sent around 150k messages to my app, and it causes a peak on my processor to reach 100% utilization. But when I stop sending requests 2/3 of messages are not delivered to the actor. Here is a simple metrics from app insights:
To prove I have almost the same number of event persistent in mongo that my actor received messages.
Secondly, performance of processing messages is very disappointing. I get around 300 messages per second.
I know Akka.NET message delivery is at most once by default but I don't get any error saying that message were dropped.
Here is code:
Cluster shard registration:
services.AddSingleton<ValueCoordinatorProvider>(provider =>
{
var shardRegion = ClusterSharding.Get(_actorSystem).Start(
typeName: "values-actor",
entityProps: _actorSystem.DI().Props<ValueActor>(),
settings: ClusterShardingSettings.Create(_actorSystem),
messageExtractor: new ValueShardMsgRouter());
return () => shardRegion;
});
Controller:
[ApiController]
[Route("api/[controller]")]
public class ValueController : ControllerBase
{
private readonly IActorRef _valueCoordinator;
public ValueController(ValueCoordinatorProvider valueCoordinatorProvider)
{
_valueCoordinator = valuenCoordinatorProvider();
}
[HttpPost]
public Task<IActionResult> PostAsync(Message message)
{
_valueCoordinator.Tell(message);
return Task.FromResult((IActionResult)Ok());
}
}
Actor:
public class ValueActor : ReceivePersistentActor
{
public override string PersistenceId { get; }
private decimal _currentValue;
public ValueActor()
{
PersistenceId = Context.Self.Path.Name;
Command<Message>(Handle);
}
private void Handle(Message message)
{
Context.IncrementMessagesReceived();
var accepted = new ValueAccepted(message.ValueId, message.Value);
Persist(accepted, valueAccepted =>
{
_currentValue = valueAccepted.BidValue;
});
}
}
Message router.
public sealed class ValueShardMsgRouter : HashCodeMessageExtractor
{
public const int DefaultShardCount = 1_000_000_000;
public ValueShardMsgRouter() : this(DefaultShardCount)
{
}
public ValueShardMsgRouter(int maxNumberOfShards) : base(maxNumberOfShards)
{
}
public override string EntityId(object message)
{
return message switch
{
IWithValueId valueMsg => valueMsg.ValueId,
_ => null
};
}
}
akka.conf
akka {
stdout-loglevel = ERROR
loglevel = ERROR
actor {
debug {
unhandled = on
}
provider = cluster
serializers {
hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
}
serialization-bindings {
"System.Object" = hyperion
}
deployment {
/valuesRouter {
router = consistent-hashing-group
routees.paths = ["/values"]
cluster {
enabled = on
}
}
}
}
remote {
dot-netty.tcp {
hostname = "desktop-j45ou76"
port = 5054
}
}
cluster {
seed-nodes = ["akka.tcp://valuessystem#desktop-j45ou76:5054"]
}
persistence {
journal {
plugin = "akka.persistence.journal.mongodb"
mongodb {
class = "Akka.Persistence.MongoDb.Journal.MongoDbJournal, Akka.Persistence.MongoDb"
connection-string = "mongodb://localhost:27017/akkanet"
auto-initialize = off
plugin-dispatcher = "akka.actor.default-dispatcher"
collection = "EventJournal"
metadata-collection = "Metadata"
legacy-serialization = off
}
}
snapshot-store {
plugin = "akka.persistence.snapshot-store.mongodb"
mongodb {
class = "Akka.Persistence.MongoDb.Snapshot.MongoDbSnapshotStore, Akka.Persistence.MongoDb"
connection-string = "mongodb://localhost:27017/akkanet"
auto-initialize = off
plugin-dispatcher = "akka.actor.default-dispatcher"
collection = "SnapshotStore"
legacy-serialization = off
}
}
}
}

So there are two issues going on here: actor performance and missing messages.
It's not clear from your writeup, but I'm going to make an assumption: 100% of these messages are going to a single actor.
Actor Performance
The end-to-end throughput of a single actor depends on:
The amount of work it takes to route the message to the actor (i.e. through the sharding system, hierarchy, over the network, etc)
The amount of time it takes the actor to process a single message, as this determines the rate at which a mailbox can be emptied; and
Any flow control that affects which messages can be processed when - i.e. if an actor uses stashing and behavior switching, the amount of time an actor spends stashing messages while waiting for its state to change will have a cumulative impact on the end-to-end processing time for all stashed messages.
You will have poor performance due to item 3 on this list. The design that you are implementing calls Persist and blocks the actor from doing any additional processing until the message is successfully persisted. All other messages sent to the actor are stashed internally until the previous one is successfully persisted.
Akka.Persistence offers four options for persisting messages from the point of view of a single actor:
Persist - highest consistency (no other messages can be processed until persistence is confirmed), lowest performance;
PersistAsync - lower consistency, much higher performance. Doesn't wait for the message to be persisted before processing the next message in the mailbox. Allows multiple messages from a single persistent actor to be processed concurrently in-flight - the order in which those events are persisted will be preserved (because they're sent to the internal Akka.Persistence journal IActorRef in that order) but the actor will continue to process additional messages before the persisted ones are confirmed. This means you probably have to modify your actor's in-memory state before you call PersistAsync and not after the fact.
PersistAll - high consistency, but batches multiple persistent events at once. Same ordering and control flow semantics as Persist - but you're just persisting an array of messages together.
PersistAllAsync - highest performance. Same semantics as PersistAsync but it's an atomic batch of messages in an array being persisted together.
To get an idea as to how the performance characteristics of Akka.Persistence changes with each of these methods, take a look at the detailed benchmark data the Akka.NET organization has put together around Akka.Persistence.Linq2Db, the new high performance RDBMS Akka.Persistence library: https://github.com/akkadotnet/Akka.Persistence.Linq2Db#performance - it's a difference between 15,000 per second and 250 per second on SQL; the write performance is likely even higher in a system like MongoDB.
One of the key properties of Akka.Persistence is that it intentionally routes all of the persistence commands through a set of centralized "journal" and "snapshot" actors on each node in a cluster - so messages from multiple persistent actors can be batched together across a small number of concurrent database connections. There are many users running hundreds of thousands of persistent actors simultaneously - if each actor had their own unique connection to the database it would melt even the most robustly vertically scaled database instances on Earth. This connection pooling / sharing is why the individual persistent actors rely on flow control.
You'll see similar performance using any persistent actor framework (i.e. Orleans, Service Fabric) because they all employ a similar design for the same reasons Akka.NET does.
To improve your performance, you will need to either batch received messages together and persist them in a group with PersistAll (think of this as de-bouncing) or use asynchronous persistence semantics using PersistAsync.
You'll also see better aggregate performance if you spread your workload out across many concurrent actors with different entity ids - that way you can benefit from actor concurrency and parallelism.
Missing Messages
There could be any number of reasons why this might occur - most often it's going to be the result of:
Actors being terminated (not the same as restarting) and dumping all of their messages into the DeadLetter collection;
Network disruptions resulting in dropped connections - this can happen when nodes are sitting at 100% CPU - messages that are queued for delivery at the time can be dropped; and
The Akka.Persistence journal receiving timeouts back from the database will result in persistent actors terminating themselves due to loss of consistency.
You should look for the following in your logs:
DeadLetter warnings / counts
OpenCircuitBreakerExceptions coming from Akka.Persistence
You'll usually see both of those appear together - I suspect that's what is happening to your system. The other possibility could be Akka.Remote throwing DisassociationExceptions, which I would also look for.
You can fix the Akka.Remote issues by changing the heartbeat values for the Akka.Cluster failure-detector in configuration https://getakka.net/articles/configuration/akka.cluster.html:
akka.cluster.failure-detector {
# FQCN of the failure detector implementation.
# It must implement akka.remote.FailureDetector and have
# a public constructor with a com.typesafe.config.Config and
# akka.actor.EventStream parameter.
implementation-class = "Akka.Remote.PhiAccrualFailureDetector, Akka.Remote"
# How often keep-alive heartbeat messages should be sent to each connection.
heartbeat-interval = 1 s
# Defines the failure detector threshold.
# A low threshold is prone to generate many wrong suspicions but ensures
# a quick detection in the event of a real crash. Conversely, a high
# threshold generates fewer mistakes but needs more time to detect
# actual crashes.
threshold = 8.0
# Number of the samples of inter-heartbeat arrival times to adaptively
# calculate the failure timeout for connections.
max-sample-size = 1000
# Minimum standard deviation to use for the normal distribution in
# AccrualFailureDetector. Too low standard deviation might result in
# too much sensitivity for sudden, but normal, deviations in heartbeat
# inter arrival times.
min-std-deviation = 100 ms
# Number of potentially lost/delayed heartbeats that will be
# accepted before considering it to be an anomaly.
# This margin is important to be able to survive sudden, occasional,
# pauses in heartbeat arrivals, due to for example garbage collect or
# network drop.
acceptable-heartbeat-pause = 3 s
# Number of member nodes that each member will send heartbeat messages to,
# i.e. each node will be monitored by this number of other nodes.
monitored-by-nr-of-members = 9
# After the heartbeat request has been sent the first failure detection
# will start after this period, even though no heartbeat mesage has
# been received.
expected-response-after = 1 s
}
Bump the acceptable-heartbeat-pause = 3 s value to something larger like 10,20,30 if needed.
Sharding Configuration
One last thing I want to point out with your code - the shard count is way too high. You should have about ~10 shards per node. Reduce it to something reasonable.

Why Alamofire is using dispatch_sync() function when creating a dataTask?

The code below is from source code of Alamofire
let queue = dispatch_queue_create(nil, DISPATCH_QUEUE_SERIAL)
public func request(URLRequest: URLRequestConvertible) -> Request {
var dataTask: NSURLSessionDataTask!
dispatch_sync(queue) { dataTask = self.session.dataTaskWithRequest(URLRequest.URLRequest) }
let request = Request(session: session, task: dataTask)
self.delegate[request.delegate.task] = request.delegate
if startRequestsImmediately {
request.resume()
}
return request
}
It seems like every time it creates a dataTask, it dispatch that creating process to a serial queue. Would this measure protect the program from any kind of multi-thread trap?
I can't figure out what's the difference without that queue.

The reason why we implemented that check is due to Alamofire Issue #393. We were seeing duplicate task identifiers without the serial queue when creating data and upload tasks in parallel from multiple threads. It appears that Apple has a thread safety issue when incrementing the task identifiers. Therefore in Alamofire, we eliminate the issue by creating the tasks on a serial queue.
Cheers. 🍻

SpringAMQP delay

I'm having trouble to identify a way to delay message level in SpringAMQP.
I call a Webservice if the service is not available or if it throws some exception I store all the requests into RabbitMQ queue and i keep retry the service call until it executes successfully. If the service keeps throwing an error or its not available the rabbitMQ listener keeps looping.( Meaning Listener retrieves the message and make service call if any error it re-queue the message)
I restricted the looping until X hours using MessagePostProcessor however i wanted to enable delay on message level and every time it tries to access the service. For example 1st try 3000ms delay and second time 6000ms so on until i try x number of time.
It would be great if you provide a few examples.
Could you please provide me some idea on this?

Well, it isn't possible the way you do that.
Message re-queuing is fully similar to transaction rallback, where the system returns to the state before an exception. So, definitely you can't modify a message to return to the queue.
Probably you have to take a look into Spring Retry project for the same reason and poll message from the queue only once and retries in memory until successful answer or retry policy exhausting. In the end you can just drop message from the queue or move it into DLQ.
See more info in the Reference Manual.

I added CustomeMessage delay exchange
#Bean
CustomExchange delayExchange() {
Map<String, Object> args = new HashMap<>();
args.put("x-delayed-type", "direct");
return new CustomExchange("delayed-exchange", "x-delayed-message", true, false, args);
}
Added MessagePostProcessor
if (message.getMessageProperties().getHeaders().get("x-delay") == null) {
message.getMessageProperties().setHeader("x-delay", 10000);
} else {
Integer integer = (Integer) message.getMessageProperties().getHeaders().get("x-delay");
if (integer < 60000) {
integer = integer + 10000;
message.getMessageProperties().setHeader("x-delay", integer);
}
}
First time it delays 30 seconds and adds 10seconds each time till it reaches 600 seconds.This should be configurable.
And finally send the message to
rabbitTemplate.convertAndSend("delayed-exchange", queueName,message, rabbitMQMessagePostProcessor);

RabbitMQ Wait for a message with a timeout

I'd like to send a message to a RabbitMQ server and then wait for a reply message (on a "reply-to" queue). Of course, I don't want to wait forever in case the application processing these messages is down - there needs to be a timeout. It sounds like a very basic task, yet I can't find a way to do this. I've now run into this problem with Java API.

The RabbitMQ Java client library now supports a timeout argument to its QueueConsumer.nextDelivery() method.
For instance, the RPC tutorial uses the following code:
channel.basicPublish("", requestQueueName, props, message.getBytes());
while (true) {
QueueingConsumer.Delivery delivery = consumer.nextDelivery();
if (delivery.getProperties().getCorrelationId().equals(corrId)) {
response = new String(delivery.getBody());
break;
}
}
Now, you can use consumer.nextDelivery(1000) to wait for maximum one second. If the timeout is reached, the method returns null.
channel.basicPublish("", requestQueueName, props, message.getBytes());
while (true) {
// Use a timeout of 1000 milliseconds
QueueingConsumer.Delivery delivery = consumer.nextDelivery(1000);
// Test if delivery is null, meaning the timeout was reached.
if (delivery != null &&
delivery.getProperties().getCorrelationId().equals(corrId)) {
response = new String(delivery.getBody());
break;
}
}

com.rabbitmq.client.QueueingConsumer has a nextDelivery(long timeout) method, which will do what you want. However, this has been deprecated.
Writing your own timeout isn't so hard, although it may be better to have an ongoing thread and a list of in-time identifiers, rather than adding and removing consumers and associated timeout threads all the time.
Edit to add: Noticed the date on this after replying!

There is similar question. Although it's answers doesn't use java, maybe you can get some hints.
Wait for a single RabbitMQ message with a timeout

I approached this problem using C# by creating an object to keep track of the response to a particular message. It sets up a unique reply queue for a message, and subscribes to it. If the response is not received in a specified timeframe, a countdown timer cancels the subscription, which deletes the queue. Separately, I have methods that can be synchronous from my main thread (uses a semaphore) or asynchronous (uses a callback) to utilize this functionality.
Basically, the implementation looks like this:
//Synchronous case:
//Throws TimeoutException if timeout happens
var msg = messageClient.SendAndWait(theMessage);
//Asynchronous case
//myCallback receives an exception message if there is a timeout
messageClient.SendAndCallback(theMessage, myCallback);

Twitter stream api with agents in F#

From Don Syme blog (http://blogs.msdn.com/b/dsyme/archive/2010/01/10/async-and-parallel-design-patterns-in-f-reporting-progress-with-events-plus-twitter-sample.aspx) I tried to implement a twitter stream listener. My goal is to follow the guidance of the twitter api documentation which says "that tweets should often be saved or queued before processing when building a high-reliability system".
So my code needs to have two components:
A queue that piles up and processes each status/tweet json
Something to read the twitter stream that dumps to the queue the tweet in json strings
I choose the following:
An agent to which I post each tweet, that decodes the json, and dumps it to database
A simple http webrequest
I also would like to dump into a text file any error from inserting in the database. ( I will probably switch to a supervisor agent for all the errors).
Two problems:
is my strategy here any good ? If I understand correctly, the agent behaves like a smart queue and processes its messages asynchronously ( if it has 10 guys on its queue it will process a bunch of them at time, instead of waiting for the 1 st one to finish then the 2nd etc...), correct ?
According to Don Syme's post everything before the while is Isolated so the StreamWriter and the database dump are Isolated. But because I need this, I never close my database connection... ?
The code looks something like:
let dumpToDatabase databaseName =
//opens databse connection
fun tweet -> inserts tweet in database
type Agent<'T> = MailboxProcessor<'T>
let agentDump =
Agent.Start(fun (inbox: MailboxProcessor<string>) ->
async{
use w2 = new StreamWriter(#"\Errors.txt")
let dumpError =fun (error:string) -> w2.WriteLine( error )
let dumpTweet = dumpToDatabase "stream"
while true do
let! msg = inbox.Receive()
try
let tw = decode msg
dumpTweet tw
with
| :? MySql.Data.MySqlClient.MySqlException as ex ->
dumpError (msg+ex.ToString() )
| _ as ex -> ()
}
)
let filter_url = "http://stream.twitter.com/1/statuses/filter.json"
let parameters = "track=RT&"
let stream_url = filter_url
let stream = twitterStream MyCredentials stream_url parameters
while true do
agentDump.Post(stream.ReadLine())
Thanks a lot !
Edit of code with processor agent:
let dumpToDatabase (tweets:tweet list)=
bulk insert of tweets in database
let agentProcessor =
Agent.Start(fun (inbox: MailboxProcessor<string list>) ->
async{
while true do
let! msg = inbox.Receive()
try
msg
|> List.map(decode)
|> dumpToDatabase
with
| _ as ex -> Console.WriteLine("Processor "+ex.ToString()))
}
)
let agentDump =
Agent.Start(fun (inbox: MailboxProcessor<string>) ->
let rec loop messageList count = async{
try
let! newMsg = inbox.Receive()
let newMsgList = newMsg::messageList
if count = 10 then
agentProcessor.Post( newMsgList )
return! loop [] 0
else
return! loop newMsgList (count+1)
with
| _ as ex -> Console.WriteLine("Dump "+ex.ToString())
}
loop [] 0)
let filter_url = "http://stream.twitter.com/1/statuses/filter.json"
let parameters = "track=RT&"
let stream_url = filter_url
let stream = twitterStream MyCredentials stream_url parameters
while true do
agentDump.Post(stream.ReadLine())

I think that the best way to describe agent is that it is is a running process that keeps some state and can communicate with other agents (or web pages or database). When writing agent-based application, you can often use multiple agents that send messages to each other.
I think that the idea to create an agent that reads tweets from the web and stores them in a database is a good choice (though you could also keep the tweets in memory as the state of the agent).
I wouldn't keep the database connection open all the time - MSSQL (and MySQL likely too) implements connection pooling, so it will not close the connection automatically when you release it. This means that it is safer and similarly efficient to reopen the connection each time you need to write data to the database.
Unless you expect to receive a large number of error messages, I would probably do the same for file stream as well (when writing, you can open it, so that new content is added to the end).
The way queue of F# agents work is that it processes messages one by one (in your example, you're waiting for a message using inbox.Receive(). When the queue contains multiple messages, you'll get them one by one (in a loop).
If you wanted to process multiple messages at once, you could write an agent that waits for, say, 10 messages and then sends them as a list to another agent (which would then perform bulk-processing).
You can also specify timeout parameter to the Receive method, so you could wait for at most 10 messages as long as they all arrive within one second - this way, you can quite elegantly implement bulk processing that doesn't hold messages for a long time.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas