Apache beam KafkaIO offset management to external data stores - google-bigquery

I am trying to read from multiple kafka brokers using KafkaIO on apache beam. The default option for offset management is to the kafka partition itself (no longer using zookeper from kafka >0.9). With this setup, when i restart the job/pipeline, there is issue with duplicate and missing records.
From what i read, the best way to handle this is to manage offset to external data stores. Is it possible to do this with current version of apache beam and KafkaIO? I am using 2.2.0 version right now.
And, after reading from kafka,i will write it to BigQuery. Is there a setup in KafkaIO where I can set the committed message only after i insert the message to BigQuery? I can only find auto commit setup right now.

In Dataflow, you can update a job rather than restarting from scratch. The new job resumes from the last checkpointed state, ensuring exactly-once processing. This works for KafkaIO source as well. The auto-commit option in Kafka consumer configuration helps but it is not atomic with Dataflow internal state, which implies restarted job might have small fraction of duplicate or missing messages.

Related

Replaying Kafka events stored in S3

I might be thinking of this incorrectly, but we're looking to set up a connection between Kafka and S3. We are using Kafka as the backbone of our microservice event sourcing system and may occasionally need to replay events from the beginning of time in certain scenarios (i.e. building a new service, rebuilding a corrupted database view).
Instead of storing events indefinitely in AWS EBS storage ($0.10/GB/mo.), we'd like to shift them to S3 ($0.023/Gb/mo. or less) after seven days using the S3 Sink Connector and eventually continually move them down the chain of S3 storage levels.
However, I don't understand that if I need to replay a topic from the beginning to restore a service, how would Kafka get that data back on demand from S3? I know I can utilize a source connector, but it seems that is only for setting up a new topic. Not for pulling data back from an existing topic.
The Confluent S3 Source Connector doesn't dictate where the data is written back into. But you may want to refer the storage configuration properties regarding topics.dir and topic relationship.
Alternatively, write some code to read your S3 events and send them into a Kafka producer client.
Keep in mind, for your recovery payment calculations that reads from different tiers of S3 cost more and more.
You may also want to follow the developments of Kafka native tiered storage support (or similarly, look at Apache Pulsar as an alternative)

Schedule task to load BigQuery table into Apache Ignite

I have a use case where we need to periodically load BigQuery table in to a cache and support SQL query from there. I'm doing researching on Apache Ignite and think it could be a good fit to our use case. Only that it's not clear to me yet how I can get auto-load from BigQuery. By "auto-load" I mean to keep Apache Ignite updated with BigQuery table data and let this updating transparent to applications. In most cases, our BigQuery tables are updated by other scheduled jobs/queries with intervals from 5 minutes to 1 month.
I'm new to Ignite, and I guess my questions are as the following:
Is this a feature supported in Ignite already? (I couldn't find any)
Or is there any exiting pluggins already? (I couldn't find any)
how to implement the auto-load cache for BigQuery using Ignite?
You can do this once with Cache Store / loadCache(), but doing this every few minutes is infeasible. You may wish to design a BigQuery streamer to Apache Ignite, if it supports pushing of deltas.
If Google BigQuery doesn't open its changelog files for CDC tools then find how to capture those updates differently and stream them to Ignite via its IgniteDataStreamer API. There should be a way to capture the changes with some pub/sub mechanism.

Aerospike cluster rebalancing causing errors

When adding a new node to an Aerospike cluster, a rebalance happens for the new node. For large data sets this takes time and some requests to the new node fail until rebalance is complete. The only solution I could figure out is retry the request until it gets the data.
Is there a better way?
I don't think it is possible to keep the node out of cluster for requests until it's done replicating because it is also master for one of the partitions.
If you are performing batch-reads, there is an improvement in 3.6.0. While the cluster is in-flux, if the client directs the read transaction to Node_A, but the partition containing the record has been moved to Node_B, Node_A proxies the request to Node_B.
Is that what you are doing?
You should not be in a position where the client cannot connect to the cluster, or it cannot complete a transaction.
I know that SO frowns on this, but can you provide more detail about the failures? What kinds of transactions are you performing? What versions are you using?
I hope this helps,
-DM
Requests shouldn't be failing, the new node will proxy to the node that currently has the data.
Prior to Aerospike 3.6.0 batch read requests were the exception. I suspect this is your problem.

Couchbase node failure

My understanding could be amiss here. As I understand it, Couchbase uses a smart client to automatically select which node to write to or read from in a cluster. What I DON'T understand is, when this data is written/read, is it also immediately written to all other nodes? If so, in the event of a node failure, how does Couchbase know to use a different node from the one that was 'marked as the master' for the current operation/key? Do you lose data in the event that one of your nodes fails?
This sentence from the Couchbase Server Manual gives me the impression that you do lose data (which would make Couchbase unsuitable for high availability requirements):
With fewer larger nodes, in case of a node failure the impact to the
application will be greater
Thank you in advance for your time :)
By default when data is written into couchbase client returns success just after that data is written to one node's memory. After that couchbase save it to disk and does replication.
If you want to ensure that data is persisted to disk in most client libs there is functions that allow you to do that. With help of those functions you can also enshure that data is replicated to another node. This function is called observe.
When one node goes down, it should be failovered. Couchbase server could do that automatically when Auto failover timeout is set in server settings. I.e. if you have 3 nodes cluster and stored data has 2 replicas and one node goes down, you'll not lose data. If the second node fails you'll also not lose all data - it will be available on last node.
If one node that was Master goes down and failover - other alive node becames Master. In your client you point to all servers in cluster, so if it unable to retreive data from one node, it tries to get it from another.
Also if you have 2 nodes in your disposal you can install 2 separate couchbase servers and configure XDCR (cross datacenter replication) and manually check servers availability with HA proxies or something else. In that way you'll get only one ip to connect (proxy's ip) which will automatically get data from alive server.
Hopefully Couchbase is a good system for HA systems.
Let me explain in few sentence how it works, suppose you have a 5 nodes cluster. The applications, using the Client API/SDK, is always aware of the topology of the cluster (and any change in the topology).
When you set/get a document in the cluster the Client API uses the same algorithm than the server, to chose on which node it should be written. So the client select using a CRC32 hash the node, write on this node. Then asynchronously the cluster will copy 1 or more replicas to the other nodes (depending of your configuration).
Couchbase has only 1 active copy of a document at the time. So it is easy to be consistent. So the applications get and set from this active document.
In case of failure, the server has some work to do, once the failure is discovered (automatically or by a monitoring system), a "fail over" occurs. This means that the replicas are promoted as active and it is know possible to work like before. Usually you do a rebalance of the node to balance the cluster properly.
The sentence you are commenting is simply to say that the less number of node you have, the bigger will be the impact in case of failure/rebalance, since you will have to route the same number of request to a smaller number of nodes. Hopefully you do not lose data ;)
You can find some very detailed information about this way of working on Couchbase CTO blog:
http://damienkatz.net/2013/05/dynamo_sure_works_hard.html
Note: I am working as developer evangelist at Couchbase

logging apache2 to mongodb: apache hook? something out there?

i just found a great blog posting on http://simonwillison.net/2009/Aug/26/logging/ stating the following
MongoDB is fantastic for logging".
Sounds tempting... high performance
inserts, JSON structured records and
capped collections if you only want to
keep the past X entries. If you care
about older historic data but still
want to preserve space you could run
periodic jobs to roll up log entries
in to summarised records. It shouldn’t
be too hard to write a command-line
script that hooks in to Apache’s
logging directive and writes records
to MongoDB.
is there anything out there already? anyone already using apache logging with mongodb?
A simple solution is to set Apache to write access logs to a perl script, which then does the needed work such as parsing, inserting into Mongo, and so on.
#Alexander, you don't need to have Apache block on IO. Write your logger/perl script so it uses a message queue + threading. Apache sends the log line to the perl script, which then inserts the message into a queue held in memory. Another thread reads the queue and does the actual work. We do this on our 1 billion+ views/month cache servers and it works without fail.
A relatively recent option is to use Flume to collect the logs and use the MongoDB sink plugin for Flume to write the events to MongoDB.