ELK Stack and Redis. Can I stop the DB being emptied? - redis

I have a running live system that uses a redis DB and an old ELK stack. I am creating a new version. What I want is to use the input section of my new Logstash to read the data from the old redis DB, but in my tests when I do this I seem to drain the data from it. I do not want to modify in anyway the current logstash or live pipeline implementation (i.e. add a 2nd output to the live logstash config).
LIVE Data -> Redis -> Logstash -> ES -> Kibana
| :
Read only compare old with new
| :
V V
New Logstash -> New ES -> New Kibana
I feel I am missing something about the relationship between logstash and redis. I was hoping to simply duplicate the redis read in my new logstash config and validate that the pipeline behaves the same as the old one before I go live with it, but if I am deleting this data rather than duplicating it I'm going to seriously upset the monitoring team!
How can I prevent my new logstash from draining the logs from redis?

I found a solution without answering this question. I used a redis replica e.g.
LIVE Data -> Redis -> Logstash -> ES -> Kibana
| :
Read only compare
| :
V V
Redis Replica -> New Logstash -> New ES -> New Kibana
The only issue was the data was pulled out so quickly from the old Redis DB by logstash that I was unable to read it with the new logstash (usually, bits got through). The DB sent an LPOP to the replica and the data removed. This was not too bad as I could then stop the logstash on the old system, let the new system fill it's ES database, then re-enable the old logstash, which would catch up very quickly. They then had the same set of data for the same period and I could compare.
Have not worked out a way to run these 2 systems at the same time but this is good enough for my purposes as it does not change the config of the old system, just interrupt it for a while.

Related

Do we need to reload the cache when the redis server is reconnection

I am new to Redis cache and this cache we are using in my node API application. On the start-up of this application, we are setting the values in the cache. Do we need to reload the values when the Redis server is restarting? Please help on this.
Thanks in advance
Redis has a configuration option that writes the content of the database to disk, and when it restarts, load the data from disk into the database. The details on this option are in the docs here: https://redis.io/topics/persistence
If you need some data to always be present in Redis, then you'll either need to implement persistence above, or do something like this in your app:
# when retrieving something from Redis cache
if (item_is_in_cache('my_key') { #inexpensive operation
retrieve_item_from_cache('my_key'); #inexpensive operation
} else {
store_important_data_in_cache(); #expensive operation
}
What this pseudo-code does is first check that the required data is in the cache, and retrieve it if it is. Checking and retrieving data from a Redis cache is an inexpensive operation, meaning the resources required are low. If the data isn't in the cache (ie, the Redis server recently started), we have to put the important data in the cache. This can be an expensive operation (more resources used than checking/retrieving data) depending on the amount of data required.

Apache beam KafkaIO offset management to external data stores

I am trying to read from multiple kafka brokers using KafkaIO on apache beam. The default option for offset management is to the kafka partition itself (no longer using zookeper from kafka >0.9). With this setup, when i restart the job/pipeline, there is issue with duplicate and missing records.
From what i read, the best way to handle this is to manage offset to external data stores. Is it possible to do this with current version of apache beam and KafkaIO? I am using 2.2.0 version right now.
And, after reading from kafka,i will write it to BigQuery. Is there a setup in KafkaIO where I can set the committed message only after i insert the message to BigQuery? I can only find auto commit setup right now.
In Dataflow, you can update a job rather than restarting from scratch. The new job resumes from the last checkpointed state, ensuring exactly-once processing. This works for KafkaIO source as well. The auto-commit option in Kafka consumer configuration helps but it is not atomic with Dataflow internal state, which implies restarted job might have small fraction of duplicate or missing messages.

Can I push data directly to Kibana's data store when using the ELK stack?

(a very similar question has been asked but has no answers)
I have a job processor (node.js) that takes in a couple of fields, runs a query and data manipulation on the result then sends the final result out to RabbitMQ queue. I have logging set up with Bunyan.
Now we'd like to log the results. A typical record in this log would look like:
{
"queryTime": 1460135319890,
"transID": "d5822210-8f87-4327-b43c-957b1ff96306",
"customerID": "AF67879",
"processingTime": 2345,
"queryStartDate": "1/1/2016",
"queryEndDate": "1/5/2016"
"numRecords": 20,
"docLength": 67868
}
The org has an existing ELK stack set up. I've got enough experience with Redis that it would be very simple to just push the data that I want out to the Redis instance in the ELK stack. Seems a lot easier than setting up logstash and messing around with its config.
I'd like to be able to visualize the customerID, processingTime and numRecords fields (to start). Is Kibana the right choice for this? Can I push data directly to it instead of messing around with LogStash?
Kibana doesn't have a datastore of its own and it relies on the datastore of Elasticsearch,i.e. Kibana uses the data stored in Elasticsearch to provide visualizations.
Hence you cannot directly push redis logs into Kibana by bypassing elasticsearch. For parsing logs from redis you need to use Logstash & Elasticsearch to push in your data.
Approach: Use Logstash and create a logstash configuration file wherein input will contain redis plugin. Output will contain elasticsearch plugin.
Kibana is an open source-tool which seems to be good in sync with what you want to achieve & also in sync with your organization setup as well.
We provide native support for node.js so you could push data directly, bypassing Logstash.
(disclosure - I'm an evangelist at Logz.io)

Can Infinispan be forced to fully replicate to a new cluster member

Looking through the Infinispan getting started guide it states [When in replication mode]
Infinispan only replicates data to nodes which are already in the
cluster. If a node is added to the cluster after an entry is added, it
won’t be replicated there.
Which I read as any cluster member will always be ignorant of any data that existed in the cluster before it became a cluster member.
Is there a way to force Infinispan to replicate all existing data to a new cluster member?
I see two options currently but I'm hoping I can just get Infinispan to do the work.
Use a distributed cache and live with the increase in access times inherent in the model, but this at least leaves Infinispan to handle its own state.
Create a Listener to listen for a new cache member joining and iterate through the existing data, pushing it into the new member. Unfortunately this would in effect cause every entry to replicate out to the existing cluster members again. I don't think this option will fly.
This information sounds as misleading/outdated. When the node joins a cluster, a rebalance process is initiated and when you query for these data during the rebalance prior to delivering these data to the node, the entry is fetched by remote RPC.

Couchbase node failure

My understanding could be amiss here. As I understand it, Couchbase uses a smart client to automatically select which node to write to or read from in a cluster. What I DON'T understand is, when this data is written/read, is it also immediately written to all other nodes? If so, in the event of a node failure, how does Couchbase know to use a different node from the one that was 'marked as the master' for the current operation/key? Do you lose data in the event that one of your nodes fails?
This sentence from the Couchbase Server Manual gives me the impression that you do lose data (which would make Couchbase unsuitable for high availability requirements):
With fewer larger nodes, in case of a node failure the impact to the
application will be greater
Thank you in advance for your time :)
By default when data is written into couchbase client returns success just after that data is written to one node's memory. After that couchbase save it to disk and does replication.
If you want to ensure that data is persisted to disk in most client libs there is functions that allow you to do that. With help of those functions you can also enshure that data is replicated to another node. This function is called observe.
When one node goes down, it should be failovered. Couchbase server could do that automatically when Auto failover timeout is set in server settings. I.e. if you have 3 nodes cluster and stored data has 2 replicas and one node goes down, you'll not lose data. If the second node fails you'll also not lose all data - it will be available on last node.
If one node that was Master goes down and failover - other alive node becames Master. In your client you point to all servers in cluster, so if it unable to retreive data from one node, it tries to get it from another.
Also if you have 2 nodes in your disposal you can install 2 separate couchbase servers and configure XDCR (cross datacenter replication) and manually check servers availability with HA proxies or something else. In that way you'll get only one ip to connect (proxy's ip) which will automatically get data from alive server.
Hopefully Couchbase is a good system for HA systems.
Let me explain in few sentence how it works, suppose you have a 5 nodes cluster. The applications, using the Client API/SDK, is always aware of the topology of the cluster (and any change in the topology).
When you set/get a document in the cluster the Client API uses the same algorithm than the server, to chose on which node it should be written. So the client select using a CRC32 hash the node, write on this node. Then asynchronously the cluster will copy 1 or more replicas to the other nodes (depending of your configuration).
Couchbase has only 1 active copy of a document at the time. So it is easy to be consistent. So the applications get and set from this active document.
In case of failure, the server has some work to do, once the failure is discovered (automatically or by a monitoring system), a "fail over" occurs. This means that the replicas are promoted as active and it is know possible to work like before. Usually you do a rebalance of the node to balance the cluster properly.
The sentence you are commenting is simply to say that the less number of node you have, the bigger will be the impact in case of failure/rebalance, since you will have to route the same number of request to a smaller number of nodes. Hopefully you do not lose data ;)
You can find some very detailed information about this way of working on Couchbase CTO blog:
http://damienkatz.net/2013/05/dynamo_sure_works_hard.html
Note: I am working as developer evangelist at Couchbase