I am trying to solve a problem with real-time analytics. I would like to compute values in real-time. I receive streaming data and process it with Kafka and Storm and finally write it to Redis. Now I would like to push/pull all the data stored in Redis again into Storm to do further computation with it. The problem is, this must be repeated every minute. So every minute all the values from Redis have to be pulled/pushed and computed. I do not know if this is the right way to solve my problem, but I need a kind of cache. Do you have any recommendations?
Thank you in advance.
Regards
you can use druid instead. Which stores values in kafka and uses storm to insert values. It's column based storage and specially designed for real-time analytics. Redis is quick, but you can't achieve all the analytical requirements with redis, to achieve a simple group by or order by queries you need to write your own implementation logic, Whereas druid is specifically designed to serve for this purpose.
http://druid.io/
Hope this helps.
Related
What should I use cache.put(key, value) or cache.query("INSERT INTO Table ")?
In case you properly configured queryable fields for your cache you can use both ways to insert data into the cache:
Key-Value API as shown here.
SqlFieldsQuery as described here.
Also, in case you would like to upload a large amount of data you can use Data Streamer, which automatically buffer the data and group it into batches for better performance.
Any. Or both.
One of the powers of Ignite is that it's truly multi-model - the same data is accessible via different interfaces. If you migrate a legacy app from an RDBMS, you'll use SQL. If you have something simple and don't care about the schema or queries, you'll use key-value.
In my experience, non-trivial systems based on Apache Ignite tend to use different kinds of access simultaneously. A perfectly normal example of an app:
Use key-value to insert the data from an upstream source
Use SQL to read and write data in batch processing and analytics
Use Compute with both SQL and key-value inside of the tasks to do colocated processing and fast analytics
We have analytical ETL store model results on a Snowflake table (2 columns: user-id, score)
We need to use that info in our low latency service, which snowflake is not suitable for that latency.
I thought about storing that table on a Redis collection.
I would like to have some idea of how to keep the Redis in sync with the table.
any other solution for the latency is also welcomed
well it depends on how frequently you snowflake data is updated, what process is updating the data (snowplow or some external tool that you can hock into), what latency you want, are prepared between the snowflake data change, and redis having the values.
You could and a task to export the changes to a S3 and then have a lambda watching the bucket/folder, and push the changes into redis.
You could have your tool that loads the changes, pull the changes out and push those into redis. (we did a form of this)
You could have something poll the snowflake data (seems the worst idea) and push changes into redis. Well if you are polling the main table, this sounds bad, but you could also have a multi-table insert/merge command, thus when you are updating the main table, insert into a changes or stream, and thus read from this in you redis sync.
I've done some projects with Redis and MongoDB but I'm not comfortable at all. I'm currently using MongoDB for storing player datas and Redis for temporary and sorted datas. I'd want to use Redis more to my projects.
My questions
Should I use Redis more for persistent datas? I'd like to know a question about this case; if I make a project that ban players from the game server, is Redis good option to use for this case?
What are the best use cases for Redis?
As I mention it above, I use MongoDB for storing player datas and map for cache their information when they're online. From what I know redis is one of the best NoSQL database for caching. Should I use Redis for caching player datas?
If you have any other idea about the topic, I'd like to know that with details.
Should I use Redis more for persistent datas?
Redis is way more than Cache and is acting as Main database in many enterprises, and also supports few methods persistency like RDB and AOF.
if I make a project that ban players from the game server, is Redis good option to use for this case?
Redis support a nice set of plugins (Modules), one of them is RedisBloom, especially suited for quick filtering.
I was recently asked this system design question in an interview:
Let's suppose an application allows users to share articles from 3rd
party sites with their connections. Assume all share actions go
through a common code path on the app site (served by multiple servers
in geographically diverse colos). Design a system to aggregate, in
near-real time, the N most shared articles over the last five minutes,
last hour and last day. Assume the number of unique shared articles
per day is between 1M and 10M.
So I came up with below components:
Existing service tier that handles share events
Aggregation service
Data Store
Some Transport mechanism to send notifications of share events to aggregation service
Now I started talking about how data from existing service tier that handles share events will get to the aggregation servers? Possible solution was to use any messaging queue like Kafka here.
And interviewer asked me why you chose Kafka here and how Kafka will work like what topics you will create and how many partitions will it have. Since I was confuse so couldn't answer properly. Basically he was trying to get some idea on point-to-point vs publish-subscribe or push vs pull model?
Now I started talking about how Aggregation service operates. One solution I gave was to keep a collection of counters for each shared URL by 5 minute bucket for the last 24 hours (244 buckets per URL) As each share events happens, increment the current bucket and recompute the 5 min, hour, and day totals. Update Top-N lists as necessary. As each newly shared URL comes in, push out any URLs that haven't been updated in 24 hours. Now I think all this can be done on single machine.
Interviewer asked me can this all be done on one machine? Also can maintenance of 1M-10M tracked shares be done on one machine? If not, how would you partition? What happens if it crashes and how will you recover? Basically I was confuse how Aggregation service will actually work here? How it is getting data from Kafka and what is going to do actually with those data.
Now for data store part, I don't think we need persistent data store here so I suggested we can use Redis with partitioning and redundancy.
Interviewer asked me how will you partition and have redundancy here? And how Redis instance will get updated from the entire flow and how Redis will be structured? I was confuse on this as well. I told him that we can write output from Aggregation service to these redis instance.
There were few things I was not able to answer since I am confuse on how the entire flow will work. Can someone help me understand how we can design a system like this in a distributed fashion? And what I should have answered for the questions that interviewer asked me.
The intention of these questions is not to get ultimate answer for the problem. Instead check the competence and thought process of the interviewee. There is no point to be panic while answering these kind questions while facing tough follow up questions. Intention of the follow up questions is to guide you or give some hint for the interviewee.
I will try to share one probable answer for this problem. Assume I have s distributed persistent system like Cassandra. And I am going to maintain the status of sharing at any moment using my Cassandra infrastructure. I will maintain a Redis cluster ahead of persistence layer for LRU caching and maintain the buckets for 5 minutes, 1 hour and a day. Eviction will be configured using expire set. Now my aggregator service only need to address minimal data present within my Redis LRU cache. Set up a high through put distributed Kafka cluster will pump data from shared handler. And Kafka feed the data to Redis cluster and from there to Cassandra. To maintain the near real time output, we have to maintain the Kafka cluster throughput matching with it.
I am new to NiFi, and advice welcomed.
We get data sent in from external sources in many small records. I am thinking of pulling those records into NiFi via RabbitMQ. I'd like to "spool" or "batch" those records up into larger grouping (perhaps based on some index in the records), and when a group of records reaches a certain size threshold write out to S3.
How to best accomplish this in NiFi? Any other suggestions?
Thanks, Gary
RabbitMQ is based upon AMQP. Nifi supports a processor for AMQP called as ConsumeAMQP. You will find Additional details in link which has documentation specific to RabbitMQ. Configure the processor according to the documentation and you are good to go.
For the second part you need to use PutS3Object processor and there you will be able define the thresholds.
This should be achievable... I don't know that much about RabbitMQ, but assuming that it supports a JMS interface, then you could probably use NiFi's ConsumeJMS processor, followed by MergeContent to merge until your threshold is reached, and then PutS3Object to write to S3.