Opensource stat server? - sql

I've been looking for an opensource stat server that supports the following requirements:
Local proxy to aggregate 100s stats per second, and sends those stats out to a central cluster (or single server) every 10 seconds or so. The application will be making blocking network calls to the proxy to stat within the code rather than writing out to disk and having another process come and read the logs.
The central server responds to queries that asks for aggregates in REALTIME (sub-second response) (stats per 5 minute interval, hour, day, month, year). Optional: Support rolling time windows (e.g. 1 hour back from now)
Tagging per stat metric. Each stat name will have different attributes such as the hostname this stat is coming from.
Monotonically increasing stats (stats that increase forever, i.e. total count)
I understand it is fairly straightforward to write your own (Table per day, aggregate lower granularity tables based on policy, then drop them per TTL, can be done on NOSQL, e.g. hashsets on redis keyed on time bucket), but am surprised that there isn't one readily available given that it is a standard use-case. OpenTSDB is a close candidate (doesn't provide the local proxy) but doesn't support monotonically increasing stats.
Any suggestions or pointers?

have a look at statsd, it's a really cool project that does more or less what you want. your app fires up UDP packets to a central node (you state a sample percentage you want to actually send to avoid overloading, we use about 10%), and the central server aggregates the data, which is labeled. It then uses Graphite to generate the actualy reports.
https://github.com/etsy/statsd

Related

Upper limit on number of redis streams consumer groups?

We are looking at using redis streams as a cluster wide messaging bus, where each node in the cluster has a unique id. The idea is that each node, when spawned, creates a consumer group with that unique id to a central redis stream to guarantee each node in the cluster gets a copy of every message. In an orchestrated environment, cluster nodes will be spawned and removed on the fly, each having a unique id. Over time I can see this resulting in there being 100's or even 1000's of old/unused consumer groups all subscribed to the same redis stream.
My question is this - is there an upper limit to the number of consumer groups that redis can handle and does a large number of (unused) consumer groups have any real processing cost? It seems that a consumer group is just a pointer stored in redis that points to the last read entry in the stream, and is only accessed when a consumer of the group does a ranged XREADGROUP. That would lead me to assume (without diving into Redis code) that the number of consumer groups really does not matter, save for the small amount of RAM that the consumer groups pointers would eat up.
Now, I understand we should be smarter and a node should delete its own consumer groups when it is being killed or we should be cleaning this up on a scheduled basis, but if a consumer group is just a record in redis, I am not sure it is worth the effort - at least at the MVP stage of development.
TL;DR;
Is my understanding correct, that there is no practical limit on the number of consumer groups for a given stream and that they have no processing cost unless used?
Your understanding is correct, there's no practical limit to the number of CGs and these do not impact the operational performance.
That said, other than the wasted RAM (which could become significant, depending on the number of consumers in the group and PEL entries), this will add time complexity to invocations of XINFO STREAM ... FULL and XINFO GROUPS as these list the CGs. Once you have a non-trivial number of CGs, every call to these would become slow (and block the server while it is executing).
Therefore, I'd still recommend implementing some type of "garbage collection" for the "stale" CGs, perhaps as soon as the MVP is done. Like any computing resource (e.g. disk space, network, mutexes...) and given there are no free lunches, CGs need to be managed as well.
P.S. IIUC, you're planning to use a single consumer in each group, and have each CG/consumer correspond to a node in your app's cluster. If that is the case, I'm not sure that you need CGs and you can use the simpler XREAD (instead of XREADGROUP) while keeping the last ID locally in the node.
OTOH, assuming I'm missing something and that there's a real need for this use pattern, I'd imagine Redis being able to support it better by offering some form of expiry for idle groups.

Baselining internal network traffic (corporate)

We are collecting network traffic from switches using Zeek in the form of ‘connection logs’. The connection logs are then stored in Elasticsearch indices via filebeat. Each connection log is a tuple with the following fields: (source_ip, destination_ip, port, protocol, network_bytes, duration) There are more fields, but let’s just consider the above fields for simplicity for now. We get 200 million such logs every hour for internal traffic. (Zeek allows us to identify internal traffic through a field.) We have about 200,000 active IP addresses.
What we want to do is digest all these logs and create a graph where each node is an IP address, and an edge (directed, sourcedestination) represents traffic between two IP addresses. There will be one unique edge for each distinct (port, protocol) tuple. The edge will have properties: average duration, average bytes transferred, number of logs histogram by the hour of the day.
I have tried using Elasticsearch’s aggregation and also the newer Transform technique. While both work in theory, and I have tested them successfully on a very small subset of IP addresses, the processes simply cannot keep up for our entire internal traffic. E.g. digesting 1 hour of logs (about 200M logs) using Transform takes about 3 hours.
My question is:
Is post-processing Elasticsearch data the right approach to making this graph? Or is there some product that we can use upstream to do this job? Someone suggested looking into ntopng, but I did not find this specific use case in their product description. (Not sure if it is relevant, but we use ntop’s PF_RING product as a Frontend for Zeek). Are there other products that does the job out of the box? Thanks.
What problems or root causes are you attempting to elicit with graph of Zeek east-west traffic?
Seems that a more-tailored use case, such as a specific type of authentication, or even a larger problem set such as endpoint access expansion might be a better use of storage, compute, memory, and your other valuable time and resources, no?
Even if you did want to correlate or group on Zeek data, try to normalize it to OSSEM, and there would be no reason to, say, collect tuple when you can collect community-id instead. You could correlate Zeek in the large to Suricata in the small. Perhaps a better data architecture would be VAST.
Kibana, in its latest iterations, does have Graph, and even older version can lever the third-party kbn_network plugin. I could see you hitting a wall with 200k active IP addresses and Elasticsearch aggregations or even summary indexes.
Many orgs will build data architectures beyond the simple Serving layer provided by Elasticsearch. What I have heard of would be a Kappa architecture streaming into the graph database directly, such as dgraph, and perhaps just those edges of the graph available from a Serving layer.
There are other ways of asking questions from IP address data, such as the ML options in AWS SageMaker IP Insights or the Apache Spot project.
Additionally, I'm a huge fan of getting the right data only as the situation arises, although in an automated way so that the puzzle pieces bubble up for me and I can simply lock them into place. If I was working with Zeek data especially, I could lever a platform such as SecurityOnion and its orchestrated Playbook engine to kick off other tasks for me, such as querying out with one of the Velocidex tools, or even cross correlating using the built-in Sigma sources.

System design to aggregate in near-real time, the N most shared articles over the last five minutes, last hour and last day?

I was recently asked this system design question in an interview:
Let's suppose an application allows users to share articles from 3rd
party sites with their connections. Assume all share actions go
through a common code path on the app site (served by multiple servers
in geographically diverse colos). Design a system to aggregate, in
near-real time, the N most shared articles over the last five minutes,
last hour and last day. Assume the number of unique shared articles
per day is between 1M and 10M.
So I came up with below components:
Existing service tier that handles share events
Aggregation service
Data Store
Some Transport mechanism to send notifications of share events to aggregation service
Now I started talking about how data from existing service tier that handles share events will get to the aggregation servers? Possible solution was to use any messaging queue like Kafka here.
And interviewer asked me why you chose Kafka here and how Kafka will work like what topics you will create and how many partitions will it have. Since I was confuse so couldn't answer properly. Basically he was trying to get some idea on point-to-point vs publish-subscribe or push vs pull model?
Now I started talking about how Aggregation service operates. One solution I gave was to keep a collection of counters for each shared URL by 5 minute bucket for the last 24 hours (244 buckets per URL) As each share events happens, increment the current bucket and recompute the 5 min, hour, and day totals. Update Top-N lists as necessary. As each newly shared URL comes in, push out any URLs that haven't been updated in 24 hours. Now I think all this can be done on single machine.
Interviewer asked me can this all be done on one machine? Also can maintenance of 1M-10M tracked shares be done on one machine? If not, how would you partition? What happens if it crashes and how will you recover? Basically I was confuse how Aggregation service will actually work here? How it is getting data from Kafka and what is going to do actually with those data.
Now for data store part, I don't think we need persistent data store here so I suggested we can use Redis with partitioning and redundancy.
Interviewer asked me how will you partition and have redundancy here? And how Redis instance will get updated from the entire flow and how Redis will be structured? I was confuse on this as well. I told him that we can write output from Aggregation service to these redis instance.
There were few things I was not able to answer since I am confuse on how the entire flow will work. Can someone help me understand how we can design a system like this in a distributed fashion? And what I should have answered for the questions that interviewer asked me.
The intention of these questions is not to get ultimate answer for the problem. Instead check the competence and thought process of the interviewee. There is no point to be panic while answering these kind questions while facing tough follow up questions. Intention of the follow up questions is to guide you or give some hint for the interviewee.
I will try to share one probable answer for this problem. Assume I have s distributed persistent system like Cassandra. And I am going to maintain the status of sharing at any moment using my Cassandra infrastructure. I will maintain a Redis cluster ahead of persistence layer for LRU caching and maintain the buckets for 5 minutes, 1 hour and a day. Eviction will be configured using expire set. Now my aggregator service only need to address minimal data present within my Redis LRU cache. Set up a high through put distributed Kafka cluster will pump data from shared handler. And Kafka feed the data to Redis cluster and from there to Cassandra. To maintain the near real time output, we have to maintain the Kafka cluster throughput matching with it.

Graphite: Carbon Aggregator dropping data?

I'm working on an alerting solution that uses Logstash to stream AWS CloudFront logs from an S3 bucket into Graphite after doing some minor processing.
Since multiple events with the same timestamp can occur (multiple events within a second), I elected to use Carbon Aggregator to count these events per second.
The problem I'm facing is that the aggregated whisper database seems to be dropping data. The normal whisper file sees all of it, but of course it cannot account for more than 1 event per second.
I'm running this setup in docker on an EC2 instance, which isn't hitting any sort of limit (CPU, Mem, Network, Disk).
I've checked every log I could find in the docker instances and checked docker logs, however nothing jumps out.
I've set the logstash output to display the lines on stdout (not missing any) and to send them to graphite on port 2023, which is set to be the line-by-line receiver for Carbon Aggregator:
[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
aggregation-rules.conf is set to a very simple count per second:
test.<user>.total1s (1) = count test.<user>.total
storage-schemas.conf:
[default]
pattern = .*
retentions = 1s:24h
Happy to share more of my configuration as you request it.
I've hit a brick wall with this, I've been trying so many different things but I'm not able to see all data in the aggregated whisper db.
Any help is very much appreciated.
Carbon aggregator isn't designed to do what you are trying to do. For that use-case you'd want to use statsd to count the events per second.
https://github.com/etsy/statsd/blob/master/docs/metric_types.md#counting
Carbon aggregator is meant to aggregate across different series, for each point that it sees on the input it quantizes it to a timestamp before any aggregation happens, so you are still only going to get a single value per second with aggregator. statsd will take any number of counter increments and total them up each interval.

Is it possible to get sub-1-second latency with transactional replication?

Our database architecture consists of two Sql Server 2005 servers each with an instance of the same database structure: one for all reads, and one for all writes. We use transactional replication to keep the read database up-to-date.
The two servers are very high-spec indeed (the write server has 32GB of RAM), and are connected via a fibre network.
When deciding upon this architecture we were led to believe that the latency for data to be replicated to the read server would be in the order of a few milliseconds (depending on load, obviously). In practice we are seeing latency of around 2-5 seconds in even the simplest of cases, which is unsatisfactory. By a simplest case, I mean updating a single value in a single row in a single table on the write db and seeing how long it takes to observe the new value in the read database.
What factors should we be looking at to achieve latency below 1 second? Is this even achievable?
Alternatively, is there a different mode of replication we should consider? What is the best practice for the locations of the data and log files?
Edit
Thanks to all for the advice and insight - I believe that the latency periods we are experiencing are normal; we were mis-led by our db hosting company as to what latency times to expect!
We're using the technique described near the bottom of this MSDN article (under the heading "scaling databases"), and we'd failed to deal properly with this warning:
The consequence of creating such specialized databases is latency: a write is now going to take time to be distributed to the reader databases. But if you can deal with the latency, the scaling potential is huge.
We're now looking at implementing a change to our caching mechanism that enforces reads from the write database when an item of data is considered to be "volatile".
No. It's highly unlikely you could achieve sub-1s latency times with SQL Server transactional replication even with fast hardware.
If you can get 1 - 5 seconds latency then you are doing well.
From here:
Using transactional replication, it is
possible for a Subscriber to be a few
seconds behind the Publisher. With a
latency of only a few seconds, the
Subscriber can easily be used as a
reporting server, offloading expensive
user queries and reporting from the
Publisher to the Subscriber.
In the following scenario (using the
Customer table shown later in this
section) the Subscriber was only four
seconds behind the Publisher. Even
more impressive, 60 percent of the
time it had a latency of two seconds
or less. The time is measured from
when the record was inserted or
updated at the Publisher until it was
actually written to the subscribing
database.
I would say it's definately possible.
I would look at:
Your network
Run ping commands between the two servers and see if there are any issues
If the servers are next to each other you should have < 1 ms.
Bottlenecks on the server
This could be network traffic (volume)
Like network cards not being configured for 1GB/sec
Anti-virus or other things
Do some analysis on some queries and see if you can identify indexes or locking which might be a problem
See if any of the selects on the read database might be blocking the writes.
Add with (nolock), and see if this makes a difference on one or two queries you're analyzing.
Essentially you have a complicated system which you have a problem with, you need to determine which component is the problem and fix it.
Transactional replication is probably best if the reports / selects you need to run need to be up to date. If they don't you could look at log shipping, although that would add some down time with each import.
For data/log files, make sure they're on seperate drives so the performance is maximized.
Something to remember about transaction replication is that a single update now requires several operations to happen for that change to occur.
First you update the source table.
Next the log readers sees the change and writes the change to the distribution database.
Next the distribution agent sees the new entry in the distribution database and reads that change, then runs the correct stored procedure on the subscriber to update the row.
If you monitor the statement run times on the two servers you'll probably see that they are running in just a few milliseconds. However it is the lag time while waiting for the log reader and distribution agent to see that they need to do something which is going to kill you.
If you truly need sub second processing time then you will want to look into writing your own processing engine to handle data moving from one server to another. I would recommend using SQL Service Broker to handle this as this way everything is native to SQL Server and no third party code has to be written.