Baselining internal network traffic (corporate) - log-analysis

We are collecting network traffic from switches using Zeek in the form of ‘connection logs’. The connection logs are then stored in Elasticsearch indices via filebeat. Each connection log is a tuple with the following fields: (source_ip, destination_ip, port, protocol, network_bytes, duration) There are more fields, but let’s just consider the above fields for simplicity for now. We get 200 million such logs every hour for internal traffic. (Zeek allows us to identify internal traffic through a field.) We have about 200,000 active IP addresses.
What we want to do is digest all these logs and create a graph where each node is an IP address, and an edge (directed, sourcedestination) represents traffic between two IP addresses. There will be one unique edge for each distinct (port, protocol) tuple. The edge will have properties: average duration, average bytes transferred, number of logs histogram by the hour of the day.
I have tried using Elasticsearch’s aggregation and also the newer Transform technique. While both work in theory, and I have tested them successfully on a very small subset of IP addresses, the processes simply cannot keep up for our entire internal traffic. E.g. digesting 1 hour of logs (about 200M logs) using Transform takes about 3 hours.
My question is:
Is post-processing Elasticsearch data the right approach to making this graph? Or is there some product that we can use upstream to do this job? Someone suggested looking into ntopng, but I did not find this specific use case in their product description. (Not sure if it is relevant, but we use ntop’s PF_RING product as a Frontend for Zeek). Are there other products that does the job out of the box? Thanks.

What problems or root causes are you attempting to elicit with graph of Zeek east-west traffic?
Seems that a more-tailored use case, such as a specific type of authentication, or even a larger problem set such as endpoint access expansion might be a better use of storage, compute, memory, and your other valuable time and resources, no?
Even if you did want to correlate or group on Zeek data, try to normalize it to OSSEM, and there would be no reason to, say, collect tuple when you can collect community-id instead. You could correlate Zeek in the large to Suricata in the small. Perhaps a better data architecture would be VAST.
Kibana, in its latest iterations, does have Graph, and even older version can lever the third-party kbn_network plugin. I could see you hitting a wall with 200k active IP addresses and Elasticsearch aggregations or even summary indexes.
Many orgs will build data architectures beyond the simple Serving layer provided by Elasticsearch. What I have heard of would be a Kappa architecture streaming into the graph database directly, such as dgraph, and perhaps just those edges of the graph available from a Serving layer.
There are other ways of asking questions from IP address data, such as the ML options in AWS SageMaker IP Insights or the Apache Spot project.
Additionally, I'm a huge fan of getting the right data only as the situation arises, although in an automated way so that the puzzle pieces bubble up for me and I can simply lock them into place. If I was working with Zeek data especially, I could lever a platform such as SecurityOnion and its orchestrated Playbook engine to kick off other tasks for me, such as querying out with one of the Velocidex tools, or even cross correlating using the built-in Sigma sources.

Related

Controlling and monitoring use of BI Engine Reservations

With the new beta BI Engine Reservations, I've noticed some queries speed up, but others remain unaffected. Will it be possible
- to monitor how the reservation is being used?
- to have some control over how the reservation is used?
When it comes to control, I've seen no indication that you'll have any—the system decides what the most efficient mechanism is (BI Engine, query cache, etc.) and then allocates accordingly. Also, the size of your reservation, usage, and age are factored into what is added and subsequently removed from the BI Engine reservation.
While that may seem frustrating, it's also the selling point: zero-config, automatic acceleration of your dashboards. As Google iterates quickly on these products, I would expect some controls to find their way in eventually.
As a workaround, you could use a separate project for data you want to ensure has access to the full reservation (since BI Engine is project-level).
As was mentioned elsewhere, there are a handful of metrics that can be viewed using Stackdriver logging (if you enable it). These are all high-level metrics, and are listed in the documentation:
Reservation Total Bytes
Reservation Used Bytes
Inflight Requests
Request Count
Request Execution Times
These won't likely give you a lot of the information you're looking for, but can be monitored for patterns.
You can use the elasticsearch and logstash for monitoring and implementing a security enviroment. The way with works is simple and for Near Real Time.

Need HA, multi-datacenter, memory based, key-value solution

New to space and I have been attempting to narrow down the potentials but seem to be spinning my wheels.
I are storing String -> String lookups. The key and value are likely under 15 bytes each. A userid -> server they are connected to.
We have multiple data centers
Need ability to read/write locally in low ms range.
I tried writing to a remote Redis and it seems tied to network latency which is too slow. This rules out simple M-S solutions.
Need HA both for node failure and entire data center failure. Implies replication
Prefer solution to provide ability to query multiple queries in one call. MGET
A few hundred thousand SET per second and a few million GET per second
Free for commercial use (could pay for instance so Riak is possible)
I need only for the solution to support GET, SET, DELETE, [MGET prefered]
Active - Active / Multi Master / Clusters with replication - I am only vaguely familiar with these terms and their tradeoffs.
http://blog.nahurst.com/visual-guide-to-nosql-systems
In CAP Theorem I probably want a AP solution but flip flop that a lot. It is further confusing when a typical CP solution can be augmented with replication solutions such as Dynamite/Dynamo/Twemproxy/etc.
My shortlist but open to any solution.
Redis Cluster
Redis+Sentinel with Dynomite
Memcached with Dynomite
Voldemort
Any idea how I should go about finding a solution for my need or does anyone have a solution in mind?

Protocol for remote logging of temperature, gas/electricity consumption

So, I'm managing a series of rented holiday homes, which all have dynamic IP, ADSL Internet connections.
We've wanted to keep track of a few types of data, e.g. per-room electricity usage, hot water temperature, thermostat setting, gas usage, network bandwidth usage, etc etc, and keep these centrally so we can perform analytics and graph them in real-time.
I'm comfortable building the hardware required to log these variables every 1-5 seconds and get them into e.g. a Raspberry Pi, but I'm wondering what kind of framework would be suitable for transferring and storing the data on the server side.
My initial thought was something like SNMP, but a) this doesn't seem designed for non-network uses, b) it's not very secure, and c) I'm looking for something agent-to-server (so I don't have to know the IP of the agent, and it'll also traverse NAT, so I can have multiple devices logging different things on the same network.)
My second thought was something using a REST API, but making potentially hundreds of API calls per second via different TCP connections seems a bit wasteful.
I came across Cubism but this seems to have the same disadvantages as some sort of REST API; there's a lot of redundant data transmitted every connection, if I were to send the data every 5 seconds per sensor.
Names like AMQP and MQTT come up, though none of these seem particularly suited (natively) to travelling over the public Internet without configuring VPNs etc.
Thoughts?
[This doesn't seem like a particularly niche problem, now I think about it - weather logging, share price, etc etc... although this is probably a smaller interval]
I have an geospatial/environment monitoring background and can tell you something about two major standards which are used today in environmental/infrastructural (electricity and water supply networks) monitoring sensor networks.
Proprietary one: Most sensors simply store time series measurements in their own local data format. A server process calls every sensor from time to time to gather the time series data (in most cases via a simple GPRS uplink), transforms it into an exchange Format and then stores it into a centralized database where you can work with the data. One of the industry leader companies is Kisters AG and their exchange format ZRXP. So this is simply storing time series data in an ASCII Format (i.e.ZRXP), and import that into a database by calling the sensor over any connection.
Open Geospatial Standard: Sensor Observation Service and SensorML which I think does more fit your needs, because these are Web Service Specifications whilst the proprietary stuff above is a complete system solution built by one vendor. There exists a nearly ready to use java reference implementation of SOS provided by 52 north which should be easily runnable on a Pi. Although the SOS specification has a very strong geospatial background, that does not mean,that it can't be adopted for your purpose I think. At least SensorML should give you some ideas.

High speed data acquisiton using REST Services

We need to develop a high speed REST based WCF Service , which will be used for updating 2000 datapoint , each data point changing at 25 msec . Is it possible to implement such high speed data acquisition using WCF
Using WCF yes. I'm not sure REST is the best architectural style for the type of problem you are trying to solve. I also wonder whether HTTP is appropriate.
Having said that you might want to look into CORE which is an effort to apply REST in highly constrained environments like data acquisition.
Here is how I am understanding your question: you expect new data values every 25 ms, or 40 x per second. There are 2000 discrete data values is one device, which means the telemetry flow from each device is around 80,000 values per second. You also have multiple devices, so your throughput will go higher than this, e.g. 800,000 updates per second for 10 devices.
In this scenario, I wouldn't expect the service layer to be a constraint, for the simple reason that it is always possible to scale up the service layer by adding more hosts to receive messages and load balancing between them. Where I would be concerned is any place where all transactions must be processed within the same domain. For example, is all this data winding up in one relational database? In that case you may have a problem with transaction throughput.
Another area that seems problematic in your architecture is the device itself. Is one device going to be capable of gathering and sending out values at 80 kHz? Here is where the REST protocol may have have too high an overhead. So it is device, not server, constraint that might drive you to find a more efficient protocol. This may be a case where writing a custom protocol directly against the socket might be warranted, but that depends on your device.

BIND9.7. When several named processes are running, how to judge which process is providing the service?

For example, I execute "sudo named" several times, so there are several named processes running. When I use "pidof named", I get several pids.
I want to calculate the CPU usage rate of the BIND process,so I need to get some parameters from "/proc/pid/stat", so I need the pid of the named process which is really providing the domain resolution service.
What's the difference between the named process which is providing the service and the others? Could you give me a detailed explanation?
thanks very much~
(It's my first time to use stackoverflow , to use English to ask quetions , please ignore those syntax errors.)
There should be just one named running, the scripts that manage the service ensure that. You shouldn't start it like that, you should use what your distribution uses to start it, probably something along the lines of service bind start (that is probably a RedHat-ism), or /etc/rc.d/bind start (for bog-standard SysVinit).
I was responsible for DNS for quite some time here. Some tips:
DNS is a very critical service, configure and monitor with extreme care. Do read up on setting up and managing this, don't go ahead until you are absolutely clear.
Get somebody as a backup for the case that you aren't available, and make sure they understand the previous point.
DNS isn't CPU intensive (OK, with signed domains and that newfangled stuff that might have changed), it is memory intensive (and network intensive, or at least sensitive to delays). Our main DNS server was running for months at a time, and clocked up some half hour of CPU time during that kind of period IIRC.
Separate your master server (responsible for the domain(s) from the servers queried by clients (caching servers). There have been vulnerabilities where malformed questions or "answers" to questions that hadn't been asked soiled the database
The master server will have all the domain information in RAM, make sure you have got enough of it
Make sure all machines under your jurisdiction use the same caching server. It makes no sense for more than one, that destroys the idea of cache.
The caching servers collect immense amounts of data over time. This data rarely is performance critical, so configure them with plenty of swap space to accommodate overflows.
Bind issues as many named processes as many CPUs you have:
man named:
-n #cpus
Create #cpus worker threads to take advantage of multiple CPUs. If not specified, named will try to determine the number of CPUs present and create one thread per CPU. If it is unable to determine the number of CPUs, a single worker thread will be created.
External source:
https://unix.stackexchange.com/questions/140986/multiple-named-processes-for-bind9-in-debian