Performance Indicators for Load Testing? - testing

Currently I am working my way into the topic of load and performance testing. In our planning, however, the customer now wants to have indicators for the load and performance test named. Here I am personally however over-questioned. What exactly are the performance indicators within a load and performance test?

You can separate the Performance indicators based on Client Side and Server Side Indicators:
1. Client Side Indicators : JMeter Dashboard
Average Response Time
Minimum Response Time
Maximum Response Time
90th Percentile
95th Percentile
99th Percentile
Throughput
Network Byte Send
Network Byte Received
Error% and different types of Error received
Response Time Over Time
Active Threads Over Time
Latencies Over Time
Connect Time Over Time
Hits Per Second
Codes Per Second
Transactions Per Second
Total Transactions Per Second etc.
You can also obtain Composite Graphs for better understanding.
2. Server Side Indicators :
CPU Utilization
Memory Utilization
Disk Details
Filesystem Details
Network Trafic Details
Network Socket
Network Netstat
Network TCP
Network UDP
Network ICMP etc.
3. Component Level Monitoring :
Language Specific likes Java, .Net, Python etc.
Database Server
Web Server
Application Server
Broker Statistics
Load Balancers etc.
Just to name a few.

Related

Baselining internal network traffic (corporate)

We are collecting network traffic from switches using Zeek in the form of ‘connection logs’. The connection logs are then stored in Elasticsearch indices via filebeat. Each connection log is a tuple with the following fields: (source_ip, destination_ip, port, protocol, network_bytes, duration) There are more fields, but let’s just consider the above fields for simplicity for now. We get 200 million such logs every hour for internal traffic. (Zeek allows us to identify internal traffic through a field.) We have about 200,000 active IP addresses.
What we want to do is digest all these logs and create a graph where each node is an IP address, and an edge (directed, sourcedestination) represents traffic between two IP addresses. There will be one unique edge for each distinct (port, protocol) tuple. The edge will have properties: average duration, average bytes transferred, number of logs histogram by the hour of the day.
I have tried using Elasticsearch’s aggregation and also the newer Transform technique. While both work in theory, and I have tested them successfully on a very small subset of IP addresses, the processes simply cannot keep up for our entire internal traffic. E.g. digesting 1 hour of logs (about 200M logs) using Transform takes about 3 hours.
My question is:
Is post-processing Elasticsearch data the right approach to making this graph? Or is there some product that we can use upstream to do this job? Someone suggested looking into ntopng, but I did not find this specific use case in their product description. (Not sure if it is relevant, but we use ntop’s PF_RING product as a Frontend for Zeek). Are there other products that does the job out of the box? Thanks.
What problems or root causes are you attempting to elicit with graph of Zeek east-west traffic?
Seems that a more-tailored use case, such as a specific type of authentication, or even a larger problem set such as endpoint access expansion might be a better use of storage, compute, memory, and your other valuable time and resources, no?
Even if you did want to correlate or group on Zeek data, try to normalize it to OSSEM, and there would be no reason to, say, collect tuple when you can collect community-id instead. You could correlate Zeek in the large to Suricata in the small. Perhaps a better data architecture would be VAST.
Kibana, in its latest iterations, does have Graph, and even older version can lever the third-party kbn_network plugin. I could see you hitting a wall with 200k active IP addresses and Elasticsearch aggregations or even summary indexes.
Many orgs will build data architectures beyond the simple Serving layer provided by Elasticsearch. What I have heard of would be a Kappa architecture streaming into the graph database directly, such as dgraph, and perhaps just those edges of the graph available from a Serving layer.
There are other ways of asking questions from IP address data, such as the ML options in AWS SageMaker IP Insights or the Apache Spot project.
Additionally, I'm a huge fan of getting the right data only as the situation arises, although in an automated way so that the puzzle pieces bubble up for me and I can simply lock them into place. If I was working with Zeek data especially, I could lever a platform such as SecurityOnion and its orchestrated Playbook engine to kick off other tasks for me, such as querying out with one of the Velocidex tools, or even cross correlating using the built-in Sigma sources.

AWS cloud watch "Latency" metric and jmeter "Average" metric in summary report for api performance testing

While load/performance testing of API on ELB in AWS using JMeter, I see
AWS cloud watch Latency metric = 10 ms (seems good) and in JMeter's Summary Report Average metric = 3000 ms (seems bad).
The API returns 1MB of JSON data, but I don't understand why there is so much difference in numbers and is this api performance acceptable?
If the SLA said to have 100 ms API response time.
You are looking into different metrics:
Latency: JMeter measures the latency from just before sending the request to just after the first response has been received.
Elapsed time: JMeter measures the elapsed time from just before sending the request to just after the last response has been received.
So Latency is included into response time, it is so-called Time To First Byte and Elapsed Time is the Time to Last Byte. My expectation is that you should be sticking to what JMeter reports so you won't be confused with the metrics coming from different sources, JMeter is at least open source therefore you have the confidence regarding how the metrics are calculated.
If response time of 3 seconds is too high you can start looking into the reasons for this which could be:
Your API server is simply overloaded, check out CPU, RAM, Network, Disk usage using i.e. aforementioned Amazon CloudWatch or JMeter PerfMon Plugin
Your application configuration might not be ready for high loads. The majority of web/application/database servers defaults are suitable for application development and debugging only (same applies to JMeter) so most probably you will need to tune infrastructure.
Your application uses non-optimal algorithms. Use profiler tools to inspect where it spends time, what are the "heaviest" methods, how long database calls last, etc.
Also if your application is behind the ELB JMeter can cache IP address of one of the entry nodes and all your requests will be hitting only one host. To avoid this situation add DNS Cache Manager to your Test Plan.
References:
JMeter Glossary
JMeter Best Practices
The DNS Cache Manager: The Right Way To Test Load Balanced Apps

Capacity test on Apache WebServer

I was trying to do a capacity test on an apache web server, but there are some result I can't understand: according to the theory part of capacity planning, I should see three different regions on the plot of throughput in/out.
In the first region the expected result is the line y=x, meaning that the web server can follow my requests and reply to all with the code 200-OK (Thus, the throughput I request is equal to the throughput I get).
In the second region the expected result is the line y=k, where k is that throughput that indicates the saturation of the web server (Thus, the throughput I get can't go further k).
In the third region the expected result is a curve that goes from k to zero, that shows the degradation of web server, which for memory or CPU leaks starts to reject requests.
I have tried to replicate the experiment with a Virtual Machine running an instance of Apache as a server and the Physical Machine running an instance of Apache JMeter as a client. The result that I get is only the first two points, but also if I request a very very huge number of samples/seconds as throughput, I always get the saturation value.
Why I can't get the server going down, even if the CPU is 0% idle and the remaining memory is about 10MB? Or maybe is this the correct behavior and my hypothesis was incorrect?
Thank you in advance.

Opensource stat server?

I've been looking for an opensource stat server that supports the following requirements:
Local proxy to aggregate 100s stats per second, and sends those stats out to a central cluster (or single server) every 10 seconds or so. The application will be making blocking network calls to the proxy to stat within the code rather than writing out to disk and having another process come and read the logs.
The central server responds to queries that asks for aggregates in REALTIME (sub-second response) (stats per 5 minute interval, hour, day, month, year). Optional: Support rolling time windows (e.g. 1 hour back from now)
Tagging per stat metric. Each stat name will have different attributes such as the hostname this stat is coming from.
Monotonically increasing stats (stats that increase forever, i.e. total count)
I understand it is fairly straightforward to write your own (Table per day, aggregate lower granularity tables based on policy, then drop them per TTL, can be done on NOSQL, e.g. hashsets on redis keyed on time bucket), but am surprised that there isn't one readily available given that it is a standard use-case. OpenTSDB is a close candidate (doesn't provide the local proxy) but doesn't support monotonically increasing stats.
Any suggestions or pointers?
have a look at statsd, it's a really cool project that does more or less what you want. your app fires up UDP packets to a central node (you state a sample percentage you want to actually send to avoid overloading, we use about 10%), and the central server aggregates the data, which is labeled. It then uses Graphite to generate the actualy reports.
https://github.com/etsy/statsd

Detecting Connection Speed / Bandwidth in .net/WCF

I'm writing both client and server code using WCF, where I need to know the "perceived" bandwidth of traffic between the client and server. I could use ping statistics to gather this information separately, but I wonder if there is a way to configure the channel stack in WCF so that the same statistics can be gathered simultaneously while performing my web service invocations. This would be particularly useful in cases where ICMP is disabled (e.g. ping won't work).
In short, while making my regular business-related web service calls (REST calls to be precise), is there a way to collect connection speed data implicitly?
Certainly I could time the web service round trip, compared to the size of data used in the round-trip, to give me an idea of throughput - but I won't know how much of that perceived bandwidth was network related, or simply due to server-processing latency. I could perhaps solve that by having the server send back a time delta, representing server latency, so that the client can compute the actual network traffic time. If a more sophisticated approach is not available, that might be my answer...
The ICMP was not created with the intention of trying those connection speed statistics, but rather if a valid connection was made between two hosts.
My best guess is that the amount of data sent in those REST calls or ICMP traffic is not enough to calculate a perceived connection speed / bandwidth.
If you calculate by these metrics, you will get very big bandwidth statistics or very low, use as an example the copy box in windows XP. You need a constant and substantial amount of data to be sent in order to calculate valid throughput statistics.