Why are throughput and latency inversely proportional on pub/sub systems?

Why are throughput and latency inversely proportional on pub/sub systems? - rabbitmq

When reading a paper (not free) comparing Kafka and RabbitMQ, I came across the following (emphasis mine):
Latency. In any transport architecture, latency of a packet/message is
determined by the serial pipeline (i.e., sequence of processing steps)
that it passes through. Latency can only be reduced by pipelining the packet transport over resources that can work concurrently on the same packet in a series architecture (multiple processing cores, master DMA engines in case of disk or network access,…) . It is not infuenced by scaling out resources in
parallel.
Throughput. Throughput of a transport architecture is the number of
packets (or alternatively,bytes) per time unit that can be transported
between producers and consumers. Contrary to latency,throughput can
easily be enhanced by adding additional resources in parallel.
For a simple pipeline throughput and latency are inversely
proportional.
Why is it so? Isn't that the contrary of saying that "(latency) is not influenced by scaling out resources in parallel"? If I add more machines to increase the throughput, how is the latency reduced?

Let's examine the scenario of a highway, and for purposes of discussion we'll use I-66 in the Washington, DC metro. This highway experiences rush hour delays each morning amounting to about 40-60 minutes of additional travel time. This is because the throughput of the road is constrained. As a result, latency for a single car increases.
The general theory behind this is known as Little's Law. It states that the average amount of time a customer (or in this case, a driver) spends in a system (i.e. the highway) is equal to the arrival rate divided by total number of customers in the system. Expressed algebraically,
The practical implications of this are that, given an increase in the number of cars L, such as what happens around rush hour, and given constant throughput of the highway Lambda (Virginia got a little creative and figured out how to dynamically convert a shoulder into a traffic lane, but it wasn't very effective), what results is an increase in the time it takes to travel a defined distance W. The inverse of W is the speed of a car.
It is clear that, by Little's Law, throughput lambda is inversely proportional to latency (time) W for a constant number of cars L.

Related

Cost of deploying a TensorFlow model in GCP?

I'm thinking of deploying a TensorFlow model using Vertex AI in GCP. I am almost sure that the cost will be directly related to the number of queries per second (QPS) because I am going to use automatic scaling. I also know that the type of machine (with GPU, TPU, etc.) will have an impact on the cost.
Do you have any estimation about the cost versus the number of queries per second?
How does the type of virtual machine changes this cost?
The type of model is for object detection.

Autoscaling depends on the CPU and GPU utilization which directly correlates to the QPS, as you have said. To estimate the cost based on the QPS, you can deploy a custom prediction container to a Compute Engine instance directly, then benchmark the instance by making prediction calls until the VM hits 90+ percent CPU utilization (consider GPU utilization if configured). Do this multiple times for different machine types, and determine the "QPS per cost per hour" of different machine types. You can re-run these experiments while benchmarking latency to find the ideal cost per QPS per your latency targets for your specific custom prediction container. For more information about choosing the ideal machine for your workload, refer to this documentation.
For your second question, as per the Vertex AI pricing documentation (for model deployment), cost estimation is done based on the node hours. A node hour represents the time a virtual machine spends running your prediction job or waiting in a ready state to handle prediction or explanation requests. Each type of VM offered has a specific pricing per node hour depending on the number of cores and the amount of memory. Using a VM with more resources will cost more per node hour and vice versa. To choose an ideal VM for your deployment, please follow the steps given in the first paragraph which will help you find a good trade off between cost and performance.

Performance Counter for DRAM Per-Rank Memory Access

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. I need to retrieve the number of accesses to each DRAM rank, over time, to estimate its power consumption. Based on page 261 of the chipset documentation (i.e., Datasheet, volume 2 (M- and H-processor lines)), I could use the 32-bit value in register, RAM—DRAM_ENERGY_STATUS, as a DRAM energy estimation. But I need rank-level energy estimates. I could also use core and offcore DRAM access performance counters to estimate power consumption, but, as mentioned before, I need per-rank statistics. Besides that, they report whole-system stats, while energy is calculated per-rank. They also do not report many DRAM accesses.
Therefore, IMC counters (which are uncore counters) should be the ideal choice. Perf does not support per-rank counters. I tried to use PCM-Memory to access IMC counter information. But /sys/bus/event_source/devices/uncore_imc is not mounted by the kernel (the version is 5.0.0-37-generic) and the tool does not detect the CPU. I tried to access uncore performance counters, manually. Whole-system DRAM access counters are documented, here (They were not documented in the above-mentioned chipset manual). I can retrieve total DRAM read and write accesses using these counters. But, there is no information about channel or rank-level access stats. How can I find the offset associated with these counters? Should I use trial and error?
P.S.: This question is also asked at Intel Software Tuning, Performance Optimization & Platform Monitoring Forum.

The MSR_DRAM_ENERGY_STATUS always reports an estimate of the energy consume by all memory channels. There is no easy way to break it into per-rank energy. This register reports a highly accurate estimate on Haswell.
The 5.0.0-37-generic kernel is an Ubuntu kernel and does support the uncore_imc/data_reads/ and uncore_imc/data_writes/ events on Haswell, which represent a data read CAS command and a data write CAS command from the IMC, respectively. A full cache-line read and a full cache-line write transactions cause a single bursty 64-byte transaction on the memory bus to a single rank. A partial read is also executed as a single full-line read on the bus, but a partial write may require a full line read followed by a full line write due to restrictions in the protocol. Partial writes are generally negligible.
The uncore_imc/data_reads/ and uncore_imc/data_writes/ events occur for requests targeting DRAM memory generated by any unit, not just cores. These names are given by perf and they correspond to UNC_IMC_DRAM_DATA_READS and UNC_IMC_DRAM_DATA_WRITES, respectively, which are mentioned in the Intel article you've cited. The other three events mentioned there allow you count requests (not CAS commands!) for each of the three possible sources separately (GT, IA, and IO). You won't find them listed under /sys/bus/event_source/devices/uncore_imc/events on your old kernel. They are supported in perf starting with mainstream kernel v5.9-rc2.
By the way, PCM does support these events as well, which it uses to report read and write bandwdith over all channels, but you should use the tool pcm.x, not pcm-memory.x, which only works on server processors.
A Haswell H-processor line processor has a single on-die memory controller with two DDR3L 64-bit channels. Each channel can contain zero, one, or two DIMMs with a total capacity of up to 32 GBs over all channels. Moreover, each DIMM can contain up to two ranks, so a single channel can contain anywhere between zero and four ranks. The i7-4720HQ is a high-end mobile processor. You're probably on a laptop with 8 GBs or 16 GBs of memory. If the memory topology was not changed since purchase, it probably has only two 4GB or 8GB DIMMs, one in each channel, with one remaining free slot per channel available for expansion if desired by the user. This means that there are either one or two ranks per channel.
You can approximate the number of accesses to each rank given the knowledge of how physical addresses are mapped to ranks. If each channel is populated with a single rank DIMM of the same capacity, the mapping is simple on your processor. Bit 6 of the physical address (i.e., the seventh bit) determines which channel, and therefore which rank, a request is mapped to. You can collect a set of samples of physical addresses of requests at the IMC by running perf record on MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM with the --phys-data option. Obviously this set of samples may only be representative of core-originated retired loads that reach the IMC, which are a small subset of all requests at the IMC.
It appears to me that you want to measure the number of memory accesses per rank in order to estimate the the per-rank energy from the total DRAM energy, but this is not trivial at all for the following reasons:
Not all CAS commands are of the same energy cost. Precharge and activate commands are not counted by any event and may consume significant energy, especially with high row buffer miss rates.
Even if there are zero requests in the IMC, as long as there is at least one active core, the memory channels are powered and do consume energy.
The amount of time it takes to process a request of the same type and to the same address may vary depending surrounding requests due to timing delays required by rank-to-rank turnarounds and read-write switching.
Despite of all of that, I imagine it may be possible to build a good model of upper and lower bounds on per-rank energy given a representative estimate of the number of requests to each rank (as discussed above).
The bottom line is that there is no easy way to get the luxury of per-rank counting like on server processors.

What is difference between upload speed and upload throughput?

I am trying to simulate different network speeds using selenium

Maybe I'm missing the point of the question but:
"Bandwidth and throughput have to do with speed, but what's the difference? To be brief, bandwidth is the theoretical speed of data on the network, whereas throughput is the actual speed of data on the network."
Pretty much: bandwidth is what your ISP will market to you, but your throughput is what you'll actually get on your side, in terms of speed. Throughput will almost always be lower than the marketed/advertised bandwidth.
source:
https://study.com/academy/lesson/bandwidth-vs-throughput.html#:~:text=Lesson%20summary,fast%20data%20is%20being%20sent.&text=Bandwidth%20refers%20to%20the%20theoretical,data%20on%20your%20network%20travels.

Possibly the term upload speed in a broader aspect indicates the internet speed where for uploading and downloading you need speed. Bandwidth and Throughput are the two major indicators of speed where:
Bandwidth is the theoretical speed of data on the network.
Throughput is the actual speed of data on the network.
Bandwidth
In true essence, Bandwidth refers to the maximum amount of data you can get from point A to point B in a specific amount of time. Thesedays while dealing with computers Bandwidth refers to, how many bits of information we can theoretically transmit in specific amount of time, as an example bits per second. E.g. Kbps (kilobits per second) and Mbps (megabits per second).
Throughput
Throughput can only send as much as the bandwidth will allow and is actually less than that as factors like latency (delays), jitter (irregularities in the signal), and error rate (actual mistakes during transmission) reduces the overall throughput.

I think what you are looking for is the method to do it.
Sets Chromium network emulation settings.
driver.set_network_conditions(
offline=False,
latency=5, # additional latency (ms)
download_throughput=500 * 1024, # maximal throughput
upload_throughput=500 * 1024) # maximal throughput
Note: 'throughput' can be used to set both (for download and upload).
Source

maximum safe CPU utilization for embedded system

What is the safe maximum CPU utilization time for embedded system for critical applications. We are measuring performance with top. Is 50 - 75% is safe?

Real-Time embedded systems are designed to meet Real-Time constraints, for example:
Voltage acquisition and processing every 500 us (lets say sensor monitoring).
Audio buffer processing every 5.8 ms (4ms processing).
Serial Command Acknowledgement within 3ms.
Thanks to Real-Time Operating System (RTOS), which are "preemptive" (the scheduler can suspend a task to execute one with a higher priority), you can meet those constraints even at 100% CPU usage: The CPU will execute the high priority task then resume to whatever it was doing.
But this does not mean you will meet the constraint no matter what, few tips:
High priority Tasks execution must be as short as possible (by calculating Execution Time/Occurence you can have an estimation of CPU usage).
If estimated CPU usage is too high, look for code optimization, hardware equivalent (hardware CRC, DMA, ...), or second micro-processor.
Stress test your device and measure if your real-time constraints are met.
For the previous example:
Audio processing should be the lowest priority
Serial Acknowledgement/Voltage Acquisition the highest
Stress test can be done by issuing Serial Commands and checking for missed audio buffers, missed analog voltage events and so on. You can also vary CPU clock frequency: your device might meet constraints at much lower clock frequencies, reducing power consumption.
To answer you question, 50-75 and even 100% CPU usage is safe as long as you meet real-time constraints, but bear in mind that if you want to add functionality later on, you will not have much room for that at 98% CPU usage.

In rate monotonic scheduling mathematical analysis determines that real-time tasks (that is tasks with specific real-time deadlines) are schedulable when the utilisation is below about 70% (and priorities are appropriately assigned). If you have accurate statistics for all tasks and they are deterministic this can be as high as 85% and still guarantee schedulability.
Note however that the utilisation applies only to tasks with hard-real-time deadlines. Background tasks may utilise the remaining CPU time all the time without missing deadlines.
Assuming by "CPU utilization" you are referring to time spent executing code other than an idle loop, in a system with a preemptive priority based scheduler, then ... it depends.
First of all there is a measurement problem; if your utilisation sampling period is sufficiently short, it will often switch between 100% and zero, whereas if it were very long you'd get a good average, but would not know if utilisation was high for long-enough to starve a lower priority task than that running to the extent that it might miss its deadlines. It is kind of the wrong question, because any practical utilisation sampling rate will typically be much longer then the shortest deadline, so is at best a qualitative rather then quantitative measure. It does not tell you much that is useful in critical cases.
Secondly there is teh issue of what you are measuring. While it is common for an RTOS to have a means to measure CPU utilisation, it is measuring utilisation by all tasks including those that have no deadlines.
If the utilisation becomes 100% in the lowest priority task for example, there is no harm to schedulabilty - the low priority task is in this sense no different from the idle loop. It may have consequences for power consumption in systems that normally enter a low-power mode in the idle loop.
It a higher priority task takes 100% CPU utilisation such that a deadline for a lower priority task is missed, then your system will fail (in the sense that deadlines will be missed - the consequences are application specific).
Simply measuring CPU utilisation is insufficient, but if your CPU is at 100% utilisation and that utilisation is not simply some background task in the lowest priority thread, it is probably not schedulable- there will be stuff not getting done. The consequences are of course application dependent.
Whilst having a low priority background thread consume 100% CPU may do no harm, it does render the ability to measure CPU utilisation entirely useless. If the task is preemted and for some reason the higher priority task take 100%, you may have no way of detecting the issue and the background task (and any tasks lower then the preempting task) will not get done. It is better therefore to ensure that you have some idle time so that you can detect abnormal scheduling behaviour (if you have no other means to do so).
One common solution to the non-yielding background task problem is to perform such tasks in the idle loop. Many RTOS allow you to insert idle-loop hooks to do this. The background task is then preemptable but not included in the utilisation measurement. In that case of course you cannot then have a low-priority task that does not yield, because your idle-loop does stuff.
When assigning priorities, the task with the shortest execution time should have the highest priority. Moreover the execution time should be more deterministic in higher priority tasks - that is if it takes 100us to run it should within some reasonable bounds always take that long. If some processing might be variable such that it takes say 100us most of the time and must occasionally do something that requires say 100ms; then the 100ms process should be passed off to some lower priority task (or the priority of the task temporarily lowered, but that pattern may be hard to manage and predict, and may cause subsequent deadlines or events to be missed).
So if you are a bit vague about your task periods and deadlines, a good rule of thumb is to keep it below 70%, but not to include non-real-time background tasks in the measurement.

How to leverage blocks/grid and threads/block?

I'm trying to accelerate this database search application with CUDA, and I'm working on running a core algorithm in parallel with CUDA.
In one test, I run the algorithm in parallel across a digital sequence of size 5000 with 500 blocks per grid and 100 threads per block and came back with a runt time of roughly 500 ms.
Then I increased the size of the digital sequence to 8192 with 128 blocks per grid and 64 threads per block and somehow came back with a result of 350 ms to run the algorithm.
This would indicate that how many blocks and threads used and how they're related does impact performance.
My question is how to decide the number of blocks/grid and threads/block?
Below I have my GPU specs from a standard device query program:

You should test it because it depends on your particular kernel. One thing you must aim to do is to make the number of threads per block a multiple of the number of threads in a warp. After that you can aim for high occupancy of each SM but that is not always synonymous with higher performance. It was been shown that sometimes lower occupancy can give better performance. Memory bound kernels usually benefit more from higher occupancy to hide memory latency. Compute bound kernels not so much. Testing the various configurations is your best bet.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas