Maximum number of nodes in NS-3 - ns-3

What is the maximum limit of nodes in the most recent version of NS-3? I have found in some (older) sources saying that the maximum limit of 6000 nodes. Is this still the case? If so, what is the main reason not allowing to increase this limit?

There is no limit such as the 6000 that you cite. The practical limit is based on how much memory you have in your simulation server. Even then, the limit can be extended by the use of distributed (parallel) simulation if the simulation scenario allows it. The largest ns-3 simulations on a supercomputer approached a network size of one billion nodes (see https://dl.acm.org/doi/abs/10.1145/2756509.2756525).

Related

Thousands of REDIS Sorted Sets VS millions of Simple Sets

I came to 2 options on how to solve the problem I have with (AWS ElastiCache (REDIS)).
I was able to find all the differences for these two approaches in scope of Time complexity (Big O) and other stuff.
However, there is one question that still bothers me:
Is there any difference for REDIS cluster (in memory consumption, CPU or any other resources) to handle:
500K larger Sorted Sets (https://redis.io/commands#sorted_set) containing ~100K elements each
48MLN smaller Simple Sets (https://redis.io/commands#set) containing ~500 elements each
?
Thanks in advance for the help :)
You are comparing two different data types, it is better to be benchmarked to decide which one's memory consumption is better with info memory. But I assume both are used with the same length for entries inside.
If you use the config set-max-intset-entries and stay in the limits of it while adding to this set(let's say 512), then your memory consumption will be lower than your first option(same value lengths and equality of the total entries). But it doesn't come for free.
The documentation states that
This is completely transparent from the point of view of the user and API. Since this is a CPU / memory trade off it is possible to tune the maximum number of elements and maximum element size for special encoded types using the following redis.conf directives.

InfluxDB max available expiration and performance concerns

I develop my metrics based on influxdb. I want to keep the data forever therefore my retention policy is set to inf and my shard retention policy is set to 100 years (the max I could set).
My main concern has to do with degrading performance by keeping this data. My series will not be more than 100000 (as adviced for the low server specs).
Is there gonna be an impact on the memory used indexing wise? More specific memory used by influxdb regardless of issuing any actions such as queries/continoues queries
Also in case there is a problem with performance, is it possible to backup only the data that are bound to be deleted?
Based on InfluxDB Hardware sizing guidelines, in moderate load situation with a single node InfluxDB deployed on a server with these specifications: CPU:6 cores and RAM:8-32 GB; you can have 250k writes per second and about 25 queries per second. These numbers will definitely meet your requirements. Also by increasing CPU and RAM you can achieve better performance.
Note, If the scale of your work grew in the future, you can also use "continues query" for down-sampling old data; or export a part of data to a backup file.

Designing a service for scale. Number of servers needed

Suppose that I need to design a web service. To keep it simple, assume that I use LAMP (Linux-Apache-MySQL-PHP).
I know that I will serve exactly N user requests per second. The requests are basically simple CRUD operations to the database, no file uploads or complex calculations.
Suppose that each request executes M ms and takes K Mb of memory on my server, having G Gb of RAM.
How many such servers do I need? Is it just N * K / G?
The reasonable value for M is 200ms. What is the reasonable value for K?
Do we need to take CPU power into account in this question?
Any additional considerations?
What you're doing is a good back of the envelope approximation but by no means should you use your thought exercise as a definitive guide for scaling your service.
That is because no service will exhibit that type of constant behavior as you describe (blame if on unpredictable peripheral i/o, garbage collection, external factors, user input, etc)
The correct approach is to perform scale and load testing. After you've written your service, start to load test your service and note the performance characteristics of your service. If you do things right you should reach a point where your configuration maxes out: either the CPU, the network throughput, the memory, or disk I/O. If neither are maxed out and you hit a limit then it's one of your upstream dependencies (your database etc.)
Once you've reached your peak it will tell you how many requests per second you can handle at peak.
You will also notice that in most cases peak performance is not sustainable: your setup may be able to burst for short periods of time handling many more requests per second than under sustained load.
After you get the numbers for a single server, you can start to vary in two ways:
test with different hardware configurations (add more RAM if you're memory bound, add a better CPU if you're CPU bound, etc)
test with multiple servers; start adding servers and see how your service scales horizontally
Ideally your service should scale linearly as you add servers but you will likely find that the performance curve is not linear.
Get your numbers, tweak your design. Rinse. Repeat.
There is no substitute, magic formula.

OpenCL optimization and apparnt PCI bus limitations?

I'm writing a program using JOGL/openCL to utilize the GPU. I have code that kicks in when we work with data sizes which is suppose to detect the available memory on the GPU. If there is insufficient memory on the GPU to process the entire calculation at once it will break the process up into sub process with X number of frames which utilizes less then the max GPU global memory to store.
I had expected that using the maximum possible value of X would give me the largest speed up by minimizing the number of kernels used. Instead I found using a smaller group (X/2 or X/4) gives me better speeds. I'm trying to figure out why breaking the GPU processing into smaller groups rather then having the GPU process the maximum amount it can handle at one time gives me a speed increase; and how I can optimize to figure out what the best value of X is.
My current tests have been running on a GPU kernel which uses very little processing power (both kernels decimate output by selecting part of input and returning it) However, I am fairly certain the same effects occur when I activate all kernels which do a larger degree of processing on the value before returning.
The short answer is, it's complicated. There are many factors at play. These include (but are not limited to):
Amount of local memory you are using.
Amount of private memory you are using.
A limit on the max number of work groups the Symmetric Multiprocessor is able to handle at once.
Exceeding register limits, causing memory access slow-down.
And many more...
I recommend you check out the following link:
http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
In particular, check out section 5.3. Dynamic Partitioning of SM Resources. This text is meant to be general purpose, but uses CUDA for its examples. However, the concepts still apply just the same to OpenCL.
This text comes from the following book:
http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1314279939&sr=8-2
For what its worth, I found this book to be very informative. It will give you a deeper understanding of the hardware that will allow you to answer questions like this.
PCI-e are full duplex bi-directional. i think that means you can write as you read. in which case, if you're doing very little processing, you may be seeing a gain because you're overlappings reads with writes.
consider a total size of N. in one work unit you do:
write N
process N
read N
total time proportional to: process N, transfer 2N
if you split this in two with parallel read/write you can get:
write N/2
process N/2
read N/2 and write N/2
process N/2
read N/2
total time proportional to: process N, transfer 3N/2 (saving N/2 transfer time)

Algorithmically suggest best node to perform demanding computation

At work we perform demanding numerical computations.
We have a network of several Linux boxes with different processing capabilities. At any given time, there can be anywhere from zero to dozens of people connected to a given box.
I created a script to measure the MFLOPS (Million of Floating Point Operations per Second) using the Linpack Benchmark; it also provides number of cores and memory.
I would like to use this information together with the load average (obtained using the uptime command) to suggest the best computer for performing a demanding computation. In other words, its 3:00pm; I have a meeting in two hours; I need to run a demanding process: what node will get me the answer fastest?
I envision a script which will output a suggestion along the lines of:
SUGGESTED HOSTS (IN ORDER OF PREFERENCE)
HOST1.MYNETWORK
HOST2.MYNETWORK
HOST3.MYNETWORK
Such suggestion should favor fast computers (high MFLOPS) if the load average is low and, as load average increases for a given node, it should favor available nodes instead (i.e., I'd rather run in a slower computer with no users than in an eight-core with forty dudes logged in).
How should I prioritize? What algorithm (rationale) would you use? Again, what I have is:
Load Average (1min, 5min, 15min)
MFLOPS measure
Number of users logged in
RAM (installed and available)
Number of cores (important to normalize the load average)
Any thoughts? Thanks!
You don't have enough data to make an well-informed decision. It sounds as though the scheduling is very volatile: "At any given time, there can be anywhere from zero to dozens of people connected to a given box." So the current load does not necessarily reflect the future load of the machines.
To properly asses what hosts someone should use to minimize computation time would require knowing when the current jobs will terminate. If a powerful machine is about to be done doing most of its jobs, it would be a good candidate even though it currently has a high load.
If you want to guess purely on the current situation, you can do a weighed calculation to find out which hosts have the most MFLOPS available.
MFLOPS available = host's MFLOPS + (number of logical processors - load average)
Sort the hosts by MFLOPS available and suggest them in a descending order.
This formula assumes that the MFLOPS of a host is linearly related to its load average. This might not be exactly true, but it's probably fairly close.
I would favor the most recent load average since it's closer to the current/future situation, whereas, jobs from 15 minutes ago might have completed by now.
Have you considered a distributed approach to computation? Not all computations can be broken up such that more than one cpu can work on them. But perhaps your problem space can benefit from some parallelization. Have a look at Hadoop.
You don't need to know FLOPS. beowulf modules paralell computing center has I go to has the script for sure
PDC operates leading-edge, high-performance computers on a national level. PDC offers easily accessible computational resources that primarily cater to the ...