In the figure 1 below, throughput (KB/sec) is plotted for writing new files of various sizes with different record sizes. File sizes vary from 64KB to 4GB. Record sizes vary from 4KB to 16MB. The file and record sizes are the two horizontal axis and the throughput is plotted on the vertical axis.
I need to understand the causes of the following two observations:
(1) Why are there 2 plateaus: one plateau: file size 128KB to 8MB, another from 64MB to 1GB ?
(2) Why are there 2 peaks (sweet spots) in the range of file-size 256KB and 2MB ?
System has 8GB RAM. I am sure a lot of other parameters shall be required to explain it properly, but still can any probable inference be drawn from the plot ?
I would be very suspicious about the following factors:
Other processes interfering
Throughput results are volatile. This is what I would suspect first. What was your measurement methodology?
If I were to perform such a measurement, I would try the following:
Measure throughput for each scenario multiple times either rebooting the system in between or running some "clean-up" process to make sure one test run doesn't interfere with another one.
Try to select scenarios in a random pattern.
Having multiple results for each scenario, throw away the outliers on both ends (or on the slow end) and take an average (or median) of the rest.
Related
I came to 2 options on how to solve the problem I have with (AWS ElastiCache (REDIS)).
I was able to find all the differences for these two approaches in scope of Time complexity (Big O) and other stuff.
However, there is one question that still bothers me:
Is there any difference for REDIS cluster (in memory consumption, CPU or any other resources) to handle:
500K larger Sorted Sets (https://redis.io/commands#sorted_set) containing ~100K elements each
48MLN smaller Simple Sets (https://redis.io/commands#set) containing ~500 elements each
?
Thanks in advance for the help :)
You are comparing two different data types, it is better to be benchmarked to decide which one's memory consumption is better with info memory. But I assume both are used with the same length for entries inside.
If you use the config set-max-intset-entries and stay in the limits of it while adding to this set(let's say 512), then your memory consumption will be lower than your first option(same value lengths and equality of the total entries). But it doesn't come for free.
The documentation states that
This is completely transparent from the point of view of the user and API. Since this is a CPU / memory trade off it is possible to tune the maximum number of elements and maximum element size for special encoded types using the following redis.conf directives.
From your experience, what would be an ideal size of a .tfrecord file that would work best across a wide variety of devices (hard-disk, ssd, nvme) and storage locations (local machine, hpc cluster with network mounts) ?
In case I get slower performance on a technically more powerful computer in the cloud than on my local PC, could the size of the tfrecord dataset be the root cause of the bottleneck ?
Thanks
Official Tensorflow website recommends ~100MB (https://docs.w3cub.com/tensorflow~guide/performance/performance_guide/)
Reading large numbers of small files significantly impacts I/O
performance. One approach to get maximum I/O throughput is to
preprocess input data into larger (~100MB) TFRecord files. For smaller
data sets (200MB-1GB), the best approach is often to load the entire
data set into memory.
Currently (19-09-2020) Google recommends the following rule of thumb:
"In general, you should shard your data across multiple files so that you can parallelize I/O (within a single host or across multiple hosts). The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10+MB and ideally 100MB+) so that you benefit from I/O prefetching. For example, say you have X GBs of data and you plan to train on up to N hosts. Ideally, you should shard the data to ~10N files, as long as ~X/(10N) is 10+ MBs (and ideally 100+ MBs). If it is less than that, you might need to create fewer shards to trade off parallelism benefits and I/O prefetching benefits."
Source: https://www.tensorflow.org/tutorials/load_data/tfrecord
There are a couple of threads discussing the scalability of Optaplanner, and I am wondering what's the recommended approach to deal with very large datasets when it comes to millions of rows?
As this blog discussed I am already using heuristic (Simulated Annealing + Tabu Search). The search space of cloud balancing problem is c^p, but the feasible space is unknown/NP-complete.
http://www.optaplanner.org/blog/2014/03/27/IsTheSearchSpaceOfAnOptimizationProblemReallyThatBig.html
The problem I am trying to solve is similar to cloud balancing. But the main difference is in the input data, besides a list of computers and a list of processes, there is also a big two dimensional 'score list/table' which has the scores for each possible combinations that needs to be loaded into memory.
In other words, except for the constraints between computers and processes that the planning needs to satisfy, different valid combinations yield various scores and the higher the score the better.
It's a simple problem but when it comes to hundreds of computers, 100k+ processes and the score table has a million+ combinations, it needs a lot of memory. Even though I could allocate more memory to increase the heap size, the planning could become very slow and struggling, as the steps are sorted with custom planning variable/entity comparator classes.
A straight-forward solution is to divide the dataset into smaller subsets, run each of them individually and then combine the results, so that I could have multiple machines to run at the same time and each machine runs on multi-threads. The biggest drawback of this approach is the result produced is far away from optimal.
I am wondering is there any other better solutions?
The MachineReassignment example also has a big "score combination" matrix. OptaPlanner doesn't care about that - those are just problem facts and the DRL quickly matches the combination(s) that is picked for an assignment. The Solver.solve() causes no big memory consumption or performance impact.
However, loading the problem in your code (before calling Solver.solve()) does cause a huge memory consumption: Understand that if n = 20k, then n² = 400m and an int takes of up 4 bytes, so for 20 000 elements that matrix is 1.6 GB in its most efficient uncompressed form int[][] (both in Java and C++!). So for 20k reserve 2GB RAM, for 40k reserve 8GB RAM for 80k reserve 32 GB RAM. That scales badly.
As for dealing with these big problems, I use combinations of techniques such as Nearby Selection (see my blog article on that), Partitioned Search (what you described, it will be supported out of the box in 7, but I 've implemented it for customers in a CustomPhase), Limited Selection Construction Heuristics (need to research that further, the plumbing is there, usually overkill), ... Partitioned Search does indeed exclude optimal solutions, but above 10k planning entities the trade-off quality vs time taking clearly favors Partitioned Search given a reasonable solving time (minutes/hours/days instead of millenia). The trick is to keep the size of each partition big enough, above 1k entities (hence the use NearbySelection). Score calculation speed also matters a lot, of course.
I have two questions:
Is it better to make a kernel overwork or underwork? Let's say I want to calculate a difference image with only 4 GPU cores. Should I consider any pixel of my image to be calculated independently by 1 thread or Should I make 1 thread calculate a whole line of my image? I dont know which solution is the most optimized to use. I already vectorized the first option (which was impelmented) but I only gain some ms, it is not very significative.
My second question is about the execution costs of a kernel. I know how to measure any OpenCL command queue task (copy, write, read, kernel...) but I think there is a time taken by the host to load the kernel to the GPU cores. Is there any way to evaluate it?
Baptiste
(1)
Typically you'd process a single item in a kernel. If you process multiple items, you need to do them in the right order to ensure coalesced memory access or you'll be slower than doing a single item (the solution to this is to process a column per work item instead of a row).
Another reason why working on multiple items can be slower is that you might leave compute units idle. For example, if you process scanlines on a 1000x1000 image with 700 compute units, the work will be chunked into 700 work items and then only 300 work items (leaving 400 idle).
A case where you want to do lots of work in a single kernel is if you're using shared local memory. For example, if you load a look-up table (LUT) into SLM, you should use it for an entire scanline or image.
(2)
I'm sure this is a non-zero amount of time but it is negligible. Kernel code is pretty small. The driver handles moving it to the GPU, and also handles pushing parameter data onto the GPU. Both are very fast, and likely happen while other kernels are running, so are "free".
I am indexing large amounts of data into DynamoDB and experimenting with batch writing to increase actual throughput (i.e. make indexing faster). Here's a block of code (this is the original source):
def do_batch_write(items,conn,table):
batch_list = conn.new_batch_write_list()
batch_list.add_batch(table, puts=items)
while True:
response = conn.batch_write_item(batch_list)
unprocessed = response.get('UnprocessedItems', None)
if not unprocessed:
break
# identify unprocessed items and retry batch writing
I am using boto version 2.8.0. I get an exception if items has more than 25 elements. Is there a way to increase this limit? Also, I noticed that sometimes, even if items is shorter, it cannot process all of them in a single try. But there does not seem to be correlation between how often this happens, or how many elements are left unprocessed after a try, and the original length of items. Is there a way to avoid this and write everything in one try? Now, the ultimate goal is to make processing faster, not just avoid repeats, so sleeping for a long period of time between successive tries is not an option.
Thx
From the documentation:
"The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB."
The reason for some not succeeded is probably due to exceeding the provisioned throughput of your table. Do you have other write operations being performed on the table at the same time? Have you tried increasing the write throughput on your table to see if more items are processed.
I'm not aware of any way of increasing the limit of 25 items per request but you could try asking on the AWS Forums or through your support channel.
I think the best way to get maximum throughput is to increase the write capacity units as high as you can and to parallelize the batch write operations across several threads or processes.
From my experience, there is little to be gained in trying to optimize your write throughput using either batch write or multithreading. Batch write saves a little network time, and multithreading saves close to nothing as the item size limitation is quite low and the bottleneck is very often DDB throttling your request.
So (like it or not) increasing your Write Capacity in DynamoDB is the way to go.
Ah, like garnaat said, latency inside the region is often really different (like from 15ms to 250ms) from inter-region or outside AWS.
Not only increasing the Write Capacity will make it faster.
if your HASH KEY diversity is poor, then even if you will increase your write capacity, then you can have throughput errors.
throughput errors are depends on your hit map.
example: if your hash key is a number between 1-10, and you have 10 records with hash value 1-10 but 10k records with value 10, then you will have many throughput errors even while increasing your write capacity.