Thousands of REDIS Sorted Sets VS millions of Simple Sets - redis

I came to 2 options on how to solve the problem I have with (AWS ElastiCache (REDIS)).
I was able to find all the differences for these two approaches in scope of Time complexity (Big O) and other stuff.
However, there is one question that still bothers me:
Is there any difference for REDIS cluster (in memory consumption, CPU or any other resources) to handle:
500K larger Sorted Sets ( containing ~100K elements each
48MLN smaller Simple Sets ( containing ~500 elements each
Thanks in advance for the help :)

You are comparing two different data types, it is better to be benchmarked to decide which one's memory consumption is better with info memory. But I assume both are used with the same length for entries inside.
If you use the config set-max-intset-entries and stay in the limits of it while adding to this set(let's say 512), then your memory consumption will be lower than your first option(same value lengths and equality of the total entries). But it doesn't come for free.
The documentation states that
This is completely transparent from the point of view of the user and API. Since this is a CPU / memory trade off it is possible to tune the maximum number of elements and maximum element size for special encoded types using the following redis.conf directives.


Optimizing Redis cluster nodes

I understand that in a Redis cluster, there are 16384 slots total distributed across the nodes. So if I have a key like this entity:user:userID (like user:1234) and the value is a serialized user object and say if my application has 500k+ users. It should get distributed to each slots evenly. We currently have 6 nodes total (3 masters and 3 slaves), and we are always wondering when we shall add 2 more nodes to 8 total. We also do write the cache data to disk, and sometimes we do get latency warning when persisting to disk. I'd assume if there are more nodes, there are less data to persist for each node, thus a better performance/usage of resources. But asides from disk i/o, is there a dead-on performance measurement to let us know when we should start adding additional nodes?
If your limiting factor is disk I/O for replication, using SSDs can drastically improve performance.
Two additional signs that it is time to scale out include server load and used memory for your nodes. There are others, but these two are simple to reason about.
If your limiting factor is processing power on the nodes (e.g. server load) because of a natural increase in requests, scaling out will help distribute the load across more nodes. If one node is consistently higher than the others, this could indicate a hot partition, which is a separate problem to solve.
If your limiting factor is total storage capacity (e.g. used memory) because of a natural increase in data stored in your cache, scaling out will help grow the total storage capacity of your cluster. If you have a very large dataset and the set of keys used on a regular basis is small, technologies such as Redis on Flash by Redis Labs may be applicable.

Optimize memory usage of very large HashMap

I need to preprocess data from OpenStreetMap. First step is to store a bunch of nodes (more than 200 million) from a unprocessed.pbf file(Europe, ~21GB). Therefore I'm using a HashMap. After importing the data into the map, my programm checks each single Node if it fulfills some conditions. If not, the node is removed from the map. Afterwards each remaining node in the map is written into a new processed.pbf file.
The problem is, that this programm is using more than 100GB RAM. I want to optimize the memory usage.
I've read that I should adjust the initial capacity and load factor of HashMap if many entries are used. Now I'm asking myself which is the best value for those two parameters.
I've also seen that the memory load when using JVM of Oracle-JDK (1.8) raises slower than using OpenJDK JVM (1.8). Are there some settings which i can use for OpenJDK JVM, to minimize memory usage?
Thanks for your help.
There will be a lot of collision in the hashmap if you don't provide the load factor and initial size while searching the key.
Generally for,
default load factor = 0.75, we provide a
initial size = ((number of data) / loadFactor) + 1
It increases the efficiency of the code. As hashmap has more space to store the data which reduces the collision occurring inside hashmap while searching a key.

Optaplanner - large datasets with millions of rows

There are a couple of threads discussing the scalability of Optaplanner, and I am wondering what's the recommended approach to deal with very large datasets when it comes to millions of rows?
As this blog discussed I am already using heuristic (Simulated Annealing + Tabu Search). The search space of cloud balancing problem is c^p, but the feasible space is unknown/NP-complete.
The problem I am trying to solve is similar to cloud balancing. But the main difference is in the input data, besides a list of computers and a list of processes, there is also a big two dimensional 'score list/table' which has the scores for each possible combinations that needs to be loaded into memory.
In other words, except for the constraints between computers and processes that the planning needs to satisfy, different valid combinations yield various scores and the higher the score the better.
It's a simple problem but when it comes to hundreds of computers, 100k+ processes and the score table has a million+ combinations, it needs a lot of memory. Even though I could allocate more memory to increase the heap size, the planning could become very slow and struggling, as the steps are sorted with custom planning variable/entity comparator classes.
A straight-forward solution is to divide the dataset into smaller subsets, run each of them individually and then combine the results, so that I could have multiple machines to run at the same time and each machine runs on multi-threads. The biggest drawback of this approach is the result produced is far away from optimal.
I am wondering is there any other better solutions?
The MachineReassignment example also has a big "score combination" matrix. OptaPlanner doesn't care about that - those are just problem facts and the DRL quickly matches the combination(s) that is picked for an assignment. The Solver.solve() causes no big memory consumption or performance impact.
However, loading the problem in your code (before calling Solver.solve()) does cause a huge memory consumption: Understand that if n = 20k, then n² = 400m and an int takes of up 4 bytes, so for 20 000 elements that matrix is 1.6 GB in its most efficient uncompressed form int[][] (both in Java and C++!). So for 20k reserve 2GB RAM, for 40k reserve 8GB RAM for 80k reserve 32 GB RAM. That scales badly.
As for dealing with these big problems, I use combinations of techniques such as Nearby Selection (see my blog article on that), Partitioned Search (what you described, it will be supported out of the box in 7, but I 've implemented it for customers in a CustomPhase), Limited Selection Construction Heuristics (need to research that further, the plumbing is there, usually overkill), ... Partitioned Search does indeed exclude optimal solutions, but above 10k planning entities the trade-off quality vs time taking clearly favors Partitioned Search given a reasonable solving time (minutes/hours/days instead of millenia). The trick is to keep the size of each partition big enough, above 1k entities (hence the use NearbySelection). Score calculation speed also matters a lot, of course.

Why redis cluster only have 16384 slots?

In my opinion, with the development of keys, the 'hash conflict' will occurs more and more frequently. I have no idea if those keys on the same slot are stored in singly linked list, then read performance will be effected, especially the stale record?
answer from antirez, the author of Redis, below.
The reason is:
Normal heartbeat packets carry the full configuration of a node, that can be replaced in an idempotent way with the old in order to update an old config. This means they contain the slots configuration for a node, in raw form, that uses 2k of space with16k slots, but would use a prohibitive 8k of space using 65k slots.
At the same time it is unlikely that Redis Cluster would scale to more than 1000 mater nodes because of other design tradeoffs.
So 16k was in the right range to ensure enough slots per master with a max of 1000 maters, but a small enough number to propagate the slot configuration as a raw bitmap easily. Note that in small clusters the bitmap would be hard to compress because when N is small the bitmap would have slots/N bits set that is a large percentage of bits set.
These "slots" are merely a unit of distribution across shards. You're not going to have of 16K shards servers in a cluster; but the are granular enough to allow some degree of weighted load distribution. (For example if you start with four shard on one type of hardware and choose to introduce two more of a more power profile, you could make the new servers targets for twice as many slots as the existing servers and thus achieve a more even relatively utilization of your capacity.
I'm just summarizing the gist of how they're used. For details read the Redis Cluster Specification.

How to optimize indexation on elasticsearch?

I am trying to understand how indexing can be optimized on elasticsearch. Let me clarify my needs;
I have two indices rigth now. Lets say, indexA and indexB ( Two indices can be seen approximately same size)
I have 6 machines dedicated to elasticsearch (we can say exactly the same hardware)
The most important part of my elasticsearch usage is on writing since I am doing heavy writing on real time.
So my question is, how I can I optimize the writing operation using those 6 machines ?
Should I separate machines into two part like 3 machines for indexA and 3 machines for indexB ?
Should I use all of 6 machines in order to index indexA and indexB ?
What else should I need to give attention in order to optimize write operations ?
Thank you in advance
It depends, but let me take to a direction as per your problem statement which led to following assumptions:
you want to do more write operations (not worried about search performance)
both the indices are in the same cluster
in future more systems can get added
For better indexing performance first thing is you may want to have single shard for your index (unless you are using routing). But since you have 6 servers having single shard will be waste of resources so you can assign 3 shard to each of indexA and indexB. This is for current scenario but it is recommended to have ~10 shards(for future scalibility and your data size dependent)
Turn off the replica (if possible as index requests wait for the replicas to respond before returning). Though in production environment it is highly recommended to have at least one replica for high availability.
Set refresh rate to "-1" or at least to a larger figure say "30m". (You will lose NRT search if you do so but as you have mentioned you are concerned about indexing)
Turn of index warmers if you have any.
avoid using "doc_values" for your field mapping. (though it is beneficial for reducing memory footprint during search time it will increase your index time as it prepares field values during indexing)
If possible/not required disable "norms" in your mapping
Lastly read this.
Word of caution: some of the approach above will impact your search performance.