Store sparse voxel trees in Geode or gemfire - gemfire

Is it possible to store binary data in GEODE or Gemfire?
In particular, I would like to store binary structures of sparse voxel octrees and retrieve them using a 3D coordinate.
If yes, is it possible to create a client in C++?

Without knowing much about voxel structures, I can't comment on the best way to do this. Geode/GemFire is designed to store binary data. With a C++ client, you should use Geode's PDX serialization. If it's a simple matter of retrieving a structure given a known coordinate, Geode should work well.

Related

Is it better better to open or to read large matrices in Julia?

I'm in the process of switching over to Julia from other programming languages and one of the things that Julia will let you hang yourself on is memory. I think this is likely a good thing, a programming language where you actually have to think about some amount of memory management forces the coder to write more efficient code. This would be in contrast to something like R where you can seemingly load datasets that are larger than the allocated memory. Of course, you can't actually do that, so I wonder how does R get around that problem?
Part of what I've done in other programming languages is work on large tabular datasets, often converted over to a R dataframe or a matrix. I think the way this is handled in Julia is to stream data in wherever possible, so my main question is this:
Is it better to use readline("my_file.txt") to access data or is it better to use open("my_file.txt", "w")? If possible, wouldn't it be better to access a large dataset all at once for speed? Or would it be better to always stream data?
I hope this makes sense. Any further resources would be greatly appreciated.
I'm not an extensive user of Julia's data-ecosystem packages, but CSV.jl offers the Chunks and Rows alternatives to File, and these might let you process the files incrementally.
While it may not be relevant to your use case, the mechanisms mentioned in #Przemyslaw Szufel's answer are used other places as well. Two I'm familiar with are the TiffImages.jl and NRRD.jl packages, both I/O packages mostly for loading image data into Julia. With these, you can load terabyte-sized datasets on a laptop. There may be more packages that use the same mechanism, and many package maintainers would probably be grateful to receive a pull request that supports optional memory-mapping when applicable.
In R you cannot have a data frame larger than memory. There is no magical buffering mechanism. However, when running R-based analytics you could use a disk.frame package for that.
Similarly, in Julia if you want to process data frames larger than memory you need to use am appropriate package. The most reasonable and natural option in Julia ecosystem is JuliaDB.
If you want to do something more low-level solution have a look at:
Mmap that provides Memory-mapped I/O that exactly solves the issue of conveniently handling data too large to fit into memory
SharedArrays that offers a disk mapped array with implementation based on Mmap.
In conclusion, if your data is data frame based - try JuliaDB, otherwise have a look at Mmap and SharedArrays (look at the filename parameter)

Single Object having 2GB size(may be it more in future) to store in Redis Cache

we are planning to implement distributed Cache(Redis Cache) for our application. We have a data and stored it in map with having size around 2GB and it is a single object. Currently it is storing in Context scope similarly we have plenty of objects storing into context scope.
Now we are planning to store all these context data into Redis Cache. Here the map data taking high amount of memory and we have to store this map data as single key-value object.
Is it suitable Redis Cache for my requirement. And which data type is suitable to store this data into Redis Cache.
Please suggest the way to implement this.
So, you didn't finish discussion in the other question and started a new one? 2GB is A LOT. Suppose, you have 1Gb/s link between your servers. You need 16 seconds just to transfer raw data. Add protocol costs, add deserialization costs. And you're at 20 seconds now. This is hardware limitations. Of course you may get 10Gb/s link. Or even multiplex it for 20Gb/s. But is it the way? The real solution is to break this data into parts and perform only partial updates.
To the topic: use String (basic) type, there are no options. Other types are complex structures and you need just one value.

Apache Ignite 2.x BinaryObject deserialize performance

I'm observing two orders of magnitude performance difference scanning a local off-heap cache between binary and deserialized mode (200k/sec vs 2k/sec). Have not profiled it with tools yet.
Is the default reflection based binary codec a recommended one for production or there's a better one?
What's the best source to read for description of the binary layout (the official documentation is missing that)?
Or in the most generic form - what's the expected data retrieval performance with Ignite scanning query and how to achieve it?
Since 2.0.0 version ignite stores all data in off heap memory, so it's expected that BinaryObjects works faster, because BinaryObject doesn't deserialize you objects to classes, but works directly with bytes.
So yes, it's recommended to use BinaryObjects if possible for performance sake.
Read the following doc:
https://apacheignite.readme.io/docs/binary-marshaller
it explains how to use BinaryObjects.

Redis vs RocksDB

I have read about Redis and RocksDB, I don't get the advantages of Redis over RocksDB.
I know that Redis is all in-memory and RocksDB is in-memory and uses flash storage. If all data fits in-memory, which one should I choose? do they have the same performance? Redis scales linearly with the number of CPU's? I guess that there are others differences that I don't get.
I have a dataset which fits in-memory and I was going to choose Redis but it seems that RocksDB offers me the same and if one day the dataset grows too much I wouldn't have to be worried about the memory.
They have nothing in common. You are trying to compare apples and oranges here.
Redis is a remote in-memory data store (similar to memcached). It is a server. A single Redis instance is very efficient, but totally non scalable (regarding CPU). A Redis cluster is scalable (regarding CPU).
RocksDB is an embedded key/value store (similar to BerkeleyDB or more exactly LevelDB). It is a library, supporting multi-threading and a persistence based on log-structured merge trees.
While Didier Spezia's answer is correct in his distinction between the two projects, they are linked by a project called LedisDB. LedisDB is an abstraction layer written in Go that implements much of the Redis API on top of storage engines like RocksDB. In many cases you can use the same Redis client library directly with LedisDB, making it almost a drop in replacement for Redis in certain situations. Redis is obviously faster, but as OP mentioned in his question, the main benefit of using RocksDB is that your dataset is not limited to the amount of available memory. I find that useful not because I'm processing super large datasets, but because RAM is expensive and you can get more milage out of smaller virtual servers.
Redis, in general, has more functionalities than RocksDB. It can natively understand the semantics of complex data structures such as lists and arrays . RocksDB, in contrast, looks at the stored values as a blob of data. If you want to do any further processing, you need to bring the data to your program and process it there (in other words, you can't delegate the processing to the database engine aka RocksDB).
RocksDB only runs on a single server. Redis has a clustered version (though it is not free)
Redis is built for in-memory computation, though it also support backing the data up to the persistent storage, but the main use cases are in memory use cases. RocksDB by contrast is usually used for persisting data and in most cases store the data on persistent medium.
RocksDB has a better multi-threaded support (specially for reads --writes still suffer from concurrent access).
Many memcached servers use Redis (where the protocol used is memcached but underlying server is Redis). This doesn't used most of Redis's functionality but is one case that Redis and RocksDB both function similarly (as a KVS though still in different context, where Redis based memcached is a cache but RocksDB is a database, though not an enterprise grade one)
#Guille If you know the behavior of hot data(getting fetched frequently) is based of time-stamp then Rocksdb would a smart choice, but do optimize it for fallback using bloom-filters .If your hot data is random ,then go for Redis .Using rocksDB entirely in memory is not generally recommended in log-structured databases like Rocksdb and its specifically optimized for SSD and flash storage .So my recommendation would be to understand the usecase and pick a DB for that particular usecase .
Redis is distributed, in-memory data store where as Rocks DB is embedded key-value store and not distributed.
Both are Key-Value Stores, so they have something in common.
As others mentioned RocksDB is embedded (as a library), while Redis is a standalone server. Moreover, Redis can sharded.
RocksDB
Redis
persisted on disk
stored in memory
strictly serializable
eventually consistent
sorted collections
no sorting
vertical scaling
horizontal scaling
If you don't need horizontal scaling, RocksDB is often a superior choice. Some people would assume that an in-memory store would be strictly faster than a persistent one, but it is not always true. Embedded storage doesn't have networking bottlenecks, which matters greatly in practice, especially for vertical scaling on bigger machines.
If you need to server RocksDB over a network or need high-level language bindings, the most efficient approach would be using project UKV. It, however, also supports other embedded stores as engines and provides higher-level functionality, such as Graph collections, similar to RedisGraph, and Document collections, like RedisJSON.

Is there a max limit for objects per process?

In C++, map class is very comfortable. Instead of going for a separate database I want to store all the rows as objects and I want to create map object for the columns to search. I am concerned with maximum objects a process can handle. And is using map function to retrieve an object among, say, 10 million objects, if linux permits, is a good choice? I'm not worried about persisting the data.
What you are looking for is std::map::max_size, quoting from the reference:
...reflects the theoretical limit on the size of the container. At runtime, the size of the container may be limited to a value smaller than max_size() by the amount of RAM available.
No, there is no maximum number of objects per process. Objects (as in, C++ objects) are an abstraction which the OS is unaware of. The only meaningful limit in this regard is the amount of memory used.
You can completely fill your RAM using as much map as it takes, I promise.
As you can see in reference documentation, the constannt map::max_size will let you know the numbers.
This should be 2^31-1 on iX86 hardware/OS and 2^64-1 on amd64 hardware/64bit OS
Possible additionnal information here.
Object is a concept in programming language. In fact, the processes are not aware of the objects. With enough RAM space, you can alloc as many objects as possible in your program.
About your second question, my answer is that which data structure you choose in your program depends on the problem that you want to solve in your program. Map is a suitable data structure for quick accessing objects, testing existance, etc, but is not good enough to maintain the objects' order.