o(1) complexity python structure for storing data in memory or to disc depends on number of items

o(1) complexity python structure for storing data in memory or to disc depends on number of items - serialization

At this moment I'm working on this library:
https://pypi.org/project/daffi/
Which is suppose to be kind of multiprocess RPC communication framework with ability to execute sync or async tasks on remotes.
The process of communication includes server which temporary stores some message metadata about receiver/transmitter processes and keep it until message is returned to process that sent message.
Generally speaking it is not the problem to keep 1k or 10k metadata items in memory as they are small and typically communication is fast so I haven't even experienced the state when server keeps so many items.
But I'd like to be protected for such cases and reduce memory consumption on server side when it happened.
So my question. Would you suggest any library or algorithm to store metadata in dict like object with ability to store items to disc under certain conditions?
Criteria is the following:
Lets say number of items in dict is less then 1k. In this case it acts like regular dict and store items in RAM.
If number or items becomes greater then 1k it starts serialize and storing them to disk with ability to take them by key and deserialize.
If, after spike number of items returns to normal (< 1K) it returns back to normal dict like behavior.
The speed is very important so I'd like to keep 0(1) complexity if possible.

Related

Cons of using MemoryCache as a temporary copy of DB table

I have a site where you can list your car for sale. There is a list and a map with filtering on car types and other car specifications. My idea was to cache cars table and use that to filter on when user is searching for a car on the website. Currently, especially when zooming in/out on the map, each time user does that, http request is made and it's querying the database, and that can be slow and heavy on the server.
As an experiment with 1 000 items, I have cached map data (trimmed data with only basic info) and it's working fine. I was thinking of doing a basically copy of cars table instead with all needed joins added in Memory Cache and use that instead of querying the DB every request for both list and the map. I would have Cron Job every 5 minutes (as data can change, but it doesn't have to be immediate) to update Memory Cache with latest cars data from DB.
What would be the cons of using this approach in long term and for using it for example storing 100 000 records? Beside server needing more RAM, would there be any concerns about scalability or usability of this approach? Would it be better to use Redis instead?
I do have in place now "search as you type" service, but I don't really need that functionality as filtering is pretty exact, I have added it more as a caching server but I think I would be better off just using Memory Cache until a real need for that kind of service is required.
Thank you

Since memory isn’t infinite, we need to limit the number of items stored in the In-Memory cache.
MemoryCache VS Redis
MemoryCache
MemoryCache is embedded in the process , hence can only be used as a plain key-value store from that process.
Redis
Redis is a remote data structure server. It is certainly slower than just storing the data in local memory.
I conclude that MemoryCache is running in the web server of the current application, and it is limited by the performance of the web server. Of course, it will be very fast under the same configuration. I think the disadvantage is that the stored data cannot be shared with other applications.
If redis is used, reading data directly from memory is not as fast as memorycache, but it has high reliability and high scalability.
Related Post:
1. How to update redis after updating database?
2. how to keep caching up to date
3. How can MySQL update data in real time in redis cache?

Where are sql results stored in a gui client?

Suppose I have a dataset that contains 100B rows and I do a SELECT * sql query from the table without a limit, and let's suppose the client doesn't impose a limit on top of it either --
As the data is running it usually loads the results incrementally into the interface. However, the dataset is much to large to fit onto my local machine. What actually happens when it is "Running query..."? Is the data loaded directly to program memory? Is the data saved to something like a tmp file that is memory mapped (I would think not), or what is the most common way to 'display' the results here? And then finally, what would happen once my local memory limit is exceeded -- would the program just hang or crash?
I know this is a slightly abstract question, but mainly I'm asking how a SQL result-set is usually 'loaded' in order to display the results to a user in a user interface.

.There may not be a "ususal" answer. Different applications are likely to take different approaches depending on the trade-offs they want to make.
The simplest approach is for the client to fetch the first N rows (you tagged this for Oracle SQL Developer where the default N is 50). If you then scroll down in the results, the client will fetch the next N rows. The client keeps the results it has already fetched in memory. If you try to fetch more data than the client machine has memory available (and, of course, the client may have been configured to have virtual memory larger than the physical memory available), the application either crashes or generates some sort of error. Note that depending on the specific implementation, the data could be cached either by the ODBC/JDBC/etc. driver or by the actual application code.
If there is some reason for the client to expect that it would be beneficial to display gigabytes worth of data to a human (or if crashing or erroring out is particularly problematic), the client might write results to a file rather than keeping them in memory. That doesn't seem particularly common in a GUI IDE but I don't use a terribly large number of different GUIs.
Other options are possible (but probably not worth implementing in an application that is supposed to provide results to a human who isn't going to scroll through billions of results). Under the covers, the application or driver could cache a key (in Oracle, normally the ROWID) for the previously returned data rather than the entire row and then re-fetch that data if the user tries to scroll back to the top. The application could discard data that you had already fetched and throw an error if you tried to scroll back from row 1 billion to row 1. Or it could silently re-execute the query if you wanted to go back to the first row.

Max value size for Redis

I've been trying to make replay system. So basically when player moves, system saves his datas(movements, location, animation etc.) into JSON file. In the end of the record, JSON file may be over 50 MB. I'd want to save this data into Redis with expire date (24-48 hours).
My questions are;
Is it bad to save over 50 MB into Redis with expire date?
How many datas that over 50 MB can Redis handle without performance loss?
If players make 500 records in 48 hours, may it be bad for Redis?
How many milliseconds does it takes 50 MB data from Redis with average VDS/VPS?

Storing a large object(in terms of size) is not a good practice. You may read it from here. One of the problem is network. You need to send 50MB payload to a redis server in a single call. Also if you save them as one big object, then while retrieving, updating it (a single field, element etc), you need to get 50 MB back from server and parse it to get a single field, update it back end send back to server. That's a serious problem in terms of network.
Instead of redis strings, you may prefer sorted sets or lists depending on your use case. If you are going to store them with timestamps and get the range of events between these timestamps, then sorted sets may be an ideal solution for you. It's good for pagination etc. One of the crucial drawback is the complexity of adding a new element is O(log(N)).
lists may also provide a good playground for your case. You may use LPUSH/RPUSH to add new events to your list, and since Redis lists are implemented with linked lists, both adding a message to the beginning or end of the list is same, O(1), which is great.
Whenever an event happens, you either call ZADD or RPUSH/LPUSH to send the events to redis. If you need to query those to you may use available functions such as ZRANGEBYSCORE or LRANGE depending on your choice.
While designing your keys you may use an identifier such as user-id just like you mentioned in the comments. You will not have the problems with lists/sorted sets like you will have in strings. But choosing which one is most suitable for your depends on your use case for reads/writes or business rules.
Here some useful links to read;
Redis data types intro
Redis data types
Redis labs documentation about data types

Aerospike: Device Overload Error when size of map is too big

We got "device overload" error after the program ran successfully on production for a few months. And we find that some maps' sizes are very big, which may be bigger than 1,000.
After I inspected the source code, I found that the reason of "devcie overload" is that the write queue is beyond limitations, and the length of the write queue is related to the effiency of processing.
So I checked the "particle_map" file, and I suspect that the whole map will be rewritten even if we just want to insert one pair of KV into the map.
But I am not so sure about this. Any advice ?

So I checked the "particle_map" file, and I suspect that the whole map will be rewritten even if we just want to insert one pair of KV into the map.
You are correct. When using persistence, Aerospike does not update records in-place. Each update/insert is buffered into an in-memory write-block which, when full, is queued to be written to disk. This queue allows for short bursts that exceed your disks max IO but if the burst is sustained for too long the server will begin to fail the writes with the 'device overload' error you have mentioned. How far behind the disk is allowed to get is controlled by the max-write-cache namespace storage-engine parameter.
You can find more about our storage layer at https://www.aerospike.com/docs/architecture/index.html.

What is a real world use for ConcurrentBag<T>?

A ConcurrentBag will allow multiple threads to add and remove items from the bag. It is possible that a thread will add an item to the bag and then end up taking that same item right back out. It says that the ConcurrentBag is unordered, but how unordered is it? On a single thread, the bag acts like a Stack. Does unordered mean "not like a linked list"?
What is a real world use for ConcurrentBag?

Because there is no ordering the ConcurrentBag has a performance advantage over ConcurrentStack/Queue. It is implemented by Microsoft as local thread storage. So every thread that adds items does this in it's own space. When retrieving items they come from the local storage. Only when that is empty the thread steals item from another threads storage. So instead of a simple list a ConcurrentBag is a distributed list of items. And is almost lockfree and should scale better with high concurrency.
Unfortunately in .NET 4.0 there was a performance issue (fixed in 4.5) see
http://ayende.com/blog/156097/the-high-cost-of-concurrentbag-in-net-4-0

Bags are really useful for tracking instance counts. For example, if you want to keep a record of which hosts you're servicing web requests for, you can add their IP to the bag when you start servicing the request, and remove it when done.
Using a bag will allow you to tell at a glance which IPs you're currently servicing. It will also let you quickly query whether you're servicing a given IP address.
If you use a set for this rather than a bag, then having multiple concurrent requests from the same IP address will mess up your record-keeping.

Anything where you just need to keep track of what's there and don't need random access or guaranteed order. If you have a thread that adds items to process, and a thread that removes items in order to process them, a concurrent bag would work well if you don't care that they're processed in FIFO order.

Thanks to #Chris Jester-Young I came up with a good, real world, scenario that actually applies to a project i'm working on.
Find - Process - Store
Find - threads 1 & 2 are set to find or scrape data (file system, web, etc). These results are stored in ConcurrentBag1.
Process - threads 3 & 4 are set to take out of ConcurrentBag1, clean/transform/process the data and then store the results in ConcurrentBag2.
Store - threads 5 is set to gather results from ConcurrentBag2 and store the results in SQL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas