Is a SortedList or SortedDictionary the best way to efficiently store and retrieve data by the nearest key? - optimization

I am working on a project where I'm storing an object inside a SortedList<float, myData> with timestamps as a key so that I can use the built in binary search to retrieve the data (or two sets if the lower and higher bounds are equally distant from the key) that was stored closest to any given timestamp. For context, the insertion and retrieval occur every 2ms on average (it's for a game replay/rewind function essentially, I'm just storing the current state of entities). And while it works, it's not super efficient with it's O(n) insertion. I know SortedDictionary has an O(log n) insertion, but no built in binary search.
Is there a better option for my situation that you more experienced folks may know of? The ultimate goal of what I'm looking for is a way to do the following:
Save data in sequence of timestamp
Get data flanking any given timestamp (I lerp between for interval values)
"Replay" data starting from given timestamp (so #2 in rapid succession)
Clear any data from before or after a given timestamp
From what I've seen and understood, SortedList wins for #2 (and #3 by proxy) and SortedDictionary is best at number 1 and I'd have to roll my own BinarySearch for it to even do #2/3.
At the moment each entity is handling it's own SortedList, though I wonder if making a sort of RecordManager with a single structure to handle all entities might be more performant.
Any help or suggestions would be great!

Related

Best way of storing an array in an SQL database?

For an Android Launcher (Home Screen) app project i want to implement a feature called "Sort by usage". This will sort by the launch count of an app within a user settable timeframe.
The current idea for the implementation is to store an array of unich epoch timestamps, one for each launch.
Additionaly it'll store a counter caching the current amount of launches within the selected timeframe, incremented with every launch. Of course, this would regularly have to be rebuild as time passes, but merely every few hours or at least x percent of the selected timeframe, so computations definitely wouldn't run as often as without the counter, since this information is required everytime when any app entries on screen need to get sorted - but i'm not quite sure if it matters in any way during actual use.
I am now unsure how to store the timestamp array inside the SQL database. As there is a table holding one record with information about each launcher entry i thought about the following options:
Store the array of unix epochs in serialized form (maybe JSON Array) to one field of the entries record
Create a seperate table for launch times with
a. each record starting with an id associated with an entry followed by all launch times, one for each field
b. each record a combination of entry id and one launch time
these options would obvously have the advatage of storing the timestamp using an appropriate type
I probably didn't quite understand why you need a second piece of data for your launch counter - the fact you saved a timestamp already means a launch - why not just count timestamps? Less updating, less record locking, more concurrency.
Now, let's say you've got a separate table with timestamps in a classic one to many setting.
Pros of this setup - you never need to update anything - just keep inserting. You can easily cluster your table by timestamp, run a filter on your timeframe and issue a group by and count rows. The client then will get the numbers and sort by count (I believe it's generally better to not sort in SQL). Cons - you need a join to parent table and probably need to get your indexes right.
Alternatively you store timestamps in a blob text (JSON, CSV, whatever) with your main records. This definitely means you'll have to update your records a lot, which potentially opens you up to locking issues. Then, I'm not entirely sure what you'll have to do to get your final launch counts - you read all entities, deserialise all timestamps, filter by timeframe and then count? It does feel a bit more convoluted in your case.
I don't think there's such thing as a "best" way. You have to consider pros and cons. From what I gather, you might be better off with classic SQL approach unless there's something I didn't catch that will outweigh my points above

Redis bitmap split key division strategy

I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.

Plone - ZODB catalog query sort_on multiple indexes?

I have a ZODB catalog query with a start and end date. I want to sort the result on end_date first and then start_date second.
Sorting on either end_date or start_date works fine.
I tried with a tuple (start_date,end_date), but with no luck.
Is there a way to achieve this or do one have to employ some custom logic afterwards?
The generalized answer ought to be post-hoc-sort of your entire result set of catalog brains, use zope.sequencesort (via PyPI, but already shipped with Plone) or similar.
The more complex answer is a rabbit-hole of optimizations that you should only go down if you know you need to and know what you are doing:
Make sure when you do sort the brains that your user gets a sticky session to the same instance, at least for cache-affinity to get the same catalog indexes and brains (metadata);
You might want to cache across requests (thread-global) a unique session id, and a sequence of catalog RID (integer) values for your entire sorted request, should you expect the user to come back and need in subsequent batches. Of course, RIDs need to be re-constituted into ZCatalog's lazy-sequences of brains, and this requires some know-how (or reading the source).
Finally, for large result (many thousands) sets, I would suggest that it is reasonable to make application-specific compromises that approximate correct by post-hoc sorting of the current batch through to the end of the n-batches after it, where n is inversely proportional to the len(site.portal_catalog.uniqueValuesFor(indexnamehere)). For a large set of results, the correctness of an approximated secondary-sort is high for high-variability, and low for low variability (many items with same secondary value, such that count is much larger than batch size can make this frustrating).
Do not optimize as such unless you are dealing with particularly large result sets.
It should go without saying: if you do optimize, you need to verify that you are actually getting a superior result (profile and benchmark). If you cannot justify investing the time to do this, you cannot justify optimizing.

What is the conventional way to store objects in a sorted set in redis?

What is the most convenient/fast way to implement a sorted set in redis where the values are objects, not just strings.
Should I just store object id's in the sorted set and then query every one of them individually by its key or is there a way that I can store them directly in the sorted set, i.e. must the value be a string?
It depends on your needs, if you need to share this data with other zsets/structures and want to write the value only once for every change, you can put an id as the zset value and add a hash to store the object. However, it implies making additionnal queries when you read data from the zset (one zrange + n hgetall for n values in the zset), but writing and synchronising the value between many structures is cheap (only updating the hash corresponding to the value).
But if it is "self-contained", with no or few accesses outside the zset, you can serialize to a chosen format (JSON, MESSAGEPACK, KRYO...) your object and then store it as the value of your zset entry. This way, you will have better performance when you read from the zset (only 1 query with O(log(N)+M), it is actually pretty good, probably the best you can get), but maybe you will have to duplicate the value in other zsets / structures if you need to read / write this value outside, which also implies maintaining synchronisation by hand on the value.
Redis has good documentation on performance of each command, so check what queries you would write and calculate the total cost, so that you can make a good comparison of these two options.
Also, don't forget that redis comes with optimistic locking, so if you need pessimistic (because of contention for instance) you will have to do it by hand and/or using lua scripts. If you need a lot of sync, the first option seems better (less performance on read, but still good, less queries and complexity on writes), but if you have values that don't change a lot and memory space is not a problem, the second option will provide better performance on reads (you can duplicate the value in redis, synchronize the values periodically for instance).
Short answer: Yes, everything must be stored as a string
Longer answer: you can serialize your object into any text-based format of your choosing. Most people choose MsgPack or JSON because it is very compact and serializers are available in just about any language.

SQL Read/Write efficiency

Is there any diffrenece in the performance of read and write operations in SQL? Using Linq to SQL in an ASP.NET MVC application, I often update many values in one of my tables in single posts (during this process, many posts of this type will come in rapidly from the user, although the user is unable to submit new data until the previous update is complete). My current implementation is to loop through the input (a list of the current values for each row), and write them to the field (nullable int). I wonder if there would be any performance difference if instead I read the current db value, and only wrote if it has changed. Most of these operations change the values for roughly 1/4 to 2/3 of the rows, some change fewer, and few change more than 2/3 of the rows.
I don't know much about the comparative speeds of these operations (or if there is even any difference). Is there any benefit to be gained from doing this? If so, what table sizes would benefit the most/not benefit at all, and would there be any percentage of the rows changing that would be a threshold for this improvement?
It's always faster to read.
A write is actually always a read followed by a write.
SQL needs to know which row to write to, which involves reading either an index or the table itself in a seek or scan operation, then writing to the appropriate row.
Writing also needs to update any applicable indexes. Depending on the circumstance, the index may get "updated" even when the data doesn't change.
As a very general rule, it's a good idea only to modify the data that needs to be changed.