Is there a NO KEY or auto increment strategy on AEROSPIKE that i can Use for unique keys? - aerospike

We are trying to implement a logging or statistics implementation with Aerospike. We have users logged in and Anonymus users making queries to our main database and we want to store every request that it's made.
Our best approach so far is to store Records with the UserID as a Key, and the query keywords as a List like this:
{
Key: 'alacret'
Bins:{
searches: [
"something to search 1",
"something to search 2",
"something to search 3",
...
]
}
}
As the application Architect, reviewing this, I come to several performance/design pitfalls :
1) Retrieving and storing are two operations, getting all the list, append, and then put again seems inefficient or suboptimal
2) By doing two operations means that I have to do both in a transaction, to prevent raise conditions, that I think would kill the Aerospike performance
3) The documentation states that the List are data structures to sized-bounded data, so if I understand correctly is not gonna scale pretty well, especially for anonymous users who would increase the size of the list exponentially.
As an alternative, I'm proposing to move the userID as Bin, and generate a Key that prevents raises conditions and keep the save operation as a single operation, and not several in a transaction.
So, what I'm looking for are opinions and validations.
Greetings

You can append to the list or prepend it. You can also limit it by trimming it, if beyond a certain limit, you don't care to store the search items ie you only want to store say 100 most recent items in your userID search list. You can do the append and trim, then read back the updated list, all in one lock. If you are storing on disk, the record size is limited to 1MB including all overhead etc. You can store much much larger record size if storing data only in RAM. (storage-engine memory). Does that suit your application need?

Related

Sorting in application vs sorting in DB

When querying for the top N results, I can ask the DB to sort the results OR I can sort them myself.
I read a lot about performance and memory advantage that the DB has over in-app sorting. However, assuming I write an optimal sorting code, isn't the performance equal in both options?
Both are using the same CPU, both can allocate threads and both can allocate more space in memory to perform the sort.
All the answers I found in the subject are more of less the same - saying
"just let the DB do it, it will do it better than you", or
"the rule of thumb is do anything in the DB unless a specific need arises such as complex sorts..."
So, Why choose DB-sorting over in-app sorting (besides saving the network bandwidth by not asking for millions of table entries to sort upon)?
With an app sort you need transfer all the DATA, with a database sort you need just transfer N rows !
Database implement already the most efficient sort algorythm.
If index already exist, DMBS can return top N without sort the data.
Edit :
If your dataset is very small, it can be stored in memory client side, then you can ordered it by the app. Can be a good solution, if you need reorder data without refresh data from your DB.
In other case use DB sort.

Correct modeling in Redis for writing single entity but querying multiple

I'm trying to convert data which is on a Sql DB to Redis. In order to gain much higher throughput because it's a very high throughput. I'm aware of the downsides of persistence, storage costs etc...
So, I have a table called "Users" with few columns. Let's assume: ID, Name, Phone, Gender
Around 90% of the requests are Writes. to update a single row.
Around 10% of the requests are Reads. to get 20 rows in each request.
I'm trying to get my head around the right modeling of this in order to get the max out of it.
If there were only updates - I would use Hashes.
But because of the 10% of Reads I'm afraid it won't be efficient.
Any suggestions?
Actually, the real question is whether you need to support partial updates.
Supposing partial update is not required, you can store your record in a blob associated to a key (i.e. string datatype). All write operations can be done in one roundtrip, since the record is always written at once. Several read operations can be done in one rountrip as well using the MGET command.
Now, supposing partial update is required, you can store your record in a dictionary associated to a key (i.e. hash datatype). All write operations can be done in one roundtrip (even if they are partial). Several read operations can also be done in one roundtrip provided HGETALL commands are pipelined.
Pipelining several HGETALL commands is a bit more CPU consuming than using MGET, but not that much. In term of latency, it should not be significantly different, except if you execute hundreds of thousands of them per second on the Redis instance.

Should I be accessing Map Reduce output as a Mongoid Document?

So my map reduce operation sums up a list of micro payments into a lump sum that I owe a particular user. The user_id ends up being the _id. I also store an array ids of the micro payments that need to be paid. The output goes into a permeant collection called payments.
The output looks like this for one document
{ "_id" : ObjectId("4f48855606164f4765000004"), "value" : { "payment" : "5.0", "conversions" : [ ObjectId("4f5bd23baa113e964700000e") ] } }
I'd kind of like to track these payments so I was thinking about just building a mongoid document around the payments collection. I kind of know it can be done but I haven't really seen anyone doing it so it makes me think there must be a better way.
Also one problem with this approach is I'm making the payments every month so the _id being the user_id is going to conflict. Additionally I think there is a possible transaction problem because I need to update the micro payments to a different state so I know not to pay them ever again and what happens if one of the payments fails? These state change via state_machine if that makes any difference.
Should I be accessing Map Reduce output as a Mongoid Document?
Sure you can definitely do this. That's kind of the reason the M/R is output to a collection rather than just "some file".
Also one problem with this approach is I'm making the payments every month so the _id being the user_id is going to conflict.
So clearly, the output of your M/R is important data. Do not leave this data in a collection that could be "hammered" by a future M/R. Instead, rename the collection you have created, or run a for loop that manually appends the data to a "keeper" collection.
In the "keeper" collection change the _id to something like _id: { uid: ObjectId('...'), month: "201203" }. You may also want to "fan out" the values field into several fields. And you will need to add a field for transaction ID.
Also remember that MongoDB uses "fire & forget" writes by default. These are low safety. You have financial data, so ensure that you are following all of the best practices for availability and data safety:
Journaling On
Replica Sets (with secondary data center)
Ensure that all writes to this collection/db are done with w: majority and journal: true. This will slow down DB throughput on this operation as these writes can take a few hundred milliseconds.
Database passwords
Non-standard MongoDB port, IP white-listing (usual DB security)
what happens if one of the payments fails?
This a non-trivial problem and way too complicated to explain here. Instead, see this document on the two-phase commit with MongoDB.
Note that two-phase commit requires MongoDB's findAndModify command. You will have to learn how to handle this with Mongoid.

LIST alternative in redis

Redis.io
The main features of Redis Lists from the point of view of time
complexity is the support for constant time insertion and deletion of
elements near the head and tail, even with many millions of inserted
items. Accessing elements is very fast near the extremes of the list
but is slow if you try accessing the middle of a very big list, as it
is an O(N) operation.
what is the LIST alternative when the data is too high and writes are lesser than Reads
This is something I'd definitely benchmark before doing, but if you're really hitting a performance issue accessing items in the middle of the list, there are a couple of alternatives that really depend on your use case.
Don't make a list so big, age out/trim pieces that don't matter any more.
Memoize hot sections of the list. If a particular paginated range is being requested much more often than others, make that it's own list. Check if it exists already, and if it doesn't create a subset of your list in the paginated range.
Bucket your list from the beginning into "manageable sizes" (for whatever your definition of managable is). If a list is purely additive (no removal from the list), you could use the modulus index of an item as part of the key so that your list is stored in smaller buckets. Ex: key = "your_key_name_" + index % 100000

Lucene Indexing

I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.