Optimizing 4 sorted sets - optimization

I'm currently using Redis as a time sorted index for my Mongo database. My mongo document looks something like this:
{
_id : MongoId,
title : "Title",
publishDate : new Date().getTime(),
attachments : [
{
title : "Title",
type : "VIDEO"
},
{
title : "Title",
type : "PHOTO"
},
{
title : "Title",
type : "LINK"
}
]
}
Attachments can have 0 to 3 items in it. (No attachment, only link / only video / only photo, 2 of different types, 3 of different types)
Each time a document is added to Mongo it's automatically added to a sorted set in Redis.
The publishDate (unix timestamp) is used as the score and MongoId is used as the sorted set member. The key has the name "monitor:latest:feed"
If the attachments array contains a link, I add the same member to a sorted list with the key "monitor:latest:feed:links". Same goes for videos/photos.
The result is I have 4 Redis lists that are sorted by time.
The lists are called:
"monitor:latest:feed" that contains all of the document ids in mongodb.
"monitor:latest:feed:links" contains all the document ids that have an attachment with the type link.
"monitor:latest:feed:photos" contains all the document ids that have an attachment with the type photo.
"monitor:latest:feed:video" contains all the document ids that have an attachment with the type video.
Is there a way I could remove the last 3 sorted sets or use some other data structure that takes less memory in Redis? Keeping in mind that I would still need the members sorted by time somehow.

Related

Is it correct to do 1-to-1 mapping in Update API request param

There is a need for me to do bulk update of user details.
Let the object details have the following fields,
User First Name
User ID
User Last Name
User Email ID
User Country
An admin can upload the updated data of the users through a csv file. Values with mismatching data needs to be updated. The most probable request format for this bulk update request will be like:(Method 1)
"data" : {
"userArray" : [
{
"id" : 2343565432,
"f_name" : "David",
"email" : "david#testmail.com"
},
{
"id" : 2344354351,
"country" : "United States",
}
.
.
.
]
}
Method 2 : I would send the details in two arrays, one containing the list of similar filed values with respect to their user ids
"data" : {
"userArray" : [
{
"ids" : [23234323432, 4543543543, 45654543543],
"country" : ["United States", "Israel", "Mexico"]
},
{
"ids" : [2323432334543, 567676565],
"email" : ["groove#drivein.com", "zara#foobar.com"]
},
.
.
.
]
}
In method 1, i need to query the database for every user update, which will be more as the no of user edited is more. In contrast, if i use method 2, i query the database only once for each param(i add the array in the query and get those rows whose user id is present in the given array in a single query). And then i can update the each row with their respective details.
But overall in the internet, most of the update api had params in the format specified in method 1 which gives user good readability. But i need to know what will be advantage if i go with method 1 rather than method 2? (I save some query time in method 2 if the no of users count is large which can improve my performance)
I almost always see it being method 1 style.
Woth that said, I don't understand why your DB performance is based on the way the input data is structured. That's just the way information gets into your code.
You can have the client send the data as method 1 and then shim it to method 2 on the backend if that helps you structure the DB queries better

How to add a custom field in the significant terms aggregation bucket

I am using the significant terms aggregation in ElasticSearch and I am wondering if it is possible to add a custom field to each element in the bucket. Currently, the bucket element looks like this:
{
"key" : "Q",
"doc_count" : 4,
"score" : 4.731818103557571,
"bg_count" : 22
},
By default there are 4 fields and I want to add something here which are computed from the documents belonging here (4 documents in the example above).
I found that there is a way to customize the score itself but that's not what I want. What I want is to add a new custom field which is calculated from the documents having the same key.

In RavenDB can I retrieve all document Id's for a document type

My Scenario:
I have a few thousand documents that I want alter (rename & add properties), I have written a a PatchRequest to alter the document but this takes a document Id.
I'm looking for a way to get a list of document Id's for all documents of a specific type, any ideas?
If possible I'd like to avoid retrieving the document from the server.
I have written a PatchRequest to alter the document but this takes a document Id.
No, .Patch takes a document ID, not the PatchRequest.
Since you want to update a whole swath of documents, you'll want to use the .UpdateByIndex method:
documentStore.DatabaseCommands.UpdateByIndex("IndexName",
new IndexQuery {Query = "Title:RavenDB"},
new []
{
new PatchRequest
{
Type = PatchCommandType.Add,
Name = "Comments",
Value = "New automatic comment we added programmatically"
}
}, allowStale: false);
This will allow you to patch all documents matching an index. That index can be whatever you please.
For more information, see Set-Based Operations in the Raven docs.

MongoDB Update / Upsert Question - Schema Related

I have an problem representing data in MongoDB. I was using this schema design, where a combination of date and word is unique.
{'date':2-1-2011,
'word':word1'
users = [user1, user2, user3, user4]}
{'date':1-1-2011,
'word':word2'
users = [user1, user2]}
There are a fixed number of dates, approximately 200; potentially 100k+ words for each date; and 100k+ users.
I inserted records with an algorithm like so:
while records exist:
message, user, date = pop a record off a list
words = set(tokenise(message))
for word in words:
collection1.insert({'date':date, 'word':word}, {'user':user})
collection2.insert('something similar')
collection3.insert('something similar again')
collection4.insert('something similar again')
However, this schema resulted in extremely large collections and terrible performance was terrible. I am inserting different information into each of the four collections, so it is an extremely large number of operations on the database.
I'm considering representing the data in a format like so, where the words and users arrays are sets.
{'date':'26-6-2011',
'words': [
'word1': ['user1', 'user2'],
'word2': ['user1']
'word1': ['user1', 'user2', 'user3']]}
The idea behind this was to cut down on the number of database operations. So that for each loop of the algorithm, I perform just one update for each collection. However, I am unsure how to perform an update / upsert on this because with each loop of the algorithm, I may need to insert a new word, user, or both.
Could anyone recommend either a way to update this document, or could anyone suggest an alternative schema?
Thanks
Upsert is well suited for dynamically extending documents. Unfortunately I only found it working properly if you have an atomic modifier operation in your update object. like the $addToSet here (mongo shell code):
db.words is empty. add first document for a given date with an upsert.
var query = { 'date' : 'date1' }
var update = { $addToSet: { 'words.word1' : 'user1' } }
db.words.update(query,update,true,false)
check object.
db.words.find();
{ "_id" : ObjectId("4e3bd4eccf7604a2180c4905"), "date" : "date1", "words" : { "word1" : [ "user1" ] } }
now add some more users to first word and another word in one update.
var update = { $addToSet: { 'words.word1' : { $each : ['user2', 'user4', 'user5'] }, 'words.word2': 'user3' } }
db.words.update(query,update,true,false)
again, check object.
db.words.find()
{ "_id" : ObjectId("4e3bd7e9cf7604a2180c4907"), "date" : "date1", "words" : { "word1" : [ "user1", "user2", "user4", "user5" ], "word2" : [ "user3" ] } }
I'm using MongoDB to insert 105mil records with ~10 attributes each. Instead of updating this dataset with changes, I just delete and re insert everything. I found this method to be faster than individually touching each row to see if it was one that I needed to update. You will have better insert speeds if you create JSON formatted text files and use MongoDB's mongoimport tool.
format your data into JSON txt files (one file per collection)
mongoimport each file and specify the collection you want it inserted into

What are the resources or tools used to manage temporal data in key-value stores?

I'm considering using MongoDB or CouchDB on a project that needs to maintain historical records. But I'm not sure how difficult it will be to store historical data in these databases.
For example, in his book "Developing Time-Oriented Database Applications in SQL," Richard Snodgrass points out tools for retrieving the state of data as of a particular instant, and he points out how to create schemas that allow for robust data manipulation (i.e. data manipulation that makes invalid data entry difficult).
Are there tools or libraries out there that make it easier to query, manipulate, or define temporal/historical structures for key-value stores?
edit:
Note that from what I hear, the 'version' data that CouchDB stores is erased during normal use, and since I would need to maintain historical data, I don't think that's a viable solution.
P.S. Here's a similar question that was never answered: key-value-store-for-time-series-data
There are a couple options if you wanted to store the data in MongoDB. You could just store each version as a separate document, as then you can query to get the object at a certain time, the object at all times, objects over ranges of time, etc. Each document would look something like:
{
object : whatever,
date : new Date()
}
You could store all the versions of a document in the document itself, as mikeal suggested, using updates to push the object itself into a history array. In Mongo, this would look like:
db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})
// make changes to obj
...
db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})
A cooler (I think) and more space-efficient way, although less time-efficient, might be to store a history in the object itself about what changed in the object at each time. Then you could replay the history to build the object at a certain time. For instance, you could have:
{
object : startingObj,
history : [
{ date : d1, addField : { x : 3 } },
{ date : d2, changeField : { z : 7 } },
{ date : d3, removeField : "x" },
...
]
}
Then, if you wanted to see what the object looked like between time d2 and d3, you could take the startingObj, add the field x with the value 3, set the field z to the value of 7, and that would be the object at that time.
Whenever the object changed, you could atomically push actions to the history array:
db.foo.update({object : startingObj}, {$push : {history : {date : new Date(), removeField : "x"}}})
Yes, in CouchDB the revisions of a document are there for replication and are usually lost during compaction. I think UbuntuOne did something to keep them around longer but I'm not sure exactly what they did.
I have a document that I need the historical data on and this is what I do.
In CouchDB I have an _update function. The document has a "history" attribute which is an array. Each time I call the _update function to update the document I append to the history array the current document (minus the history attribute) then I update the document with the changes in the request body. This way I have the entire revision history of the document.
This is a little heavy for large documents, there are some javascript diff tools I was investigating and thinking about only storing the diff between the documents but haven't done it yet.
http://wiki.apache.org/couchdb/How_to_intercept_document_updates_and_perform_additional_server-side_processing
Hope that helps.
I can't speak for mongodb but for couchdb it all really hinges on how you write your views.
I don't know the specifics of what you need but if you have a unique id for a document throughout its lifetime and store a timestamp in that document then you have everything you need for robust querying of that document.
For instance:
document structure:
{ "docid" : "doc1", "ts" : <unix epoch> ...<set of key value pairs> }
map function:
function (doc) {
if (doc.docid && doc.ts)
emit([doc.docid, doc.ts], doc);
}
}
The view will now output each doc and its revisions in historical order like so:
["doc1", 1234567], ["doc1", 1234568], ["doc2", 1234567], ["doc2", 1234568]
You can use view collation and start_key or end_key to restrict the returned documents.
start_key=["doc1", 1] end_key=["doc1", 9999999999999]
will return all historical copies of doc1
start_key=["doc2", 1234567] end_key=["doc2", 123456715]
will return all historical copies of doc2 between 1234567 and 123456715 unix epoch times.
see ViewCollation for more details