John Nunemaker has a blog post with some nice tips about Mongo ObjectIds -- http://mongotips.com/b/a-few-objectid-tricks/ -- in particular I was interested in the tip about generation_time. He suggests it's not necessary to explicitly store the created_at time in mongo documents because you can always pull it from the ID, which caught my attention. Problem is I can't figure out how to generate mongo queries in mongomapper to find documents based on creation time if all I have is the id.
If I store a key :created_at as part of the document I can do a query in mongomapper to get all documents created since Dec 1st like this:
Foo.where(:created_at.gt=>Time.parse("2011-12-01"))
(which maps to:
{created_at: {"$gt"=>Thu Dec 01 06:00:00 UTC 2011}}
I can't figure out how to make the equivalent query using the ObjectId.. I imagine it'd look something like this (though obviously generation_time is a ruby function, but is there an equivalent I can use on the objectid in the context of a mongo query?)
Foo.where('$where'=>"this.id.generation_time > new Date('2011-12-01')")
{$where: "this.id.generation_time > new Date('2011-12-01')"}
One further question: if I forgo storing separate timestamps, will I lose the timestamp metadata if I dump and restore my database using mongodump? Are there recommended backup/restore techniques that preserve ObjectIds?
this is javascript code which would be run in the shell but generation time is a mongomapper method so it doesn't make sense in the code you have.
In rails you would get the id by saying something like
created_at = self.id.generation_time.in_time_zone(Time.zone)
Where self refers to an instance of Foo.
And you would query by saying
Foo.find('_id' => {'$gte' => BSON::ObjectId.from_time(created_at)}).count
Why bother though... the hassle isn't worth it, just store the time.
Regarding the backup/restore techniques, unless you are manually reading and re-inserting mongodump/restore and similar tools will preserve the object id so you have nothing to worry about there.
Related
New to Sequel and SQL in general, so bear with me. I'm using Sequel's many_through_many plugin and I retrieve resources that are indirectly associated with particular tasks through groups, via a groups_tasks join table and a groups_resources join table. Then when I query task.resource on a Task dataset I get resource objects in Ruby, like so:
>>[#<Resource #values={:id=>2, :group_id=>nil, :display_name=>"some_name"}>, #<Resource #values={:id=>3, :group_id=>nil, :display_name=>"some_other_name"}>]
Now, I want to be able to add a new instance variable, schedule to these resource objects and do work on it in Ruby. However, every time I query task.resources for each task, Sequel is bringing resources objects in to ruby as different resource objects each time (which makes sense), despite being the same record in the database:
>>
"T3"
#<Resource:0x007fd4ca0c6fd8>
#<Resource:0x007fd4ca0c6920>
#<Resource:0x007fd4ca0c60d8>
#<Resource:0x007fd4ca0c57a0>
"T1"
#<Resource:0x007fd4ca0a4c08>
#<Resource:0x007fd4ca097f58>
#<Resource:0x007fd4ca097b48>
"T2"
#<Resource:0x007fd4ca085ba0>
#<Resource:0x007fd4ca0850d8>
I had wanted to just put a setter in class Resource and do resource.schedule = Schedule.new, but since they're all different objects, each resource is going to have a ton of different schedules. What's the most straightforward way to manipulate these resource objects client side, but maintain their task associations that I query from the server?
If I am understanding your question correctly, you want to retrieve Resource objects and then manipulate some attribute named schedule. I am not very familiar with Sequel, but looking over the docs it seems to work similarly to ActiveRecord.
Set up your instance variable (I imagine using something like attr_accessor :schedule on the Resource class).
Store the records in a variable, you will be working with same instance each time, rather than the new instance Sequel returns.
I've been playing with Couchbase and I'm trying to find best ways to model relationships.
belongsTo: this is fairly easy. When I have Posts and Comments, I can have the following structure in comments.
Comment:
id: 1
parent: this is where I store an id of post
hasMany: This seemed pretty easy at first. Assuming I have Posts and Users and users can like a Post, I had the following structure.
Posts:
id: 1
likedBy: [
'user-id-1',
'user-id-2'
]
This works if I have may be...a thousand likes, but as the # of likes increases..it gets slower and slower and I have to lock the document.
My first solution was using view, but then view is not real time even though it is adequate for most of queries. There is always delay for indexing.
Then I thought about using a relational database just to save relationship and I think this might be pretty good choice, but I would like to know if there is something I'm missing.
For the comments I might use something like this, but instead of "SomeEventType" and date time stamp like it has in the blog post, I would do the ID of the post itself. This way you get the counter object for that post, which gives you the upper bound of the array of comments. Then you can iterate through that list, use pagination or do a bulk get for all of them. Since this would be interacting purely with the Data Service, it would meet your consistency and real time needs.
For the number of likes, you could use a counter object. For recording which user's like a post or comment, you could store that in a separate object or maybe have an index object like you have in your question per user? Let me think more about this one.
A migration contains the following:
Service.find_by_sql("select
service_id,
registrations.regulator_given_id,
registrations.regulator_id
from
registrations
order by
service_id, updated_at desc").each do |s|
this_service_id = s["service_id"]
if this_service_id != last_service_id
Service.find(this_service_id).update_attributes!(:regulator_id => s["regulator_id"],
:regulator_given_id => s["regulator_given_id"])
last_service_id = this_service_id
end
end
and it is eating up memory, to the point where it will not run in the 512MB allowed in Heroku (the registrations table has 60,000 items). Is there a known problem? Workaround? Fix in a later version of Rails?
Thanks in advance
Edit following request to clarify:
That is all the relevant source - the rest of the migration creates the two new columns that are being populated. The situation is that I have data about services from multiple sources (regulators of the services) in the registrations table. I have decided to 'promote' some of the data ([prime]regulator_id and [prime]regulator_given_key) into the services table for the prime regulators to speed up certain queries.
This will load all 60000 items in one go and keep those 60000 AR objects around, which will consume a fair amount of memory. Rails does provide a find_each method for breaking down a query like that into chunks of 1000 objects at a time, but it doesn't allow you to specify an ordering as you do.
You're probably best off implementing your own paging scheme. Using limit/offset is a possibility however large OFFSET values are usually inefficient because the database server has to generate a bunch of results that it then discards.
An alternative is to add conditions to your query that ensures that you don't return already processed items, for example specifying that service_id be less than the previously returned values. This is more complicated if when compared in this matter some items are equal. With both of these paging type schemes you probably need to think about what happens if a row gets inserted into your registrations table while you are processing it (probably not a problem with migrations, assuming you run them with access to the site disabled)
(Note: OP reports this didn't work)
Try something like this:
previous = nil
Registration.select('service_id, regulator_id, regulator_given_id')
.order('service_id, updated_at DESC')
.each do |r|
if previous != r.service_id
service = Service.find r.service_id
service.update_attributes(:regulator_id => r.regulator_id, :regulator_given_id => r.regulator_given_id)
previous = r.service_id
end
end
This is a kind of hacky way of getting the most recent record from regulators -- there's undoubtedly a better way to do it with DISTINCT or GROUP BY in SQL all in a single query, which would not only be a lot faster, but also more elegant. But this is just a migration, right? And I didn't promise elegant. I also am not sure it will work and resolve the problem, but I think so :-)
The key change is that instead of using SQL, this uses AREL, meaning (I think) the update operation is performed once on each associated record as AREL returns them. With SQL, you return them all and store in an array, then update them all. I also don't think it's necessary to use the .select(...) clause.
Very interested in the result, so let me know if it works!
Rails 2.3.4
I have searched google, and have not found an answer to my dilemma.
For this discussion, I have two models. Users and Entries. Users can have many Entries (one for each day).
Entries have values and sent_at dates.
I want to query and display the average value of entries for a user BY DAY OF WEEK. So if a user has entered values for, say, the past 3 weeks, I want to show the average value for Sundays, Mondays, etc. In MySQL, it is simple:
SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = ? GROUP BY 1
That query will return between 0 and 7 records, depending upon how many days a user has had at least one entry.
I've looked at find_by_sql, but while I am searching Entry, I don't want to return an Entry object; instead, I need an array of up to 7 days and averages...
Also, I am concerned a bit about the performance of this, as we would like to load this to the user model when a user logs in, so that it can be displayed on their dashboard. Any advice/pointers are welcome. I am relatively new to Rails.
You can query the database directly, no need to use an actual ActiveRecord object. For example:
ActiveRecord::Base.connection.execute "SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = #{user.id} GROUP BY DAYOFWEEK(sent_at);"
This will give you a MySql::Result or MySql2::Result that you can then use each or all on this enumerable, to view your results.
As for caching, I would recommend using memcached, but any other rails caching strategy will work as well. The nice benefit of memcached is that you can have your cache expire after a certain amount of time. For example:
result = Rails.cache.fetch('user/#{user.id}/averages', :expires_in => 1.day) do
# Your sql query and results go here
end
This would put your results into memcached for one day under the key 'user//averages'. For example if you were user with id 10 your averages would be in memcached under 'user/10/average' and the next time you went to perform this query (within the same day) the cached version would be used instead of actually hitting the database.
Untested, but something like this should work:
#user.entries.select('DAYOFWEEK(sent_at) as day, AVG(value) as average').group('1').all
NOTE: When you use select to specify columns explicitly, the returned objects are read only. Rails can't reliably determine what columns can and can't be modified. In this case, you probably wouldn't try to modify the selected columns, but you can'd modify your sent_at or value columns through the resulting objects either.
Check out the ActiveRecord Querying Guide for a breakdown of what's going on here in a fairly newb-friendly format. Oh, and if that query doesn't work, please post back so others that may stumble across this can see that (and I can possibly update).
Since that won't work due to entries returning an array, we can try using join instead:
User.where(:user_id => params[:id]).joins(:entries).select('...').group('1').all
Again, I don't know if this will work. Usually you can specify where after joins, but I haven't seen select combined in there. A tricky bit here is that the select is probably going to eliminate returning any data about the user at all. It might make more sense just to eschew find_by_* methods in favor of writing a method in the Entry model that just calls your query with select_all (docs) and skips the association mapping.
I am looking for the fastest way to check for the existence of an object.
The scenario is pretty simple, assume a directory tool, which reads the current hard drive. When a directory is found, it should be either created, or, if already present, updated.
First lets only focus on the creation part:
public static DatabaseDirectory Get(DirectoryInfo dI)
{
var result = DatabaseController.Session
.CreateCriteria(typeof (DatabaseDirectory))
.Add(Restrictions.Eq("FullName", dI.FullName))
.List<DatabaseDirectory>().FirstOrDefault();
if (result == null)
{
result = new DatabaseDirectory
{
CreationTime = dI.CreationTime,
Existing = dI.Exists,
Extension = dI.Extension,
FullName = dI.FullName,
LastAccessTime = dI.LastAccessTime,
LastWriteTime = dI.LastWriteTime,
Name = dI.Name
};
}
return result;
}
Is this the way to go regarding:
Speed
Separation of Concern
What comes to mind is the following: A scan will always be performed "as a whole". Meaning, during a scan of drive C, I know that nothing new gets added to the database (from some other process). So it MAY be a good idea to "cache" all existing directories prior to the scan, and look them up this way. On the other hand, this may be not suitable for large sets of data, like files (which will be 600.000 or more)...
Perhaps some performance gain can be achieved using "index columns" or something like this, but I am not so familiar with this topic. If anybody has some references, just point me in the right direction...
Thanks,
Chris
PS: I am using NHibernate, Fluent Interface, Automapping and SQL Express (could switch to full SQL)
Note:
In the given problem, the path is not the ID in the database. The ID is an auto-increment, and I can't change this requirement (other reasons). So the real question is, what is the fastest way to "check for the existance of an object, where the ID is not known, just a property of that object"
And batching might be possible, by selecting a big group with something like "starts with C:Testfiles\" but the problem then remains, how do I know in advance how big this set will be. I cant select "max 1000" and check in this buffered dictionary, because i might "hit next to the searched dir"... I hope this problem is clear. The most important part, is, is buffering really affecting performance this much. If so, does it make sense to load the whole DB in a dictionary, containing only PATH and ID (which will be OK, even if there are 1.000.000 object, I think..)
First off, I highly recommend that you (anyone using NH, really) read Ayende's article about the differences between Get, Load, and query.
In your case, since you need to check for existence, I would use .Get(id) instead of a query for selecting a single object.
However, I wonder if you might improve performance by utilizing some knowledge of your problem domain. If you're going to scan the whole drive and check each directory for existence in the database, you might get better performance by doing bulk operations. Perhaps create a DTO object that only contains the PK of your DatabaseDirectory object to further minimize data transfer/processing. Something like:
Dictionary<string, DirectoryInfo> directories;
session.CreateQuery("select new DatabaseDirectoryDTO(dd.FullName) from DatabaseDirectory dd where dd.FullName in (:ids)")
.SetParameterList("ids", directories.Keys)
.List();
Then just remove those elements that match the returned ID values to get the directories that don't exist. You might have to break the process into smaller batches depending on how large your input set is (for the files, almost certainly).
As far as separation of concerns, just keep the operation at a repository level. Have a method like SyncDirectories that takes a collection (maybe a Dictionary if you follow something like the above) that handles the process for updating the database. That way your higher application logic doesn't have to worry about how it all works and won't be affected should you find an even faster way to do it in the future.