I have a listing of ~10,000 apps and I'd like to order them by certain columns, but I want to give certain columns more "weight" than others.
For instance, each app has overall_ratings and current_ratings. If the app has a lot of overall_ratings, that's worth 1.5, but the number of current_ratings would be worth, say 2, since the number of current_ratings shows the app is active and currently popular.
Right now there are probably 4-6 of these variables I want to take into account.
So, how can I pull that off? In the query itself? After the fact using just Ruby (remember, there are over 10,000 rows that would need to be processed here)? Something else?
This is a Rails 3.2 app.
Sorting 10000 objects in plain Ruby doesn't seem like a good idea, specially if you just want the first 10 or so.
You can try to put your math formula in the query (using the order method from Active Record).
However, my favourite approach would be to create a float attribute to store the score and update that value with a before_save method.
I would read about dirty attributes so you only perform this scoring when some of you're criteria is updated.
You may also create a rake task that re-scores your current objects.
This way you would keep the scoring functionality in Ruby (you could test it easily) and you could add an index to your float attribute so database queries have better performance.
One attempt would be to let the DB do this work for you with some query like: (can not really test it because of laking db schema):
ActiveRecord::Base.connection.execute("SELECT *,
(2*(SELECT COUNT(*) FROM overall_ratings
WHERE app_id = a.id) +
1.5*(SELECT COUNT(*) FROM current_ratings
WHERE app_id = a.id)
AS rating FROM apps a
WHERE true HAVING rating > 3 ORDER BY rating desc")
Idea is to sum the number of ratings found for each current and overall rating with the subqueries for an specific app id and weight them as desired.
I have this huge database of records that have been created over the past 5 or so years. I'm thinking it would be cool (and edifying) to try to create some time categories/segments for these records, the unit could be week or month or something like that, something to use for a graph.
Anyway, I need to develop a query that, given a datetime attr for each record in the table, would return all the records with a datetime falling in between X and Y (June 1, 2011 & June 7, 2011, for example).
I'm not good at using the time helpers yet and could not find any sufficiently similar questions on SO or elsewhere.
Solutions that use subjective increments like "week" or "month" that rails can understand would be strongly appreciated. I know how tricky the calendar can get in programming. Or I could just use some lowest common denominator (day) and do an extremely fine graph.
Client.where(:created_at => X..Y)
Source: Ruby on Rails Guides
I have User model with many fields and I would like to display a
table as a matrix of 2 of those fields:
- created_at
- type
For the created_at I simply used a group_by as so:
(User.where(:type => "blabla" ).all.group_by { |item|
item.send(:created_at).strftime("%Y-%m-%d") }).sort.each do |
creation_date, users|
This gives me a nice array of all the users per creation_date, so the
lines on my table are ok. However I want to display multiple lines,
each representing the sub selection of the users per type.
So for the moment, I am performing one request per line (per type,
simply replacing the "blabla").
For the moment it's ok because I have
just a few type, but this number will soon increase a lot more, and at
this will not be efficient I am afraid.
Any suggestion on how I could achieve my expected results ?
Thanks,
Alex
The general answer here is to perform a Map / Reduce. Generally, you do not perform the map-reduce in real time due to performance constraints. Instead you run the map-reduce on a schedule and query against the results directly.
Here's a primer on map-reduce for Ruby. Here's another example using Mongoid specifically.
Rails 2.3.4
I have searched google, and have not found an answer to my dilemma.
For this discussion, I have two models. Users and Entries. Users can have many Entries (one for each day).
Entries have values and sent_at dates.
I want to query and display the average value of entries for a user BY DAY OF WEEK. So if a user has entered values for, say, the past 3 weeks, I want to show the average value for Sundays, Mondays, etc. In MySQL, it is simple:
SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = ? GROUP BY 1
That query will return between 0 and 7 records, depending upon how many days a user has had at least one entry.
I've looked at find_by_sql, but while I am searching Entry, I don't want to return an Entry object; instead, I need an array of up to 7 days and averages...
Also, I am concerned a bit about the performance of this, as we would like to load this to the user model when a user logs in, so that it can be displayed on their dashboard. Any advice/pointers are welcome. I am relatively new to Rails.
You can query the database directly, no need to use an actual ActiveRecord object. For example:
ActiveRecord::Base.connection.execute "SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = #{user.id} GROUP BY DAYOFWEEK(sent_at);"
This will give you a MySql::Result or MySql2::Result that you can then use each or all on this enumerable, to view your results.
As for caching, I would recommend using memcached, but any other rails caching strategy will work as well. The nice benefit of memcached is that you can have your cache expire after a certain amount of time. For example:
result = Rails.cache.fetch('user/#{user.id}/averages', :expires_in => 1.day) do
# Your sql query and results go here
end
This would put your results into memcached for one day under the key 'user//averages'. For example if you were user with id 10 your averages would be in memcached under 'user/10/average' and the next time you went to perform this query (within the same day) the cached version would be used instead of actually hitting the database.
Untested, but something like this should work:
#user.entries.select('DAYOFWEEK(sent_at) as day, AVG(value) as average').group('1').all
NOTE: When you use select to specify columns explicitly, the returned objects are read only. Rails can't reliably determine what columns can and can't be modified. In this case, you probably wouldn't try to modify the selected columns, but you can'd modify your sent_at or value columns through the resulting objects either.
Check out the ActiveRecord Querying Guide for a breakdown of what's going on here in a fairly newb-friendly format. Oh, and if that query doesn't work, please post back so others that may stumble across this can see that (and I can possibly update).
Since that won't work due to entries returning an array, we can try using join instead:
User.where(:user_id => params[:id]).joins(:entries).select('...').group('1').all
Again, I don't know if this will work. Usually you can specify where after joins, but I haven't seen select combined in there. A tricky bit here is that the select is probably going to eliminate returning any data about the user at all. It might make more sense just to eschew find_by_* methods in favor of writing a method in the Entry model that just calls your query with select_all (docs) and skips the association mapping.
I have been searching all over the web and I have no clue.
Suppose you have to build a dashboard in the admin area of your Rails app and you want to have the number of subscriptions per day.
Suppose that you are using SQLite3 for development, MySQL for production (pretty standard setup)
Basically, there are two options :
1) Retrieve all rows from the database using Subscriber.all and aggregate by day in the Rails app using the Enumerable.group_by :
#subscribers = Subscriber.all
#subscriptions_per_day = #subscribers.group_by { |s| s.created_at.beginning_of_day }
I think this is a really bad idea. Retrieving all rows from the database can be acceptable for a small application, but it will not scale at all. Database aggregate and date functions to the rescue !
2) Run a SQL query in the database using aggregate and date functions :
Subscriber.select('STRFTIME("%Y-%m-%d", created_at) AS day, COUNT(*) AS subscriptions').group('day')
Which will run in this SQL query :
SELECT STRFTIME("%Y-%m-%d", created_at) AS day, COUNT(*) AS subscriptions
FROM subscribers
GROUP BY day
Much better. Now aggregates are done in the database which is optimized for this kind of task, and only one row per day is returned from the database to the Rails app.
... but wait... now the app has to go live in my production env which uses MySQL !
Replace STRFTIME() with DATE_FORMAT().
What if tomorrow I switch to PostgreSQL ?
Replace DATE_FORMAT() with DATE_TRUNC().
I like to develop with SQLite. Simple and easy.
I also like the idea that Rails is database agnostic.
But why Rails doesn't provide a way to translate SQL functions that do the exact same thing, but have different syntax in each RDBMS (this difference is really stupid, but hey, it's too late to complain about it) ?
I can't believe that I find so few answers on the Web for such a basic feature of a Rails app : count the subscriptions per day, month or year.
Tell me I'm missing something :)
EDIT
It's been a few years since I posted this question.
Experience has shown that I should use the same DB for dev and prod. So I now consider the database agnostic requirement irrelevant.
Dev/prod parity FTW.
I ended up writing my own gem. Check it out and feel free to contribute:
https://github.com/lakim/sql_funk
It allows you to make calls like:
Subscriber.count_by("created_at", :group_by => "day")
You speak of some pretty difficult problems that Rails, unfortunately, completely overlooks. The ActiveRecord::Calculations docs are written like they're all you ever need, but databases can do much more advanced things. As Donal Fellows mentioned in his comment, the problem is much trickier than it seems.
I've developed a Rails application over the last two years that makes heavy use of aggregation, and I've tried a few different approaches to the problem. I unfortunately don't have the luxary of ignoring things like daylight savings because the statistics are "only trends". The calculations I generate are tested by my customers to exact specifications.
To expand upon the problem a bit, I think you'll find that your current solution of grouping by dates is inadequate. It seems like a natural option to use STRFTIME. The primary problem is that it doesn't let you group by arbitrary time periods. If you want to do aggregation by year, month, day, hour, and/or minute, STRFTIME will work fine. If not, you'll find yourself looking for another solution. Another huge problem is that of aggregation upon aggregation. Say, for example, you want to group by month, but you want to do it starting from the 15th of every month. How would you do it using STRFTIME? You'd have to group by each day, and then by month, but then someone account for the starting offset of the 15th day of each month. The final straw is that grouping by STRFTIME necessitates grouping by a string value, which you'll find very slow when performing aggregation upon aggregation.
The most performant and best designed solution I've come to is one based upon integer time periods. Here is an excerpt from one of my mysql queries:
SELECT
field1, field2, field3,
CEIL((UNIX_TIMESTAMP(CONVERT_TZ(date, '+0:00', ##session.time_zone)) + :begin_offset) / :time_interval) AS time_period
FROM
some_table
GROUP BY
time_period
In this case, :time_interval is the number of seconds in the grouping period (e.g. 86400 for daily) and :begin_offset is the number of seconds to offset the period start. The CONVERT_TZ() business accounts for the way mysql interprets dates. Mysql always assumes that the date field is in the mysql local time zone. But because I store times in UTC, I must convert it from UTC to the session time zone if I want the UNIX_TIMESTAMP() function to give me a correct response. The time period ends up being an integer that describes the number of time intervals since the start of unix time. This solution is much more flexible because it lets you group by arbitrary periods and doesn't require aggregation upon aggregation.
Now, to get to my real point. For a robust solution, I'd recommend that you consider not using Rails at all to generate these queries. The biggest issue is that the performance characteristics and subtleties of aggregation are different across the databases. You might find one design that works well in your development environment but not in production, or vice-versa. You'll jump through a lot of hoops to get Rails to play nicely with both databases in query construction.
Instead I'd recommend that you generate database-specific views in your chosen database and bring those along to the correct environment. Try to model the view as you would any other ActiveRecord table (id's and all), and of course make the fields in the view identical across databases. Because these statistics are read-only queries, you can use a model to back them and pretend like they're full-fledged tables. Just raise an exception if somebody tries to save, create, update, or destroy.
Not only will you get simplified model management by doing things the Rails way, you'll also find that you can write units tests for your aggregation features in ways you wouldn't dream of in pure SQL. And if you decide to switch databases, you'll have to rewrite those views, but your tests will tell you where you're wrong, and make life so much easier.
I just released a gem that allows you to do this easily with MySQL. https://github.com/ankane/groupdate
You should really try to run MySQL in development, too. Your development and production environments should be as close as possible - less of a chance for something to work on development and totally break production.
If db agnosticism is what you're after, I can think of a couple of options:
Create a new field (we'll call it day_str) for the Subscriber that stores either the formatted date or a timestamp and use ActiveRecord.count:
daily_subscriber_counts = Subscriber.count(:group => "day_str")
The trade-off is of course a slightly larger record size, but this would all but eliminate performance worries.
You could also, depending on how granular the data that's being visualized is, just call .count several times with the date set as desired...
((Date.today - 7)..Date.today).each |d|
daily_subscriber_counts[d] = Subscriber.count(:conditions => ["created_at >= ? AND created_at < ?", d.to_time, (d+1).to_time)
end
This could also be customized to account for varying granularities (per month, per year, per day, per hour). It's not the most efficient solution in the case that you wanted to group by day on all of your subscribers (haven't had a chance to run it either), but I would imagine you'd want to group by month, day, hour if you're viewing the a years worth, months worth or days worth of data respectively.
If you're willing to commit to mysql and sqlite you could use...
daily_subscriber_counts = Subscriber.count(:group => "date(created_at)")
...as they share similar date() functions.
I'd refine/expand PBaumann's answer slightly, and include a Dates table in your database. You'd need a join in your query:
SELECT D.DateText AS Day, COUNT(*) AS Subscriptions
FROM subscribers AS S
INNER JOIN Dates AS D ON S.created_at = D.Date
GROUP BY D.DateText
...but you'd have a nicely-formatted value available without calling any functions. With a PK on Dates.Date, you can merge join and it should be very fast.
If you have an international audience, you could use DateTextUS, DateTextGB, DateTextGer, etc., but obviously this would not be a perfect solution.
Another option: cast the date to text on the database side using CONVERT(), which is ANSI and may be available across databases; I'm too lazy to confirm that right now.
Here's how I do it:
I have a class Stat which allows storing raw events.
(Code is from the first few weeks I started coding in Ruby so excuse some of it :-))
class Stat < ActiveRecord::Base
belongs_to :statable, :polymorphic => true
attr_accessible :statable_id, :statable_type, :statable_stattype_id, :source_url, :referral_url, :temp_user_guid
# you can replace this with a cron job for better performance
# the reason I have it here is because I care about real-time stats
after_save :aggregate
def aggregate
aggregateinterval(1.hour)
#aggregateinterval(10.minutes)
end
# will aggregate an interval with the following properties:
# take t = 1.hour as an example
# it's 5:21 pm now, it will aggregate everything between 5 and 6
# and put them in the interval with start time 5:00 pm and 6:00 pm for today's date
# if you wish to create a cron job for this, you can specify the start time, and t
def aggregateinterval(t=1.hour)
aggregated_stat = AggregatedStat.where('start_time = ? and end_time = ? and statable_id = ? and statable_type = ? and statable_stattype_id = ?', Time.now.utc.floor(t), Time.now.utc.floor(t) + t, self.statable_id, self.statable_type, self.statable_stattype_id)
if (aggregated_stat.nil? || aggregated_stat.empty?)
aggregated_stat = AggregatedStat.new
else
aggregated_stat = aggregated_stat.first
end
aggregated_stat.statable_id = self.statable_id
aggregated_stat.statable_type = self.statable_type
aggregated_stat.statable_stattype_id = self.statable_stattype_id
aggregated_stat.start_time = Time.now.utc.floor(t)
aggregated_stat.end_time = Time.now.utc.floor(t) + t
# in minutes
aggregated_stat.interval_size = t / 60
if (!aggregated_stat.count)
aggregated_stat.count = 0
end
aggregated_stat.count = aggregated_stat.count + 1
aggregated_stat.save
end
end
And here's the AggregatedStat class:
class AggregatedStat < ActiveRecord::Base
belongs_to :statable, :polymorphic => true
attr_accessible :statable_id, :statable_type, :statable_stattype_id, :start_time, :end_time
Every statable item that gets added to the db has a statable_type and a statable_stattype_id and some other generic stat data. The statable_type and statable_stattype_id are for the polymorphic classes and can hold values like (the string) "User" and 1, which means you're storing stats about User number 1.
You can add more columns and have mappers in the code extract the right columns when you need them. Creating multiple tables make it harder to manage.
In the code above, StatableStattypes is just a table that contains "events" you'd like to log... I use a table because prior experience taught me that I don't want to look for what type of stats a number in the database refers to.
class StatableStattype < ActiveRecord::Base
attr_accessible :name, :description
has_many :stats
end
Now go to the classes you'd like to have some stats for and do the following:
class User < ActiveRecord::Base
# first line isn't too useful except for testing
has_many :stats, :as => :statable, :dependent => :destroy
has_many :aggregated_stats, :as => :statable, :dependent => :destroy
end
You can then query the aggregated stats for a certain User (or Location in the example below) with this code:
Location.first.aggregated_stats.where("start_time > ?", DateTime.now - 8.month)