I have a mongodb database that contains a large amount of data without a highly consistent schema. It is used for doing Google Analytics-style interaction tracking with our applications. I need to gather some output covering a whole month, but I'm struggling with the performance of the query, and I don't really know MongoDB very well at all.
The only way I can get results out is by restricting the timespan I am querying within to one day at a time, using the _timestamp field which I believe is indexed by default (I might be wrong).
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-01T00:00:00.000Z"),$lte:ISODate("2019-09-02T00:00:00.000Z")}}); // Day 1..
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-03T00:00:00.000Z"),$lte:ISODate("2019-09-04T00:00:00.000Z")}}); // Day 2..
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-05T00:00:00.000Z"),$lte:ISODate("2019-09-06T00:00:00.000Z")}}); // Day 3..
This works 'fine', but I'd rather be able to SQL union those seperate queries together - but then I guess I'd still end up timing out.
Ideally I'd end up with each of those queries executing seperately, with the resultset being appended to each time and returned at the end.
I might be better off writing a simple application to do this.
Help me Obi-Wan Kenobi, you're my only hope.
I have a listing of ~10,000 apps and I'd like to order them by certain columns, but I want to give certain columns more "weight" than others.
For instance, each app has overall_ratings and current_ratings. If the app has a lot of overall_ratings, that's worth 1.5, but the number of current_ratings would be worth, say 2, since the number of current_ratings shows the app is active and currently popular.
Right now there are probably 4-6 of these variables I want to take into account.
So, how can I pull that off? In the query itself? After the fact using just Ruby (remember, there are over 10,000 rows that would need to be processed here)? Something else?
This is a Rails 3.2 app.
Sorting 10000 objects in plain Ruby doesn't seem like a good idea, specially if you just want the first 10 or so.
You can try to put your math formula in the query (using the order method from Active Record).
However, my favourite approach would be to create a float attribute to store the score and update that value with a before_save method.
I would read about dirty attributes so you only perform this scoring when some of you're criteria is updated.
You may also create a rake task that re-scores your current objects.
This way you would keep the scoring functionality in Ruby (you could test it easily) and you could add an index to your float attribute so database queries have better performance.
One attempt would be to let the DB do this work for you with some query like: (can not really test it because of laking db schema):
ActiveRecord::Base.connection.execute("SELECT *,
(2*(SELECT COUNT(*) FROM overall_ratings
WHERE app_id = a.id) +
1.5*(SELECT COUNT(*) FROM current_ratings
WHERE app_id = a.id)
AS rating FROM apps a
WHERE true HAVING rating > 3 ORDER BY rating desc")
Idea is to sum the number of ratings found for each current and overall rating with the subqueries for an specific app id and weight them as desired.
I have a simple query:
Select Count(p.Group_ID)
From Player_Source P
Inner Join Feature_Group_Xref X On P.Group_Id=X.Group_Id
where x.feature_name ='Try this site'
which spits out the current number of people in a specific test group at the current moment in time.
If I wanted to see what this number was, say, on 9/10/12 instead, could I add something in to the query to time phase this information as the database had it 2 days ago?
No. If you want to store historical information, you will need to incorporate that into your schema. For example, you might extend Feature_Group_Xref to add the columns Effective_Start_Timestamp and Effective_End_Timestamp; to find which groups currently have a given feature, you would write AND Effective_End_Timestamp > CURRENT_TIMESTAMP() (or AND Effective_End_Timestamp IS NULL, depending how you want to define the column), but to find which groups had a given feature at a specific time, you would write AND ... BETWEEN Effective_Start_Timestamp AND Effective_End_Timestamp (or AND Effective_Start_Timestamp < ... AND (Effective_End_Timestamp > ... OR Effective_End_Timestamp IS NULL)).
Wikipedia has a good article on various schema designs that people use to tackle this sort of problem: see http://en.wikipedia.org/wiki/Slowly_changing_dimension.
It depends...
It is at least theoretically possible that you could use flashback query
Select Count(p.Group_ID)
From Player_Source as of timestamp( date '2012-09-10' ) P
Join Feature_Group_Xref as of timestamp( date '2012-09-10' ) X
On P.Group_Id=X.Group_Id
where x.feature_name ='Try this site'
This requires, though, that you have the privileges necessary to do a flashback query and that there is enough UNDO for Oracle to apply to be able to get back to the state those tables were in at midnight two days ago. It is unlikely that the database is configured to retain that much UNDO though it is generally possible. This query would also work if you happen to be using Oracle Total Recall.
More likely, though, you will need to modify your schema definition so that you are storing historical information that you can then query as of a point in time. There are a variety of ways to accomplish this-- adding effective and expiration date columns to the table as #ruakh suggests is one of the more popular options. Which option(s) are appropriate in your particular case will depend on a variety of factors including how much history you want to retain, how frequently data changes, etc.
A migration contains the following:
Service.find_by_sql("select
service_id,
registrations.regulator_given_id,
registrations.regulator_id
from
registrations
order by
service_id, updated_at desc").each do |s|
this_service_id = s["service_id"]
if this_service_id != last_service_id
Service.find(this_service_id).update_attributes!(:regulator_id => s["regulator_id"],
:regulator_given_id => s["regulator_given_id"])
last_service_id = this_service_id
end
end
and it is eating up memory, to the point where it will not run in the 512MB allowed in Heroku (the registrations table has 60,000 items). Is there a known problem? Workaround? Fix in a later version of Rails?
Thanks in advance
Edit following request to clarify:
That is all the relevant source - the rest of the migration creates the two new columns that are being populated. The situation is that I have data about services from multiple sources (regulators of the services) in the registrations table. I have decided to 'promote' some of the data ([prime]regulator_id and [prime]regulator_given_key) into the services table for the prime regulators to speed up certain queries.
This will load all 60000 items in one go and keep those 60000 AR objects around, which will consume a fair amount of memory. Rails does provide a find_each method for breaking down a query like that into chunks of 1000 objects at a time, but it doesn't allow you to specify an ordering as you do.
You're probably best off implementing your own paging scheme. Using limit/offset is a possibility however large OFFSET values are usually inefficient because the database server has to generate a bunch of results that it then discards.
An alternative is to add conditions to your query that ensures that you don't return already processed items, for example specifying that service_id be less than the previously returned values. This is more complicated if when compared in this matter some items are equal. With both of these paging type schemes you probably need to think about what happens if a row gets inserted into your registrations table while you are processing it (probably not a problem with migrations, assuming you run them with access to the site disabled)
(Note: OP reports this didn't work)
Try something like this:
previous = nil
Registration.select('service_id, regulator_id, regulator_given_id')
.order('service_id, updated_at DESC')
.each do |r|
if previous != r.service_id
service = Service.find r.service_id
service.update_attributes(:regulator_id => r.regulator_id, :regulator_given_id => r.regulator_given_id)
previous = r.service_id
end
end
This is a kind of hacky way of getting the most recent record from regulators -- there's undoubtedly a better way to do it with DISTINCT or GROUP BY in SQL all in a single query, which would not only be a lot faster, but also more elegant. But this is just a migration, right? And I didn't promise elegant. I also am not sure it will work and resolve the problem, but I think so :-)
The key change is that instead of using SQL, this uses AREL, meaning (I think) the update operation is performed once on each associated record as AREL returns them. With SQL, you return them all and store in an array, then update them all. I also don't think it's necessary to use the .select(...) clause.
Very interested in the result, so let me know if it works!
I have been searching all over the web and I have no clue.
Suppose you have to build a dashboard in the admin area of your Rails app and you want to have the number of subscriptions per day.
Suppose that you are using SQLite3 for development, MySQL for production (pretty standard setup)
Basically, there are two options :
1) Retrieve all rows from the database using Subscriber.all and aggregate by day in the Rails app using the Enumerable.group_by :
#subscribers = Subscriber.all
#subscriptions_per_day = #subscribers.group_by { |s| s.created_at.beginning_of_day }
I think this is a really bad idea. Retrieving all rows from the database can be acceptable for a small application, but it will not scale at all. Database aggregate and date functions to the rescue !
2) Run a SQL query in the database using aggregate and date functions :
Subscriber.select('STRFTIME("%Y-%m-%d", created_at) AS day, COUNT(*) AS subscriptions').group('day')
Which will run in this SQL query :
SELECT STRFTIME("%Y-%m-%d", created_at) AS day, COUNT(*) AS subscriptions
FROM subscribers
GROUP BY day
Much better. Now aggregates are done in the database which is optimized for this kind of task, and only one row per day is returned from the database to the Rails app.
... but wait... now the app has to go live in my production env which uses MySQL !
Replace STRFTIME() with DATE_FORMAT().
What if tomorrow I switch to PostgreSQL ?
Replace DATE_FORMAT() with DATE_TRUNC().
I like to develop with SQLite. Simple and easy.
I also like the idea that Rails is database agnostic.
But why Rails doesn't provide a way to translate SQL functions that do the exact same thing, but have different syntax in each RDBMS (this difference is really stupid, but hey, it's too late to complain about it) ?
I can't believe that I find so few answers on the Web for such a basic feature of a Rails app : count the subscriptions per day, month or year.
Tell me I'm missing something :)
EDIT
It's been a few years since I posted this question.
Experience has shown that I should use the same DB for dev and prod. So I now consider the database agnostic requirement irrelevant.
Dev/prod parity FTW.
I ended up writing my own gem. Check it out and feel free to contribute:
https://github.com/lakim/sql_funk
It allows you to make calls like:
Subscriber.count_by("created_at", :group_by => "day")
You speak of some pretty difficult problems that Rails, unfortunately, completely overlooks. The ActiveRecord::Calculations docs are written like they're all you ever need, but databases can do much more advanced things. As Donal Fellows mentioned in his comment, the problem is much trickier than it seems.
I've developed a Rails application over the last two years that makes heavy use of aggregation, and I've tried a few different approaches to the problem. I unfortunately don't have the luxary of ignoring things like daylight savings because the statistics are "only trends". The calculations I generate are tested by my customers to exact specifications.
To expand upon the problem a bit, I think you'll find that your current solution of grouping by dates is inadequate. It seems like a natural option to use STRFTIME. The primary problem is that it doesn't let you group by arbitrary time periods. If you want to do aggregation by year, month, day, hour, and/or minute, STRFTIME will work fine. If not, you'll find yourself looking for another solution. Another huge problem is that of aggregation upon aggregation. Say, for example, you want to group by month, but you want to do it starting from the 15th of every month. How would you do it using STRFTIME? You'd have to group by each day, and then by month, but then someone account for the starting offset of the 15th day of each month. The final straw is that grouping by STRFTIME necessitates grouping by a string value, which you'll find very slow when performing aggregation upon aggregation.
The most performant and best designed solution I've come to is one based upon integer time periods. Here is an excerpt from one of my mysql queries:
SELECT
field1, field2, field3,
CEIL((UNIX_TIMESTAMP(CONVERT_TZ(date, '+0:00', ##session.time_zone)) + :begin_offset) / :time_interval) AS time_period
FROM
some_table
GROUP BY
time_period
In this case, :time_interval is the number of seconds in the grouping period (e.g. 86400 for daily) and :begin_offset is the number of seconds to offset the period start. The CONVERT_TZ() business accounts for the way mysql interprets dates. Mysql always assumes that the date field is in the mysql local time zone. But because I store times in UTC, I must convert it from UTC to the session time zone if I want the UNIX_TIMESTAMP() function to give me a correct response. The time period ends up being an integer that describes the number of time intervals since the start of unix time. This solution is much more flexible because it lets you group by arbitrary periods and doesn't require aggregation upon aggregation.
Now, to get to my real point. For a robust solution, I'd recommend that you consider not using Rails at all to generate these queries. The biggest issue is that the performance characteristics and subtleties of aggregation are different across the databases. You might find one design that works well in your development environment but not in production, or vice-versa. You'll jump through a lot of hoops to get Rails to play nicely with both databases in query construction.
Instead I'd recommend that you generate database-specific views in your chosen database and bring those along to the correct environment. Try to model the view as you would any other ActiveRecord table (id's and all), and of course make the fields in the view identical across databases. Because these statistics are read-only queries, you can use a model to back them and pretend like they're full-fledged tables. Just raise an exception if somebody tries to save, create, update, or destroy.
Not only will you get simplified model management by doing things the Rails way, you'll also find that you can write units tests for your aggregation features in ways you wouldn't dream of in pure SQL. And if you decide to switch databases, you'll have to rewrite those views, but your tests will tell you where you're wrong, and make life so much easier.
I just released a gem that allows you to do this easily with MySQL. https://github.com/ankane/groupdate
You should really try to run MySQL in development, too. Your development and production environments should be as close as possible - less of a chance for something to work on development and totally break production.
If db agnosticism is what you're after, I can think of a couple of options:
Create a new field (we'll call it day_str) for the Subscriber that stores either the formatted date or a timestamp and use ActiveRecord.count:
daily_subscriber_counts = Subscriber.count(:group => "day_str")
The trade-off is of course a slightly larger record size, but this would all but eliminate performance worries.
You could also, depending on how granular the data that's being visualized is, just call .count several times with the date set as desired...
((Date.today - 7)..Date.today).each |d|
daily_subscriber_counts[d] = Subscriber.count(:conditions => ["created_at >= ? AND created_at < ?", d.to_time, (d+1).to_time)
end
This could also be customized to account for varying granularities (per month, per year, per day, per hour). It's not the most efficient solution in the case that you wanted to group by day on all of your subscribers (haven't had a chance to run it either), but I would imagine you'd want to group by month, day, hour if you're viewing the a years worth, months worth or days worth of data respectively.
If you're willing to commit to mysql and sqlite you could use...
daily_subscriber_counts = Subscriber.count(:group => "date(created_at)")
...as they share similar date() functions.
I'd refine/expand PBaumann's answer slightly, and include a Dates table in your database. You'd need a join in your query:
SELECT D.DateText AS Day, COUNT(*) AS Subscriptions
FROM subscribers AS S
INNER JOIN Dates AS D ON S.created_at = D.Date
GROUP BY D.DateText
...but you'd have a nicely-formatted value available without calling any functions. With a PK on Dates.Date, you can merge join and it should be very fast.
If you have an international audience, you could use DateTextUS, DateTextGB, DateTextGer, etc., but obviously this would not be a perfect solution.
Another option: cast the date to text on the database side using CONVERT(), which is ANSI and may be available across databases; I'm too lazy to confirm that right now.
Here's how I do it:
I have a class Stat which allows storing raw events.
(Code is from the first few weeks I started coding in Ruby so excuse some of it :-))
class Stat < ActiveRecord::Base
belongs_to :statable, :polymorphic => true
attr_accessible :statable_id, :statable_type, :statable_stattype_id, :source_url, :referral_url, :temp_user_guid
# you can replace this with a cron job for better performance
# the reason I have it here is because I care about real-time stats
after_save :aggregate
def aggregate
aggregateinterval(1.hour)
#aggregateinterval(10.minutes)
end
# will aggregate an interval with the following properties:
# take t = 1.hour as an example
# it's 5:21 pm now, it will aggregate everything between 5 and 6
# and put them in the interval with start time 5:00 pm and 6:00 pm for today's date
# if you wish to create a cron job for this, you can specify the start time, and t
def aggregateinterval(t=1.hour)
aggregated_stat = AggregatedStat.where('start_time = ? and end_time = ? and statable_id = ? and statable_type = ? and statable_stattype_id = ?', Time.now.utc.floor(t), Time.now.utc.floor(t) + t, self.statable_id, self.statable_type, self.statable_stattype_id)
if (aggregated_stat.nil? || aggregated_stat.empty?)
aggregated_stat = AggregatedStat.new
else
aggregated_stat = aggregated_stat.first
end
aggregated_stat.statable_id = self.statable_id
aggregated_stat.statable_type = self.statable_type
aggregated_stat.statable_stattype_id = self.statable_stattype_id
aggregated_stat.start_time = Time.now.utc.floor(t)
aggregated_stat.end_time = Time.now.utc.floor(t) + t
# in minutes
aggregated_stat.interval_size = t / 60
if (!aggregated_stat.count)
aggregated_stat.count = 0
end
aggregated_stat.count = aggregated_stat.count + 1
aggregated_stat.save
end
end
And here's the AggregatedStat class:
class AggregatedStat < ActiveRecord::Base
belongs_to :statable, :polymorphic => true
attr_accessible :statable_id, :statable_type, :statable_stattype_id, :start_time, :end_time
Every statable item that gets added to the db has a statable_type and a statable_stattype_id and some other generic stat data. The statable_type and statable_stattype_id are for the polymorphic classes and can hold values like (the string) "User" and 1, which means you're storing stats about User number 1.
You can add more columns and have mappers in the code extract the right columns when you need them. Creating multiple tables make it harder to manage.
In the code above, StatableStattypes is just a table that contains "events" you'd like to log... I use a table because prior experience taught me that I don't want to look for what type of stats a number in the database refers to.
class StatableStattype < ActiveRecord::Base
attr_accessible :name, :description
has_many :stats
end
Now go to the classes you'd like to have some stats for and do the following:
class User < ActiveRecord::Base
# first line isn't too useful except for testing
has_many :stats, :as => :statable, :dependent => :destroy
has_many :aggregated_stats, :as => :statable, :dependent => :destroy
end
You can then query the aggregated stats for a certain User (or Location in the example below) with this code:
Location.first.aggregated_stats.where("start_time > ?", DateTime.now - 8.month)