sqlalchemy: paginate does not return the expected number of elements - sql

I am using flask-sqlalchemy together with a sqlite database. I try to get all votes below date1
sub_query = models.VoteList.query.filter(models.VoteList.vote_datetime < date1)
sub_query = sub_query.filter(models.VoteList.group_id == selected_group.id)
sub_query = sub_query.filter(models.VoteList.user_id == g.user.id)
sub_query = sub_query.subquery()
old_votes = models.Papers.query.join(sub_query, sub_query.c.arxiv_id == models.Papers.arxiv_id).paginate(1, 4, False)
where the database model for VoteList looks like this
class VoteList(db.Model):
id = db.Column(db.Integer, primary_key=True)
user_id = db.Column(db.Integer, db.ForeignKey('user.id'))
group_id = db.Column(db.Integer, db.ForeignKey('groups.id'))
arxiv_id = db.Column(db.String(1000), db.ForeignKey('papers.arxiv_id'))
vote_datetime = db.Column(db.DateTime)
group = db.relationship("Groups", backref=db.backref('vote_list', lazy='dynamic'))
user = db.relationship("User", backref=db.backref('votes', lazy='dynamic'), foreign_keys=[user_id])
def __repr__(self):
return '<VoteList %r>' % (self.id)
I made sure that the 'old_votes' selection above has 20 elements. If I use .all() instead of .paginate() I get the expected 20 result?
Since I used a max results value of 4 in the example above I would expect that old_votes.items has 4 elements. But it has only 2? If I increase the max results value the number of elements also increases, but it is always below the max result value? Paginate seems to mess up something here?
any ideas?
thanks
carl
EDIT
I noticed that it works fine if I apply the paginate() function on add_columns(). So if I add (for no good reason) a column with
old_votes = models.Papers.query.join(sub_query, sub_query.c.arxiv_id == models.Papers.arxiv_id)
old_votes = old_votes.add_columns(sub_query.c.vote_datetime).paginate(page, VOTES_PER_PAGE, False)
it works fine? But since I don't need that column it would still be interesting to know what goes wrong with my example above?

Looks to me that for the 4 rows returned (and filtered) by the query, there are 4 rows representing 4 different rows of the VoteList table, but they refer/link/belong to only 2 different Papers models. When model instances are created, duplicates are filtered out, and therefore you get less rows. When you add a column from a subquery, the results are tuples of (Papers, vote_datetime), and in this case no duplicates are removed.

I encountered the same issue and I applied van's answer but it did not work. However I agree with van's explanation so I added .distinct() to the query like this:
old_votes = models.Papers.query.distinct().join(sub_query, sub_query.c.arxiv_id == models.Papers.arxiv_id).paginate(1, 4, False)
It worked as I expected.

Related

Understanding odd results when paginating double outerjoins

I'm working with Flask and SQLAlchemy and I stumble on behavior I do not understand. When I build a query with .outerjoin() and .paginate() it all works well until today when I created a query with double outerjoin related from one another like on the simplified example code bellow (db is reference to SQLAlchemy). Class First has one-to-many relation with Second and Second is one-to-one with Third.
For testing purpose I have prepared three search queries. First two search_1 and search_2 works well all the time. But search_3 works only until there are two Second records related to the same First record. When there is more than one Second related to First then query returns mostly lower number of records (but not as low as when using .join instead of .outerjoin) and in some cases even higher then the number of records in First table. What's strange number of records is changing even when using different sorting order (always by columns of First model).
class First(db.Model):
__tablename__ = 'first'
id = db.Column(db.Integer, primary_key=True)
date_create = db.Column(db.DateTime, default=datetime.utcnow)
date_update = db.Column(db.DateTime)
class Second(db.Model):
__tablename__ = 'second'
id = db.Column(db.Integer, primary_key=True)
first_id = db.Column(db.Integer, db.ForeignKey('first.id'))
third_id = db.Column(db.Integer, db.ForeignKey('third.id'))
class Third(db.Model):
__tablename__ = 'third'
id = db.Column(db.Integer, primary_key=True)
date_create = db.Column(db.DateTime, default=datetime.utcnow)
# prepare base query to reuse later
outer_base = First.query \
.outerjoin(Second, Second.first_id == First.id) \
.outerjoin(Third, Third.id == Second.third_id) \
.order_by(First.id.asc())
# works well
search_1 = First.query.order_by(First.id.asc()).paginate(1, 10, False)
# works well
search_2 = outer_base.all()
# odd as hell...
search_3 = outer_base.paginate(1, 10, False)
I just want to be able to filter First records by value from Third if there is any relation created with the use of Second table. Can anyone please explain me what am I missing? Maybe the double outerjoin can be achieved differently to work with pagination?

How to select each model which has the maximum value of an attribute for any given value of another attribute?

I have a Work model with a video_id, a user_id and some other simple fields. I need to display the last 12 works on the page, but only take 1 per user. Currently I'm trying to do it like this:
def self.latest_works_one_per_user(video_id=nil)
scope = self.includes(:user, :video)
scope = video_id ? scope.where(video_id: video_id) : scope.where.not(video_id: nil)
scope = scope.order(created_at: :desc)
user_ids = works = []
scope.each do |work|
next if user_ids.include? work.user_id
user_ids << work.user_id
works << work
break if works.size == 12
end
works
end
But I'm damn sure there is a more elegant and faster way of doing it especially when the number of works gets bigger.
Here's a solution that should work for any SQL database with minimal adjustment. Whether one thinks it's elegant or not depends on how much you enjoy SQL.
def self.latest_works_one_per_user(video_id=nil)
scope = includes(:user, :video)
scope = video_id ? scope.where(video_id: video_id) : scope.where.not(video_id: nil)
scope.
joins("join (select user_id, max(created_at) created_at
from works group by created at) most_recent
on works.user_id = most_recent.user_id and
works.created_at = most_recent.created_at").
order(created_at: :desc).limit(12)
end
It only works if the combination of user_id and created_at is unique, however. If that combination isn't unique you'll get more than 12 rows.
It can be done more simply in MySQL. The MySQL solution doesn't work in Postgres, and I don't know a better solution in Postgres, although I'm sure there is one.

sqlalchemy join with sum and count of grouped rows

Hi i am working on a little prediction game in flask with flask-sqlalchemy I have a User Model:
class User(db.Model, UserMixin):
id = db.Column(db.Integer, primary_key=True)
nick = db.Column(db.String(255), unique=True)
bets = relationship('Bet', backref=backref("user"))
and my Bet model
class Bet(db.Model):
id = db.Column(db.Integer, primary_key=True)
uid = db.Column(db.Integer, db.ForeignKey('user.id'))
matchid = db.Column(db.Integer, db.ForeignKey('match.id'))
points = db.Column(db.Integer)
Both are not the full classes but it should do it for the question. A user can gather points for predicting the match outcome and gets different amount of points for predicting the exact outcome, the winner or the difference.
I now want to have a list of the top users, where i have to sum up the points which i'm doing via
toplist = db.session.query(User.nick, func.sum(Bet.points)).\
join(User.bets).group_by(Bet.uid).order_by(func.sum(Bet.points).desc()).all()
This works quite good, now there maybe the case that two players have the same sum of points. In this case the amount of correct predictions (rewarded with 3 points) would define the winner. I can get this list by
tophits = db.session.query(User.nick, func.count(Bet.points)).\
join(User.bets).filter_by(points=3).all()
They both work well, but I think there has to be a way to get both querys together and get a table with username, points and "hitcount". I've done that before in SQL but i am not that familiar with SQLAlchemy and thought knots in my brain. How can I get both queries in one?
In the query for tophits just replace the COUNT/filter_by construct with equivalent SUM(CASE(..)) without filter so that the WHERE clause for both is the same. The code below should do it:
total_points = func.sum(Bet.points).label("total_points")
total_hits = func.sum(case(value=Bet.points, whens={3: 1}, else_=0)).label("total_hits")
q = (session.query(
User.nick,
total_points,
total_hits,
)
.join(User.bets)
.group_by(User.nick)
.order_by(total_points.desc())
.order_by(total_hits.desc())
)
Note that i changed a group_by clause to use the column which is in SELECT, as some database engines might complain otherwise. But you do not need to do it.

Issues with DISTINCT when used in conjunction with ORDER

I am trying to construct a site which ranks performances for a selection of athletes in a particular event - I have previously posted a question which received a few good responses which me to identify the key problem with my code currently.
I have 2 models - Athlete and Result (Athlete HAS MANY Results)
Each athlete can have a number of recorded times for a particular event, i want to identify the quickest time for each athlete and rank these quickest times across all athletes.
I use the following code:
<% #filtered_names = Result.where(:event_name => params[:justevent]).joins(:athlete).order('performance_time_hours ASC').order('performance_time_mins ASC').order('performance_time_secs ASC').order('performance_time_msecs ASC') %>
This successfully ranks ALL the results across ALL athletes for the event (i.e. one athlete can appear a number of times in different places depending on the times they have recorded).
I now wish to just pull out the best result for each athlete and include them in the rankings. I can select the time corresponding to the best result using:
<% #currentathleteperformance = Result.where(:event_name => params[:justevent]).where(:athlete_id => filtered_name.athlete_id).order('performance_time_hours ASC').order('performance_time_mins ASC').order('performance_time_secs ASC').order('performance_time_msecs ASC').first() %>
However, my problem comes when I try to identify the distinct athlete names listed in #filtered_names. I tried using <% #filtered_names = #filtered_names.select('distinct athlete_id') %> but this doesn't behave how I expected it to and on occasions it gets the rankings in the wrong order.
I have discovered that as it stands my code essentially looks for a difference between the distinct athlete results, starting with the hours time and progressing through to mins, secs and msec. As soon as it has found a difference between a result for each of the distinct athletes it orders them accordingly.
For example, if I have 2 athletes:
Time for Athlete 1 = 0:0:10:5
Time for Athlete 2 = 0:0:10:3
This will yield the order, Athlete 2, Athlete1
However, if i have:
Time for Athlete 1 = 0:0:10:5
Time for Athlete 2 = 0:0:10:3
Time for Athlete 2 = 0:1:11:5
Then the order is given as Athlete 1, Athlete 2 as the first difference is in the mins digit and Athlete 2 is slower...
Can anyone suggest a way to get around this problem and essentially go down the entries in #filtered_names pulling out each name the first time it appears (i.e. keeping the names in the order they first appear in #filtered_names
Thanks for your time
If you're on Ruby 1.9.2+, you can use Array#uniq and pass a block specifying how to determine uniqueness. For example:
#unique_results = #filtered_names.uniq { |result| result.athlete_id }
That should return only one result per athlete, and that one result should be the first in the array, which in turn will be the quickest time since you've already ordered the results.
One caveat: #filtered_names might still be an ActiveRecord::Relation, which has its own #uniq method. You may first need to call #all to return an Array of the results:
#unique_results = #filtered_names.all.uniq { ... }
You should use DB to perform the max calculation, not the ruby code. Add a new column to the results table called total_time_in_msecs and set the value for it every time you change the Results table.
class Result < ActiveRecord::Base
before_save :init_data
def init_data
self.total_time_in_msecs = performance_time_hours * MSEC_IN_HOUR +
performance_time_mins * MSEC_IN_MIN +
performance_time_secs * MSEC_IN_SEC +
performance_time_msecs
end
MSEC_IN_SEC = 1000
MSEC_IN_MIN = 60 * MSEC_IN_SEC
MSEC_IN_HOUR = 60 * MSEC_IN_MIN
end
Now you can write your query as follows:
athletes = Athlete.joins(:results).
select("athletes.id,athletes.name,max(results.total_time_in_msecs) best_time").
where("results.event_name = ?", params[:justevent])
group("athletes.id, athletes.name").
orde("best_time DESC")
athletes.first.best_time # prints a number
Write a simple helper to break down the the number time parts:
def human_time time_in_msecs
"%d:%02d:%02d:%03d" %
[Result::MSEC_IN_HOUR, Result::MSEC_IN_MIN,
Result::MSEC_IN_SEC, 1 ].map do |interval|
r = time_in_msecs/interval
time_in_msecs = time_in_msecs % interval
r
end
end
Use the helper in your views to display the broken down time.

Raven DB Count Queries

I have a need to get a Count of Documents in a particular collection :
There is an existing index Raven/DocumentCollections that stores the Count and Name of the collection paired with the actual documents belonging to the collection. I'd like to pick up the count from this index if possible.
Here is the Map-Reduce of the Raven/DocumentCollections index :
from doc in docs
let Name = doc["#metadata"]["Raven-Entity-Name"]
where Name != null
select new { Name , Count = 1}
from result in results
group result by result.Name into g
select new { Name = g.Key, Count = g.Sum(x=>x.Count) }
On a side note, var Count = DocumentSession.Query<Post>().Count(); always returns 0 as the result for me, even though clearly there are 500 odd documents in my DB atleast 50 of them have in their metadata "Raven-Entity-Name" as "Posts". I have absolutely no idea why this Count query keeps returning 0 as the answer - Raven logs show this when Count is done
Request # 106: GET - 0 ms - TestStore - 200 - /indexes/dynamic/Posts?query=&start=0&pageSize=1&aggregation=None
For anyone still looking for the answer (this question was posted in 2011), the appropriate way to do this now is:
var numPosts = session.Query<Post>().Count();
To get the results from the index, you can use:
session.Query<Collection>("Raven/DocumentCollections")
.Where(x=>x.Name == "Posts")
.FirstOrDefault();
That will give you the result you want.