speed up sqlalchemy orm dynamic relationship slicing with negative indicies - orm

I have the following SQLA models and relationships. I am logging a measurement for each channel every second, so there are lots of measurements in the DB.
class Channel( Model ) :
__tablename__ = 'channel'
id = Column( Integer, primary_key=True )
#! --- Relationships ---
measurements = relationship( 'Measurement', back_populates='channel', lazy='dynamic' )
class Measurement( Model ) :
__tablename__ = 'measurement'
id = Column( Integer, primary_key=True )
timestamp = Column( DateTime, nullable=False )
value = Column( Float, nullable=False )
#! --- Relationships ---
channel = relationship( 'Channel', back_populates='measurements', uselist=False )
If I want to get the latest measurement I can get it via ORM and slicing with a negative index.
channel.measurements[-1]
However, it is very very slow !!
I can do another filter the relationship query further with .filter() and .order_by() etc, to get what I want, but I like using the ORM (why have it otherwise?)
I noticed that if I slice with a positive index that it is fast (similar to explicit SQLA queries mentioned above).
channel.measurements[0]
I changed the relationship to keep measurements in reverse order, and that seems work in conjunction with using a zero index.
measurements = relationship( 'Measurement', back_populates='channel', lazy='dynamic', order_by='Measurement.id.desc()' )
So, why is negative index slicing so slow ??
Is it a bug in SQLAlchemy? I would have thought it would be smart enough to do the correct SQL to get only the latest item from the DB?
Is there something else I need to do to have the measurements sorted in natural order and use negative index slicing and get the same speed as the other methods ??

You haven't given any ordering, so it has to load all of the objects into a list, and then get the last one.
If you add the echo=True parameter, you can see the difference in the queries:
For measurements[0], it selects just one (LIMIT 1) of the measurements matching the channel:
SELECT measurement.id AS measurement_id, measurement.ts AS measurement_ts,
measurement.value AS measurement_value,
measurement.channel_id AS measurement_channel_id
FROM measurement
WHERE %(param_1)s = measurement.channel_id
LIMIT %(param_2)s
{'param_1': 6, 'param_2': 1}
For measurements[-1], it selects all of the measurements matching the channel. You haven't ordered it, so it has to ask the database to return the rows in whatever order it decides (maybe the primary key on measurement, but no guarantee):
SELECT measurement.id AS measurement_id, measurement.ts AS measurement_ts,
measurement.value AS measurement_value,
measurement.channel_id AS measurement_channel_id
FROM measurement
WHERE %(param_1)s = measurement.channel_id
{'param_1': 6}
If you want just the latest Measurement, select it and order by timestamp field; you probably want indices on channel_id and your timestamp field:
db.session.query(Measurement)\
.filter(Measurement.channel_id == channel_id)\
.order_by(Measurement.ts.desc())\
.limit(1)\
.first()

It seems the answer is that efficient slicing or realtionship collections with negative indicies is not supported in SQLA. In fact there seems to be some kludgy attempt in the code but is going to be removed from SQLA as it wasn't carefully thought out.
https://github.com/sqlalchemy/sqlalchemy/issues/5605
I have solved my issue by implementing a hybrid property that returns me the latest measurement, rather than slicing the relationship collection directly.
#hybrid_property
def latest_measurement( self ) -> float :
"""
Hybrid property that returns the latest measurement for the channel.
"""
measurement = self.measurements.order_by( Measurement.id.desc() ).first()
return measurement

Related

Efficient way to select one from each category - Rails

I'm developing a simple app to return a random selection of exercises, one for each bodypart.
bodypart is an indexed enum column on an Exercise model. DB is PostgreSQL.
The below achieves the result I want, but feels horribly inefficient (hitting the db once for every bodypart):
BODYPARTS = %w(legs core chest back shoulders).freeze
#exercises = BODYPARTS.map do |bp|
Exercise.public_send(bp).sample
end.shuffle
So, this gives a random exercise for each bodypart, and mixes up the order at the end.
I could also store all exercises in memory and select from them; however, I imagine this would scale horribly (there are only a dozen or so seed records at present).
#exercises = Exercise.all
BODYPARTS.map do |bp|
#exercises.select { |e| e[:bodypart] == bp }.sample
end.shuffle
Benchmarking these shows the select approach as the more effective on a small scale:
Queries: 0.072902 0.020728 0.093630 ( 0.088008)
Select: 0.000962 0.000225 0.001187 ( 0.001113)
MrYoshiji's answer: 0.000072 0.000008 0.000080 ( 0.000072)
My question is whether there's an efficient way to achieve this output, and, if so, what that approach might look like. Ideally, I'd like to keep this to a single db query.
Happy to compose this using ActiveRecord or directly in SQL. Any thoughts greatly appreciated.
From my comment, you should be able to do (thanks PostgreSQL's DISTINCT ON):
Exercise.select('distinct on (bodypart) *')
.order('bodypart, random()')
Postgres' DISTINCT ON is very handy and performance is typically great, too - for many distinct bodyparts with few rows each. But for only few distinct values of bodypart with many rows each (big table - and your use case) there are far superior query techniques.
This will be massively faster in such a case:
SELECT e.*
FROM unnest(enum_range(null::bodypart)) b(bodypart)
CROSS JOIN LATERAL (
SELECT *
FROM exercises
WHERE bodypart = b.bodypart
-- ORDER BY ??? -- for a deterministic pick
LIMIT 1 -- arbitrary pick!
) e;
Assuming that bodypart is the name of the enum as well as the table column.
enum_range is an enum support function that (quoting the manual):
Returns all values of the input enum type in an ordered array
I unnest it and run a LATERAL subquery for each value, which is very fast when supported with the right index. Detailed explanation for the query technique and the needed index (focus on chapter "2a. LATERAL join"):
Optimize GROUP BY query to retrieve latest record per user
For just an arbitrary row for each bodypart, a simple index on exercises(bodypart) does the job. But you can have a deterministic pick like "the latest entry" with the right multicolumn index and a matching ORDER BY clause and almost the same performance.
Related:
Is it a bad practice to query pg_type for enums on a regular basis?
Select first row in each GROUP BY group?

Search efficiently for records matching given set of properties/attributes and their values (exact match, less than, greater than)

It is fairly simple problem to describe. However I could not come up with any reasonable solution so solution may or may not be so easy to cook up. Here is the problem:
Let there be many records describing some objects. For example:
{
id : 1,
kind : cat,
weight : 25 lb,
color : red
age : 10,
fluffiness : 98
attitude : grumpy
}
{
id : 2,
kind : robot,
chassis : aluminum,
year : 2015,
hardware : intel curie,
battery : 5000,
bat-life : 168,
weight : 0.5 lb,
}
{
id : 3,
kind : lightsaber,
color : red,
type : single blade,
power : 1000,
weight : 25 lb,
creator : Darth Vader
}
Attributes are not pre-specified so an object could be described using any attribute-value pairs.
If there are 1 000 000 records/objects there could easily be 100 000 different attributes.
My goal is to efficiently search through the data structure/s that will contain all records and if possible to come up with answer (quickly) which records match the given conditions.
For example a search query could be: Find all cats that weigh more than 20 and are older than 9 and are more fluffy than 98 and are red and whose attitude is "grumpy".
We can assume that there could be infinite number of records and infinite number of attributes but any search query contains no more than 20 numerical (lt,gt) clauses.
One possible implementation using SQL/MySQL I could think of was using fulltext indexes.
For example I could store non numeric attributes as "kind_cat color_red attitude_grumpy", search through them to narrow the resultset and then scan table containing numeric attributes for matches. It seems however (I am not sure at this point) that gt, lt searches might be costly in general using this strategy (I would have to do at least N joins for N numerical clauses).
I thought of MongoDB thinking of the problem, but although MongoDB naturally allows me to store key-value pairs, searching by some fields (not all) means that I must create indexes that contain all keys in all possible orders/permutations (and this is impossible).
Can this be done efficiently (maybe in logarithmic time??) using MySQL or any other dbms? - If not, is there data structure (maybe some muti-dimensional tree?) and algorithm that allows efficiently executing this kind of searches on a large scale (considering both time and space complexity)?
If it isn't possible to solve the problem defined this way are there any heuristic approaches that solve it without sacrificing too much.
If I get it right your thinking something like:
create table t
( id int not null
, kind varchar(...) not null
, key varchar(...) not null
, val varchar(...) not null
, primary key (id, kind, key) );
There are several problems with this approach, you can google for EAV to find out more. One example is that you will have to cast val to the appropriate type when doing comparisons ( '2' > '10' )
That said, an index like:
create unique index ix1 on t (kind, key, val, id)
will reduce the pain you will be suffering slightly, but the design wont scale well and with 1E6 of rows and 1E5 attributes the performance will be far from good. Your example query would look something like:
select a.id
from ( select id
from ( select id, val
from t
where kind = 'cat'
and key = 'weight'
)
where cast(val as int) > 20
) as a
join ( select id
from ( select id, val
from t
where kind = 'cat'
and key = 'age'
)
where cast(val as int) > 9
) as b
on a.id = b.id
join ( ...
and key = 'fluffy'
)
where cast(val as int) > 98
) as c
on a.id = c.id
join ...

What is the use case that makes EAVT index preferable to EATV?

From what I understand, EATV (which Datomic does not have) would be great fit for as-of queries. On the other hand, I see no use-case for EAVT.
This is analogous to row/primary key access. From the docs: "The EAVT index provides efficient access to everything about a given entity. Conceptually this is very similar to row access style in a SQL database, except that entities can possess arbitrary attributes rather then being limited to a predefined set of columns."
The immutable time/history side of Datomic is a motivating use case for it, but in general, it's still optimized around typical database operations, e.g. looking up an entity's attributes and their values.
Update:
Datomic stores datoms (in segments) in the index tree. So you navigate to a particular E's segment using the tree and then retrieve the datoms about that E in the segment, which are EAVT datoms. From your comment, I believe you're thinking of this as the navigation of more b-tree like structures at each step, which is incorrect. Once you've navigated to the E, you are accessing a leaf segment of (sorted) datoms.
You are not looking for a single value at a specific point in time. You are looking for a set of values up to a specific point in time T. History is on a per value basis (not attribute basis).
For example, assert X, retract X then assert X again. These are 3 distinct facts over 3 distinct transactions. You need to compute that X was added, then removed and then possibly added again at some point.
You can do this with SQL:
create table Datoms (
E bigint not null,
A bigint not null,
V varbinary(1536) not null,
T bigint not null,
Op bit not null --assert/retract
)
select E, A, V
from Datoms
where E = 1 and T <= 42
group by E, A, V
having 0 < sum(case Op when 1 then +1 else -1 end)
The fifth component Op of the datom tells you whether the value is asserted (1) or retracted (0). By summing over this value (as +1/-1) we arrive at either 1 or 0.
Asserting the same value twice does nothing, and you always retract the old value before you assert a new value. The last part is a prerequisite for the algorithm to work out this nicely.
With an EAVT index, this is a very efficient query and it's quite elegant. You can build a basic Datomic-like system in just 150 lines of SQL like this. It is the same pattern repeated for any permutation of EAVT index that you want.

Redis zrevrangebyscore, sorting other than lexicographical order

I have implemented a leader board using sorted sets in redis. I want users with same scores to be ordered in chronological order, i.e., user who came first should be ranked higher. Currently redis supports lexicographical order. Is there a way to override that. Mobile numbers are being used as members in sorted set.
One solution that I thought of is appending timestamp in front of mobile numbers and maintaining a hash to map mobile number and timestamp.
$redis.hset('mobile_time', '1234567890', "#{Time.now.strftime('%y%m%d%H%M%S')}")
pref = $redis.hget('mobile_time, '1234567890'')
$redis.zadd('myleaderboard', "1234567890:#{pref}")
That way I can get rank for a given user at any instance by adding a prefix from hash.
Now this is not exactly what I want. This will return opposite of what I want. User who comes early will be placed below user who comes later(both with same score).
Key for user1 = 201210121953**23**01234567890 score: 400
key for user2 = 201210121253**26**09313123523 score: 400 (3 seconds later)
if I use zrevrangebyscore, user2 will be placed higher than user1.
However, there's a way to get the desired rank:
users_with_higher_score_count = $redis.zcount("mysset", "(400", "+inf")
users_with_same_score = $redis.zrangebyscore("mysset", "400", "400")
Now I have the list users_with_same_score with correct ordering. Looking at index I can calculate rank of the user.
To get leader board. I can get members in intervals of 50 and order them through ruby code. But it doesn't seems to be a good way.
I want to know if there's a better approach to do it. Or any improvements that can be made in solution I purposed.
Thanks in advance for your help.
P.S. Scores are in multiples of 50
The score in a sorted set supports double precision floating point numbers, so possibly a better solution would be to store the redis score as highscore.timestamp
e.g. (pseudocode)
highscore = 100
timestamp = now()
redis.zadd('myleaderboard', highscore + '.' + timestamp, playerId)
This would mean that multiple players who achieved the same high score will also be sorted based on the time they achieved that high score as per the following
For player 1...
redis.zadd('myleaderboard', '100.1362345366', "Charles")
For player 2...
redis.zadd('myleaderboard', '100.1362345399', "Babbage")
See this question for more detail: Unique scoring for redis leaderboard
The external weights feature of the sort command is your saviour here
SORT mylist BY weight_*
http://redis.io/commands/sort
If you are displaying leaderboard in descending order of score then I don't think the above solution will work. Instead of just appending timestamp in the score you should append Long.MAX_VALUE - System.nanoTime() So your final score code should be like -
highscore = 100
timestamp = Long.MAX_VALUE - System.nanoTime();
redis.zadd('myleaderboard', highscore + '.' + timestamp, playerId);
Now you will get the correct order when you call redis.zrevrange('myleaderboard', startIndex, endIndex)

How to create a sorted set with "field1 desc, field2 asc" order in Redis?

I am trying to build leaderboards in Redis and be able to get top X scores and retrieve a rank of user Y.
Sorted lists in Redis look like an easy fit except for one problem - I need scores to be sorted not only by actual score, but also by date (so whoever got the same score earlier will be on top). SQL query would be:
select * from scores order by score desc, date asc
Running zrevrange on a sorted set in Redis uses something like:
select * from scores order by score desc, key desc
Which would put users with lexicographically bigger keys above.
One solution I can think of is making some manipulations with a score field inside a sorted set to produce a combined number that consists of a score and a timestamp.
For example for a score 555 with a timestamp 111222333 the final score could be something like 555.111222333 which would put newer scores above older ones (not exactly what I need but could be adjusted further).
This would work, but only on small numbers, as a score in a sorted set has only 16 significant digits, so 10 of them will be wasted on a timestamp right away leaving not much room for an actual score.
Any ideas how to make a sorted set arrange values in a correct order? I would really want an end result to be a sorted set (to easily retrieve user's rank), even if it requires some temporary structures and sorts to build such set.
Actually, all my previous answers are terrible. Disregard all my previous answers (although I'm going to leave them around for the benefit of others).
This is how you should actually do it:
Store only the scores in the zset
Separately store a list of each time a player achieved that score.
For example:
score_key = <whatever unique key you want to use for this score>
redis('ZADD scores-sorted %s %s' %(score, score))
redis('RPUSH score-%s %s' %(score, score_key))
Then to read the scores:
top_score_keys = []
for score in redis('ZRANGE scores-sorted 0 10'):
for score_key in redis('LRANGE score-%s 0 -1' %(score, )):
top_score_keys.append(score_key)
Obviously you'd want to do some optimizations there (ex, only reading hunks of the score- list, instead of reading the entire thing).
But this is definitely the way to do it.
User rank would be straight forward: for each user, keep track of their high score:
redis('SET highscores-%s %s' %(user_id, user_high_score))
Then determine their rank using:
user_high_score = redis('GET highscores-%s' %(user_id, ))
score_rank = int(redis('ZSCORE scores-sorted %s' %(user_high_score, )))
score_rank += int(redis('LINDEX score-%s' %(user_high_score, )))
It's not really the perfect solution, but if you make a custom epoch that would be closer to the current time, then you would need less digits to represent it.
For instance if you use January 1, 2012 for your epoch you would (currently) only need 8 digits to represent the timestamp.
Here's an example in ruby:
(Time.new(2012,01,01,0,0,0)-Time.now).to_i
This would give you about 3 years before the timestamp would require 9 digits, at which time you could perform some maintenance to move the custom epoch forward again.
I would however love to hear if anyone has a better idea, since I have the excact same problem.
(Note: this answer is almost certainly suboptimial; see https://stackoverflow.com/a/10575370/71522)
Instead of using a timestamp in the score, you could use a global counter. For example:
score_key = <whatever unique key you want to use for this score>
score_number = redis('INCR global-score-counter')
redis('ZADD sorted-scores %s.%s %s' %(score, score_number, score_key)
And to sort them in descending order, pick a large score count (1<<24, say), use that as the initial value of global-score-counter, then use DECR instead of INCR.
(this would also apply if you are using a timestamp)
Alternately, if you really incredibly worried about the number of players, you could use a per-score counter:
score_key = <whatever unique key you want to use for this score>
score_number = redis('HINCR score-counter %s' %(score, ))
redis('ZADD sorted-scores %s.%s %s' %(score, score_number, score_key))
(Note: this answer is almost certainly suboptimial; see https://stackoverflow.com/a/10575370/71522)
A couple thoughts:
You could make some assumptions about the timestamps to make them smaller. For example, instead of storing Unix timestamps, you could store "number of minutes since May 13, 2012" (for example). In exchange for seven significant digits, this would let you store times for the next 19 years.
Similarly, you could reduce the number of significant digits in the scores. For example, if you expect scores to be in the 7-digit range, you could divide them by 10, 100, or 1000 when storing them in the sorted list, then use the results of the sorted list to access the actual scores, sorting those at the application level.
For example, using both of the above (in potentially buggy pseudo-code):
score_small = int(score / 1000)
time_small = int((time - 1336942269) / 60)
score_key = uuid()
redis('SET full-score-%s "%s %s"' %(score_key, score, time))
redis('ZADD sorted-scores %s.%s %s' %(score_small, time_small, score_key))
Then to load them (approximately):
top_scores = []
for score_key in redis('ZRANGE sorted-scores 0 10'):
score_str, time_str = redis('GET full-score-%s' %(score_key, )).split(" ")
top_scores.append((int(score_str), int(time_str))
top_scores.sort()
This operation could even be done entirely inside Redis (avoid the network overhead of the O(n) GET operations) using the EVAL command (although I don't know enough Lua to confidently provide an example implementation).
Finally, if you expect a truly huge range of scores (for example, you expect that there will be a large number of scores below 10,000 and an equally large number of scores over 1,000,000), then you could use two sorted sets: scores-below-100000 and scores-above-100000.