I am trying to find a complement using ActiveRecord and/or SQL.
I have a collection of 'annotations' that each have two relevant fields:
session_datum_id which corresponds to the user who performed the
annotation. Null means it has not yet been done.
post_id which represents the post the annotation is 'about'. Cannot
be null.
There are potentially multiple annotations per post_id.
I would like to efficiently find an annotation that satisfies two constraints:
session_datum_id is null. This means this particular annotation hasn't already been performed.
the session_datum passed in as an arg has not already performed another annotation with the same post_id.
Here is a very naive version which does a join outside the DB. It finds all annotations this user has already performed and removes those post_ids from the exhaustive list of annotations that still need to be performed. It then picks at random from the resulting list:
def self.random_empty_unseen(session_datum)
mine = where('session_datum_id = ?', session_datum)
elligible = where('session_datum_id IS NULL')
mine.each do |i|
elligible.each do |j|
if (i.post_id == j.post_id)
elligible.delete(j)
end
end
end
elligible[rand(elligible.count)]
end
As the list of annotations gets large, this will bog down terribly. I can imagine a probabilistic algorithm where we select an elligible annotation at random and then checks if the user has already performed it (retrying if so) but there are degenerate cases where that won't work. (Large number of annotations and the user has performed all but one of them.)
Is there a closed form query for this, perhaps using NOT EXISTS?
SELECT a1.*
FROM annotations AS a1
JOIN annotations AS a2
ON a1.post_id=a2.post_id
WHERE a2.session_datum_id=session_datum AND a1.session_datum_id IS NULL
Related
A and B are adjacent runs on a stack, with A being the bottom and smaller run(If B were smaller merge_hi would be performing the merging but the same question applies there as well).I have been trying to figure why the last element of A MUST be bigger than the last element of B because I don't see how the run decomposition (or the rest of the algorithm) would ensure that condition. Also, in the same function the code seems to suggest that the first element of B is always smaller than the first element of A which I also don't understand why, but I'm guessing the answer to this question is tied to the answer of the first question.
In short, galloping is the reason. That's why we haven't seen it at first, because I thought gallop_{left,right} is called only from merge_{lo,hi}. But it isn't true. gallop_... are also called from merge_at, before merge_{lo,hi} are called, in order to find "Where does b start in a?" and "Where does a end in b?". Those calls (and subsequent code) change ssb (and its length nb), and also ssa and its length na, such that the invariant in the title is satisfied.
That is, the point is that A and B are not "adjacent runs on the runstack" as found in the original list to be sorted. Before calling merge_{lo,hi}, they are trimmed so that the condition in the title holds. That is, prior to A being merged with B, the elements of B greater than the last element of A are explicitly left out of consideration. In other words, "Must also have that ssa.keys[na-1] belongs at the end of the merge" is true not because of some special property of ssa.keys[na-1], but because we have defined "the end of the merge" in such a way. :-)
That also means that when you implement Timsort yourself, and you leave out the galloping, you must also leave out that optimization being talked about here, otherwise the code won't work right.
I have this armor table, that has three fields to identify individual designs: make, model and version.
I have to implement a search feature for our software, that lets a user search armors according to various criteria, among which their design.
Now, the users' idea of a design is a single string that contains make, model and version concatenated, so the entry is that single string. Let's say they want to look up the specifications for make FH, model TT, version 27, they'll think of (and type) "FHTT27".
We use an IQueryOver object, upon which we add successive conditions according to the criteria. For the design, our code is
z_quoQuery = z_quoQuery.And(armor => armor.make + armor.model + armor.version == z_strDesign);
Which raises an InvalidOperationException, "variable 'armor' of type 'IArmor' referenced from scope '', but it is not defined".
This is described as a bug here: https://github.com/mbdavid/LiteDB/issues/637
After a lot of trial and error, it seems that syntaxs that don't use the armor variable first raise that exception.
Obviously, I have to go another route at least for now, but after searching for some time I can't seem to find how. I thought of using something like
z_quoQuery = z_quoQuery.And(armor => armor.make == z_strDesign.SubString(0, 2).
And(armor => armor.model == z_strDesign.SubString(2, 2).
And(armor => armor.version == z_strDesign.SubString(4, 2);
Unfortunately, the fields are liable to have variable lengths. For instance, another set of values for make, model, and version might be, respectively, "NGI", "928", and "RX", that the code above would parse wrong. So I can't bypass the difficulty that way. Nor can I use a RegEx.
Neither can I make up a property in my Armor class that would concatenate all three properties, since it cannot be converted to SQL by NHibernate.
Has someone an idea of how to do it?
Maybe I should use an explicit SQL condition here, but how would it mix with other conditions?
It seems you can use Projections.Concat to solve your issue:
z_quoQuery = z_quoQuery.And(armor => Projections.Concat(armor.make, armor.model, armor.version) == z_strDesign);
I have a data set which includes a number of nodes, all of which labeled claim, which can have various properties (names P1, P2, etc., through P2000). Currently, each of the claim nodes can have only one of these properties, and each property has value, which can be of different types (i.e. P1 may be string, P2 may be float, P3 integer, etc.). I also need to be able to look up the nodes by any property (i.e. "find all nodes with P3 which equals to 42").
I have modeled it as nodes having property value and label according to the P property. Then I define schema index on label claim and property value. The lookup then would look something like:
MATCH (n:P569:claim) WHERE n.value = 42 RETURN n
My first question is - is this OK to have such index? Are mixed type indexes allowed?
The second question is that the lookup above works (though I'm not sure whether it uses index or not), but this doesn't - note the label order is switched:
neo4j-sh (?)$ MATCH (n:claim:P569) WHERE n.value>0 RETURN n;
IncomparableValuesException: Don't know how to compare that. Left: "113" (String); Right: 0 (Long)
P569 properties are all numeric, but there are string properties from other P-values one of which is "113". Somehow, even though I said the label should be both claim and P569, the "113" value is still included in the comparison, even though it has no P569 label:
neo4j-sh (?)$ MATCH (n:claim) WHERE n.value ="113" RETURN LABELS(n);
+-------------------+
| LABELS(n) |
+-------------------+
| ["claim","P1036"] |
| ["claim","P902"] |
+-------------------+
What is wrong here - why it works with one label order but not another? Can this data model be improved?
Let me at least try to side-step your question, there's another way you could model this that would resolve at least some of your problems.
You're encoding the property name as a label. Perhaps you want to do that to speed up looking up a subset of nodes where that property applies; still it seems like you're causing a lot of difficulty by shoe-horning incomparable data values all into the same property named "value".
What if, in addition to using these labels, each property was named the same as the value? I.e.:
CREATE (n:P569:claim { P569: 42});
You still get your label lookups, but by segregating the property names, you can guarantee that the query planner will never accidentally compare incomparable values in the way it builds an execution plan. Your query for this node would then be:
MATCH (n:P569:claim) WHERE n.P569 > 5 AND n.P569 < 40 RETURN n;
Note that if you know the right label to use, then you're guaranteed to know the right property name to use. By using properties of different names, if you're logging your data in such a way that P569's are always integers, you can't end up with that incomparable situation you have. (I think that's happening because of the particular way cypher is executing that query)
A possible downside here is that if you have to index all of those properties, it could be a lot of indexes, but still might be something to consider.
I think it makes sense to take a step back and think what you actually want to achieve, and why you have those 2000 properties in the first place and how you could model them differently in a graph?
Also make sure to just leave off properties you don't need and use coalesce() to provide the default.
Given two models:
class Post(models.Model):
...
class Comment(models.Model):
post = models.ForeignKey(Post)
...
Is there a good way to get the last 3 Comments on a set of Posts (i.e. in a single roundtrip to the DB instead of once per post)? A naive implementation to show what I mean:
for post in Post.objects.filter(id__in=some_post_ids):
post.latest_comments = list(Comment.objects.filter(post=post).order_by('-id')[:3])
Given some_post_ids == [1, 2], the above will result in 3 queries:
[{'sql': 'SELECT "myapp_post"."id" FROM "myapp_post" WHERE "myapp_post"."id" IN (1, 2)', 'time': '0.001'},
{'sql': 'SELECT "myapp_comment"."id", "myapp_comment"."post_id" FROM "myapp_comment" WHERE "myapp_comment"."post_id" = 1 LIMIT 3', 'time': '0.001'},
{'sql': 'SELECT "myapp_comment"."id", "myapp_comment"."post_id" FROM "myapp_comment" WHERE "myapp_comment"."post_id" = 2 LIMIT 3', 'time': '0.001'}]
From Django's docs:
Slicing. As explained in Limiting QuerySets, a QuerySet can be sliced, using Python’s array-slicing syntax. Slicing an unevaluated QuerySet usually returns another unevaluated QuerySet, but Django will execute the database query if you use the “step” parameter of slice syntax, and will return a list. Slicing a QuerySet that has been evaluated (partially or fully) also returns a list.
Your naive implementation is correct and should only make one DB query. However, don't call list on it, I believe that will cause the DB to be hit immediately (although it should still only be a single query). The queryset it already iterable and there really shouldn't be any need to call list. More on calling list from the same doc page:
list(). Force evaluation of a QuerySet by calling list() on it. For example:
entry_list = list(Entry.objects.all())
Be warned, though, that this could have a large memory overhead, because Django will load each element of the list into memory. In contrast, iterating over a QuerySet will take advantage of your database to load data and instantiate objects only as you need them.
UPDATE:
With your added explanation I believe the following should work (however, it's untested so report back!):
post.latest_comments = Comment.objects.filter(post__in=some_post_ids).order_by('-id')
Admittedly it doesn't do the limit of 3 comments per post, I'm sure that's possible but can't think of the syntax off the top of my head. Also, remember you can always do a manual query on any Model to get better optimisation, so you can run Comment.query("Select ...;")
Given the information here on the "select top N from group" problem, if you're IN clause will be a small number of posts, it may just be cheaper to either a) do the multiple queries or b) select all comments for the posts then filter in Python. I'd suggest using a if it's a small number of posts with lots of comments and b if there will be relatively few comments per post.
I am processing a bunch of data and I haven't coded a duplicate checker into the data processor yet, so I expected duplicates to occur. I ran the following SQL query:
SELECT body, COUNT(body) AS dup_count
FROM comments
GROUP BY body
HAVING (COUNT(body) > 1)
And get back a list of duplicates. Looking into this I find that these duplicates have multiple hashes. The shortest string of a comment is "[deleted]". So let's use that as an example. In my database there are nine instances of a comment being "[deleted]" and in my database this produces a hash of both 1169143752200809218 and 1738115474508091027. The 116 is found 6 times and 173 is found 3 times. But, when I run it in IRB, I get the following:
a = '[deleted]'.hash # => 811866697208321010
Here is the code I'm using to produce the hash:
def comment_and_hash(chunk)
comment = chunk.at_xpath('*/span[#class="comment"]').text ##Get Comment##
hash = comment.hash
return comment,hash
end
I've confirmed that I don't touch comment anywhere else in my code. Here is my datamapper class.
class Comment
include DataMapper::Resource
property :uid , Serial
property :author , String
property :date , Date
property :body , Text
property :arank , Float
property :srank , Float
property :parent , Integer #Should Be UID of another comment or blank if parent
property :value , Integer #Hash to prevent duplicates from occurring
end
Am I correct in assuming that .hash on a string will return the same value each time it is called on the same string?
Which value is the correct value assuming my string consists of "[deleted]"?
Is there a way I could have different strings inside ruby, but SQL would see them as the same string? That seems to be the most plausible explanation for why this is occurring, but I'm really shooting in the dark.
If you run
ruby -e "puts '[deleted]'.hash"
several times, you will notice that the value is different. In fact, the hash value stays only constant as long as your Ruby process is alive. The reason for this is that String#hash is seeded with a random value. rb_str_hash (the C implementing function) uses rb_hash_start which uses this random seed which gets initialized every time Ruby is spawned.
You could use a CRC such as Zlib#crc32 for your purposes or you may want to use one of the message digests of OpenSSL::Digest, although the latter is overkill since for detection of duplicates you probably won't need the security properties.
I use the following to create String#hash alternatives that are consistant across time and processes
require 'zlib'
def generate_id(label)
Zlib.crc32(label.to_s) % (2 ** 30 - 1)
end
Ruby intentionally makes String.hash produce different values in different sessions: Why is Ruby String.hash inconsistent across machines?