ORDER BY RANDOM - With or without replacement? - sql

When using something like SELECT * FROM Object ORDER BY RANDOM() LIMIT 200, to randomly sample 200 objects out of a table, is the sampling done with or without replacement? I am speculating it is with, but I don't know for sure. I have not found any documentation about this. I am using SQLite but I don't think the implementation there differs from the rest.

First a random value is assigned to all rows, then the topmost 200 are selected, so it is done without replacement since it is impossible for the same row to be selected twice.

Related

sdiff - limit the result set to X items

I want to get the diff of two sets in redis, but I don't need to return the entire array, just 10 items for example. Is there any way to limit the results?
I was thinking something like this:
SDIFF set1 set2 LIMIT 10
If not, are there any other options to achieve this in a performant way, considering that set1 can be millions of objects and set2 is much much smaller (hundreds).
More info would be helpful on what you want to achieve. Something like this might require you to duplicate your data. Though I don’t know if it’s something you want.
An option is chunking them.
Create a set with a unique generated id that can hold a max of 10 items
Create a sorted set like so…
zadd(key, timestamp, chunkid)
where your timestamp is a unix time and the chunkid is the key the connects to the set. The key can be the name of whatever you would like it to be or it could also be a uniquely generated id.
Use zrange to grab a specific one
(Repeat steps 1-3 for the second set)
Once you have your 1 result from both your sorted sets “zset”. You can now do your sdiff by using the chunkid.
Note that there is advantages and disadvantages in doing this. Like more connection consumption (if calling from a a client), and the obvious being a little more processing. Though it will help immensely if you put this in a lua script.
Hope this helps or at least gives you an idea on how to model your data. Though if this is critical data you might need to use a automated script of some sort to move your data around to meet the modeling requirement.

jsFiddle API to get row count of user's fiddles

So, I had a nice thing going on a jsFiddle where I listed all my fiddles on one page:
jsfiddle.net/show
However, they have been changing things slowly this year, and I've already had to make some changes to keep it running. The newest change is rather annoying. Of course, I like to see ALL my fiddles all at once, make it easier to just hit ctrl+f and find what I might be looking for, but they' made it hard to do now. Used to I could just set the limit to 99999, and see everything, but now it appears I can't go past how many i actually have (186 atm).
I tried using a start to limit solution, but when it got to last 10|50 (i tried like start={x}&limit10 and start={x}&limit50) it would die. Namely because last pull had to be exact count. Example, I have 186, and use the by 10's solution, then it would die at start=180&limit=10.
I've search the API docs but can't seem to find a row count or anything of that manner. Anyone know of a good feasible solution that wont have me overloading there servers doing a constant single row check?
I'm having the same problem as you are. Then I checked the docs (Displaying user’s fiddles - Result) and found out that if you include callback=Api parameter, an additional overallResultSetCount field is included in the JSON response. I checked your fiddles and currently you have total of 229 public fiddles.
The solution I can think of will force you to only request twice. The first request's parameters doesn't matter as long as you have callback=Api. Then you send the second request in which your limit will be overallResultSetCount value.
Edit:
It's not in the documentation, however, I think the result set is limited to 200 entries only (hence your start/limit is from 0 - 199). I tried to query more than the 200 range but I get a Error 500. I couldn't find another user whose fiddle count is more than 200 (most of the usernames I tested only have less than 100 fiddles like zalun, oskar, and rpflorence).
Based on this new observation, you can update your script like this:
I have tested that if the total fiddle count is less than 200,
adding start=0&limit=199 parameter will only return all the
fiddles. Hence, you can add that parameter on your initial call.
Check if your total result set is more than 200. If yes, update your
parameters to reflect the range for the remaining result set (in
this case, start=199&limit=229) and add the new result set to your
old result set. Else, show/print the result set you initially got from your first query.
Repeat steps 1 and 2, if your total count reaches 400, 600, etc (any
multiple of 200).

Getting maximum value of field in solr

I'd like to boost my query by the item's view count; I'd like to use something like view_count / max_view_count for this purpose, to be able to measure how the item's view count relates to the biggest view count in the index. I know how to boost the results with a function query, but how can I easily get the maximum view count? If anybody could provide an example it would be very helpful...
There aren't any aggregate functions under solr in the way you might be thinking about them from SQL. The easiest way to do it is to have a two-step process:
Get the max value via an appropriate query with a sort
use it with the max() function
So, something like:
q=*:*&sort=view_count desc&rows=1&fl=view_count
...to get an item with the max view_count, which you record somewhere, and then
q=whatever&bq=div(view_count, max(the_max_view_count, 1))
Note that that max() function isn't doing an aggregate max; just getting the maximum of the max-view-count you pass in or 1 (to avoid divide-by-zero errors).
If you have a multiValued field (which you can't sort on) you could also use the StatsComponent to get the max. Either way, you would probably want to do this once, not for every query (say, every night at midnight or whatever once your data set settles down).
You can add just:
&stats=true&stats.field=view_count
You will see a small statistics on that specified field. More documentation here

Youtube API problem - when searching for playlists, start-index does not work past 100

I have been trying to get the full list of playlists matching a certain keyword. I have discovered however that using start-index past 100 brings the same set of results as using start-index=1. It does not matter what the max-results parameter is - still the same results. The total results returned however is way above 100, thus it cannot be that the query returned only 100 results.
What might the problem be? Is it a quota of some sort or any other authentication restriction?
As an example - the queries bring the same result set, whether you use start-index=1, or start-index=101, or start-index = 201 etc:
http://gdata.youtube.com/feeds/api/playlists/snippets?q=%22Jan+Smit+Laura%22&max-results=50&start-index=1&v=2
Any idea will be much appreciated!
Regards
Christo
I made an interface for my site, and the way I avoided this problem is to do a query for a large number, then store the results. Let your web page then break up the results and present them however is needed.
For example, if someone wants to do a search of over 100 videos, do the search and collect the results, but only present them with the first group, say 10. Then when the person wants to see the next ten, you get them from the list you stored, rather than doing a new query.
Not only does this make paging faster, but it cuts down on the constant queries to the YouTube database.
Hope this makes sense and helps.

When Does Django Perform the Database Lookup?

From the following code:
dvdList = Dvd.objects.filter(title = someDvdTitle)[:10]
for dvd in dvdList:
result = "Title: "+dvd.title+" # "+dvd.price+"."
When does Django do the lookup? Maybe it's just paranoia, but it seems if I comment out the for loop, it returns a lot quicker. Is the first line setting up a filter and then the for loop executes it, or am I completely muddled up? What actually happens with those lines of code?
EDIT:
What would happen if I limited the objects.filter to '1000' and then implemented a counter in the for loop that broke out of it after 10 iterations. Would that effectively only get 10 values or 1000?
Django querysets are evaluated lazily, so yes, the query won't actually be executed until you try and get values out of it (as you're doing in the for loop).
From the docs:
You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and
it executes its database query the
first time you iterate over it. For
example, this will print the headline
of all entries in the database:
for e in Entry.objects.all():
print e.headline
...(snip)...
See When Querysets are evaluated.
Per your edit:
If you limited the filter to 1000 and then implemented a counter in the for loop that broke out of it after 10 iterations, then you'd hit the database for all 1000 rows - Django has no way of knowing ahead of time exactly what you're going to do with the Queryset - it just knows that you want some data out of it, so evaluates the query string it's built up.
It may be also good to evaluate all at once using list() or any other method of eval of the query. I find it to boost performance sometimes (no paying for the DB connections every time).
Find more info about when django evaluates here.