subquery using DISTINCT and LIMIT - sql

In SQLite, when I do
SELECT DISTINCT idvar
FROM myTable
LIMIT 100
OFFSET 0;
the data returned are 100 rows with (the first) 100 distinct values of idvar in myTable. That's exaclty what I expected.
Now, when I do
SELECT *
FROM myTable
WHERE idvar IN (SELECT DISTINCT idvar
FROM myTable
LIMIT 100
OFFSET 0);
I would expect to have all the data from myTable corresponding to those 100 distinct values of idvar (so potentially the data returned would have more than 100 rows if there is more than one row of each idvar). What I get however is all the data for whatever many distinct values of idvar that return more or less 100 rows. I don't understand why.
Thoughts? How should I build a query that returns what I expected?
Context
I have a 50GB table, and I need to do some calculations using R. Since I can't possibly load that much data into R for memory reasons, I want to work in chuncks. It is important however that each chunck contains all the rows for a given level of idvar. That's why I use OFFSET and LIMIT in the query, as well as trying to make sure that it returns all rows for levels of idvar.

I'm not sure about SQLite, but in other SQL variants the result of un-ordered LIMIT query is not guaranteed to return the same result every time. So you should also include ORDER BY in there.
But a better idea may be to do a separate query at the beginning to read all of the distinct IDs into R. And then split those into batches of 100 and then to a separate query for each batch. Should be clearer and faster and easier to debug.
Edit: example R code. Lets say you have 100k distinct IDs in variable ids.
for (i in 1:1000) {
tmp.ids <- ids[((i - 1) * 100 + 1) : (i * 100)]
query <- paste0("SELECT * FROM myTable WHERE idvar IN (",
paste0(tmp.ids, collapse = ", "),
")")
dbSendquery(con, query)
fetch results, etc..
}

Related

Rails Active Record - Get an evenly distributed arbitrary number of records

I want to have queries on my rails App that return an evenly distributed arbitrary number of records.
What means is, if my query returns 100 records and I want only 4, it should return the records 100, 75, 50 and 25.
If I want 5, it should return 100, 80, 60, 40, 20.
I know I could do it manipulating an array after the result, but my question is, is there a way to do it directly with ActiveRecord or even SQL?
If you're on postgresql, mssql or oracle, you can use the row_number() function but it looks like you'll need two nested queries, which might be more complicated than you want. See Return row of every n'th record for an example of how to construct it.
From there, what you're trying to do might look like:
MyModel.find_by_sql("
select * from my_models where id in (
SELECT t.id
FROM
(
SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS rownum
FROM my_models
WHERE [YOUR CONDITIONS HERE]
) AS t
WHERE t.rownum % n = 0
ORDER BY t.key
)")
It's really hard to tell from here how expensive that query might be. If I might offer some advice, I would really suggest proving that you have a performance problem doing it the Rails way, before getting into trying an optimization like this.
You can avoid having to fetch and deserialize all the records you'd skip, by using the ids method of an AREL query object to get only the ids like so:
class MyModel < ActiveRecord::Base
class < self
def get_each_n_of_query(n)
all_ids = get_query.ids
ids = (0... all_ids.length).select{ |x| x%n == n-1 }.map { |y| all_ids[y] }
where(id: ids)
end
def get_query()
where(foo: 'bar', ...)
end
end
end
Credit to the answers at How do you select every nth item in an array? for how to divide the list of ids.
I may have misread the question, but I was trying to solve a similar question and used:
SELECT DISTINCT [column]
FROM [table]
ORDER BY [column] [DESC|ASC]
OFFSET #Row-1 ROWS --you send #Row a row value if you want to start on Nth row down e.g. 10 to start at row 10
FETCH NEXT #Rows ROWS ONLY --the number of rows you want to see

SQL COUNT - greater than some number without having to get the exact count?

There's a thread at https://github.com/amatsuda/kaminari/issues/545 talking about a problem with a Ruby pagination gem when it encounters large tables.
When the number of records is large, the pagination will display something like:
[1][2][3][4][5][6][7][8][9][10][...][end]
This can incur performance penalties when the number of records is huge, because getting an exact count of, say, 50M+ records will take time. However, all that's needed to know in this case is that the count is greater than the number of pages to show * number of records per page.
Is there a faster SQL operation than getting the exact COUNT, which would merely assert that the COUNT is greater than some value x?
You could try with
SQL Server:
SELECT COUNT(*) FROM (SELECT TOP 1000 * FROM MyTable) X
MySQL:
SELECT COUNT(*) FROM (SELECT * FROM MyTable LIMIT 1000) X
With a little luck, the SQL Server/MySQL will optimize this query. Clearly instead of 1000 you should put the maximum number of pages you want * the number of rows per page.

How can I get a specific chunk of results?

Is it possible to retrieve a specific range of results? I know how to do TOP x but the result I will retrieve is WAY too big and will time out. I was hoping to be able to pick say the first 10,000 results then the next 10,000 and so on. Is this possible?
WITH Q AS (
SELECT ROW_NUMBER() OVER (ORDER BY ...some column) AS N, ...other columns
FROM ...some table
) SELECT * FROM Q WHERE N BETWEEN 1 AND 10000;
Read more about ROW_NUMBER() here: http://msdn.microsoft.com/en-us/library/ms186734.aspx
Practically all SQL DB implementations have a way of specifying the starting row to return, as well as the number of rows.
For example, in both mysql and postgres it looks like:
SELECT ...
ORDER BY something -- not required, but highly recommended
LIMIT 100 -- only get 100 rows
OFFSET 500; -- start at row 500
Note that normally you would include an ORDER BY to make sure your chunks are consistent
MS SQL Server (being a "pretend" DB) don't support OFFSET directly, but it can be coded using ROW_NUMBER() - see this SO post for more detail.

postgresql: offset + limit gets to be very slow

I have a table tmp_drop_ids with one column, id, and 3.3 million entries. I want to iterate over the table, doing something with every 200 entries. I have this code:
LIMIT = 200
for offset in xrange(0, drop_count+LIMIT, LIMIT):
print "Making tmp table with ids %s to %s/%s" % (offset, offset+LIMIT, drop_count)
query = """DROP TABLE IF EXISTS tmp_cur_drop_ids; CREATE TABLE tmp_cur_drop_ids AS
SELECT id FROM tmp_drop_ids ORDER BY id OFFSET %s LIMIT %s;""" % (offset, LIMIT)
cursor.execute(query)
This runs fine, at first, (~0.15s to generate the tmp table), but it will slow down occasionally, e.g. around 300k tickets it started taking 11-12 seconds to generate this tmp table, and again around 400k. It basically seems unreliable.
I will use those ids in other queries so I figured the best place to have them was in a tmp table. Is there any better way to iterate through results like this?
Use a cursor instead. Using a OFFSET and LIMIT is pretty expensive - because pg has to execute query, process and skip a OFFSET rows. OFFSET is like "skip rows", that is expensive.
cursor documentation
Cursor allows a iteration over one query.
BEGIN
DECLARE C CURSOR FOR SELECT * FROM big_table;
FETCH 300 FROM C; -- get 300 rows
FETCH 300 FROM C; -- get 300 rows
...
COMMIT;
Probably you can use a server side cursor without explicit using of DECLARE statement, just with support in psycopg (search section about server side cursors).
If your id's are indexed you can use "limit" with ">", for example in python-like pseudocode:
limit=200
max_processed_id=-1
query ("create table tmp_cur_drop_ids(id int)")
while true:
query("truncate tmp_cur_drop_ids")
query("insert into tmp_cur_drop_ids(id)" \
+ " select id from tmp_drop_ids" \
+ " where id>%d order by id limit %d" % (max_processed_id, limit))
max_processed_id = query("select max(id) from tmp_cur_drop_ids")
if max_processed_id == None:
break
process_tmp_cur_drop_ids();
query("drop table tmp_cur_drop_ids")
This way Postgres can use index for your query.

Sql query to get a non-contiguous subset of results

I'm writing a web application that should show very large results on a search query.
Say some queries will return 10.000 items.
I'd like to show those to users paginated; no problem so far: each page will be the result of a query with an appropriate LIMIT statement.
But I'd like to show clues about results in each page of the paginated query: some data from the first item and some from the last.
This mean that, for example, with a result of 10.000 items and a page size of 50 items, if the user asked for the first page I will need:
the first 50 items (the page requested by the user)
item 51 and 100 (the first and last of the second page)
item 101 and 151
etc
For efficiency reasons I want to avoid one query per row.
[edit] I also would prefer not downloading 10.000 results if I only need 50 + 10000/50*2 = 400
The question is: is there a single query I can issue to the RDBMS (mysql, by the way, but I'd prefer a cross-db solution) that will return only the data I need?
I can't use server side cursor, because not all dbs support it and I want my app to be database-agnostic.
Just for fun, here is the MSSQL version of it.
declare #pageSize as int; set #pageSize = 10;
declare #pageIndex as int; set #pageIndex = 0; /* first page */
WITH x AS
(
select
ROW_NUMBER() OVER (ORDER BY (created) ASC) AS RowNumber,
*
from table
)
SELECT * FROM x
WHERE
((RowNumber <= (#pageIndex+1)*#pageSize) AND (RowNumber >= #pageIndex*#PageSize+1))
OR
RowNumber % #pageSize = 1
OR
RowNumber % #pageSize = #pageSize-1
Note, that an ORDER BY is provided in the over clause.
Also note, that if you have gazillion rows, your result set will have millions. You need to maximize the result rows for practical reasons.
I have no idea how this could be a solved in generic SQL. (My bet: no way. Even simple pageing cannot be solved without DB-specific operators.)
UPDATE: I completely misread the initial question. You can do this using UNION and the LIMIT clause in MySQL, although it might be what you meant by "one query per row". The syntax would be like:
select FOO from BAZ limit 50
union
select FOO from BAZ limit 50, 1
union
select FOO from BAZ limit 99, 1
union
select FOO from BAZ limit 100, 1
union
select FOO from BAZ limit 149, 1
and so on and so forth. Since you're using UNION, you'll only need one roundtrip to the database. I'm not sure how MySQL will treat the various SELECT statements, though. It should be able to recognize that they are essentially the same query and use a cached query plan, but I don't work with MySQL enough to know if that's a reasonable expectation for its optimizer.
Obviously, to build this query in a general fashion, you'll first need to run a count query so you can calculate what your offsets will be.
This is definitely not a tractable problem for standard SQL, since the paging logic requires nonstandard features.