postgresql: offset + limit gets to be very slow - sql

I have a table tmp_drop_ids with one column, id, and 3.3 million entries. I want to iterate over the table, doing something with every 200 entries. I have this code:
LIMIT = 200
for offset in xrange(0, drop_count+LIMIT, LIMIT):
print "Making tmp table with ids %s to %s/%s" % (offset, offset+LIMIT, drop_count)
query = """DROP TABLE IF EXISTS tmp_cur_drop_ids; CREATE TABLE tmp_cur_drop_ids AS
SELECT id FROM tmp_drop_ids ORDER BY id OFFSET %s LIMIT %s;""" % (offset, LIMIT)
cursor.execute(query)
This runs fine, at first, (~0.15s to generate the tmp table), but it will slow down occasionally, e.g. around 300k tickets it started taking 11-12 seconds to generate this tmp table, and again around 400k. It basically seems unreliable.
I will use those ids in other queries so I figured the best place to have them was in a tmp table. Is there any better way to iterate through results like this?

Use a cursor instead. Using a OFFSET and LIMIT is pretty expensive - because pg has to execute query, process and skip a OFFSET rows. OFFSET is like "skip rows", that is expensive.
cursor documentation
Cursor allows a iteration over one query.
BEGIN
DECLARE C CURSOR FOR SELECT * FROM big_table;
FETCH 300 FROM C; -- get 300 rows
FETCH 300 FROM C; -- get 300 rows
...
COMMIT;
Probably you can use a server side cursor without explicit using of DECLARE statement, just with support in psycopg (search section about server side cursors).

If your id's are indexed you can use "limit" with ">", for example in python-like pseudocode:
limit=200
max_processed_id=-1
query ("create table tmp_cur_drop_ids(id int)")
while true:
query("truncate tmp_cur_drop_ids")
query("insert into tmp_cur_drop_ids(id)" \
+ " select id from tmp_drop_ids" \
+ " where id>%d order by id limit %d" % (max_processed_id, limit))
max_processed_id = query("select max(id) from tmp_cur_drop_ids")
if max_processed_id == None:
break
process_tmp_cur_drop_ids();
query("drop table tmp_cur_drop_ids")
This way Postgres can use index for your query.

Related

Possible to get total row count in SQL Server before paginating?

I have the following query in SQL Server:
select * from sales
This returns about a billion rows, and I need to process the data. I would like to show the progress while the data is being processed using something like the following:
res=conn.execute('select * from sales s1')
total_rows = ?
processed_rows = 0
step_size = 10000
while True:
data = res.fetchmany(step_size)
processed_rows += step_size
print "Progress: %s / %s" % (processed_rows, total_rows)
Is there a way to get the total number of rows in a SQL query without running another query (or doing an operation such as len(res.fetchall()), which will add a lot of overhead to the above)?
Note: I'm not interested in paginating here (which is what should be done). This is more a question to see if it's possible to get the TOTAL ROW COUNT in a query in SQL Server BEFORE paginating/processing the data.

Get COUNT() result within time limit

Is there a way in PostgreSQL to abort execution of COUNT(*) statement and return its current result?
I would like to run:
SELECT COUNT(*) FROM table WHERE something=x;
Some queries are completed in almost no time, but some take quite a lot of time. I would like to have:
if statement is completed in within time limit then it returns final
result,
else it aborts execution but returns current result.
It would be nice to get an exit status as well (whether it finished execution or was aborted).
I found statement_timeout setting, but it doesn't return any result, just aborts.
You can easily instruct Postgres to count up to a given LIMIT - a maximum number of rows, not an elapsed time:
SELECT count(*)
FROM (
SELECT 1 FROM tbl
WHERE something = 'x'
LIMIT 100000 -- stop counting at 100k
) sub;
If count() takes a very long time, you either have huge tables or some other problems with your setup. Either way, an estimated count be good enough for your purpose:
Fast way to discover the row count of a table in PostgreSQL
It is not possible per se to stop counting after a maximum elapsed time. You could partition the count with the above technique and check the elapsed time after every step. But this adds a lot of overhead. Skipping rows with OFFSET is not that much cheaper than counting them. I don't think I would use it. Just as proof of concept:
DO
$do$
DECLARE
_partition bigint := 100000; -- size of count partition
_timeout timestamptz := clock_timestamp() + interval '1s'; -- max time allowed
_round int := 0;
_round_ct bigint;
BEGIN
LOOP
SELECT count(*)
FROM (
SELECT 1 FROM tbl
WHERE something = 'x'
LIMIT _partition
OFFSET _partition * _round
) sub
INTO _round_ct;
IF _round_ct < _partition THEN
RAISE NOTICE 'count: %; status: complete', _partition * _round + _round_ct;
RETURN;
ELSIF clock_timestamp() > _timeout THEN
RAISE NOTICE 'count: %; status: timeout', _partition * _round + _round_ct;
RETURN;
END IF;
_round := _round + 1;
END LOOP;
END
$do$;
You could wrap this in a plpgsql function and pass parameters. Even make it work for any given table / column with EXECUTE ...
If you have an ID column with few gaps, the technique would make a lot more sense. You could partition by ID with a lot less overhead ...
I don't believe you will ever get a result set with a count until a query completes and makes it visible to the end user, aka you. Such is the way of fundamental rules of an ACID database. From initiating a SELECT command you're asking for a snapshot of the number of rows at that moment in time.
You would probably be better off looking at the issue from another angle and look into why some queries take a long time by performing an EXPLAIN on the query and then investigate the results.

subquery using DISTINCT and LIMIT

In SQLite, when I do
SELECT DISTINCT idvar
FROM myTable
LIMIT 100
OFFSET 0;
the data returned are 100 rows with (the first) 100 distinct values of idvar in myTable. That's exaclty what I expected.
Now, when I do
SELECT *
FROM myTable
WHERE idvar IN (SELECT DISTINCT idvar
FROM myTable
LIMIT 100
OFFSET 0);
I would expect to have all the data from myTable corresponding to those 100 distinct values of idvar (so potentially the data returned would have more than 100 rows if there is more than one row of each idvar). What I get however is all the data for whatever many distinct values of idvar that return more or less 100 rows. I don't understand why.
Thoughts? How should I build a query that returns what I expected?
Context
I have a 50GB table, and I need to do some calculations using R. Since I can't possibly load that much data into R for memory reasons, I want to work in chuncks. It is important however that each chunck contains all the rows for a given level of idvar. That's why I use OFFSET and LIMIT in the query, as well as trying to make sure that it returns all rows for levels of idvar.
I'm not sure about SQLite, but in other SQL variants the result of un-ordered LIMIT query is not guaranteed to return the same result every time. So you should also include ORDER BY in there.
But a better idea may be to do a separate query at the beginning to read all of the distinct IDs into R. And then split those into batches of 100 and then to a separate query for each batch. Should be clearer and faster and easier to debug.
Edit: example R code. Lets say you have 100k distinct IDs in variable ids.
for (i in 1:1000) {
tmp.ids <- ids[((i - 1) * 100 + 1) : (i * 100)]
query <- paste0("SELECT * FROM myTable WHERE idvar IN (",
paste0(tmp.ids, collapse = ", "),
")")
dbSendquery(con, query)
fetch results, etc..
}

How to update a counter for a resultset

i'm creating something similar to an advertising system.
I would like to show, for example, 5 ads (5 record) from a given database table.
So i execute something like
SELECT * FROM mytable
ORDER BY view_counter ASC
LIMIT 5
ok, it works.
But, how can contextualy update the "view_counter" (that is a counter with the number of show) maybe with a single SQL ?
And, if i don't ask too much, is it possible to save the "position" which my record are returned ?
For example, my sql return
- record F (pos. 1)
- record X (pos. 2)
- record Z (pos. 3)
And save in a field "Avarage_Position" the .. avarage of position ?
Thanks in advance.
Regards
how can contextualy update the "view_counter" (that is a counter with the number of show) maybe with a single SQL ?
That's usually something handled by analytic/rank/windowing functions, which MySQL doesn't currently support. But you can use the following query to get the output you want:
SELECT *,
#rownum := #rownum + 1 AS rank
FROM mytable
JOIN (SELECT #rownum := 1) r
ORDER BY view_counter ASC
LIMIT 5
You'd get output like:
description | rank
--------------------------
record F | 1
record X | 2
record Z | 3
if i don't ask too much, is it possible to save the "position" which my record are returned ?
I don't recommend doing this, because it means the data needs to be updated every time there's a change. On other databases I'd recommend using a view so the calculation is made only when the view is used, but MySQL doesn't support variable use in views.
There is an alternative means of getting the rank value using a subselect - this link is for SQL Server, but there's nothing in the solution that is SQL Server specific.
You could do something like this, but it is pretty ugly and I would not recommend it (see below for my actual suggestion about how to handle this issue).
Create a dummy_field tinyint field, sum_position int field and average_position decimal field and run the following few statements within the same connection (I am usually very much against MySQL stored procedures, but in this case it could be useful to store this in a SP).
SET #updated_ads := '';
SET #current_position := 0;
UPDATE mytable SET view_counter= view_counter+1,
dummy_field = (SELECT #updated_ads := CONCAT(#updated_ads,id,"\t",ad_text,"\r\n"))*0, /* I added *0 for saving it as tinyint in dummy_field */
sum_position = sum_position + (#current_position := #current_position +1),
average_position = sum_position / view_counter
ORDER BY view_counter DESC
LIMIT 5;
SELECT #updated_ads;
Then parse the result string in your code using the delimiters you used (I used \r\n as a row delimiter and \t as the field delimiter).
What I actually suggest you to do is:
Query for selected ads.
Write a log file with the selected ads and positions.
Write a job to process the log file and update view_counter, average_position and sum_position fields in batch.
thanks for your answer. I solved simply executing the same SELECT query (with exactly the clause WHERE, Order BY and LIMIT) but, instead SELECT, i used UPDATE.
Yes, there's an "overhead", but it's simple solution.

Sql query to get a non-contiguous subset of results

I'm writing a web application that should show very large results on a search query.
Say some queries will return 10.000 items.
I'd like to show those to users paginated; no problem so far: each page will be the result of a query with an appropriate LIMIT statement.
But I'd like to show clues about results in each page of the paginated query: some data from the first item and some from the last.
This mean that, for example, with a result of 10.000 items and a page size of 50 items, if the user asked for the first page I will need:
the first 50 items (the page requested by the user)
item 51 and 100 (the first and last of the second page)
item 101 and 151
etc
For efficiency reasons I want to avoid one query per row.
[edit] I also would prefer not downloading 10.000 results if I only need 50 + 10000/50*2 = 400
The question is: is there a single query I can issue to the RDBMS (mysql, by the way, but I'd prefer a cross-db solution) that will return only the data I need?
I can't use server side cursor, because not all dbs support it and I want my app to be database-agnostic.
Just for fun, here is the MSSQL version of it.
declare #pageSize as int; set #pageSize = 10;
declare #pageIndex as int; set #pageIndex = 0; /* first page */
WITH x AS
(
select
ROW_NUMBER() OVER (ORDER BY (created) ASC) AS RowNumber,
*
from table
)
SELECT * FROM x
WHERE
((RowNumber <= (#pageIndex+1)*#pageSize) AND (RowNumber >= #pageIndex*#PageSize+1))
OR
RowNumber % #pageSize = 1
OR
RowNumber % #pageSize = #pageSize-1
Note, that an ORDER BY is provided in the over clause.
Also note, that if you have gazillion rows, your result set will have millions. You need to maximize the result rows for practical reasons.
I have no idea how this could be a solved in generic SQL. (My bet: no way. Even simple pageing cannot be solved without DB-specific operators.)
UPDATE: I completely misread the initial question. You can do this using UNION and the LIMIT clause in MySQL, although it might be what you meant by "one query per row". The syntax would be like:
select FOO from BAZ limit 50
union
select FOO from BAZ limit 50, 1
union
select FOO from BAZ limit 99, 1
union
select FOO from BAZ limit 100, 1
union
select FOO from BAZ limit 149, 1
and so on and so forth. Since you're using UNION, you'll only need one roundtrip to the database. I'm not sure how MySQL will treat the various SELECT statements, though. It should be able to recognize that they are essentially the same query and use a cached query plan, but I don't work with MySQL enough to know if that's a reasonable expectation for its optimizer.
Obviously, to build this query in a general fashion, you'll first need to run a count query so you can calculate what your offsets will be.
This is definitely not a tractable problem for standard SQL, since the paging logic requires nonstandard features.