I have the following query in SQL Server:
select * from sales
This returns about a billion rows, and I need to process the data. I would like to show the progress while the data is being processed using something like the following:
res=conn.execute('select * from sales s1')
total_rows = ?
processed_rows = 0
step_size = 10000
while True:
data = res.fetchmany(step_size)
processed_rows += step_size
print "Progress: %s / %s" % (processed_rows, total_rows)
Is there a way to get the total number of rows in a SQL query without running another query (or doing an operation such as len(res.fetchall()), which will add a lot of overhead to the above)?
Note: I'm not interested in paginating here (which is what should be done). This is more a question to see if it's possible to get the TOTAL ROW COUNT in a query in SQL Server BEFORE paginating/processing the data.
Related
I have an existing database from which I need to extract a single record that contains a total of 10 GB of data. I have tried to load the data with
conn = sqlite(databaseFile, 'readonly')
GetResult = [
'SELECT result1, result2 ...... FROM Result '...
'WHERE ResultID IN ......'
];
Data = fetch(conn, GetResult)
With this query, the working memory increases (16GB) until it is full, and then the software crashes.
I also tried to limit the result with
'LIMIT 10000'
at the end of the query and browse the results by offset. This works, but it takes 3 hours (calculated from 20 individual results) to get all the results. (Database can not be changed)
Maybe someone of you has an idea to get the data faster or in one query.
I have a table called projection_chart that contain these columns:
PERCENTAGE PLANNED CLEARED PROJECTED TOTAL
I need to execute a SQL query where I can make the PROJECTED column the value of the PERCENTAGE of PLANNED and TOTAL the value of PERCENTAGE of CLEARED for the entire table where media = digital.
UPDATE projection_chart
SET PROJECTED = PLANNED * PERCENTAGE / 100,
TOTAL = CLEARED * PERCENTAGE / 100
WHERE media = "Digital"
Can I use the name of the columns of the table to make the query arithmetic operations? In my head it needs to like loop through each row and perform those tasks, is this possible? if you have any suggestions or a better syntax I'll be forever thankful
There's a thread at https://github.com/amatsuda/kaminari/issues/545 talking about a problem with a Ruby pagination gem when it encounters large tables.
When the number of records is large, the pagination will display something like:
[1][2][3][4][5][6][7][8][9][10][...][end]
This can incur performance penalties when the number of records is huge, because getting an exact count of, say, 50M+ records will take time. However, all that's needed to know in this case is that the count is greater than the number of pages to show * number of records per page.
Is there a faster SQL operation than getting the exact COUNT, which would merely assert that the COUNT is greater than some value x?
You could try with
SQL Server:
SELECT COUNT(*) FROM (SELECT TOP 1000 * FROM MyTable) X
MySQL:
SELECT COUNT(*) FROM (SELECT * FROM MyTable LIMIT 1000) X
With a little luck, the SQL Server/MySQL will optimize this query. Clearly instead of 1000 you should put the maximum number of pages you want * the number of rows per page.
In SQLite, when I do
SELECT DISTINCT idvar
FROM myTable
LIMIT 100
OFFSET 0;
the data returned are 100 rows with (the first) 100 distinct values of idvar in myTable. That's exaclty what I expected.
Now, when I do
SELECT *
FROM myTable
WHERE idvar IN (SELECT DISTINCT idvar
FROM myTable
LIMIT 100
OFFSET 0);
I would expect to have all the data from myTable corresponding to those 100 distinct values of idvar (so potentially the data returned would have more than 100 rows if there is more than one row of each idvar). What I get however is all the data for whatever many distinct values of idvar that return more or less 100 rows. I don't understand why.
Thoughts? How should I build a query that returns what I expected?
Context
I have a 50GB table, and I need to do some calculations using R. Since I can't possibly load that much data into R for memory reasons, I want to work in chuncks. It is important however that each chunck contains all the rows for a given level of idvar. That's why I use OFFSET and LIMIT in the query, as well as trying to make sure that it returns all rows for levels of idvar.
I'm not sure about SQLite, but in other SQL variants the result of un-ordered LIMIT query is not guaranteed to return the same result every time. So you should also include ORDER BY in there.
But a better idea may be to do a separate query at the beginning to read all of the distinct IDs into R. And then split those into batches of 100 and then to a separate query for each batch. Should be clearer and faster and easier to debug.
Edit: example R code. Lets say you have 100k distinct IDs in variable ids.
for (i in 1:1000) {
tmp.ids <- ids[((i - 1) * 100 + 1) : (i * 100)]
query <- paste0("SELECT * FROM myTable WHERE idvar IN (",
paste0(tmp.ids, collapse = ", "),
")")
dbSendquery(con, query)
fetch results, etc..
}
I have a table tmp_drop_ids with one column, id, and 3.3 million entries. I want to iterate over the table, doing something with every 200 entries. I have this code:
LIMIT = 200
for offset in xrange(0, drop_count+LIMIT, LIMIT):
print "Making tmp table with ids %s to %s/%s" % (offset, offset+LIMIT, drop_count)
query = """DROP TABLE IF EXISTS tmp_cur_drop_ids; CREATE TABLE tmp_cur_drop_ids AS
SELECT id FROM tmp_drop_ids ORDER BY id OFFSET %s LIMIT %s;""" % (offset, LIMIT)
cursor.execute(query)
This runs fine, at first, (~0.15s to generate the tmp table), but it will slow down occasionally, e.g. around 300k tickets it started taking 11-12 seconds to generate this tmp table, and again around 400k. It basically seems unreliable.
I will use those ids in other queries so I figured the best place to have them was in a tmp table. Is there any better way to iterate through results like this?
Use a cursor instead. Using a OFFSET and LIMIT is pretty expensive - because pg has to execute query, process and skip a OFFSET rows. OFFSET is like "skip rows", that is expensive.
cursor documentation
Cursor allows a iteration over one query.
BEGIN
DECLARE C CURSOR FOR SELECT * FROM big_table;
FETCH 300 FROM C; -- get 300 rows
FETCH 300 FROM C; -- get 300 rows
...
COMMIT;
Probably you can use a server side cursor without explicit using of DECLARE statement, just with support in psycopg (search section about server side cursors).
If your id's are indexed you can use "limit" with ">", for example in python-like pseudocode:
limit=200
max_processed_id=-1
query ("create table tmp_cur_drop_ids(id int)")
while true:
query("truncate tmp_cur_drop_ids")
query("insert into tmp_cur_drop_ids(id)" \
+ " select id from tmp_drop_ids" \
+ " where id>%d order by id limit %d" % (max_processed_id, limit))
max_processed_id = query("select max(id) from tmp_cur_drop_ids")
if max_processed_id == None:
break
process_tmp_cur_drop_ids();
query("drop table tmp_cur_drop_ids")
This way Postgres can use index for your query.