Django - Iterating over Raw Query is slow - sql

I have a query which uses a window function. I am using a raw query to filter over that new field, since django doesn't allow filtering over that window function (at least in the version I am using).
So it would look something like this (simplified):
# Returns 440k lines
user_files = Files.objects.filter(file__deleted=False).filter(user__terminated__gte=today).annotate(
row_number=Window(expression=RowNumber(), partition_by=[F("user")], order_by=[F("creation_date").desc()]))
I am basically trying to get the last not deleted file from each user which is not terminated.
Afterwards I use following raw query to get what I want:
# returns 9k lines
sql, params = user_files.query.sql_with_params()
latest_user_files = Files.objects.raw(f'select * from ({sql}) sq where row_number = 1', params)
if I run these queries in the database, they run quite quickly (300ms). But once I try to iterate over them or even just print them it takes a very long time to execute.
Anywhere from 100 to 200 seconds even though the query itself just takes a little bit less than half a second. Is there anything I am missing? Is the extra field row_number in the raw query an issue?
Thank you for any hint/answers.
(Using Django 3.2 and Python 3.9)

Related

What is 'showString' in Spark UI? Are the code blocks below different?

(1)df.createOrReplaceTempView("dftable")
sqldf = spark.sql('SELECT COUNT(*) FROM dftable')
sqldf.show()
(2)df.createOrReplaceTempView("dftable")
sqldf = spark.sql('SELECT * FROM dftable')
sqldf.count()
What is the difference between above two codes? (1) takes 20 seconds to perform while (2) only takes 5 seconds. Only difference that I was able to notice is that in their corresponding stages, there is something like "showString at NativeMethodAccessorImpl.java:0" in (1) while (2) has "count at NativeMethodAccessorImpl.java:0". I attached their corresponding stages too. Stage 46 is for (1) and Stage 43 is for (2).
https://i.stack.imgur.com/Vcf9A.png
https://i.stack.imgur.com/pGOxC.png
As you have already mentioned in you question, the 1st query is using .show() while the 2nd query is using the .count(), which is completely two different action. Therefore, you can see that they have different spark job (blue boxes) when you check their execution DAG. Because they are using different action, some action are expensive than other action, especially when you use the action that bring the data back to the driver node (e.g. .collect()) so you can't expect they use the same time.
Back to your example, the reason why .count() is faster than .show() is because .count() is distributed and the final count (number) will be the sum of all partition which is done by driver, while .show() need to fetches the amount of data that you requested (default 20) back to the driver.
You can try different expensive action like .collect() to see how different action require time and resource.

Query fast with literal but slow with variable

I am using Typeorm for SQL Server in my application. When I pass the native query like connection.query(select * from user where id = 1), the performance is really good and it less that 0.5 seconds.
If we use the findone or QueryBuilder method, it is taking around 5 seconds to get a response.
On further debugging, we found that passing value directly to query like this,
return getConnection("vehicle").createQueryBuilder()
.select("vehicle")
.from(Vehicle, "vehicle")
.where("vehicle.id='" + id + "'").getOne();
is faster than
return getConnection("vehicle").createQueryBuilder()
.select("vehicle")
.from(Vehicle, "vehicle")
.where("vehicle.id =:id", {id:id}).getOne();
Is there any optimization we can do to fix the issue with parameterized query?
I don't know Typeorm but it seems to me clear the difference. In one case you query the database for the whole table and filter it locally and in the other you send the filter to the database and it filters the data before it sends it back to the client.
Depending on the size of the table this has a big impact. Consider picking one record from 10 million. Just the time to transfer the data to the local machine is 10 million times slower.

SQL - When data are transfered

i need to get a large amount of data from a remote database. the idea is do a sort of pagination, like this
1 Select a first block of datas
SELECT * FROM TABLE LIMIT 1,10000
2 Process that block
while(mysql_fetch_array()...){
//do something
}
3 Get next block
and so on.
Assuming 10000 is an allowable dimension for my system, let us suppose i have 30000 records to get: i perform 3 call to remote system.
But my question is: when executing a select, the resultset is transmitted and than stored in some local part with the result that fetch is local, or result set is stored in remote system and records coming one by one at any fetch? Because if the real scenario is the second i don't perform 3 call, but 30000 call, and is not what i want.
I hope I explained, thanks for help
bye
First, it's highly recommended to utilize MySQLi or PDO instead of the deprecated mysql_* functions
http://php.net/manual/en/mysqlinfo.api.choosing.php
By default with the mysql and mysqli extensions, the entire result set is loaded into PHP's memory when executing the query, but this can be changed to load results on demand as rows are retrieved if needed or desired.
mysql
mysql_query() buffers the entire result set in PHP's memory
mysql_unbuffered_query() only retrieves data from the database as rows are requested
mysqli
mysqli::query()
The $resultmode parameter determines behaviour.
The default value of MYSQLI_STORE_RESULT causes the entire result set to be transfered to PHP's memory, but using MYSQLI_USE_RESULT will cause the rows to be retrieved as requested.
PDO by default will load data as needed when using PDO::query() or PDO::prepare() to execute the query and retrieving results with PDO::fetch().
To retrieve all data from the result set into a PHP array, you can use PDO::fetchAll()
Prepared statements can also use the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY constant, though PDO::fetchALL() is recommended.
It's probably best to stick with the default behaviour and benchmark any changes to determine if they actually have any positive results; the overhead of transferring results individually may be minor, and other factors may be more important in determining the optimal method.
You would be performing 3 calls, not 30.000. That's for sure.
Each 10.000 results batch is rendered on the server (by performing each of the 3 queries). Your while iterates through a set of data that has already been returned by MySQL (that's why you don't have 30.000 queries).
That is assuming you would have something like this:
$res = mysql_query(...);
while ($row = mysql_fetch_array($res)) {
//do something with $row
}
Anything you do inside the while loop by making use of $row has to do with already-fetched data from your initial query.
Hope this answers your question.
according to the documentation here all the data is fetched to the server, then you go through it.
from the page:
Returns an array of strings that corresponds to the fetched row, or FALSE if there are no more rows.
In addition it seams this is deprecated so you might want to use something else that is suggested there.

Need for long and dynamic select query/view sqlite

I have a need to generate a long select query of potentially thousands of where conditions like (table1.a = ? OR table1.a = ? OR ...) AND (table2.b = ? OR table2.b = ? ...) AND....
I initially started building a class to make this more bearable, but have since stopped to wonder if this will work well. This query is going to be hammering a table of potentially 10s of millions of rows joined with 2 more tables with thousands of rows.
A number of concerns are stemming from this:
1.) I wanted to use these statements to generate a temp view so I could easily transfer over existing code base, the point here is I want to filter data that I have down for analysis based on selected parameters in a GUI, so how poorly will a view do in this scenario?
2.) Can sqlite even parse a query with thousands of binds?
3.) Isn't there a framework that can make generating this query easier other than with string concatenation?
4.) Is the better solution to dump all of the WHERE variables into hash sets in memory and then just write a wrapper for my DB query object that gets next() until a query is encountered this satisfies all my conditions? My concern here is, the application generates graphs procedurally on scrolls, so waiting to draw while calling query.next() x 100,000 might cause an annoying delay? Ideally I don't want to have to wait on the next row that satisfies everything for more than 30ms at a time.
edit:
New issue, it came to my attention that sqlite3 is limited to 999 bind values(host parameters) at compile time.
So it seems as if the only way to accomplish what I had originally intended is to
1.) Generate the entire query via string concatenations(my biggest concern being, I don't know how slow parsing all the data inside sqlite3 will be)
or
2.) Do the blanket query method(select * from * where index > ? limit ?) and call next() until I hit what valid data in my compiled code(including updating index variable and re-querying repeatedly)
I did end up writing a wrapper around the QSqlQuery object that will walk a table using index > variable and limit to allow "walking" the table.
Consider dumping the joined results without filters (denormalized) into a flat file and index it with Fastbit, a bitmap index engine.

When Does Django Perform the Database Lookup?

From the following code:
dvdList = Dvd.objects.filter(title = someDvdTitle)[:10]
for dvd in dvdList:
result = "Title: "+dvd.title+" # "+dvd.price+"."
When does Django do the lookup? Maybe it's just paranoia, but it seems if I comment out the for loop, it returns a lot quicker. Is the first line setting up a filter and then the for loop executes it, or am I completely muddled up? What actually happens with those lines of code?
EDIT:
What would happen if I limited the objects.filter to '1000' and then implemented a counter in the for loop that broke out of it after 10 iterations. Would that effectively only get 10 values or 1000?
Django querysets are evaluated lazily, so yes, the query won't actually be executed until you try and get values out of it (as you're doing in the for loop).
From the docs:
You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and
it executes its database query the
first time you iterate over it. For
example, this will print the headline
of all entries in the database:
for e in Entry.objects.all():
print e.headline
...(snip)...
See When Querysets are evaluated.
Per your edit:
If you limited the filter to 1000 and then implemented a counter in the for loop that broke out of it after 10 iterations, then you'd hit the database for all 1000 rows - Django has no way of knowing ahead of time exactly what you're going to do with the Queryset - it just knows that you want some data out of it, so evaluates the query string it's built up.
It may be also good to evaluate all at once using list() or any other method of eval of the query. I find it to boost performance sometimes (no paying for the DB connections every time).
Find more info about when django evaluates here.