SQL - When data are transfered - sql

i need to get a large amount of data from a remote database. the idea is do a sort of pagination, like this
1 Select a first block of datas
SELECT * FROM TABLE LIMIT 1,10000
2 Process that block
while(mysql_fetch_array()...){
//do something
}
3 Get next block
and so on.
Assuming 10000 is an allowable dimension for my system, let us suppose i have 30000 records to get: i perform 3 call to remote system.
But my question is: when executing a select, the resultset is transmitted and than stored in some local part with the result that fetch is local, or result set is stored in remote system and records coming one by one at any fetch? Because if the real scenario is the second i don't perform 3 call, but 30000 call, and is not what i want.
I hope I explained, thanks for help
bye

First, it's highly recommended to utilize MySQLi or PDO instead of the deprecated mysql_* functions
http://php.net/manual/en/mysqlinfo.api.choosing.php
By default with the mysql and mysqli extensions, the entire result set is loaded into PHP's memory when executing the query, but this can be changed to load results on demand as rows are retrieved if needed or desired.
mysql
mysql_query() buffers the entire result set in PHP's memory
mysql_unbuffered_query() only retrieves data from the database as rows are requested
mysqli
mysqli::query()
The $resultmode parameter determines behaviour.
The default value of MYSQLI_STORE_RESULT causes the entire result set to be transfered to PHP's memory, but using MYSQLI_USE_RESULT will cause the rows to be retrieved as requested.
PDO by default will load data as needed when using PDO::query() or PDO::prepare() to execute the query and retrieving results with PDO::fetch().
To retrieve all data from the result set into a PHP array, you can use PDO::fetchAll()
Prepared statements can also use the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY constant, though PDO::fetchALL() is recommended.
It's probably best to stick with the default behaviour and benchmark any changes to determine if they actually have any positive results; the overhead of transferring results individually may be minor, and other factors may be more important in determining the optimal method.

You would be performing 3 calls, not 30.000. That's for sure.
Each 10.000 results batch is rendered on the server (by performing each of the 3 queries). Your while iterates through a set of data that has already been returned by MySQL (that's why you don't have 30.000 queries).
That is assuming you would have something like this:
$res = mysql_query(...);
while ($row = mysql_fetch_array($res)) {
//do something with $row
}
Anything you do inside the while loop by making use of $row has to do with already-fetched data from your initial query.
Hope this answers your question.

according to the documentation here all the data is fetched to the server, then you go through it.
from the page:
Returns an array of strings that corresponds to the fetched row, or FALSE if there are no more rows.
In addition it seams this is deprecated so you might want to use something else that is suggested there.

Related

Query fast with literal but slow with variable

I am using Typeorm for SQL Server in my application. When I pass the native query like connection.query(select * from user where id = 1), the performance is really good and it less that 0.5 seconds.
If we use the findone or QueryBuilder method, it is taking around 5 seconds to get a response.
On further debugging, we found that passing value directly to query like this,
return getConnection("vehicle").createQueryBuilder()
.select("vehicle")
.from(Vehicle, "vehicle")
.where("vehicle.id='" + id + "'").getOne();
is faster than
return getConnection("vehicle").createQueryBuilder()
.select("vehicle")
.from(Vehicle, "vehicle")
.where("vehicle.id =:id", {id:id}).getOne();
Is there any optimization we can do to fix the issue with parameterized query?
I don't know Typeorm but it seems to me clear the difference. In one case you query the database for the whole table and filter it locally and in the other you send the filter to the database and it filters the data before it sends it back to the client.
Depending on the size of the table this has a big impact. Consider picking one record from 10 million. Just the time to transfer the data to the local machine is 10 million times slower.

How to count rows of finished BigQuery job using node.js client library

I would like to get the row count of job that was run using:
bigquery.startQuery(options)
The naive way of doing this, would be to stream the result (e.g. using):
job.getQueryResultsStream()
And count one by one. This obviously isn't very efficient, especially for large results. Another way I thought of is using the metadata of the job:
job.on('complete', function(metadata) {...}
Where I could kind of "reverse engineer" the response, to get the query plan, and see the number of written rows in the last step. I could find that in:
statistics.query.queryPlan[statistics.query.queryPlan.length - 1].recordsWritten
While a sample of different queries convinced me that this might work, it feels like a "hack", and it's difficult to say how robust it will be. Seems like I might need to handle different cases (failed queries, etc.)
EDIT: Another option suggested below is "SELECT COUNT"ing the temp table created by the original query (available in the job metadata). While this absolutely is a straightforward way to get the result I'm looking for, it has the disadvantage of requiring another roundtrip to query the BigQuery service, which costs several seconds. It is a 0 "bytes billed" query (counting a full table uses table metadata only), but it seems redundant when the job "knows" how many rows it has written to the output.
Is there a straightforward and "correct" way to get this count from the job object, without a roundtrip to BQ service? Perhaps a field I missed / misinterpreted, or a function in the job object that returns this?
Any job has destination table - even when you do not explicitly set it - result is still saved in so-called anonymous table that you can in turn query to get the count of output rows. So below simple extra query will work (note - names are just as an example)
SELECT COUNT(1)
FROM `yourProject._0511743a77ca76c1b55482d7cb1f8e91ac5c7b36.anon17286defe54b5c07ba6810a71abfdba6388ac4e0`
The actual destination table to use - can be retrieved from configuration.query.destinationTable property of job
job.on('complete', function(metadata) {
console.log(metadata.statistics.query.numDmlAffectedRows)
}

Need for long and dynamic select query/view sqlite

I have a need to generate a long select query of potentially thousands of where conditions like (table1.a = ? OR table1.a = ? OR ...) AND (table2.b = ? OR table2.b = ? ...) AND....
I initially started building a class to make this more bearable, but have since stopped to wonder if this will work well. This query is going to be hammering a table of potentially 10s of millions of rows joined with 2 more tables with thousands of rows.
A number of concerns are stemming from this:
1.) I wanted to use these statements to generate a temp view so I could easily transfer over existing code base, the point here is I want to filter data that I have down for analysis based on selected parameters in a GUI, so how poorly will a view do in this scenario?
2.) Can sqlite even parse a query with thousands of binds?
3.) Isn't there a framework that can make generating this query easier other than with string concatenation?
4.) Is the better solution to dump all of the WHERE variables into hash sets in memory and then just write a wrapper for my DB query object that gets next() until a query is encountered this satisfies all my conditions? My concern here is, the application generates graphs procedurally on scrolls, so waiting to draw while calling query.next() x 100,000 might cause an annoying delay? Ideally I don't want to have to wait on the next row that satisfies everything for more than 30ms at a time.
edit:
New issue, it came to my attention that sqlite3 is limited to 999 bind values(host parameters) at compile time.
So it seems as if the only way to accomplish what I had originally intended is to
1.) Generate the entire query via string concatenations(my biggest concern being, I don't know how slow parsing all the data inside sqlite3 will be)
or
2.) Do the blanket query method(select * from * where index > ? limit ?) and call next() until I hit what valid data in my compiled code(including updating index variable and re-querying repeatedly)
I did end up writing a wrapper around the QSqlQuery object that will walk a table using index > variable and limit to allow "walking" the table.
Consider dumping the joined results without filters (denormalized) into a flat file and index it with Fastbit, a bitmap index engine.

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.

groovy sql eachRow and rows method

I am new to grails and groovy.
Can anyone please explain to me the difference between these two groovy sql methods
sql.eachRow
sql.rows
Also, which is more efficient?
I am working on an application that retrieves data from the database(the resultset is very huge) and writes it to CSV file or returns a JSON format.
I was wondering which of the two methods mentioned above to use to have the process done faster and efficient.
Can anyone please explain to me the
difference between these two groovy
sql methods sql.eachRow sql.rows
It's difficult to tell exactly which 2 methods you're referring 2 because there are a large number of overloaded versions of each method. However, in all cases, eachRow returns nothing
void eachRow(String sql, Closure closure)
whereas rows returns a list of rows
List rows(String sql)
So if you use eachRow, the closure passed in as the second parameter should handle each row, e.g.
sql.eachRow("select * from PERSON where lastname = 'murphy'") { row ->
println "$row.firstname"
}
whereas if you use rows the rows are returned, and therefore should be handled by the caller, e.g.
rows("select * from PERSON where lastname = 'murphy'").each {row ->
println "$row.firstname"
}
Also, which is more efficient?
This question is almost unanswerable. Even if I had implemented these methods myself there's no way of knowing which one will perform better for you because I don't know
what hardware you're using
what JVM you're targeting
what version of Groovy you're using
what parameters you'll be passing
whether this method is a bottleneck for your application's performance
or any of the other factors that influence a method's performance that cannot be determined from the source code alone. The only way you can get a useful answer to the question of which method is more efficient for you is by measuring the performance of each.
Despite everything I've said above, I would be amazed if the performance difference between these two was in any way significant, so if I were you, I would choose whichever one you find more convenient. If you find later on that this method is a performance bottleneck, try using the other one instead (but I'll bet you a dollar to a dime it makes no difference).
If we set aside minor syntax differences, there is one difference that seems important. Let's consider
sql.rows("select * from my_table").each { row -> doIt(row) }
vs
sql.eachRow("select * from my_table") { row -> doIt(row) }
The first one opens connection, retrieves results, closes connection and returns them. Now you can iterate over the results while connection is released. The drawback is you now have entire result list in memory which in some cases might be a lot.
EachRow on the other hand opens a connection and while keeping it open executes your closure for each row. If your closure operates on the database and requires another connection your code will consume two connections from the pool at the same time. The connection used by eachRow is released after it iterates though all the resulting rows. Also if you don't perform any database operations but closure takes a while to execute, you will be blocking one database connection until eachRow completes.
I am not 100% sure but possibly eachRow allows you not to keep all resulting rows in memory but access them through a cursor - this may depend on the database driver.
If you don't perform any database operations inside your closure, closure executes fast and results list is big enough to impact memory then I'd go for eachRow. If you do perform DB operations inside closure or each closure call takes significant time while results list is manageable, then go for rows.
They differ in signature only - both support result sets paging, so both will be efficient. Use whichever fits your code.