DBI/SQL performance on SELECT via different methods? - sql

I've been looking around and haven't been able to find a good answer to this question. Is there a real performance difference between the following DBI methods?
fetchrow_arrayref vs
selectall_arrayref vs
fetchall_arrayref
It probably doesn't matter when making a single SELECT call that will give a smallish (>50 records) resultset, but what about making multiple SELECT statements all in a row? What about if the resultsets are huge (i.e. thousands of records)?
Thanks.

The key question to ask yourself is whether you need to keep all the returned rows around in memory.
If you do then you can let the DBI fetch them all for you - it will be faster than writing the equivalent code yourself.
If you don't need to keep all the rows in memory, in other words, if you can process each row in turn, then using fetchrow_arrayref in a loop will typically be much faster.
The reason is that the DBI goes to great lengths to reuse the memory buffers for each row. That can be a significant saving in effort. You can see the effect on this slide, although the examples don't directly match your question. You can also see from that slide the importance of benchmarking. It can be hard to tell where the balance lies.
If you can work on a per-row basis, then binding columns can yield a useful performance gain by reducing the work of accessing the values in the fetched rows.
You also asked about "huge" results. (I doubt "thousands of records" would be a problem on modern machines unless the rows themselves were very 'large'.) Clearly processing a row at a time is preferable for very large result sets. Note that some databases default to streaming all the results to the driver which then buffers them in a compact form and returns the rows one by one as your perl code (or a DBI method) fetches them. Again, benchmark and test for yourself.
If you need more help, the dbi-users mailing list is a good place to ask. You don't need to subscribe.

The difference between fetchrow* and fetch all will be the location of the loop code in the call stack. That is, the fetchall* implies the fetch loop, while the fetchrow* implies that you will write your own loop.
The difference between the fetch* and select* methods is that one requires you to manually prepare() and execute() the query, while the other does that for you. The time differences will come from how efficient your code is compared to DBI's.
My reading has shown that the main differences between methods are between *_arrayref and *_hashref, where the *_hashref methods are slower due to the need to look up the hash key names in the database's metadata.

Related

BigQuery - how to think about query optimisation when coming from a SQL Server background

I have a background that includes SQL Server and Informix database query optimisation (non big-data). I'm confident in how to maximise database performance on those systems. I've recently been working with BigQuery and big data (about 9+ months), and optimisation doesn't seem to work the same way. I've done some research and read some articles on optimisation, but I still need to better understand the basics of how to optimise on BigQuery.
In SQL Server/Informix, a lot of the time I would introduce a column index to speed up reads. BigQuery doesn't have indexes, so I've mainly been using clustering. When I've done benchmarking after introducing a cluster for a column that I thought should make a difference, I didn't see any significant change. I'm also not seeing a difference when I switch on query cacheing. This could be an unfortunate coincidence with the queries I've tried, or a mistaken perception, however with SQL Server/SQL Lite/Informix I'm used to seeing immediate significant improvement, consistently. Am I misunderstanding clustering (I know it's not exactly like an index, but I'm expecting it should work in a similar type of way), or could it just be that I've somehow been 'unlucky' with the optimisations.
And this is where the real point is. There's almost no such thing as being 'unlucky' with optimisation, but in a traditional RDBMS I would look at the execution plan and know exactly what I need to do to optimise, and find out exactly what's going on. With BigQuery, I can get the 'execution details', but it really isn't telling me much (at least that I can understand) about how to optimise, or how the query really breaks down.
Do I need a significantly different way of thinking about BigQuery? Or does it work in similar ways to an RDBMS, where I can consciously make the first JOINS eliminate as many records as possible, use 'where' clauses that focus on indexed columns, etc. etc.
I feel I haven't got the control to optimise like in a RDBMS, but I'm sure I'm missing a major point (or a few points!). What are the major strategies I should be looking at for BigQuery optimisation, and how can I understand exactly what's going on with queries? If anyone has any links to good documentation that would be fantastic - I'm yet to read something that makes me think "Aha, now I get it!".
It is absolutely a paradigm shift in how you think. You're right: you don't have hardly any control in execution. And you'll eventually come to appreciate that. You do have control over architecture, and that's where a lot of your wins will be. (As others mentioned in comments, the documentation is definitely helpful too.)
I've personally found that premature optimization is one of the biggest issues in BigQuery—often the things you do trying to make a query faster actually have a negative impact, because things like table scans are well optimized and there are internals that you can impact (like restructuring a query in a way that seems more optimal, but forces additional shuffles to disk for parallelization).
Some of the biggest areas our team HAS seem greatly improve performance are as follows:
Use semi-normalized (nested/repeated) schema when possible. By using nested STRUCT/ARRAY types in your schema, you ensure that the data is colocated with the parent record. You can basically think of these as tables within tables. The use of CROSS JOIN UNNEST() takes a little getting used to, but eliminating those joins makes a big difference (especially on large results).
Use partitioning/clustering on large datasets when possible. I know you mention this, just make sure that you're pruning what you can using _PARTITIONTIME when possible, and also using clutering keys that make sense for your data. Keep in mind that clustering basically sorts the storage order of the data, meaning that the optimizer knows it doesn't have to continue scanning if the criteria has been satisfied (so it doesn't help as much on low-cardinality values)
Use analytic window functions when possible. They're very well optimized, and you'll find that BigQuery's implementation is very mature. Often you can eliminate grouping this way, or filter our more of your data earlier in the process. Keep in mind that sometimes filtering data in derived tables or Common Table Expressions (CTEs/named WITH queries) earlier in the process can make a more deeply nested query perform better than trying to do everything in one flat layer.
Keep in mind that results for Views and Common Table Expressions (CTEs/named WITH queries) aren't materialized during execution. If you use the CTE multiple times, it will be executed multiple times. If you join the same View multiple times, it will be executed multiple times. This was hard for members of our team who came from the world of materialized views (although it looks like somethings in the works for that in BQ world since there's an unused materializedView property showing in the API).
Know how the query cache works. Unlike some platforms, the cache only stores the output of the outermost query, not its component parts. Because of this, only an identical query against unmodified tables/views will use the cache—and it will typically only persist for 24 hours. Note that if you use non-deterministic functions like NOW() and a host of other things, the results are non-cacheable. See details under the Limitations and Exceptions sections of the docs.
Materialize your own copies of expensive tables. We do this a lot, and use scheduled queries and scripts (API and CLI) to normalize and save a native table copy of our data. This allows very efficient processing and fast responses from our client dashboards as well as our own reporting queries. It's a pain, but it works well.
Hopefully that will give you some ideas, but also feel free to post queries on SO in the future that you're having a hard time optimizing. Folks around here are pretty helpful when you let them know what your data looks like and what you've already tried.
Good luck!

Why is queried data returned in a result set and not in an array?

Why do most programming languages use the concept of result set when returning data from a database? Why aren't the results returned directly in a more common, immediately usable structure like an array? Why is this extra layer between querying and being able to use the results better or necessary?
An array is just a container of data. A result set is a much more powerful abstraction that encapsulates a very complex interaction between the database server and the client program making the data retrieval request.
"Immediately usable"... that's very naive. Yes, of course, often you just want the data, and often everything goes well and a result set object may seem a bit of a hindrance. But you should stop a moment and think of the complexity that's behind that data retrieval you are executing.
Data fetching
The first and most important consideration is that an array is a static structure that contains all data of all rows. While that might seem like a good solution for small queries, I assure you it is not in most cases. It assumes that fetching all data will require little time and memory, which is not always the case.
RDBMS return one row at a time... that's how things work usually. That way they can serve many clients... you can also cancel your data retrieval... or the RDBMS can take you down if you are hogging too many resources.
The result set handles the complexity of fetching one row or a page of rows or all the rows from the back-end, maybe caching the result internally. It does then allow the program access to just one row of data at a time, adding methods to navigate back and forth, without having to think what is happening behind the scenes. That is not for you to know usually, but there are many optimizations and gotchas.
Unidirectional queries
Some queries on some RDBMS are more efficient if executed unidirectionally. That is you tell the server you will never need to lookup a row of data you have already fetched. But result set objects can often cache this data internally and allow the program to navigate back to it (without disturbing the server).
Updatable queries
Some RDBMS support SELECT FOR UPDATE. Result set objects can often allow the program to modify the fetched data and then handle internally all the operations necessary to reflect those updates on the underlying database... and in many languages this is possible even if the RDBMS does not support SELECT FOR UPDATE.
Better handling of exceptions
When you ask for data, if things go well you get a stream of data that can fit in an array... if things go wrong, you get a stream of information that requires a different structure to be handled. A result set object can provide the client program with that structured information... and can maybe also provide a way of recovering.
I'm adding some more info on cursors, even though it is less relevant to this question. Fetching rows from the server is done through the use of a CURSOR. It tipically envolves 4 steps (DECLARE the cursor, OPEN it, use it to FETCH data, then CLOSE it). Declaring and opening a CURSOR allocates resources on the server which are used to remember what that specific client is asking for and what data has already been returned. FETCHing allows to navigate the result set and retrieve another row of data (not necessarily the next row). Closing the cursor tells the server you are done with that request and allows it to deallocate those resources.
Because arrays require all memory to be allocated at once, and all results to be pulled immediately. You might want to stream through terabytes of data. Or you might want to stop pulling results and abort the query mid-way.
Also note, that the way a specific API exposes query results is arbitrary. You could write yourself an API that exposes the data as an array to you. This is a design choice that the creator of the API has.

Stored Procedure VS. F#

For most SP-taught developers, there are no option between Linq and Stored-Procedures/Functions. That's may be true.
However, there are a road junctions nowadays. Before I spending too much time into syntax of F#, i would like more inputs about where the power (and opposite) of F# lies.
How will F# perform on this topic (against SP)?
F# have to communicate with a database on some way. Through Linq2Sql/Entity-app-layer or directly though AnyDbConnection. Nothing new there. But F# have the power of parallellism and less overhead in thier work (Functional Programming with C#/F#). Also F# has it's effeciency as a layer for data and machine. Pretty much like C# power of being a layer between human and machine.
Would I really still let the DB Server handle a request of recurring nodes, or just fetch plain data to F# and handle it there? Encapsulated nice and smoothly as a object method call from C#?
Would a stored procedure still be the best option for scanning 50 millions of records for finding orphans or a criteria that matching 0,5% of the result?
Would a SP or function still be best for a simple task as finding next parent node?
Would a SP still being best to collect a million records of data and return calculated sums and/or periods?
Wouldn't a single f# dll library fully built on the Single responsibility principle being of more use then stored procedures hooked up inside a sql server? There are pros and cons, of course. But what are they?
Stored procedures are not magically super-fast. Often, they're actually rather slow.
Many people will downvote this answer providing anecdotal evidence that a stored procedure once made an application faster overall. However, all of those examples that I've actually seen code for indicate that they totally rethought some bad SQL to package it as an SP. I submit that the discipline of repackaging bad SQL into a procedure helped more than the SP itself.
Most of your points can't be evaluated without a measured benchmark.
I suggest that you do the following.
Write it in F#.
Measure it.
If it's too slow for your production application, then try some stored procedures to see if it's faster. If it's fast enough for your production application, then you have your answer, F# worked for you. For your application. For your data. For your architecture.
There's no "general" answer. Although my benchmarks for some particular kinds of queries indicate that the SP engine is pretty slow compared with Java. F# will probably be faster than the SP engine also.
The important thing is to make sure that the database -- if it's going to be "pure" data -- is already optimized so that queries like your "scanning 50 millions of records for finding orphans or a criteria that matching 0,5% of the result?" would retrieve the rows as quickly as possible. This often involves tweaking buffers and array sizes and other elements of the database-to-F# connection. This usually means that you want a more direct connection so that you can adjust the sizes.
Databases are efficient for certain tasks (e.g. when they can uses index to search for a specified row), but probably won't be any faster than F# if you need to process all rows and ubdate them (in database) or calculate some new result based on all the data.
As S. Lott suggests, the best option is to try implementing what you need in F# and you'll find out. Parallelism can give you some performance benefits, especially if you're doing some computationally heavy calculations. However, you may still want to store the data in databases, load it and process it in F# (I believe this is how F# was used by adCenter at Microsoft).
Possibly the most important note is that databases give you various guarantees about the consistency of the data - no matter what happens, you'll still end up with consistent state. Implementing this yourself may be tricky (e.g. when updating data), but you need to consider whether you need it or not.
You ask this:
Would a stored procedure still be the best option for scanning 50 millions of records for finding orphans or a criteria that matching 0,5% of the result?
I take your question to mean 'I have this data in sql server. Should i query it in sql or in client code (F# in this case). Queries like this should absolutely be performed in sql if at all possible. If you're doing it in F#, you're transferring those 50 million rows to the client just to do some aggregation/lookups.
I hope I understood your question correctly.
As I understand an SP just means you call some precompiled execution plan, and you can call it through an API, instead of pushing a string to the server. These two save in the order of millseconds, nowhere near a second. For larger queries that difference is negligible. They're good for highfrequency/ throughput stuff (and of course encapsulating complex logic, but that doens't seem to apply here).
Because an SP uses a procompiled plan, it can indeed be slower than a normal query because it no longer checks the statitsics of the underlying data(becuase the execution plan is already compiled.) Since you mention a condition that applies to 0.5% of the rows, this could be important.
In the discussion of SP vs F# I would reword that to 'on the server' vs 'on the client'. If you're talking higher data volumes (50M rows qualifies) my first choice would always be to 'put the mill where the wood is', that means execute on the server if possible. Only if there is some very complicated logic involved you might want to consider F#, but I don't think that applies. Then still I'd prefer to execute that on the server than first drag all those rows over the network (potentially slow).
GJ

BASIC Object-Relation Mapping question asked by a noob

I understand that, in the interest of efficiency, when you query the database you should only return the columns that are needed and no more.
But given that I like to use objects to store the query result, this leaves me in a dilemma:
If I only retrieve the column values that I need in a particular situation, I can only partially populate the object. This, I feel, leaves my object in a non-ideal state where only some of the properties and methods are available. Later, if a situation arises where I would like to the reuse the object but find that the new situation requires a different but overlapping set of columns to be returned, I am faced with a choice.
Should I reuse the existing SQL and add to the list of selected columns the additional fields that are required by the new situation so that the same query and object mapping logic can be reused for both? Or should I create another method that results in the execution of a slightly different SQL which results in the populating of only those object properties that were returned by the 2nd query?
I strongly suspect that there is no magic answer and that the answer really "depends" on the situation but I guess I am looking for general advice. In general, my approach has been to either return all columns from the queried table or to add to the query the additional columns as they are needed but to reuse the same SQL (and mapping code) that is, until performance becomes a concern. In general, I find that unless you are retrieving a large number of row - and I usually am not - that the cost of adding additional columns to the output does not have a noticable effect on performance and that the savings in development time and the simplified API that result are a good trade off.
But how do you deal with this situation when performance does become a factor? Do you create methods like
Employees.GetPersonalInfo
Employees.GetLittleMorePersonlInfoButMinusSalary
etc, etc etc
Or do you somehow end up creating an API where the user of your API has to specify which columns/properties he wants populated/returned, thereby adding complexity and making your API less friendly/easy to use?
Let's say you want to get Employee info. How many objects would typically be involved?
1) an Employee object
2) An Employees collection object containing one Employee object for each Employee row returned
3) An object, such as EmployeeQueries that returns contains methods such as "GetHiredThisWeek" which returns an Employees collection of 0 or more records.
I realize all of this is very subjective, but I am looking for suggestions on what you have found works best for you.
I would say make your application correct first, then worry about performance in this case.
You could be optimizing away your queries only to realize you won't use that query anyway. Create the most generalized queries that your entire app can use, and then as you are confident things are working properly, look for problem areas if needed.
It is likely that you won't have a great need for huge performance up front. Some people say the lazy programmers are the best programmers. Don't over-complicate things up front, make a single Employee object.
If you find a need to optimize, you'll create a method/class, or however your ORM library does it. This should be an exception to the rule; only do it if you have reason to do so.
...the cost of adding additional columns to the output does not have a noticable effect on performance...
Right. I don't quite understand what "new situation" could arise, but either way, it would be a much better idea (IMO) to get all the columns rather than run multiple queries. There isn't much of a performance penalty at all for getting more columns than you need (although the queries will take more RAM, but that shouldn't be a big issue; besides, hardware is cheap). Also, you'd save yourself quite a bit of development time.
As for the second part of your question, it's really up to you. As an example, Rails takes more of a "usability first, performance last" approach, but that may not be what you want. It just depends on your needs. If you're willing to sacrifice a little usability for performance, by all means, go for it. I would.
If you are using your Objects in a "row at a time" CRUD type application, then, by all means copy all the columns into your object, the extra overhead is minimal, and you object becomes truly re-usable for any program wanting row access to the table.
However if your SQL is doing a complex join or returning a large set of rows, then request precisely and only what you want. You get two performance penalties here, one handling each column each time will eat up cpu for no benefit, and, two most DBMS systems have a bag of tricks for optimising queries (such as index only access) which can only be used if you specify precisely which columns you want.
There is no reuse issue in most of these cases as scan/search processes tend to very specific to a particular use case.

Any SQL database: When is it better to fetch a whole table instead of querying for particular rows?

I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it