Where are sql results stored in a gui client? - sql

Suppose I have a dataset that contains 100B rows and I do a SELECT * sql query from the table without a limit, and let's suppose the client doesn't impose a limit on top of it either --
As the data is running it usually loads the results incrementally into the interface. However, the dataset is much to large to fit onto my local machine. What actually happens when it is "Running query..."? Is the data loaded directly to program memory? Is the data saved to something like a tmp file that is memory mapped (I would think not), or what is the most common way to 'display' the results here? And then finally, what would happen once my local memory limit is exceeded -- would the program just hang or crash?
I know this is a slightly abstract question, but mainly I'm asking how a SQL result-set is usually 'loaded' in order to display the results to a user in a user interface.

.There may not be a "ususal" answer. Different applications are likely to take different approaches depending on the trade-offs they want to make.
The simplest approach is for the client to fetch the first N rows (you tagged this for Oracle SQL Developer where the default N is 50). If you then scroll down in the results, the client will fetch the next N rows. The client keeps the results it has already fetched in memory. If you try to fetch more data than the client machine has memory available (and, of course, the client may have been configured to have virtual memory larger than the physical memory available), the application either crashes or generates some sort of error. Note that depending on the specific implementation, the data could be cached either by the ODBC/JDBC/etc. driver or by the actual application code.
If there is some reason for the client to expect that it would be beneficial to display gigabytes worth of data to a human (or if crashing or erroring out is particularly problematic), the client might write results to a file rather than keeping them in memory. That doesn't seem particularly common in a GUI IDE but I don't use a terribly large number of different GUIs.
Other options are possible (but probably not worth implementing in an application that is supposed to provide results to a human who isn't going to scroll through billions of results). Under the covers, the application or driver could cache a key (in Oracle, normally the ROWID) for the previously returned data rather than the entire row and then re-fetch that data if the user tries to scroll back to the top. The application could discard data that you had already fetched and throw an error if you tried to scroll back from row 1 billion to row 1. Or it could silently re-execute the query if you wanted to go back to the first row.

Related

Handling paging with changing sort orders

I'm creating a RESTful web service (in Golang) which pulls a set of rows from the database and returns it to a client (smartphone app or web application). The service needs to be able to provide paging. The only problem is this data is sorted on a regularly changing "computed" column (for example, the number of "thumbs up" or "thumbs down" a piece of content on a website has), so rows can jump around page numbers in between a client's request.
I've looked at a few PostgreSQL features that I could potentially use to help me solve this problem, but nothing really seems to be a very good solution.
Materialized Views: to hold "stale" data which is only updated every once in a while. This doesn't really solve the problem, as the data would still jump around if the user happens to be paging through the data when the Materialized View is updated.
Cursors: created for each client session and held between requests. This seems like it would be a nightmare if there are a lot of concurrent sessions at once (which there will be).
Does anybody have any suggestions on how to handle this, either on the client side or database side? Is there anything I can really do, or is an issue such as this normally just remedied by the clients consuming the data?
Edit: I should mention that the smartphone app is allowing users to view more pieces of data through "infinite scrolling", so it keeps track of it's own list of data client-side.
This is a problem without a perfectly satisfactory solution because you're trying to combine essentially incompatible requirements:
Send only the required amount of data to the client on-demand, i.e. you can't download the whole dataset then paginate it client-side.
Minimise amount of per-client state that the server must keep track of, for scalability with large numbers of clients.
Maintain different state for each client
This is a "pick any two" kind of situation. You have to compromise; accept that you can't keep each client's pagination state exactly right, accept that you have to download a big data set to the client, or accept that you have to use a huge amount of server resources to maintain client state.
There are variations within those that mix the various compromises, but that's what it all boils down to.
For example, some people will send the client some extra data, enough to satisfy most client requirements. If the client exceeds that, then it gets broken pagination.
Some systems will cache client state for a short period (with short lived unlogged tables, tempfiles, or whatever), but expire it quickly, so if the client isn't constantly asking for fresh data its gets broken pagination.
Etc.
See also:
How to provide an API client with 1,000,000 database results?
Using "Cursors" for paging in PostgreSQL
Iterate over large external postgres db, manipulate rows, write output to rails postgres db
offset/limit performance optimization
If PostgreSQL count(*) is always slow how to paginate complex queries?
How to return sample row from database one by one
I'd probably implement a hybrid solution of some form, like:
Using a cursor, read and immediately send the first part of the data to the client.
Immediately fetch enough extra data from the cursor to satisfy 99% of clients' requirements. Store it to a fast, unsafe cache like memcached, Redis, BigMemory, EHCache, whatever under a key that'll let me retrieve it for later requests by the same client. Then close the cursor to free the DB resources.
Expire the cache on a least-recently-used basis, so if the client doesn't keep reading fast enough they have to go get a fresh set of data from the DB, and the pagination changes.
If the client wants more results than the vast majority of its peers, pagination will change at some point as you switch to reading direct from the DB rather than the cache or generate a new bigger cached dataset.
That way most clients won't notice pagination issues and you don't have to send vast amounts of data to most clients, but you won't melt your DB server. However, you need a big boofy cache to get away with this. Its practical depends on whether your clients can cope with pagination breaking - if it's simply not acceptable to break pagination, then you're stuck with doing it DB-side with cursors, temp tables, coping the whole result set at first request, etc. It also depends on the data set size and how much data each client usually requires.
I am not aware of a perfect solution for this problem. But if you want the user to have a stale view of the data then cursor is the way to go. Only tuning you can do is to store only the data for 1st 2 pages in the cursor. Beyond that you fetch it again.

Why does my SELECT query take so much longer to run on the web server than on the database itself?

I'm running the following setup:
Physical Server
Windows 2003 Standard Edition R2 SP2
IIS 6
ColdFusion 8
JDBC connection to iSeries AS400 using JT400 driver
I am running a simple SQL query against a file in the database:
SELECT
column1,
column2,
column3,
....
FROM LIB/MYFILE
No conditions.
The file has 81 columns - aplhanumeric and numeric - and about 16,000 records.
When I run the query in the emulator using the STRSQL command, the query comes back immediately.
When I run the query on my Web Server, it takes about 30 seconds.
Why is this happening, and is there any way to reduce this time?
While I cannot address whatever overhead might be involved in your web server, I can say there are several other factors to consider:
This may likely have to do primarily in the differences between the way the two system interfaces work.
Your interactive STRSQL session will start displaying results as quickly as it receives the first few pages of data. You are able to page down through that initial data, but generally at some point you will see a status message at the bottom of the screen indicating that it is now getting more data.
I assume your web server is waiting until it receives the entire result set. It wants to get all the data as it is building the HTML page, before it sends the page. Thus you will naturally wait longer.
If this is not how your web server application works, then it is likely to be a JT400 JDBC Properties issue.
If you have overridden any default settings, make sure that those are appropriate.
In some situations the OPTIMIZATION_GOAL settings might be a factor. But if you are reading the table (aka physical file or PF) directly, in its physical sequence, without any index or key, then that might not apply here.
Your interactive STRSQL session will default to a setting of *FIRSTIO, meaning that the query is optimized for returning the first pages of data quickly, which corresponds to the way it works.
Your JDBC connection will default to a "query optimize goal" of "0", which will translate to an OPTIMIZATION_GOAL setting of *ALLIO, unless you are using extended dynamic packages. *ALLIO means the optimizer will try to minimize the time needed to return the entire result set, not just the first pages.
Or, perhaps first try simply adding FOR READ ONLY onto the end of your SELECT statement.
Update: a more advanced solution
You may be able to bypass the delay caused by waiting for the entire result set as part of constructing the web page to be sent.
Send a web page out to the browser without any records, or limited records, but use AJAX code to load the remainder of the data behind the scenes.
Use large block fetches whenever feasible, to grab plenty of rows in one clip.
One thing you need to remember, the i saves the access paths it creates in the job in case they are needed again. Which means if you log out and log back in then run your query, it should take longer to run, then the second time you run the query it'll be faster. When running queries in a web application, you may or may not be reusing a job meaning the access paths have to be rebuilt.
If speed is important. I would:
Look into optimizing the query. I know there are better sources, but I can't find them right now.
Create a stored procedure. A stored procedure saves the access paths created.
With only 16000 rows and no WHERE or ORDER BY this thing should scream. Break the problem down to help diagnose where the bottleneck is. Go back to the IBM i, run your query in the SQL command line and then use the B, BOT or BOTTOM command to tell the database to show the last row. THAT will force the database to cough up the entire 16k result set, and give you a better idea of the raw performance on the IBM side. If that's poor, have the IBM administrators run Navigator and monitor the performance for you. It might be something unexpected, like the 'table' is really a view and the columns you are selecting might be user defined functions.
If the performance on the IBM side is OK, then look to what Cold Fusion is doing with the result set. Not being a CF programmer, I'm no help there. But generally, when I am tasked with solving multi-platform performance issues, the client side tends to consume the entire result set and then use program logic to choose what rows to display/work with. The server is MUCH faster than the client, and given the right hints, the database optimiser can make some very good decisions about how to get at those rows.

Disadvantages of Sql Cursor

I was studying cursor and I read somewhere that. Each time you fetch a row from the cursor, it results in a network round trip whereas normal select query makes only one round trip however large the resultset is.
Can anyone explain what does that means? And what does network round trip and one round trip means in detail with some example. And when we use cursor and when we use while loop?
Unfortunately, that reference is incorrect.
A "normal SELECT" creates a cursor that the client fetch from. The mechanics are exactly the same as if you open and return a SYS_REFCURSOR (or any other mechanism for opening a cursor). In both cases, the client will fetch a number of rows over the network every time it requests data from the database. The client can control the number of rows that are fetched each time-- it would be exceptionally rare for the client to fetch 1 row or to fetch all the rows from a cursor in a single network round-trip.
What actually happens when a client application fetches from a cursor (no matter how the cursor is opened), the client application sends a request over the network for N rows (again, the client controls the value of N). The database sends the next N rows back to the client (generally, Oracle has to continue executing the query in order to determine the next N rows because Oracle does not generally materialize an entire result set). The client application does something with those N rows and then sends another request over the network for the next N rows and the pattern repeats.
In virtually all database systems, the application that uses the data; and the DBMS that is responsible for storing and searching the data; live on separate machines. They talk to each other over a network. Even when they are on the same machine, there is still effectively a network connection.
This matters because there is some time between when an application decides that it's ready to read data, when that request arrives over the network at the database server, when the database server actually gets the response for that, and when the response finally arrives over the network on the application side.
When you do a query for a whole set of data, you only pay this cost once; Although it may seem wasteful; in fact it's much more efficient to put the burden of holding on to all of the data on the application, because it's usually easier to give more resources to an application than to do the same on a database server.
When your application only fetches data one row at a time, then the cost of the round trip between application and database is paid once per row; If you want to show the titles of 100 blog posts, then you're paying the cost of 100 round trips to the database for that one report. Whats worse is that the database server has to some how keep track of the partially completed result set. That usually means that the resources that could be used for querying data for another request are instead being retained by an application that hasn't happened to ask for all of the data it originally asked for.
The basic rule is to talk to the database only when you absolutely have to, and to make the interaction as short as possible; This means you only ask for the data you really need (have the database do as much filtering as possible, instead of doing it in the application) and accept all of the data as quickly as possible, so that the database can move on to another task.

Best practice for inserting and querying data from memory

We have an application that takes real time data and inserts it into database. it is online for 4.5 hours a day. We insert data second by second in 17 tables. The user at any time may query any table for the latest second data and some record in the history...
Handling the feed and insertion is done using a C# console application...
Handling user requests is done through a WCF service...
We figured out that insertion is our bottleneck; most of the time is taken there. We invested a lot of time trying to finetune the tables and indecies yet the results were not satisfactory
Assuming that we have suffecient memory, what is the best practice to insert data into memory instead of having database. Currently we are using datatables that are updated and inserted every second
A colleague of ours suggested another WCF service instead of database between the feed-handler and the WCF user-requests-handler. The WCF mid-layer is supposed to be TCP-based and it keeps the data in its own memory. One may say that the feed handler might deal with user-requests instead of having a middle layer between 2 processes, but we want to seperate things so if the feed-handler crashes we want to still be able to provide the user with the current records
We are limited in time, and we want to move everything to memory in short period. Is having a WCF in the middle of 2 processes a bad thing to do? I know that the requests add some overhead, but all of these 3 process(feed-handler, In memory database (WCF), user-request-handler(WCF) are going to be on the same machine and bandwidth will not be that much of an issue.
Please assist!
I would look into creating a cache of the data (such that you can also reduce database selects), and invalidate data in the cache once it has been written to the database. This way, you can batch up calls to do a larger insert instead of many smaller ones, but keep the data in-memory such that the readers can read it. Actually, if you know when the data goes stale, you can avoid reading the database entirely and use it just as a backing store - this way, database performance will only affect how large your cache gets.
Invalidating data in the cache will either be based on whether its written to the database or its gone stale, which ever comes last, not first.
The cache layer doesn't need to be complicated, however it should be multi-threaded to host the data and also save it in the background. This layer would sit just behind the WCF service, the connection medium, and the WCF service should be improved to contain the logic of the console app + the batching idea. Then the console app can just connect to WCF and throw results at it.
Update: the only other thing to say is invest in a profiler to see if you are introducing any performance issues in code that are being masked. Also, profile your database. You mention you need fast inserts and selects - unfortunately, they usually trade-off against each other...
What kind of database are you using? MySQL has a storage engine MEMORY which would seem to be suited to this sort of thing.
Are you using DataTable with DataAdapter? If so, I would recommend that you drop them completely. Insert your records directly using DBCommand. When users request reports, read data using DataReader, or populate DataTable objects using DataTable.Load (IDataReader).
Storying data in memory has the risk of losing data in case of crashes or power failures.

Low MySQL Table Cache Hit Rate

I've been working on optimizing my site and databases, and I have been using mysqltuner.pl to help with this. I've gotten just about everything correct, except for the table cache hit rate, no matter how high I raise it in my.cnf, I am still hitting about 0% (284 open / 79k opened).
My problem is that I don't really understand exactly what affects this so I don't really know what to look for in my queries/database structure to fix this.
The table cache defines the number of simultaneous file descriptors that MySQL has open. So table cache hit rate will be affected by how many tables you have relative to your limit as well as how frequently you re-reference tables (keeping in mind that it is counting not just a single connection, but simultaneous connections)
For instance, if your limit is 100 and you have 101 tables and you query each one in order, you'll never get any table cache hits. On the other hand, if you only have 1 table, you should generally get close to 100% hit rate unless you run FLUSH TABLES a lot ( as long as your table_cache is set higher than the number of typically simultaneous connections).
So for tuning, you need to look at how many distinct tables you might reference by one process/client and then look at how many simultaneous connections you might typically have.
Without more details, I can't guess whether your case is due to too many simultaneous connections or too many frequently referenced tables.
A cache is supposed to maintain copies of hot data. Hot data is data that is used a lot. If you cannot retrieve data out of a certain cache it means the DB has to go to disk to retrieve it.
--edit--
sorry if the definition seemed a bit obnoxious. a specific cache often covers a lot of entities, and these are database specific, you need to find out what is cached by the table cache firstly.
--edit: some investigation --
Ok, it seems (from the reply to this post), that Mysql uses the table cache for the data structures used to represent a table. the data structures also (via encapsulation or by having duplicate table entries for each table) represent a set of file descriptors open for the data files on the file system. The MyIsam engine uses one for a table and one for each index, additionally each active query element requires its own descriptors.
A file descriptor is a kernel entity used for file IO, it represents the low-level context of a particular file read or write.
I think you are either interpreting the value's incorrectly or they need to be interpreted differently in this context. 284 is the amount of active tables at the instance you took the snapshot and the second value represents the amount of times a table was acquired since you started Mysql.
I would hazard a guess that you need to take multiple snapshots of this reading and see if the first value (active fd's at that instance) ever exceed your cache size capacity.
p.s., the kernel generally has a upper limit on the amount of file descriptors it will allow each process to open -- so you might need to tune this if it is too low.