I was studying cursor and I read somewhere that. Each time you fetch a row from the cursor, it results in a network round trip whereas normal select query makes only one round trip however large the resultset is.
Can anyone explain what does that means? And what does network round trip and one round trip means in detail with some example. And when we use cursor and when we use while loop?
Unfortunately, that reference is incorrect.
A "normal SELECT" creates a cursor that the client fetch from. The mechanics are exactly the same as if you open and return a SYS_REFCURSOR (or any other mechanism for opening a cursor). In both cases, the client will fetch a number of rows over the network every time it requests data from the database. The client can control the number of rows that are fetched each time-- it would be exceptionally rare for the client to fetch 1 row or to fetch all the rows from a cursor in a single network round-trip.
What actually happens when a client application fetches from a cursor (no matter how the cursor is opened), the client application sends a request over the network for N rows (again, the client controls the value of N). The database sends the next N rows back to the client (generally, Oracle has to continue executing the query in order to determine the next N rows because Oracle does not generally materialize an entire result set). The client application does something with those N rows and then sends another request over the network for the next N rows and the pattern repeats.
In virtually all database systems, the application that uses the data; and the DBMS that is responsible for storing and searching the data; live on separate machines. They talk to each other over a network. Even when they are on the same machine, there is still effectively a network connection.
This matters because there is some time between when an application decides that it's ready to read data, when that request arrives over the network at the database server, when the database server actually gets the response for that, and when the response finally arrives over the network on the application side.
When you do a query for a whole set of data, you only pay this cost once; Although it may seem wasteful; in fact it's much more efficient to put the burden of holding on to all of the data on the application, because it's usually easier to give more resources to an application than to do the same on a database server.
When your application only fetches data one row at a time, then the cost of the round trip between application and database is paid once per row; If you want to show the titles of 100 blog posts, then you're paying the cost of 100 round trips to the database for that one report. Whats worse is that the database server has to some how keep track of the partially completed result set. That usually means that the resources that could be used for querying data for another request are instead being retained by an application that hasn't happened to ask for all of the data it originally asked for.
The basic rule is to talk to the database only when you absolutely have to, and to make the interaction as short as possible; This means you only ask for the data you really need (have the database do as much filtering as possible, instead of doing it in the application) and accept all of the data as quickly as possible, so that the database can move on to another task.
Related
Suppose I have a dataset that contains 100B rows and I do a SELECT * sql query from the table without a limit, and let's suppose the client doesn't impose a limit on top of it either --
As the data is running it usually loads the results incrementally into the interface. However, the dataset is much to large to fit onto my local machine. What actually happens when it is "Running query..."? Is the data loaded directly to program memory? Is the data saved to something like a tmp file that is memory mapped (I would think not), or what is the most common way to 'display' the results here? And then finally, what would happen once my local memory limit is exceeded -- would the program just hang or crash?
I know this is a slightly abstract question, but mainly I'm asking how a SQL result-set is usually 'loaded' in order to display the results to a user in a user interface.
.There may not be a "ususal" answer. Different applications are likely to take different approaches depending on the trade-offs they want to make.
The simplest approach is for the client to fetch the first N rows (you tagged this for Oracle SQL Developer where the default N is 50). If you then scroll down in the results, the client will fetch the next N rows. The client keeps the results it has already fetched in memory. If you try to fetch more data than the client machine has memory available (and, of course, the client may have been configured to have virtual memory larger than the physical memory available), the application either crashes or generates some sort of error. Note that depending on the specific implementation, the data could be cached either by the ODBC/JDBC/etc. driver or by the actual application code.
If there is some reason for the client to expect that it would be beneficial to display gigabytes worth of data to a human (or if crashing or erroring out is particularly problematic), the client might write results to a file rather than keeping them in memory. That doesn't seem particularly common in a GUI IDE but I don't use a terribly large number of different GUIs.
Other options are possible (but probably not worth implementing in an application that is supposed to provide results to a human who isn't going to scroll through billions of results). Under the covers, the application or driver could cache a key (in Oracle, normally the ROWID) for the previously returned data rather than the entire row and then re-fetch that data if the user tries to scroll back to the top. The application could discard data that you had already fetched and throw an error if you tried to scroll back from row 1 billion to row 1. Or it could silently re-execute the query if you wanted to go back to the first row.
I need to take a snapshot of a table at a given time, and determine the difference between the snapshot and the current data. What is the most effective way to do that? Can it be done in pure SQL (MS SQL), or do my app server do that in Delphi code?
I'm using an app server that keeps track of these changes, and transmit them over a Telnet protocol to any number of clients/ on the same machine or not.
Because of the txt protocol, I have to use the difference of the tables because it is impractical to send all the data (~10k records) every time something changes.
The apps involved are, Swordfish (an Automatic Trading System/ ATS), not written by me. The app server (Chef), and the client (Diner), both written by me. The ATS uses MS SQL as a layer for its API, so Chef, sends and receives data to the MS SQL server, essentially controlling the ATS. The client communicates what it wants done to Chef, and then Chef talks to Swordfish through the DBMS, and the the other Diners, through Telnet.
Code. Is the most efficient way to do this. According to all the info that I could find on the web
It may be possible to know with pure SQL what rows were added, but I could find nothing (in SQL) to detect changes to already existent rows or row deletes, both of which I need knowledge of to keep my app server (that is aware and synced with the SQL table) and my app clients synchronized.
Keeping an in-memory table of 10-15k records isn't that serious, a different error in my code (to do with TFDQuery) made me think that my "offline" or "in memory" snapshot op the tables needed A LOT of memory (every sql add command created it's own instance of TFDQuery, requiring 30mb per record that leaked when destroying the TFDQuery, now I create the instance of TFDQuery once, and reuse the instance for every record added, and my memory usage total stays ~50mb, which I have no problem with)
So, every time Service Broker detects a change in the dataset of the sql table, I save the old dataset to a in-memory table, and do 3 compares between dataset and dataset (dataset saved/old and dataset current/the newest version of the SQL table). 1. Scan for addition. 2. Scan for changes. 3 Scan for deletion, DONE :-)
Then its' a simple task of encoding the text for the Telnet protocol, and all my clients and my SQL server and my app server are happily synced!
I'm creating a RESTful web service (in Golang) which pulls a set of rows from the database and returns it to a client (smartphone app or web application). The service needs to be able to provide paging. The only problem is this data is sorted on a regularly changing "computed" column (for example, the number of "thumbs up" or "thumbs down" a piece of content on a website has), so rows can jump around page numbers in between a client's request.
I've looked at a few PostgreSQL features that I could potentially use to help me solve this problem, but nothing really seems to be a very good solution.
Materialized Views: to hold "stale" data which is only updated every once in a while. This doesn't really solve the problem, as the data would still jump around if the user happens to be paging through the data when the Materialized View is updated.
Cursors: created for each client session and held between requests. This seems like it would be a nightmare if there are a lot of concurrent sessions at once (which there will be).
Does anybody have any suggestions on how to handle this, either on the client side or database side? Is there anything I can really do, or is an issue such as this normally just remedied by the clients consuming the data?
Edit: I should mention that the smartphone app is allowing users to view more pieces of data through "infinite scrolling", so it keeps track of it's own list of data client-side.
This is a problem without a perfectly satisfactory solution because you're trying to combine essentially incompatible requirements:
Send only the required amount of data to the client on-demand, i.e. you can't download the whole dataset then paginate it client-side.
Minimise amount of per-client state that the server must keep track of, for scalability with large numbers of clients.
Maintain different state for each client
This is a "pick any two" kind of situation. You have to compromise; accept that you can't keep each client's pagination state exactly right, accept that you have to download a big data set to the client, or accept that you have to use a huge amount of server resources to maintain client state.
There are variations within those that mix the various compromises, but that's what it all boils down to.
For example, some people will send the client some extra data, enough to satisfy most client requirements. If the client exceeds that, then it gets broken pagination.
Some systems will cache client state for a short period (with short lived unlogged tables, tempfiles, or whatever), but expire it quickly, so if the client isn't constantly asking for fresh data its gets broken pagination.
Etc.
See also:
How to provide an API client with 1,000,000 database results?
Using "Cursors" for paging in PostgreSQL
Iterate over large external postgres db, manipulate rows, write output to rails postgres db
offset/limit performance optimization
If PostgreSQL count(*) is always slow how to paginate complex queries?
How to return sample row from database one by one
I'd probably implement a hybrid solution of some form, like:
Using a cursor, read and immediately send the first part of the data to the client.
Immediately fetch enough extra data from the cursor to satisfy 99% of clients' requirements. Store it to a fast, unsafe cache like memcached, Redis, BigMemory, EHCache, whatever under a key that'll let me retrieve it for later requests by the same client. Then close the cursor to free the DB resources.
Expire the cache on a least-recently-used basis, so if the client doesn't keep reading fast enough they have to go get a fresh set of data from the DB, and the pagination changes.
If the client wants more results than the vast majority of its peers, pagination will change at some point as you switch to reading direct from the DB rather than the cache or generate a new bigger cached dataset.
That way most clients won't notice pagination issues and you don't have to send vast amounts of data to most clients, but you won't melt your DB server. However, you need a big boofy cache to get away with this. Its practical depends on whether your clients can cope with pagination breaking - if it's simply not acceptable to break pagination, then you're stuck with doing it DB-side with cursors, temp tables, coping the whole result set at first request, etc. It also depends on the data set size and how much data each client usually requires.
I am not aware of a perfect solution for this problem. But if you want the user to have a stale view of the data then cursor is the way to go. Only tuning you can do is to store only the data for 1st 2 pages in the cursor. Beyond that you fetch it again.
We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.
We use SQL Server and have Winforms application. In our product, sometimes the records exceeds more than 50000 in single transaction and we face Performance issue there.
When we have huge amount of data, we generally do that in multiple database calls. So in one of our Import functionality we are updating servers in a bunch of 1000 rows. So if we have 5000 records, then while processing them (in a for loop) we update the first 1000 rows and then continue processing until we get new 1000 rows to update. This performs better but honestly not the best I feel in terms of performance.
But we have seen in other Import/Export functionality that updating database with every 5000 rows is giving good results when compared to 1000. So there is a lot of confusion we are facing and also code does not look to be same across our applications.
Can anyone give me an idea what makes this happen. You don't have sample data, database schema etc. and yes I do agree. But are there any scenarios which should be taken care/considered while working with database? And why different number of records are giving us the good results, is there something we are ignoring? I am not a champ in database and more of a programming guy in .Net. Will be happy to hear your suggestions.
Not sure if this is helpful, our data generally contains employee details like payroll information, personal details, Accrual Benefits, Compensation etc. Data is fed from an excel and also we generate lot of data in our internal process. Let me know if you need more information. Thanks!!
The more database callouts you have, the more connection management you will need (open connection, use connection, cleanup & close, are we using connection pooling etc.etc.). You're sending the same amount of data over the wire, but you are opening and closing the taps more often, which brings overhead.
The downside of this is that the amount of data held in a transaction is greater.
However, if I may make a suggestion, you might want to consider achieving this in a different way, by loading all data into the database as fast as possible (into interim tables where the contraints are deativated and with transactional management turned off, if possible) and then allowing the database to carry out the task of checking and validating the data.
Since you are using SQL Server, you can just turn on SQL Profiler, define an appropriate event filter, and watch what happens under different loads.