We have a spring application. We generally have to execute several SQL queries on the view exposed to us by the Client.
In one scenario our queries work fine but the count(*) over the same queries creates problems. It returns
org.springframework.dao.RecoverableDataAccessException - StatementCallback;
IO Error: Socket read timed out; nested exception is java.sql.SQLRecoverableException: IO Error: Socket read timed out]
We asked the client to increase the oracle.jdbc.ReadTimeout property.
He instead has offered to expose a materialized view.
Can a materialized view helps in situations like these (where count queries lead to timeouts.)
How Materialized views can ve leveraged upon to increase performance of Queries
A materialized view is a great solution to your problem. Materialized views store the results of queries in a table, and can significantly improve performance. Your client seems to be doing you a huge favor, as they will be responsible for maintaining the objects that support the query.
The only potential downside depends on how they implement the materialized view. If they create a fast-refresh materialized view, it will automatically store the correct result after every change to the data. But there are many limitations to fast-refresh materialized views, and most likely your client will provide a complete refresh materialized view, which must have a schedule. If they provide a complete refresh materialized view, make sure the application can work with old data.
(Or course, the database timeout settings may still be inappropriate. There could be a bad profile, a bad sqlnet.ora parameter, a bad setting for resource manager, an ORA-600 bug, etc. You might want to find out the specific reason why your query timed out. Not that I think the client is trying to hide things from you; a terrible DBA would have just said, "tough luck, fix your stupid query". The fact that you're being offered a materialized view is a good sign that they are really trying to solve the problem.)
Related
I'm creating a RESTful web service (in Golang) which pulls a set of rows from the database and returns it to a client (smartphone app or web application). The service needs to be able to provide paging. The only problem is this data is sorted on a regularly changing "computed" column (for example, the number of "thumbs up" or "thumbs down" a piece of content on a website has), so rows can jump around page numbers in between a client's request.
I've looked at a few PostgreSQL features that I could potentially use to help me solve this problem, but nothing really seems to be a very good solution.
Materialized Views: to hold "stale" data which is only updated every once in a while. This doesn't really solve the problem, as the data would still jump around if the user happens to be paging through the data when the Materialized View is updated.
Cursors: created for each client session and held between requests. This seems like it would be a nightmare if there are a lot of concurrent sessions at once (which there will be).
Does anybody have any suggestions on how to handle this, either on the client side or database side? Is there anything I can really do, or is an issue such as this normally just remedied by the clients consuming the data?
Edit: I should mention that the smartphone app is allowing users to view more pieces of data through "infinite scrolling", so it keeps track of it's own list of data client-side.
This is a problem without a perfectly satisfactory solution because you're trying to combine essentially incompatible requirements:
Send only the required amount of data to the client on-demand, i.e. you can't download the whole dataset then paginate it client-side.
Minimise amount of per-client state that the server must keep track of, for scalability with large numbers of clients.
Maintain different state for each client
This is a "pick any two" kind of situation. You have to compromise; accept that you can't keep each client's pagination state exactly right, accept that you have to download a big data set to the client, or accept that you have to use a huge amount of server resources to maintain client state.
There are variations within those that mix the various compromises, but that's what it all boils down to.
For example, some people will send the client some extra data, enough to satisfy most client requirements. If the client exceeds that, then it gets broken pagination.
Some systems will cache client state for a short period (with short lived unlogged tables, tempfiles, or whatever), but expire it quickly, so if the client isn't constantly asking for fresh data its gets broken pagination.
Etc.
See also:
How to provide an API client with 1,000,000 database results?
Using "Cursors" for paging in PostgreSQL
Iterate over large external postgres db, manipulate rows, write output to rails postgres db
offset/limit performance optimization
If PostgreSQL count(*) is always slow how to paginate complex queries?
How to return sample row from database one by one
I'd probably implement a hybrid solution of some form, like:
Using a cursor, read and immediately send the first part of the data to the client.
Immediately fetch enough extra data from the cursor to satisfy 99% of clients' requirements. Store it to a fast, unsafe cache like memcached, Redis, BigMemory, EHCache, whatever under a key that'll let me retrieve it for later requests by the same client. Then close the cursor to free the DB resources.
Expire the cache on a least-recently-used basis, so if the client doesn't keep reading fast enough they have to go get a fresh set of data from the DB, and the pagination changes.
If the client wants more results than the vast majority of its peers, pagination will change at some point as you switch to reading direct from the DB rather than the cache or generate a new bigger cached dataset.
That way most clients won't notice pagination issues and you don't have to send vast amounts of data to most clients, but you won't melt your DB server. However, you need a big boofy cache to get away with this. Its practical depends on whether your clients can cope with pagination breaking - if it's simply not acceptable to break pagination, then you're stuck with doing it DB-side with cursors, temp tables, coping the whole result set at first request, etc. It also depends on the data set size and how much data each client usually requires.
I am not aware of a perfect solution for this problem. But if you want the user to have a stale view of the data then cursor is the way to go. Only tuning you can do is to store only the data for 1st 2 pages in the cursor. Beyond that you fetch it again.
Is there a way in Oracle Materialized views so that it automatically refresh itself when there are changes on the tables used in the materialized view? What is the Refresh Mode and Refresh Method that I should use? What options should I use using Sql Developer?
Thank you in advance
Yes, you can define a Materialized View with ON COMMIT, e.g.:
CREATE MATERIALIZED VIEW sales_mv
BUILD IMMEDIATE
REFRESH FAST ON COMMIT
AS SELECT t.calendar_year, p.prod_id ... FROM ...
In this case after every commit the MV is refreshed, provided the last transaction was done on master table, of course.
Since refresh is done after each commit it is strongly recommendd to use FAST REFRESH, rather than COMPLETE this would last too long.
You have several restrictions and pre-conditions in order to use FAST REFRESH, check Oracle documentation: CREATE MATERIALIZED VIEW, FAST Clause for details.
I don't think there's any way to 'automatically' replicate the changes to the m.view right after they are made. But there are ways to use FAST (incremental) refresh on demand, you'd only have to schedule a job for the m.view or and m.view group to do the refresh. You can also use m.view log to keep track of all the dml and the have it propagated to the m.view with a fast refresh on a remote database through the db link.
If you need the changes to be replicated as soon as they are made, then I recommend using golden gate or streams (if you don't want do license GG). Just beware that oracle discontinued support for streams in favor of Golden Gate, so if you have any issues, you're on your own. But anyway, it's a pretty solid replication tool, once you get the hang of it.
I am creating an application that allows users to construct complex SELECT statements. The SQL that is generated cannot be trusted, and is totally arbitrary.
I need a way to execute the untrusted SQL in relative safety. My plan is to create a database user who only has SELECT privileges on the relevant schemas and tables. The untrusted SQL would be executed as that user.
What could possibility go wrong with that? :)
If we assume postgres itself does not have critical vulnerabilities, the user could do a bunch of cross joins and overload the database. That could be mitigated with a session timeout.
I feel like there is a lot more than could go wrong, but I'm having trouble coming up with a list.
EDIT:
Based on the comments/answers so far, I should note that the number of people using this tool at any given time will be very near 0.
SELECT queries can not change anything in databse. Lack of dba privileges guarantee that any global settings can not be changed. So, overload is truely the only concern.
Onerload can be result of complex queryies or too much simple queries.
Too complex queryies can be ruled out by setting statement_timeout in postgresql.conf
Receiving plenties of simple queryies can be avoided too. Firstly, you can set parallel connection limit per user (alter user with CONNECTION LIMIT). And if you have some interface program between user and postgresql, you can additionally (1) add some extra wait after each query completion, (2) introduce CAPTCHA to avoid automated DOS-attack
ADDITION: PostgreSQL public system functions give many possible attack vectors. They can be called like select pg_advisory_lock(1) and every user have privilege to call them. So, you should restrict access to them. Good option is creating whitelist of all "callable words" or, more precisely, identifiers that can be used with ( after them. And rule out all queryies that include call-like construct identifier ( with an identifier not in white list.
Things that come to mind, in addition to having the user SELECT-only and revoking privileges on functions:
Read-only transaction. When a transaction is started by BEGIN READ ONLY, or SET TRANSACTION READ ONLY as its first instruction, it cannot write anything, independantly of the user permissions.
At the client side, if you want to restrict it to one SELECT, better use a SQL submission function that does not accept several queries bundled into one. For instance, the swiss-knife PQexec method of the libpq API does accept such queries and so does every driver function that is built on top of it, like PHP's pg_query.
http://sqlfiddle.com/ is a service dedicated to running arbitrary SQL statements which may be seen somehow as a proof-of-concept that it's doable without being hacked or DDos'ed all day long.
The problem with this, is i'm not sure if the sql itself will still continue to run in the background after a session timeout (can't really find much evidence either way via google and haven't had any real experience where I've attempted it myself either). If you're limiting to just select access, i think this is about the worst that could happen though. The real issue would be what happens if you got a hundred users trying to do complex cross joins? Session timeout dropping the query or not, it'll put a real heavy load on the database (could very easily be enough to pull the database down entirely)
The only way (from my point of view) to protect yourself against DoS on main server with crafted queries is to set up a read only replica of the Postgres DB and a special limited user on this replica DB. This way the main Postgres server wont be affected by queries on replica.
Also you will get hot standby / continuous replication DB for the case, when main DB fails for some reason.
What is the best way to paginate a FTS Query ? LIMIT and OFFSET spring to mind. However, I am concerned that by using limit and offset I'd be running the same query over and over (i.e., once for page 1, another time for page 2.... etc).
Will PostgreSQL be smart enough to transparently cache the query result ? Thus subsequently satisfying the pagination queries from a cache ? If not, how do I paginate efficiently ?
edit
The database is for single user desktop analytics. But, I still want to know what the best way is, if this were a live OLTP application. I have addressed the problem in the past with SQL Server by creating a ordered set of document id's and cache the query parameters against the IDs in a seperate table. Clearing the cache every few hours (so as to allow new documents to enter the result set).
Perhaps this approach is viable for postgres. But still I wanna know the mechanics present in the database and how best to leverage them. If I were a DB developer I'd enable the query-response cache to work with the FTS system.
A server-side SQL cursor can be effectively used for this if a client session can be tied to a specific db connection that stays open during the entire session. This is because cursors cannot be shared between different connections. But if it's a desktop app with a unique connection per running instance, that's fine.
The doc for DECLARE CURSOR explains how the resultset is going to be materialized when the cursor is declared WITH HOLD in a committed transaction.
Locking shouldn't be a concern at all. Should the data be modified while the cursor is already materialized, it wouldn't affect the reader nor block the writer.
Other than that, there is no implicit query cache in PostgreSQL. The LIMIT/OFFSET technique implies a new execution of the query for each page, which may be as slow as the initial query depending on the complexity of the execution plan and the effectiveness of the buffer cache and disk cache.
Well, to be honest, what you may want is for your query to return a live Cursor, that you can then reuse to fetch certain portions of the results that it (the Cursor) represents. Now, I don't know if PostGre supports this, Mongo DB does, and I've tried going down that road but it's not cool. For example: do you know how much time it will pass between when a query is done and a second page of results from that query are demanded? Can the cursor stay on for that amount if time? And if it can, what does it mean exactly, will it block resources, such that if you have many lazy users, who start queries but take a long time to navigate through pages, your server might be bogged down by locked cursors?
Honestly, I think redoing a paginated query each time someone asks for a certain page is ok. First of all, you'll be returning a small number of entries (no need to display more than 10-20 entries at a time) and that's gonna be pretty fast, and second, you should more likely tune up your server so that it executes frequent request fast (add indexes, put it behind a Solr server if necessary, etc.) rather than have those queries run slow, but caching them.
Finally, if you really want to speed up full text searches, and have fancy indexes like case insensitive, prefix and suffix enabled, etc, you should take a look at Lucene or better yet Solr (which is Lucene on steroids) as an in-between search and indexing solution between your users and your persistence tier.
I am using Action Filter Attributes for loging user activity on certain action which has SQL database interaction. Similarly I can log the activity in the SQL tables using triggers on tables during each activity on the tables. I would like to know which of the above two methods is a best practice ( perfomance wise )
I think that the actionfilter is certainly the cleanest and best practice appraoch since it is in the application layer. Part of the benefit of being there is its managed code and if something breaks you can easily locate the problem. There is also the benefit that all your code is in one spot too.
Database triggers are a big no no in many companies since they have a habit of causing infinite loop well an unknowing programmer creates some logic that steps on the trigger over and over again causing the database to fail. Some companies do allow triggers but very well documented and very lightly used. Hope this helps.
Performance of logging depends greatly on the system architecture. If you have 3 load balanced web servers hitting one main database, triggers would have to handle all the load while Action Filters would split the load in three. In that scenario, Action Filters would be better.
In terms of best practices, I wouldn't use either of those approaches. I would set up Transactional Replication to another SQL server. This approach would run without impacting performance at all. The transaction log is already being generated and replication would just spin up a separate process that's reading that log.