Any way to monitor Postgresql query changes in realtime using LISTEN & NOTIFY (or NodeJS)? - sql

So I have a custom Postgresql query that retrieves all rows within a specified longitude latitude radius, like so:
SELECT *,
earth_distance(ll_to_earth($1,$2), ll_to_earth(lat, lng)) as distance_metres
FROM RoomsWithUsers
WHERE earth_box(ll_to_earth($1,$2), $3) #> ll_to_earth(lat, lng)
ORDER by distance_metres;
And in my node server, I want to be able to be notified every time the number of rows in this query changes. I have looked into using a Node library such as pg-live-query but I would much rather pg-pubsub that works with existing Postgres LISTEN/NOTIFY in order to avoid unnecessary overhead. However, as far as I am able to understand, PostgreSQL TRIGGERs only work on UPDATE/INSERT/DELETE operations and not on any specific queries themselves. Is there any way to accomplish what I'm doing?

You need to set up the right triggers that will call NOTIFY for all the clients that use LISTENon the same channel.
It is difficult to advise how exactly you implement your NOTIFY logic inside the triggers, because it depends on the following:
How many clients the message is intended for?
How heavy/large is the query that's being evaluated?
Can the triggers know the logic of the query to evaluate it?
Based on the answers you might consider different approaches, which includes but not limited to the following options and their combinations:
execute the query/view when the outcome cannot be evaluated, and cache the result
provide smart notification, if the query's outcome can be evaluated
use the payload to pass in the update details to the listeners
schedule query/view re-runs for late execution, if it is heavy
do entire notification as a separate job
Certain scenarios can grow quite complex. For example, you may have a master client that can do the change, and multiple slaves that need to be notified. In this case the master executes the query, checks if the result has changed, and then calls a function in the PostgreSQL server to trigger notifications across all slaves.
So again, lots and lots of variations are possible, depending on specific requirements of the task at hand. In your case you do not provide enough details to offer any specific path, but the general guidelines above should help you.

Async & LISTEN/NOTIFY is the right way!
You can add trigger(s) on UPDATE/INSERT and execute your query in the body of trigger, save number of rows in simple table and if value changed call NOTIFY. If you need multiple params combinations in query you can create/destroy triggers from inside your program.

Related

detect cartesian product or other non sensible queries

I'm working on a product which gives users a lot of "flexibility" to create sql, ie they can easily set up queries that can bring the system to it's knees with over inclusive where clauses.
I would like to be able to warn users when this is potentially the case and I'm wondering if there is any known strategy for intelligently analysing queries which can be employed to this end?
I feel your pain. I've been tasked with something similar in the past. It's a constant struggle between users demanding all of the features and functionality of SQL while also complaining that it's too complicated, doesn't help them, doesn't prevent them from doing stupid stuff.
Adding paging into the query won't stop bad queries from being executed, but it will reduce the damage. If you only show the first 50 records returned from SELECT * FROM UNIVERSE and provide the ability to page to the next 50 and so on and so forth, you can avoid out of memory issues and reduce the performance hit.
I don't know if it's appropriate for your data/business domain; but I forcefully add table joins when the user doesn't supply them. If the query contains TABLE A and TABLE B, A.ID needs to equal B.ID; I add it.
If you don't mind writing code that is specific to a database, I know you can get data about a query from the database (Explain Plan in Oracle - http://www.adp-gmbh.ch/ora/explainplan.html). You can execute the plan on their query first, and use the results of that to prompt or warn the user. But the details will vary depending on which DB you are working with.

Howto test SQL statements and ensure their quality?

I am working on some business intelligence reports these days. The data is fetched by ordinary SQL SELECT statements. The statements are getting more and more complex.
The data I want to report is partially business critical. So I would feel better, if I could do something to proof the correctness and quality of the SQL statements.
I know there are some ways to do this for application-code. But what can I do to reach these goals at SQL level?
Thanks in advance.
I'm not aware of any SQL-level proof of QA you could do since you are looking for the intent (semantics) of the query rather than syntactical correctness.
I would write a small test harness that takes the SQL statement and runs it on a known test database, then compare the result with an expecetd set of reference data (spreadsheet, simple CSV file etc).
For bonus points wrap this in a unit test and make it part of your continuous build process.
If you use a spreadsheet or CSV for the reference data it may be possible to walk through it with the business users to capture their requirements ahead of writing the SQL (i.e test-driven development).
Testing correctness of the statements would require a detailed description of the logic that genreated the report requirement, and then independant of your SQL the creation of appropriate test data sets, constructed against the requirements to ensure that the correct data, and only the correct data was selected for each test case.
Constructing these cases on more and more complex conditions will get very difficult though - reporting is notorious for ever changing requirements.
You could also consider capturing metrics such as the duration of running each statement. You could either do this at the application level, or by writing into an audit table at the beginning and end of each SQL statement. This is made easier if your statements are encapsulated in stored procedures, and can also be used to monitor who is calling the procedure, at what times of day and where from.
Looking forward to reading answers on this one. It's simple to check if the statement works or not, either it runs or doesn't run. You can also check against the requirement, does it return these 14 columns in the specified order.
What is hard to check is whether the result set is the correct answer. When you have tables with millions of rows joined to other tables with millions of rows, you can't physically check everything to know what the results should be. I like the theory of running against a test database with known conditions, but building this, and accounting for the edge cases that might affect the data in production, is something that I think would be hard to tackle.
You can sometimes look at things in such a way as to tell you if things are right. Sometimes I add a small limiting where clause to a report query to get a result set I can manually check. (Say only the records for a few days). Then I go into the database tables individually and see if they match what I have. For instance if I know there were 12 meetings for the client in that time period, is that the result I got? Oh I got 14, hmmm must be one of the joins needs more limiting data (there are two records and I only want the latest one) Or I got 10, then I figure out what is eliminating the other two (Usually a join that should be a left join or a where condition) Should those two be missing with the business rules I've been given or not.
Often when building, I return more columns than I actually need so that I can see the other data, this may make you realize that you forgot to filter for something that you need to filter for when an unexpected value turns up.
I look at the number of results carefully as I go through and add joins and where conditions. Did they go up or down, if so is that what I wanted.
If I have a place that is currently returning data that I can expect my users will compare this report to, I will look there. For instance if they can search on the website for the available speakers and I'm doing an export to the client of speaker data, the totals between the two had better match or be explanable by different business rules.
When doing reporting the users of the report often have a better idea of what the data should say than the developer. I always ask one of them to look at the qa data and confirm the reort is correct. They will often say things like what happened to project XYZ, it should be on this report. Then you know to look at that particularly case.
One other thing you need to test, is not just correctness of the data, but performance. If you only test against a small test database, you may have a query that works getting your the correct data but which times out every time you try to run it on prod with the larger data set. So never test just against the limited data set. And if at all possible run load tests as well. You do not want a bad query to take down your prod system and have that be the first indicator that there is a problem.
It is rather hard to test if a SQL is correct. One idea I have is write a program that inserts semi-random data into the database. Since the program generates the data, the program can also do the calculation from within and produce the expected result. Run the SQL and compare if the program produce the same result as the SQL.
If the results from the program and the SQL are different, then this test raises a flag that there maybe a logic issue either in the SQL, the testing program or both.
Writing the unit test program to calculate the result can be time consuming, but at least you have the program to do the validation.
Adding some news about this subject, I have found this project https://querycert.github.io/index.html that, in their own words:
Q*cert is a query compiler: it takes some input query and generates code for execution. It can compile several source query languages, such as (subsets of) SQL and OQL. It can produce code for several target backends, such as Java, JavaScript, Cloudant and Spark.
Using Coq allows to prove properties on these languages, to verify the correctness of the translation from one to another, and to check that optimizations are correct
Also, you should take a look in the Cosette http://cosette.cs.washington.edu/
Cosette is an automated prover for checking equivalences of SQL queries. It formalizes a substantial fragment of SQL in the Coq Proof Assistant and the Rosette symbolic virtual machine. It returns either a formal proof of equivalence or a counterexample for a pair of given queries.
While Q*cert seems to be better if you have a formal definition of what you are trying to build in Coq, the Cosette would be better if you are trying to rewrite some query but ensure to get the same result.

SQL Efficiency with Function

So I've got this database that helps organize information for academic conferences, but we need to know sometimes whether an item is "incomplete" - the rules behind what could make something incomplete are a bit complex, so I built them into a scalar function that just returns true if the item is complete and 0 otherwise.
The problem I'm running into is that when I call the function on a big table of data, it'll take about 1 minute to return the results. This is causing time-outs on the web site.
I don't think there's much I can do about the function itself. But I was wondering if anybody knows any techniques generally for these kinds of situations? What do you do when you have a big function like that that just has to be run sometimes on everything? Could I actually store the results of the function and then have it refreshed every now and then? Is there a good and efficient way to have it stored, but refresh it if the record is updated? I thought I could do that as a trigger or something, but if somebody ever runs a big update, it'll take forever.
Thanks,
Mike
If the function is deterministic you could add it as a computed column, and then index on it, which might improve your performance.
MSDN documentation.
The problem is that the function looks at an individual record and has logic such as "if this column is null" or "if that column is greater than 0". This logic is basically a black box to the query optimizer. There might be indexes on those fields it could use, but it has no way to know about it. It has to run this logic on every available record, rather than using the criteria in a functional matter to pare down the result set. In database parlance, we would say that the UDF is not sargable.
So what you want is some way to build your logic for incomplete conferences into a structure that the query optimizier can take better advantage of: match conditions to indexes and so forth. Off the top of my head, your options to do this include a view or a computed column.
Scalar UDFs in SQL Server perform very poorly at the moment. I only use them as a carefully planned last resort. There are probably ways to solve your problem using other techniques (even deeply nested views or inline TVF which build up all the rules and are re-joined) but it's hard to tell without seeing the requirements.
If your function is that inefficient, you'll have to deal with either out of date data, or slow results.
It sounds like you care more about performance, so like #cmsjr said, add the data to the table.
Also, create a cron job to refresh the results periodically. Perhaps add an updated column to your database table, and then the cron job only has to re-process those rows.
One more thing, how complex is the function? Could you reduce the run-time of the function by pulling it out of SQL, perhaps writing it a layer above the database layer?
I've encountered cases where in SQL Server 2000 at least a function will perform terribly and just breaking that logic out and putting it into the query speeds things tremendously. This is an edge case but if you think the function is fine then you could try that. Otherwise I'd look to compute the column and store it as others are suggesting.
Don't be so sure that you can't tune your function.
Typically, with a 'completeness' check, your worst time is when the record's actually complete. For everything else, you can abort early, so either test the cases that are fastest to compute first, or those that are most likely to cause the record to be flagged incomplete.
For bulk updates, you either have to just sit and wait, or come up with a system where you can run a less complete by faster check first, and then a more thorough check in the background.
As Cade Roux says Scalar functions are evil they are interpreted for each row and as a result are a big problem where performance is concerned. If possible use a table valued function or computed column

Sorting based on calculation with nhibernate - best practice

I need to do paging with the sort order based on a calculation. The calculation is similar to something like reddit's hotness algorithm in that its dependant on time - time since post creation.
I'm wondering what the best practice for this would be. Whether to have this sort as a SQL function, or to run an update once an hour to calculate the whole table.
The table has hundreds of thousands of rows. And I'm using nhibernate, so this could cause problems for the scheduled full calcution.
Any advice?
It most likely will depend a lot on the load on your server. A few assumptions for my answer:
Your calculation is most likely not simple, but will take into account a variety of factors, including time elapsed since post
You are expecting at least reasonable growth in your site, meaning new data will be added to your table.
I would suggest your best bet would be to calculate and store your ranking value, and as Nuno G mentioned retrieve using an ordered clause. As you note there are likely to be some implications, two of which would be:
Scheduling Updates
Ensuring access to the table
As far as scheduling goes you may be able to look at some ways of intelligently recalculating your value. For example, you may be able to identify when a calculation is likely to be altered (for example, if a dependant record is updated you might fire a trigger, adding the ID of your table to a queue for recalculation). You may also do the update in ranges, rather then in the full table.
You will also want to minimise any locking of your table whilst you are recalculating. There are a number of ways to do this, including setting your isolation levels (using MS SQL terminonlogy). If you are really worried you could even perform your calculation externally (eg. in a temp table) and then simply run an update of the values to your main table.
As a final note I would recommend looking into the paging options available to you - if you are talking about thousands of records make sure that your mechanism determines the page you need on the SQL server so that you are not returning the thousands of rows to your application, as this will slow things down for you.
If you can perform the calculation using SQL, try use Hibernate to load the sorted collection by executing a SQLQuery, where your query includes a 'ORDER BY' expression.

Date ranges in views - is this normal?

I recently started working at a company with an enormous "enterprisey" application. At my last job, I designed the database, but here we have a whole Database Architecture department that I'm not part of.
One of the stranger things in their database is that they have a bunch of views which, instead of having the user provide the date ranges they want to see, join with a (global temporary) table "TMP_PARM_RANG" with a start and end date. Every time the main app starts processing a request, the first thing it does it "DELETE FROM TMP_PARM_RANG;" then an insert into it.
This seems like a bizarre way of doing things, and not very safe, but everybody else here seems ok with it. Is this normal, or is my uneasiness valid?
Update I should mention that they use transactions and per-client locks, so it is guarded against most concurrency problems. Also, there are literally dozens if not hundreds of views that all depend on TMP_PARM_RANG.
Do I understand this correctly?
There is a view like this:
SELECT * FROM some_table, tmp_parm_rang
WHERE some_table.date_column BETWEEN tmp_parm_rang.start_date AND tmp_parm_rang.end_date;
Then in some frontend a user inputs a date range, and the application does the following:
Deletes all existing rows from
TMP_PARM_RANG
Inserts a new row into
TMP_PARM_RANG with the user's values
Selects all rows from the view
I wonder if the changes to TMP_PARM_RANG are committed or rolled back, and if so when? Is it a temporary table or a normal table? Basically, depending on the answers to these questions, the process may not be safe for multiple users to execute in parallel. One hopes that if this were the case they would have already discovered that and addressed it, but who knows?
Even if it is done in a thread-safe way, making changes to the database for simple query operations doesn't make a lot of sense. These DELETEs and INSERTs are generating redo/undo (or whatever the equivalent is in a non-Oracle database) which is completely unnecessary.
A simple and more normal way of accomplishing the same goal would be to execute this query, binding the user's inputs to the query parameters:
SELECT * FROM some_table WHERE some_table.date_column BETWEEN ? AND ?;
If the database is oracle, it's possibly a global temporary table; every session sees its own version of the table and inserts/deletes won't affect other users.
There must be some business reason for this table. I've seen views with dates hardcoded that were actually a partioned view and they were using dates as the partioning field. I've also seen joining on a table like when dealing with daylights saving times imagine a view that returned all activity which occured during DST. And none of these things would ever delete and insert into the table...that's just odd
So either there is a deeper reason for this that needs to be dug out, or it's just something that at the time seemed like a good idea but why it was done that way has been lost as tribal knowledge.
Personally, I'm guessing that it would be a pretty strange occurance. And from what you are saying two methods calling the process at the same time could be very interesting.
Typically date ranges are done as filters on a view, and not driven by outside values stored in other tables.
The only justification I could see for this is if there was a multi-step process, that was only executed once at a time and the dates are needed for multiple operations, across multiple stored procedures.
I suppose it would let them support multiple ranges. For example, they can return all dates between 1/1/2008 and 1/1/2009 AND 1/1/2006 and 1/1/2007 to compare 2006 data to 2008 data. You couldn't do that with a single pair of bound parameters. Also, I don't know how Oracle does it's query plan caching for views, but perhaps it has something to do with that? With the date columns being checked as part of the view the server could cache a plan that always assumes the dates will be checked.
Just throwing out some guesses here :)
Also, you wrote:
I should mention that they use
transactions and per-client locks, so
it is guarded against most concurrency
problems.
While that may guard against data consistency problems due to concurrency, it hurts when it comes to performance problems due to concurrency.
Do they also add one -in the application- to generate the next unique value for the primary key?
It seems that the concept of shared state eludes these folks, or the reason for the shared state eludes us.
That sounds like a pretty weird algorithm to me. I wonder how it handles concurrency - is it wrapped in a transaction?
Sounds to me like someone just wasn't sure how to write their WHERE clause.
The views are probably used as temp tables. In SQL Server we can use a table variable or a temp table (# / ##) for this purpose. Although creating views are not recommended by experts, I have created lots of them for my SSRS projects because the tables I am working on do not reference one another (NO FK's, seriously!). I have to workaround deficiencies in the database design; that's why I am using views a lot.
With the global temporary table GTT approach that you comment is being used here, the method is certainly safe with regard to a multiuser system, so no problem there. If this is Oracle then I'd want to check that the system either is using an appropriate level of dynamic sampling so that the GTT is joined appropriately, or that a call to DBMS_STATS is made to supply statistics on the GTT.