PostgreSQL performance monitoring tool - sql

I'm setting up a web application with a FreeBSD PostgreSQL back-end. I'm looking for some database performance optimization tool/technique.

Database optimization is usually a combination of two things
Reduce the number of queries to the database
Reduce the amount of data that needs to be looked at to answer queries
Reducing the amount of queries is usually done by caching non-volatile/less important data (e.g. "Which users are online" or "What are the latest posts by this user?") inside the application (if possible) or in an external - more efficient - datastore (memcached, redis, etc.). If you've got information which is very write-heavy (e.g. hit-counters) and doesn't need ACID-semantics you can also think about moving it out of the Postgres database to more efficient data stores.
Optimizing the query runtime is more tricky - this can amount to creating special indexes (or indexes in the first place), changing (possibly denormalizing) the data model or changing the fundamental approach the application takes when it comes to working with the database. See for example the Pagination done the Postgres way talk by Markus Winand on how to rethink the concept of pagination to make it more database efficient
Measuring queries the slow way
But to understand which queries should be looked at first you need to know how often they are executed and how long they run on average.
One approach to this is logging all (or "slow") queries including their runtime and then parsing the query log. A good tool for this is pgfouine which has already been mentioned earlier in this discussion, it has since been replaced by pgbadger which is written in a more friendly language, is much faster and more actively maintained.
Both pgfouine and pgbadger suffer from the fact that they need query-logging enabled, which can cause a noticeable performance hit on the database or bring you into disk space troubles on top of the fact that parsing the log with the tool can take quite some time and won't give you up-to-date insights on what is going in the database.
Speeding it up with extensions
To address these shortcomings there are now two extensions which track query performance directly in the database - pg_stat_statements (which is only helpful in version 9.2 or newer) and pg_stat_plans. Both extensions offer the same basic functionality - tracking how often a given "normalized query" (Query string minus all expression literals) has been run and how long it took in total. Due to the fact that this is done while the query is actually run this is done in a very efficient manner, the measurable overhead was less than 5% in synthetic benchmarks.
Making sense of the data
The list of queries itself is very "dry" from an information perspective. There's been work on a third extension trying to address this fact and offer nicer representation of the data called pg_statsinfo (along with pg_stats_reporter), but it's a bit of an undertaking to get it up and running.
To offer a more convenient solution to this problem I started working on a commercial project which is focussed around pg_stat_statements and pg_stat_plans and augments the information collected by lots of other data pulled out of the database. It's called pganalyze and you can find it at https://pganalyze.com/.
To offer a concise overview of interesting tools and projects in the Postgres Monitoring area i also started compiling a list at the Postgres Wiki which is updated regularly.

pgfouine works fairly well for me. And it looks like there's a FreeBSD port for it.

I've used pgtop a little. It is quite crude, but at least I can see which query is running for each process ID.
I tried pgfouine, but if I remember, it's an offline tool.
I also tail the psql.log file and set the logging criteria down to a level where I can see the problem queries.
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this time.
I also use EMS Postgres Manager to do general admin work. It doesn't do anything for you, but it does make most tasks easier and makes reviewing and setting up your schema more simple. I find that when using a GUI, it is much easier for me to spot inconsistencies (like a missing index, field criteria, etc.). It's only one of two programs I'm willing to use VMWare on my Mac to use.

Munin is quite simple yet effective to get trends of how the database is evolving and performing over time. In the standard kit of Munin you can among other thing monitor the size of the database, number of locks, number of connections, sequential scans, size of transaction log and long running queries.
Easy to setup and to get started with and if needed you can write your own plugin quite easily.
Check out the latest postgresql plugins that are shipped with Munin here:
http://munin-monitoring.org/browser/branches/1.4-stable/plugins/node.d/

Well, the first thing to do is try all your queries from psql using "explain" and see if there are sequential scans that can be converted to index scans by adding indexes or rewriting the query.
Other than that, I'm as interested in the answers to this question as you are.

Check out Lightning Admin, it has a GUI for capturing log statements, not perfect but works great for most needs. http://www.amsoftwaredesign.com

DBTuna http://www.dbtuna.com/postgresql_monitor.php has recently started supporting PostgreSQL monitoring. We use it extensively for MySQL monitoring, so if it provides the same for Postgres then it should be a good fit for you too.

Related

BigQuery Testing, Debugging, and Design Patterns

We use BigQuery as the main data warehouse in our company.
We have gotten very efficient with SQL syntax and we write multi-page SQL queries with valid Syntax to analyze our data.
The main problem that we are struggling with are terrible logic mistakes in our queries. For example, it could be that a > should have been a >=, or that a join was treating NULL values the wrong way.
The effect is that we are getting wrong data out of BigQuery.
The logic within our data structure is so complicated ("what again was the definition of Customer Type ABC?") that it's terribly difficult to actually pull out anything useable. We estimate that up to 50% of analytics that we pull out of BigQuery are plain wrong.
Of course this is a problem that significantly hurts our bottom line and leads to wrong business decision. It has gotten so bad that we are craving for a normalized database structure that at least could be comprehended easier.
My hope is that maybe we are just missing certain design patterns to properly use BigQuery. However I find zero guidance about this online. The SQL we are using is so complex that I'm starting to think that although the Syntax is correct, SQL was not made for this. What we are doing feels like fitting a complex program into a single function, which in turn becomes untestable and a nightmare to work with.
I would appreciate any input and guidance
I can empathize here. I don't think your issue is unique, and there isn't one best practice. I can tell you what we have done to help with these same issues.
We are a small team of analysts, and only have a couple TB of data to crunch daily so your mileage will vary with these tips depending on your situation.
We use DBT - https://www.getdbt.com/. It has a free command line version, or you can pay for DBT cloud if you aren't confident with command line tools. It will help you go from Pages long SQL queries to smaller digestible chunks that are easier to maintain.
It helps with 3 main use cases for us.
database normalization/summarization - you can easily write queries, have them dependent on each other, have them scheduled to run at a certain time, while doing a lot of the more complex data engineering tasks for you. Such as making sure to run things in the right order, and that no circular references exist. This part of the tool helped us migrate away from pages long SQL queries to smaller digestible chunks that are useful in multiple applications.
documentation: there is a documentation site built in. So you can document a column and write out the definition of 'customer' easily.
Testing. We write loads of tests. We have a 100% accepted answer to certain metrics. Any time we need to reference this metric in other queries, or transform data to slice that metric by other dimensions, we write a test to make sure the new transformation matches back to the 100% accepted answer.
We have explored DBT, unfortunately we didn't have the bandwidth to support it at the company level. As an alternative we use airflow to build and maintain datasets in Bigquery. We use the BigQuery operators to interface with BQ through airflow. This helps us in the following ways:
Ability to build custom operators that can help with organizational level bells and whistles (integration with internal systems, data lifecycle management, lineage management etc.)
Ability to break down complex pieces of SQL into smaller manageable blocks that can be reused
Ability to incorporate testing in the process. You can build testing into your pipeline DAG or can build out separate DAGs of tests that can monitor your datasets and send out reports.
Ability to replay and recreate datasets
Ability to easily manage schema changes
I am sure there are other use cases where airflow helps, but these are some of the things that come to mind.

Database approach to use for Dynamic Form Data Collection which is suitable for good Reports and Searching

I am working on a project which involves collecting dynamic form data. These forms are user-defined (think surveymonkey) and thus a fixed schema cannot be defined for them. Data in terms of questions/answers would be retrieved for these forms and then stored into the database. Reporting/Searching on this answers (filtering and aggregation) is of utmost importance. There are two approaches which are feasible.
Use a SQL database and store the each field data as a separate row. Reporting/Searching is then done via SQL. My apprehension is that it would result in complicated joins for reporting.
Use a NoSQL database like MongoDB. This seems to be a perfect fit for storing the dynamic data since it is schema-less. However, I am not sure how good its reporting capabilities are.
It seems easier for target users to learn sql than to define map/reduce queries. How easy would it be to build a UI for reporting/searching over mongoDB.
Simple things like - list of users who gave a particular set of answers. How many such users over a period of time etc?
Thanks,
Pulkit
It's already been mentioned in the comments, but I'll re-iterate that you should look at Mongo's map/reduce functionality for reporting and the aggregation framework.
Having done map/reduce in both Couch and Mongo I can say that they are very similar. It's definitely a barrier to entry for a developer that isn't familiar with it, but once you get a few working examples, it's not too bad.
Consider that Mongo can output a map/reduce job to a collection, which I've found to be really useful. This means you can schedule the jobs and run them periodically and output to a place that you can then report on. It's not that hard to create a framework that lets developers write simple Javascript map and reduce functions and then plug them in to be run on a schedule.
The aggregation framework is much easier to understand for a developer coming from SQL. Still a learning curve, but not as bad as map/reduce. It is much more well suited to ad-hoc reporting queries and there is nothing comparable in Couch.
You could maybe make a reporting UI that maps to the aggregation framework, but I wouldn't try to do something similar for map/reduce queries.

Strategies for Fixing Problems / Tweaking NHibernate Apps in Production

First off, I am not a DBA, but I do work in an environment where DBAs do tune/make changes in the production database from time to time in ways that do not cause the need for an application rebuild/redeployment. Usually these changes consist of reworking indexes, changing procs, and sometimes changing the table structure in minor ways (usually abstracted from the app via procs).
Obviously, a team should strive to catch performance problems with NHibernate before they get into production using things like NHProf, SQL Profiler, and load tests. That being said, are there certain strategies that can be used to allow some amount of tweaking once the code is built and out running in production? Using stored procedures 100% of the time seems like it would allow the most flexibility for the DBA's, but obviously that would really kill the efficiency of NHibernate. From what I've read, updatable views (in SQL Server) don't really work that well with NHibernate either (this may-or-may-not be true).
I've read quite a bit about NHibernate and experimented with it over the years, but I have never put it into practice in a production environment. I have yet to come across a set of "best practices" to allow for maximum tweaking once deployed.
As an NHibernate user, how are you and your team dealing with issues if they arise in production? My production environment is made up of ASP.NET apps and SQL server, but I don't think the answers need to be restricted to that platform.
I am in a similar position, and in order to keep our DBA happy, I did the following:
Wrote some of the queries in HQL, some others in SQL (especially those perf-sensitive)
Externalized those queries to files, one file per query.
When your app needs to execute of these queries, it just loads the appropriate file, optionally running it through a pre-processor, and runs it.
With this approach, the DBA could theoretically tweak the queries just by modifying those files. That's quite similar to having stored procedures.
In practice, it's up to you to decide if you'll really give the DBA access to those files (if you catch my drift...)
IMHO the DBA should just use the DBMS's profiling tools and report her findings back to the devs (as in "there's this query that is running 20 times/sec and does 10 joins. is that really necessary? can it be cached? do you really need all those joins? can we denormalize this?" etc.
I'm not in the deploy phase yet, but on my current project I've come up against this already and my solution presently has been to replace my queries with stored procs. As long as the shape of the data coming back from the DB remains the same it's not a big deal. Yeah you do lose some of that agility you enjoyed during development but I'm not sure it's as bad as it initially sounds. You'll have a code push when you first make the change of course, and then from that point it's just proc changes.
You can use a profiler like NHProf to see the sql queries executed, so you can show them to a DBA. This tool can also detect some problem like n+1 select.
Using a second level of cache can be useful : http://web.archive.org/web/20110514214657/http://blogs.hibernatingrhinos.com/nhibernate/archive/2008/11/09/first-and-second-level-caching-in-nhibernate.aspx

Database System Architecture discussion

I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....

Performance metrics on specific routines: any best practices?

I'd like to gather metrics on specific routines of my code to see where I can best optimize. Let's take a simple example and say that I have a "Class" database with multiple "Students." Let's say the current code calls the database for every student instead of grabbing them all at once in a batch. I'd like to see how long each trip to the database takes for every student row.
This is in C#, but I think it applies everywhere. Usually when I get curious as to a specific routine's performance, I'll create a DateTime object before it runs, run the routine, and then create another DateTime object after the call and take the milliseconds difference between the two to see how long it runs. Usually I just output this in the page's trace...so it's a bit lo-fi. Any best practices for this? I thought about being able to put the web app into some "diagnostic" mode and doing verbose logging/event log writing with whatever I'm after, but I wanted to see if the stackoverflow hive mind has a better idea.
For database queries, you have a two small problems. Cache: data cache and statement cache.
If you run the query once, the statement is parsed, prepared, bound and executed. Data is fetched from files into cache.
When you execute the query a second time, the cache is used, and performance is often much, much better.
Which is the "real" performance number? First one or second one? Some folks say "worst case" is the real number, and we have to optimize that. Others say "typical case" and run the query twice, ignoring the first one. Others says "average" and run in 30 times, averaging them all. Other say "typical average", run the 31 times and average the last 30.
I suggest that the "last 30 of 31" is the most meaningful DB performance number. Don't sweat the things you can't control (parse, prepare, bind) times. Sweat the stuff you can control -- data structures, I/O loading, indexes, etc.
I use this method on occasion and find it to be fairly accurate. The problem is that in large applications with a fairly hefty amount of debugging logs, it can be a pain to search through the logs for this information. So I use external tools (I program in Java primarily, and use JProbe) which allow me to see average and total times for my methods, how much time is spent exclusively by a particular method (as opposed to the cumulative time spent by the method and any method it calls), as well as memory and resource allocations.
These tools can assist you in measuring the performance of entire applications, and if you are doing a significant amount of development in an area where performance is important, you may want to research the tools available and learn how to use one.
Some times approach you take will give you a best look at you application performance.
One things I can recommend is to use System.Diagnostics.Stopwatch instead of DateTime ,
DateTime is accurate only up to 16 ms where Stopwatch is accurate up to the cpu tick.
But I recommend to complement it with custom performance counters for production and running the app under profiler during development.
There are some Profilers available but, frankly, I think your approach is better. The profiler approach is overkill. Maybe the use of profilers is worth the trouble if you absolutely have no clue where the bottleneck is. I would rather spend a little time analyzing the problem up front and putting a few strategic print statements than figure out how to instrument your app for profiling then pour over gargantuan reports where every executable line of code is timed.
If you're working with .NET, then I'd recommend checking out the Stopwatch class. The times you get back from that are going to be much more accurate than an equivalent sample using DateTime.
I'd also recommend checking out ANTS Profiler for scenarios in which performance is exceptionally important.
It is worth considering investing in a good commercial profiler, particularly if you ever expect to have to do this a second time.
The one I use, JProfiler, works in the Java world and can attach to an already-running application, so no special instrumentation is required (at least with the more recent JVMs).
It very rapidly builds a sorted list of hotspots in your code, showing which methods your code is spending most of its time inside. It filters pretty intelligently by default, and allows you to tune the filtering further if required, meaning that you can ignore the detail of third party libraries, while picking out those of your methods which are taking all the time.
In addition, you get lots of other useful reports on what your code is doing. It paid for the cost of the licence in the time I saved the first time I used it; I didn't have to add in lots of logging statements and construct a mechanism to anayse the output: the developers of the profiler had already done all of that for me.
I'm not associated with ej-technologies in any way other than being a very happy customer.
I use this method and I think it's very accurate.
I think you have a good approach. I recommend that you produce "machine friendly" records in the log file(s) so that you can parse them more easily. Something like CSV or other-delimited records that are consistently structured.