I am investigating some SQLs in my java application. I want to know if it is possible to find the load of an SQL. Usually I count SQLs and try and reduce the numbers of them. This is not always the correct way though. I could have it where if I run two SQLs this could be faster than running one SQL that combines the two previous SQLs. Is it possible for me to find out the hit/load of a particular SQL?
I am using Oracle and have hibernate sitting between the DB and the java layer.
Thanks
AWR and ASH reports will help you find out the most significant SQLs that are putting load on your Database in different aspects (CPU, Elapsed Time, I/O etc...)
Related
Summary
I have a data analysis project that requires the use of a locally stored PostgreSQL database of fairly decent size for a desktop machine (~10 tables with up to ~90m rows and ~20 columns, summing to around 20gb worth of data).
I have some statistical models I want to execute on this data using R. First, though, I need to manipulate the data a little bit to get it into the form I want. The basic manipulations are fairly straightforward SELECT and JOIN operations, but they take a a few minutes to do on my machine because of the amount of data. I'll need to refer to the manipulated tables again and again in analysis, so I'd like to be able to save the results of these SELECT and JOIN operations for later use.
Question
Is it faster or computationally more efficient to
(a) execute the joins from R's DBI package, using, say, dbGetQuery, and saving the resulting dataframes on disk for later analysis
or
(b) do the joins and selects inside PGAdmin or DataGrip, save the result to a .csv file, and bring that into R?
What I've tried
I've tried three of the operations I need to do in both RStudio (as in a) and DataGrip (as in b) and timed them with a stopwatch. In 2 instances, the code seems to go faster inside the SQL environment in DataGrip, and in the third, it's marginally faster in RStudio. I'm not sure why this is, besides the third operation working on smaller tables than the first two. No, I don't know how to benchmark code in either platform, and yes, this may be part of my issue. Nor do I know much about big-O notation, but that may not be relevant here.
I can include some more concrete code if it's helpful, but my question (it seems to me) is a little more theoretical. I'm basically asking if connecting to a SQL server on my local machine should be any different if I'm doing it in R versus doing it in some "proper" database environment. Are there bottlenecks in one and not the other?
Thanks in advance for any insight!
I have a J2EE application built on EclipseLink and running under Glassfish on Postgres. We're doing some performance analysis now.
I turned on pg logging on our build server and analyzed the output with pgfouine. Now that I have these charts and data from pgfouine, how should I interpret that to actually improve performance?
I think I want to find the most frequently used, but slower queries to get the most benefit. Reducing the number of frequently run queries (perhaps through caching) also seems like a sound approach.
Properly done indexing helps a lot. If a column appears in many WHERE clauses, consider marking it for indexing.
I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.
I'm writing a web app which is using a mysql database. I want to show running time for a particular query, but I want it to be useful for other developers trying to do the same thing. The point is to give other developers an idea as to the cost of doing this query if they try the same web app pattern.
What is a good way to do this? I can run the query on mysql N times and average the results. I can modify the dataset I'm running on to provide expected, best, and worst case scenarios. Is any of that useful though for other developers? Is there some other way to go about this?
I see in mysql query browser that it'll report the time it took to run the query. Is that all that's needed to provide an accrate report?
I understand the same pattern will have different run times on different architectures,
Thanks
Determine the number of logical reads used by the query. This won't fluctuate like elapsed time will.
I was researching for a CMS to use and ran into a review on vBulletin 4.0; using about 200 queries on one page load.
I was then worried.
Further research brought me to other sites to see how much queries they are using and I found that some forum software such as Invision Power Board and PHPBB are using queries as low as 6 or 8.
Currently, my site uses about 25 to 40 queries.
Should I be worried?
Don't be worried about number of queries.
Be worried about:
Pages loading too slowly
The SQL being too complicated to maintain.
Clarification:
SQL being too complicated can come from either too many queries OR a few queries that are very complicated (lots of joins and sub queries, etc).
If you aim for something, aim for 3 reads and 1 writes per HTTP hit.
While these are arbitrary numbers (somehow, they are actually taken from the Advanced PHP Programming), they emphasize the ideas:
the number of SQL roundtrips should be low, under 10 for sure, per HTTP call
there is a difference between reads and writes, and the ratio should favour reads. writes create contention
Also remember that not all reads are equal: the 3 reads should be highly optimized reads, not full table scans with 4-5 outer joins...
It Depends. The more you hit the db, the more load you have. Just some things to look for. If you need to display values from several different tables, you will probably need to run several queries. If you only have a couple of users and you know you're not going to have lots of data, it probably doesn't matter.
Some things to consider:
Are you running the same query multiple times per page load? If you can reuse the result, do it.
Are you running a query-per-result of another query? If so, maybe allow the DB to do the join and only do one pull.
If your page is slow from hitting the db too much, look at memcached.
You might try re-factoring your code over time to decrease the number of round-trips to SQL Server. One way to do this could be to utilize caching. For example, data you need frequently can be loaded when the application is started, then grabbed from the cache when it is needed.
Another approach could be to de-normalize your data into tables that are specifically designed to give you the data your site needs in a fewer number of queries.
Also consider if some of those queries (those you use to populate lookup values for instance) can be cached. That way if the same query is called on multiple pages or each time you move from one group of records to another, the database isn't hit again to run exaclty the same query. I remeber one time we were trying to determine why the site was so slow when the stored proc that was running was very fast and found using profiler that it was being sent over and over and over again when it didn't need to be.
You can cache all those queries with vbulletin. If you look at pbnation.com they have over a million visitors a day and only around 3-4 queries per page load. Everything else is cached in memcached.