How many queries in one webpage is good performance? If that page is home page that is viewed many times.
and how about....
$sql1 = mysql_query("SELECT * FROM a", $db1);
while($row = mysql_fetch_assoc($sql1)){
$sql2 = mysql_query("SELECT * FROM b WHERE aid='a'", $db2);
$a = mysql_fetch_assoc($sql2);
}
is it good? acctually I can combine $sql1 and $sql2 together by INNER JOIN but the problem is $sql1 is query data from database 1 and $sql2 is query data from database 2. and I use Parallels Plesk Panel that doesn't allow me to add same database user to multiple database.
If I use this code on my website, is it good? or anyway to do this?
Thanks...
Actually you have 2 questions in 1.
A general one and a particular one.
Both has obvious answers in my opinion.
How many queries in one webpage is good performance?
There is no direct connection between number of queries and performance. Database setup, architecture and tuning is responsible for the performance.
And number of queries should be caused by database architecture only. Use as many queries as many you need. Do not reduce number of queries at any cost, only in pursue of performance.
is it good?
Does it matter if you have no choice?
And another, unspoken question:
Should I be concerned about this code snippet performance?
Should you?
Do you have any performance issues at the moment?
If not - why to worry at all? Why to worry about this particular snippet, not any other one?
If yes - you have to profile your code first.
And then build your optimization strategy based on the profiling results. It may be number of queries, it may be proper indexing, clusterization, server upgrade.
Do not blind shoot. Take sensible steps.
I like to keep mine under 12.
In all seriousness though, that's pretty meaningless. If hypothetically there was a reason for you to have 800 queries in a page, then you could go ahead and do it. You'll probably find that the number of queries per page will simply be dependant on what you're doing, though in normal circumstances I'd be surprised to see over 50 (though these days, it can be hard to realise just how many you're doing if you are abstracting your DB calls away).
Slow queries matter more
I used to be frustrated at a certain PHP based forum software which had 35 queries in a page and ran really slow, but that was a long time ago and I know now that the reason that particular installation ran slow had nothing to do with having 35 queries in a page. For example, only one or two of those queries took most of the time. It just had a couple of really slow queries, that were fixed by well-placed indexes.
I think that identifying and fixing slow queries should come before identifying and eliminating unnecessary queries, as it can potentially make a lot more difference.
Consider even that 20 fast queries might be significantly quicker than one slow query - number of queries does not necessarily relate to speed. Sometimes, you can reduce load and speed up a page by splitting a slow query into multiple queries.
Try caching
There are various ways to cache parts of your application which can really cut down on the number of queries you do, without reducing functionality. Libraries like memcached make this trivially easy these days and yet run really fast. This can also help improve performance a lot more than reducing the number of queries.
If queries are really unnecessary, and the performance really is making a difference, then remove/combine them
Just consider looking for slow queries and optimizing them, or caching their results, first.
Measure it.
For the specific case outlined above, I'd combine to a join if possible.
In general, multiple queries per request is pretty normal.
Many sites have tens of requests per query and they are fairly performant.
Use a load tester like Apache bench. (If you have Apache installed, type ab to see the parameters)
I just had the same problem here.
The problem is that you use a query in a loop. If your record a has 10 rows, it makes 10 queries. If your record 'a' has 100 rows, it will make 100 queries. So the more rows your record 'a' has, the worse it gets.
The solution is to put the requests in an array and use the correct foreach loops to display the same thing with only 2 queries. I found This site which is really clear about this topic.
Related
I'm currently researching a very large table (~100 million rows, 35 columns), it's currently stored in SQL db, but the queries I'm running (and they're various) run very, very slow..
so I get it I should probably move to NoSQL db. question is:
How can I tell which (NoSQL) db is best for me?
How can I move my current SQL table to the new NoSQL scheme?
OR should I stay in SQL and just fine tune it?
A few more details: rows will not be added/removed, this is historical data and all of the analysis will be done on that table. plan to run various queries on it. data is numerical.
I routinely work with a SQL Server 2012 table that has 900 million rows. This table has rows being added to it about every 2 minutes with a total of about 200K per day. I can query this table and get rows back in a couple seconds (using the clustered index / PK). I can also query on one of the other indexes and get results back in seconds or less.
So, it's all a matter of making sure your indexes are set up correctly, AND BEING USED!! Check your queries against the query plan being generated and make sure seeks are being done.
There could be good reasons for moving to NoSQL, or something similar. But moving to NoSQL because you think you can't get good performance in SQL Server, before making sure you've done everything you can do to improve performance first, is not a good reason.
Some food for thought:
100M rows is well within SQL's "sweet spot". You can grow by x10 and still be assured that SQL will be able to support you with fairly trivial effort.
NoSQL is not a silver bullet for solving performance problems at scale. It offers a set of tradeoffs which, with careful planning, can provide better results. But if sounds like you don't fully understand your performance issues in SQL, and without that your chances of making the correct design decisions in a NoSQL environment are slim.
One of the common tradeoffs in NoSQL systems is that they typically provide less flexibilty in querying, in return for greater flexibility in schema management. You mentioned your queries are "various"- if they are truly varied, or more importantly- frequently changing - then moving to a NoSQL system can put you in a world of pain. Especially if you are not familiar with the technology yet.
Bottom line- You aren't doing anything which is clearly "beyond" the capabilities of SQL, and your problems are probably caused more by inefficient implementation than by any inherent platform limitations. Moving to a NoSQL system won't magically solve any of your problems, and will probably introduce new ones.
If you are running a query on columns that are not indexed you will be very slow. You can add more indexes to speed them up. If your DB is static this should work.
One major speed up is the usage of map-reduce queries, where aggregations are carried out by multiple processes or computers. NoSQL databases like MongoDB can be used in such ways. But even MySQL has Cluster capabilities nowadays: http://www.mysql.de/products/cluster/scalability.html. SQL Server can be clustered as well.
So I guess the best first shot would be to optimize your indexes in the table to the query. Each argument column to the query (compare, count ...) etc. should be indexed.
If this is not doing any better you probably count and calculate a lot and you should use map-reduce jobs and a DB which can handle this like MongoDB: http://docs.mongodb.org/manual/aggregation/
I hope this helps
I am making a quiz and I am expecting a lot of players for it. The quiz has a fixed number of questions.The quiz system can either fetch individual questions one by one from the MySql database and display it to the user. Or alternatively I can fetch all the questions when he/she logs in and display them one by one. Can the second method significantly help in reducing the load on the server due to large number of SQL queries? Here I am talking about 500-600 users simultaneously playing the game.
The only way to know for sure is to benchmark both methods.
For what it's worth, I've recently improved the throughput of a performance-critical piece of code ten-fold by replacing a large number of small SQL queries with a small number of large queries.
A well indexed database should have no problems at all with single queries, which is what I would prefer. Bonus: those small queries will remain in the server's query cache and thus be delivered way faster if requested again.
It depends on the numbers. Are you displaying all the questions or you are picking random 5 out of 1000?
If the number of questions is small, fetching them in a single query will be better.
If the number of questions is large, fetching them in a single query may use a lot of memory, so fetching them one-by-one will be better.
If the number of questions is small, it would make little difference either way. And if the number of questions is large, fetching in a single query is more efficient since you'll avoid most of the back-and-forth network latencies.
Except 15 select queries each user's answer is a individual insert query. Maximum advantage for DB server is 2 batches of queries: 1 select and 1 insert for every user.
In theory you may receive 500-600 inserts per second. In practice there will be less inserts.
Anyway you should avoid large number of queries cause there are overheads for insert locks.
I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.
I was researching for a CMS to use and ran into a review on vBulletin 4.0; using about 200 queries on one page load.
I was then worried.
Further research brought me to other sites to see how much queries they are using and I found that some forum software such as Invision Power Board and PHPBB are using queries as low as 6 or 8.
Currently, my site uses about 25 to 40 queries.
Should I be worried?
Don't be worried about number of queries.
Be worried about:
Pages loading too slowly
The SQL being too complicated to maintain.
Clarification:
SQL being too complicated can come from either too many queries OR a few queries that are very complicated (lots of joins and sub queries, etc).
If you aim for something, aim for 3 reads and 1 writes per HTTP hit.
While these are arbitrary numbers (somehow, they are actually taken from the Advanced PHP Programming), they emphasize the ideas:
the number of SQL roundtrips should be low, under 10 for sure, per HTTP call
there is a difference between reads and writes, and the ratio should favour reads. writes create contention
Also remember that not all reads are equal: the 3 reads should be highly optimized reads, not full table scans with 4-5 outer joins...
It Depends. The more you hit the db, the more load you have. Just some things to look for. If you need to display values from several different tables, you will probably need to run several queries. If you only have a couple of users and you know you're not going to have lots of data, it probably doesn't matter.
Some things to consider:
Are you running the same query multiple times per page load? If you can reuse the result, do it.
Are you running a query-per-result of another query? If so, maybe allow the DB to do the join and only do one pull.
If your page is slow from hitting the db too much, look at memcached.
You might try re-factoring your code over time to decrease the number of round-trips to SQL Server. One way to do this could be to utilize caching. For example, data you need frequently can be loaded when the application is started, then grabbed from the cache when it is needed.
Another approach could be to de-normalize your data into tables that are specifically designed to give you the data your site needs in a fewer number of queries.
Also consider if some of those queries (those you use to populate lookup values for instance) can be cached. That way if the same query is called on multiple pages or each time you move from one group of records to another, the database isn't hit again to run exaclty the same query. I remeber one time we were trying to determine why the site was so slow when the stored proc that was running was very fast and found using profiler that it was being sent over and over and over again when it didn't need to be.
You can cache all those queries with vbulletin. If you look at pbnation.com they have over a million visitors a day and only around 3-4 queries per page load. Everything else is cached in memcached.
I have a stored proc that processes a large amount of data (about 5m rows in this example). The performance varies wildly. I've had the process running in as little as 15 minutes and seen it run for as long as 4 hours.
For maintenance, and in order to verify that the logic and processing is correct, we have the SP broken up into sections:
TRUNCATE and populate a work table (indexed) we can verify later with automated testing tools.
Join several tables together (including some of these work tables) to product another work table
Repeat 1 and/or 2 until a final output is produced.
My concern is that this is a single SP and so gets an execution plan when it is first run (even WITH RECOMPILE). But at that time, the work tables (permanent tables in a Work schema) are empty.
I am concerned that, regardless of the indexing scheme, the execution plan will be poor.
I am considering breaking up the SP and calling separate SPs from within it so that they could take advantage of a re-evaluated execution plan after the data in the work tables is built. I have also seen reference to using EXEC to run dynamic SQL which, obviously might get a RECOMPILE also.
I'm still trying to get SHOWPLAN permissions, so I'm flying quite blind.
Are you able to determine whether there are any locking problems? Are you running the SP in sufficiently small transactions?
Breaking it up into subprocedures should have no benefit.
Somebody should be concerned about your productivity, working without basic optimization resources. That suggests there may be other possible unseen issues as well.
Grab the free copy of "Dissecting Execution Plan" in the link below and maybe you can pick up a tip or two from it that will give you some idea of what's really going on under the hood of your SP.
http://dbalink.wordpress.com/2008/08/08/dissecting-sql-server-execution-plans-free-ebook/
Are you sure that the variability you're seeing is caused by "bad" execution plans? This may be a cause, but there may be a number of other reasons:
"other" load on the db machine
when using different data, there may be "easy" and "hard" data
issues with having to allocate more memory/file storage
...
Have you tried running the SP with the same data a few times?
Also, in order to figure out what is causing the runtime/variability, I'd try to do some detailed measuring to pin the problem down to a specific section of the code. (Easiest way would be to insert some log calls at various points in the sp). Then try to explain why that section is slow (other than "5M rows ;-)) and figure out a way to make that faster.
For now, I think there are a few questions to answer before going down the "splitting up the sp" route.
You're right it is quite difficult for you to get a clear picture of what is happening behind the scenes until you can get the "actual" execution plans from several executions of your overall process.
One point to consider perhaps. Are your work tables physical of temporary tables? If they are physical you will get a performance gain by inserting new data into a new table without an index (i.e. a heap) which you can then build an index on after all the data has been inserted.
Also, what is the purpose of your process. It sounds like you are moving quite a bit of data around, in which case you may wish to consider the use of partitioning. You can switch in and out data to your main table with relative ease.
Hope what I have detailed is clear but please feel free to pose further questions.
Cheers, John
In several cases I've seen this level of diversity of execution times / query plans comes down to statistics. I would recommend some tests running update stats against the tables you are using just before the process is run. This will both force a re-evaluation of the execution plan by SQL and, I suspect, give you more consistent results. Additionally you may do well to see if the differences in execution time correlate with re-indexing jobs by your dbas. Perhaps you could also gather some index health statistics before each run.
If not, as other answerers have suggested, you are more likely suffering from locking and/or contention issues.
Good luck with it.
The only thing I can think that an execution plan would do wrong when there's no data is err on the side of using a table scan instead of an index, since table scans are super fast when the whole table will fit into memory. Are there other negatives you're actually observing or are sure are happening because there's no data when an execution plan is created?
You can force usage of indexes in your query...
Seems to me like you might be going down the wrong path.
Is this an infeed or outfeed of some sort or are you creating a report? If it is a feed, I would suggest that you change the process to use SSIS which should be able to move 5 million records very fast.