I have a postgres table ("dist_mx") that indicates the distances between two points (geographic space). The points are defined in the "hex_0" and "hex_1" columns. The table will eventually be 10^7 to 10^8 rows. The table is structured as such:
One of the purposes of this table is to query the shortest distance from a list of points (1000s) to the points that correspond to locations of interest. For example, I want to know the shortest distance from each point to a grocery stores (we know how each grocery store corresponds to point ids).
I'm using a UNION statement to run the query. The OR statement is used because the order of the points is arbitrary (i.e., pairs aren't repeated in reverse order). See below:
SELECT MIN(distances) FROM dist_mx
WHERE ((point_id_0= '8829abb139fffff' AND point_id_1 IN ('8829abb555fffff', ...))
OR (point_id_1= '8829abb139fffff' AND point_id_0 IN ('8829abb555fffff', ...))
UNION
SELECT MIN(distances) FROM dist_mx
WHERE ((point_id_0= '8829abb469fffff' AND point_id_1 IN ('8829abb555fffff', ...))
OR (point_id_1= '8829abb469fffff' AND point_id_0 IN ('8829abb555fffff', ...))
...
The query seems to be working as intended but it is slow. It takes 20 minutes for the query to run on a list of ~4500 points. I have tried chunking the query so I only include 500 queries at a time (i.e., connected by the UNION statement), but this does not significantly change performance.
I'm relatively new to postgres so I am hoping that there is a fairly simple speedup (or a not fairly simple speedup)?
EDIT:
adding schema
Without seeing an explain analyze for your query, and also the whole query, I can't give specific advice. There's also probably a better way to write your query, but it's unclear what you're doing.
Here's some general advice.
The basic performance tool is indexes. Without indexes, Postgres must scan the whole table, probably repeatedly. See Use The Index, Luke for more.
A multi-column index on (point_id_0, point_id_1) will allow Postgres to quickly find the matching rows without having to scan the whole table.
create index dist_mx_points_idx on dist_mx(point_id_0, point_id_1)
That should help significantly.
One of the purposes of this table is to query the shortest distance from a list of points (1000s) to the points that correspond to locations of interest. For example, I want to know the shortest distance from each point to a grocery stores (we know how each grocery store corresponds to point ids).
Use PostGIS.
Other notes.
Don't store hex as a string, store it as a bigint and convert. This will take less space and is faster.
Don't store numbers as text, use an integer.
Don't store your points as two columns, use a single point column. Then you can use geometric operators. However, these are 2D calculations and only accurate for GIS over short distances.
Since you're doing GIS, don't do this by hand. Use PostGIS.
Related
I have been roaming these forums for a few years and I've always found my questions had already been asked, and a fitting answer was already present.
I have a pretty generic (and maybe easy) question now though, but I haven't been able to find a thread asking the same one yet.
The situation:
I have a payment table with 10-50M records per day, a history of 10 days and hundreds of columns. About 10-20 columns are indexed. One of the indices is batch_id.
I have a batch table with considerably fewer records and columns, say 10k a day and 30 columns.
If I want to select all payments from one specific sender, I could just do this:
Select * from payments p
where p.sender_id = 'SenderA'
This runs a while, even though sender_id is also indexed. So I figure, it's better to select the batches first, then go into the payments table with the batch_id:
select * from payments p
where p.batch_id in
(select b.batch_id from batches where b.sender_id = 'SenderA')
--and p.sender_id = 'SenderA'
Now, my questions are:
In the second script, should I uncomment the Sender_id in my where clause on the payments table? It doesn't feel very efficient to filter on sender_id twice, even though it's in different tables.
Is it better if I make it an inner join instead of a nested query?
Is it better if I make it a common table expression instead of a nested query or inner join?
I suppose it could all fit into one question: What is the best way to query this?
In the worst case the two queries should run in the same time and in the best case I would expect the first query to run quicker. If it is running slower, there is some problem elsewhere. You don't need the additional condition in the second query.
The first query will retrieve index entries for a single value, so that is going to access less blocks than the second query which has to find index entries for multiple batches (as well as executing the subquery, but that is probably not significant).
But the danger as always with Oracle is that there are a lot of factors determining which query plan the optimizer chooses. I would immediately verify that the statistics on your indexed columns are up-to-date. If they are not, this might be your problem and you don't need to read any further.
The next step is to obtain a query execution plan. My guess is that this will tell you that your query is running a full-table-scan.
Whether or not Oracle choses to perform a full-table-scan on a query such as this is dependent on the number of rows returned and whether Oracle thinks it is more efficient to use the index or to simply read the whole table. The threshold for flipping between the two is not a fixed number: it depends on a lot of things, one of them being a parameter called DB_FILE_MULTIBLOCK_READ_COUNT.
This is set-up by Orale and in theory it should be configured such that the transition between indexed and full-table scan queries should be smooth. In other words, at the transition point where your query is returning enough rows to just about make a full table scan more efficient, the index scan and the table scan should take roughly the same time.
Unfortunately, I have seen systems where this is way out and Oracle flips to doing full table scans far too quickly, resulting in a long query time once the number of rows gets over a certain threshold.
As I said before, first check your statistics. If that doesn't work, get a QEP and start tuning your Oracle instance.
Tuning Oracle is a very complex subject that can't be answered in full here, so I am forced to recommend links. Here is a useful page on the parameter: reducing it might help: Why Change the Oracle DB_FILE_MULTIBLOCK_READ_COUNT.
Other than that, the general Oracle performance tuning guide is here: (Oracle) Configuring a Database for Performance.
If you are still having problems, you need to progress your investigation further and then come up with a more specific question.
EDIT:
Based on your comment where you say your query is returning 4M rows out of 10M-50M in the table. If it is 4M out of 10M there is no way an index will be of any use. Even with 4M out of 50M, it is still pretty certain that a full-table-scan would be the most efficient approach.
You say that you have a lot of columns, so probably this 4M row fetch is returning a huge amount of data.
You could perhaps consider splitting off some of the columns that are not required and putting them into a child table. In particular, if you have columns containing a lot of data (e.g., some text comments or whatever) they might be better being kept outside the main table.
Remember - small is fast, not only in terms of number of rows, but also in terms of the size of each row.
SQL is an declarative language. This means, that you specify what you like not how.
Check your indexes primary and "normal" ones...
Short Intro:
When it is required to have a dozen nested calculating queries, is it more optimal to
A) Perform each operation separately (saving into a table for each result and then reading that table for the next query)
B) Have a large set of nested selects
Full Description:
I am trying to calculate some advanced forecasts from a series of input tables in SQL.
I am building around a dozen 'modules' that are separated into their own schema and each module typically includes 4-10 input tables and 6-10 calculation steps. All outputs from each module is dumped into the same output table once completed.
Queries range from 7k-200k rows.
A single schema's/module's tables might look like this:
Input Table 1
Input Table 2
Input Table 3
Input Table 4
Calculation Query 1 Result Table
Calculation Query 2 Result Table
Calculation Query 3 Result Table
Calculation Query 4 Result Table
Calculation Query 5 Result Table
Calculation Query 6 Result Table
Final Output
Each calculation query uses the results of the previous (for the most part). The final output is the result of the final calculation query. Calculations are not very complex: partitioned max, basic formula (+,-,*,/) or SUM etcetera. Normally only 1-3 of these per calculation step and always on the same column.
The main reason this is split into multiple calculation queries (instead of one super-formula) is because each calculation joins the outputs in a different way and uses different input tables; also because some are based on previous row results. (Such as max partitions or Lag)
My requirements are as follows:
A procedure that calculates final output from step 1 and merges into Final Output.
A procedure that calculates up to the selected calculation query and merges into its respective results table (and stop). Consider this the 'overriding final'
I DONT need to store the calculation results of intermediate queries - only the final output or the 'overriding final' if selected.
My Problem:
I am trying to optimise the entire process - at this point it looks like it will take around 10-15 seconds. I want it to be 1 second - however I appreciate this is probably not possible.
What I have tried:
Firstly, I created a single procedure for each calculation query that Merged the results into its respective output table. Using this method, each calculation query must read from the database and then merge into its output.
I tried temp tables however I don't see why this would be optimal because I have existing tables for the calculation steps already - which are indexed with the next step in mind.
I then made an assumption that it would be faster to simply nest all the queries into one super-procedure or maybe even have a sequence of Table-Functions.
My Question:
However I ran into a thought that I could not find an answer for - which is the following:
Inserting results into a table on every calculation step might slow the process (especially as they are indexed with 2-4 columns); but at least the data will be indexed for the next step.
Nesting selects would save the effort of inserting data but these results wouldn't be indexed? Right? Or Wrong?
Are select results intelligently indexed? And given my scenario what advise would you give on how I approach this. Maybe I am missing something really simple.
Additional Info:
Most of my larger query results (150-200K) have 4 columns that need to be indexed.
All of my tables only have one column that needs calculating - the rest are indexed.
For Example:
ForecastID, Group, Year, Type, Sub-Type, Value
So I have to index Group, Year, Type and Sub-Type to Join multiple input tables and then calculate on the Value column.
I am telling you this in case having index-heavy tables influences your advice - I wont ask for help on optimizing indexes here due to the overwhelming quantity of advice already available and because it's a different question!
Query optimization is often more art than science, there are few hard and fast rules because there are so many possible influences on the outcome. With that big caveat out of the way, Time to hit the high points.
Indexes effects on loading tables - Indexes have a similar performance impact on inserts as triggers. Unless you have a filtered index each insert will have to update every index on the table, so at three indexes you are looking at quadrupling the number of updates per insert. At one read per insert and a small table size of 200k (very doable for a table scan), for three indexes you are probably outside the butter zone for cost vs. benefit of having those indexes on your work tables.
Nesting results - Like CTEs, nested results work best when the entire result set can fit in memory. When part is in memory and part is on disk it will generally perform worse than a similarly sized temp table without an index. At 5 or so columns for 200k rows with smallish datatypes and a modern server you should be ok performance wise with nesting queries, so long as your only doing one result set at a time. Once again this varies based on your setup, if you are strapped for ram drop them into a temp table.
Joins - Another possible good reason to use temp tables/nested queries is to avoid excessively large joins. The first step in a join process is a full Cartesian join between the tables, which is then filtered based on the on and where clauses. The Join process is heavily optimized in all RDMS, so most of the time you are not aware of how much heavy lifting is occurring behind the scenes, however when tables reach large sizes this can be a major performance pain point. So instead you select the subset of data you require from both tables, and join the two much smaller sets. Once again the butter zone between subsets and full table joins depends on a number of factors, so you'll have to play around with your queries to find where it is for your situation.
Unfortunately I can't really give specific advice without some sample inputs and outputs and/or an execution plan, but I hope this is some food for thought. Good luck.
It sounds like your datasets from the subqueries are more than a few thousand rows, so I would start off with approach A, persist some of these intermediate result sets to #temptables, check the execution plan for scans on these tables, and index the #temptables if needed.
If you want to use approach B, or mix A and B, I suggest CTEs instead of nested queries where possible. They are more readable, and it is easier to switch to #temptables when you are testing/designing the query.
I know there are similar questions on StackOverflow, but after testing different indexes on my tables, I think I don't quite understand how indexes work and I'd like it if someone could explain the behavior I'm experiencing on my queries' performance.
I'm using this query as an example, I'm going to try to explain it in detail:
SELECT ss1.PlayerID, ss1.Name, ss1.Series, ss1.LanesNum, ss1.Date, ss1.LeagueName, ss1.Season FROM SeriesScores ss1
JOIN (SELECT Series, Gender, LanesNum, Bowlout, Season FROM SeriesScores
WHERE Gender = ? AND LanesNum = ? AND Series > -1 AND Bowlout = 'No' AND Season = '2011-2012'
ORDER BY Series DESC LIMIT 0,?) as ss2
USING(series, gender, lanesNum, bowlout, season)
ORDER BY ss1.Series DESC
This query is used to get the highest series bowled in a given season for each pair of lanes in a bowling center for both male and female players.
I'm joining the table on itself instead of using the MAX aggregate function because if there's a tie on a given pair of lanes, I want all the names to come up.
Basically, I join all the fields that match what the inner SELECT returns. That inner SELECT returns the top X players for a given gender and a given pair of lanes.
The USING part makes sure only the players that haven't bowled out, with the same gender, series, lanesNum and season as I'm looking for get selected. I then order them by highest series to lowest series.
This query is in a for loop, which gets run 12 times for men and 12 times for women (12 pair of lanes in the bowling center) with only the lanesNum and gender parameters changing.
I then put all the results in two different vectors in Java to display the results in an application (one vector for men, one for women).
Without any indexes whatsoever, it takes around 11 seconds to run everything including putting the results in a vector and all of that. (5.5 seconds for the 12 queries for men, same for women).
With an index on (gender, lanesNum, series), it takes 0.04 seconds for the whole thing, which is amazing, since that's a more than acceptable speed for my needs.
I used that index because those are all the most important fields I'm using in my WHERE clause, but I don't get why it speeds things up that much, because I tried other things and using some other indexes actually made my queries SLOWER by more than 100%. Also, I'm wondering if I would get an even faster query if I added "bowlout" and "season" to that index.
I wanted to try a single column index on series first and test performance. That's the index that made all of those queries take a total of 22 seconds.
I came to the conclusion that I don't understand where I should be using my indexes and when I should be using them on multiple fields, or using multiple indexes on single fields, etc. Also, I don't understand how using (the wrong) indexes can actually make performance worse.
Optimizing an index too aggresively for just one query runs the risk of slowing down other queries (and thus a real world application, or the next version of it). However, let us do exactly that as an exercise in analysing index performance.
Indexes influence query performance in multiple ways; their existence can actually completely change the algorithm that the database server will use to get to the data. A nice overview is here, but as your query is simple, and you actually have very few relevant indexes in your database (the one you see, and also automatically created indexes to support the primary keys of your tables) we can simplify the story greatly.
A good index makes it faster to cross reference the data between the tables. Ideally it contains columns in your USING and WHERE clauses, and enough of them to reference a unique row in its table most of the time. If it contains less, it may still be used by the database server, but the remaining rows will have to be visited one by one.
An great index does not only all that, but it also contains all data that you will be selecting from the table (yes, this makes sense when the two tables are actually the same physical table due to the self-join; the database server still processes as if it was two different tables, incidentally with the same data). The benefit of such a "fully covering index" is that the database server does not have to visit its table at all; all the columns are available in the index.
Order of columns in the index matters. It is especially essential that the leftmost column in the index appears in the USING clause, or WHERE clause; otherwise the index is pretty much unusable as matching data for a single lookup can appear in many locations in that index. It should also be highly selective (have many different values in the table). Do a few experiments now to see this first hand.
For this reason, the first choice index I'd suggest to you would be series, gender, lanesNum, bowlout; but yours is also a very good one for this query.
There is not much use in creating more than one index explicitly. There is basically no use for more than one of them during query execution, because your query is so simple. So the most useful one will supposedly win and all the others will be ignored.
To your last question: some people believe that superfluous indexes only slow down UPDATE, INSERT and DELETE statements (because these carry the overhead to update the indexes), but it is not that simple. As the database server considers multiple algorithms to compute your query (there are two logical tables to start from and automatic and explicit indexes to use, or not to use), it may choose the wrong plan: an index may look seductive without knowing the data distribution in the table, but be very counterproductive given the distribution.
There is actually a way to let the database server analyze the data and record some statistics that will greatly help it optimize your subsequent queries reasonably and probably to avoid any 22 second executions of your query (until you change your data so much that the statistics will no longer hold true). That is the ANALYZE command. Issue it every time after you change your indexes to see the subsequent sqlite performance at its best. In a production database, schedule ANALYZE to execute every night, so that your database does not gradually slow down over time, or abruptly after adding a harmless, useless index.
I have a MySQL SELECT query that calculates a distance using Pythagoras in the WHERE clause to restrict results to a certain radius.
I also use the exact same calculation in the ORDER BY clause to sort the results by smallest distance first.
Does MySQL calculate the distance twice (once for the WHERE, and again for the ORDER BY)?
If it does, how can I optimize the query so it is only calculated once (if possible at all)?
Does MySQL calculate the distance twice (once for the WHERE, and again for the ORDER BY)?
No, the calculation will not be performed twice if it is written in exactly the same way. However if your aim is to improve the performance of your application then you might want to look at the bigger picture rather than concentrating on this minor detail which could give you at most a factor of two difference. A more serious problem is that your query prevents efficient usage of indexes and will result in a full scan.
I would recommend that you change your database so that you use the geometry type and create a spatial index on your data. Then you can use MBRWithin to quickly find the points that lie inside the bounding box of your circle. Once you have found those points you can run your more expensive distance test on those points only. This approach will be significantly faster if your table is large and a typical search returns only a small fraction of the rows.
If you can't change the data model then you can still improve the performance by using a bounding box check first, for example WHERE x BETWEEN 10 AND 20 AND y BETWEEN 50 AND 60. The bounding box check will be able to use an index, but because R-Tree indexes are only supported on the geometry type you will have to use the standard B-Tree index which is not as efficient for this type of query (but still much better than what you are currently doing).
You could possibly select for it, put it in the HAVING clause and use it in the ORDER BY clause, then the calculation is certainly only done once, but I guess that would be slower, because it has to work with more data. The calculation itself is not that expensive.
It seems that all questions regarding this topic are very specific, and while I value specific examples, I'm interested in the basics of SQL optimization. I am very comfortable working in SQL, and have a background in hardware/low level software.
What I want is the tools both tangible software, and a method to look at the mysql databases I look at on a regular basis and know what the difference between orders of join statements and where statements.
I want to know why an index helps, like, exactly why. I want to know specifically what happens differently, and I want to know how I can actually look at what is happening. I don't need a tool that will breakdown every step of my SQL, I just want to be able to poke around and if someone can't tell me what column to index, I will be able to get out a sheet of paper and within some period of time be able to come up with the answers.
Databases are complicated, but they aren't THAT complicated, and there must be some great material out there for learning the basics so that you know how to find the answers to optimization problems you encounter, even if could hunt down the exact answer on a forum.
Please recommend some reading that is concise, intuitive, and not afraid to get down to the low level nuts and bolts. I prefer online free resources, but if a book recommendation demolishes the nail head it hits I'd consider accepting it.
You have to do a look up for every where condition and for every join...on condition. The two work the same.
Suppose we write
select name
from customer
where customerid=37;
Somehow the DBMS has to find the record or records with customerid=37. If there is no index, the only way to do this is to read every record in the table comparing the customerid to 37. Even when it finds one, it has no way of knowing there is only one, so it has to keep looking for others.
If you create an index on customerid, the DBMS has ways to search the index very quickly. It's not a sequential search, but, depending on the database, a binary search or some other efficient method. Exactly how doesn't matter, accept that it's much faster than sequential. The index then takes it directly to the appropriate record or records. Furthermore, if you specify that the index is "unique", then the database knows that there can only be one so it doesn't waste time looking for a second. (And the DBMS will prevent you from adding a second.)
Now consider this query:
select name
from customer
where city='Albany' and state='NY';
Now we have two conditions. If you have an index on only one of those fields, the DBMS will use that index to find a subset of the records, then sequentially search those. For example, if you have an index on state, the DBMS will quickly find the first record for NY, then sequentially search looking for city='Albany', and stop looking when it reaches the last record for NY.
If you have an index that includes both fields, i.e. "create index on customer (state, city)", then the DBMS can immediately zoom to the right records.
If you have two separate indexes, one on each field, the DBMS will have various rules that it applies to decide which index to use. Again, exactly how this is done depends on the particular DBMS you are using, but basically it tries to keep statistics on the total number of records, the number of different values, and the distribution of values. Then it will search those records sequentially for the ones that satisfy the other condition. In this case the DBMS would probably observe that there are many more cities than there are states, so by using the city index it can quickly zoom to the 'Albany' records. Then it will sequentially search these, checking the state of each against 'NY'. If you have records for Albany, California these will be skipped.
Every join requires some sort of look-up.
Say we write
select customer.name
from transaction
join customer on transaction.customerid=customer.customerid
where transaction.transactiondate='2010-07-04' and customer.type='Q';
Now the DBMS has to decide which table to read first, select the appropriate records from there, and then find the matching records in the other table.
If you had an index on transaction.transactiondate and customer.customerid, the best plan would likely be to find all the transactions with this date, and then for each of those find the customer with the matching customerid, and then verify that the customer has the right type.
If you don't have an index on customer.customerid, then the DBMS could quickly find the transaction, but then for each transaction it would have to sequentially search the customer table looking for a matching customerid. (This would likely be very slow.)
Suppose instead that the only indexes you have are on transaction.customerid and customer.type. Then the DBMS would likely use a completely different plan. It would probably scan the customer table for all customers with the correct type, then for each of these find all transactions for this customer, and sequentially search them for the right date.
The most important key to optimization is to figure out what indexes will really help and create those indexes. Extra, unused indexes are a burden on the database because it takes work to maintain them, and if they're never used this is wasted effort.
You can tell what indexes the DBMS will use for any given query with the EXPLAIN command. I use this all the time to determine if my queries are being optimized well or if I should be creating additional indexes. (Read the documentation on this command for an explanation of its output.)
Caveat: Remember that I said that the DBMS keeps statistics on the number of records and the number of different values and so on in each table. EXPLAIN may give you a completely different plan today than it gave yesterday if the data has changed. For example, if you have a query that joins two tables and one of these tables is very small while the other is large, it will be biased toward reading the small table first and then finding matching records in the large table. Adding records to a table can change which is larger, and thus lead the DBMS to change its plan. Thus, you should attempt to do EXPLAINS against a database with realistic data. Running against a test database with 5 records in each table is of far less value than running against a live database.
Well, there's much more that could be said, but I don't want to write a book here.
Let's say you're looking for a friend in another city. One way would be to go from door to door and ask whether this is the house you're looking for. Another way is to look at the map.
The index is the map to a table. It can tell the DB engine exactly where the thing you're looking for is. Thus, you index every column that you think you will have to search for, and leave out the columns that you are just reading data from, and never searching for.
Good technical reading about indices and about ORDER BY optimization. And if you want to see what exactly is happening, you want the EXPLAIN statement.
Don't think about optimizing databases. Think about optimizing queries.
Generally, you optimize one case at the expense of others. You just have to decide which cases you're interested in.
I don't know about MySql tools but in MS SqlServer you have a tool that shows all of the operations a query would take and how much of the processing time of the entire query would take.
Using this tool helped me to understand how queries are optimized by the query optimizer much more than I think any book could help because what the optimizer does is often not easy to understand. By tweaking the query and possibly the underlining database I could see how each change affected the query plan. There are certain key points in writing queries but to me it looks like you already have an idea of those so optimizing in your case is much more about this than any general rules. After a few years of db development I did look at a few books specifically aimed at database optimization on the SQL Server and found very little useful info.
Quick googling came up with this: http://www.mysql.com/products/enterprise/query.html which sounds like a similar tool.
This was of course on a query level, database level optimizations are again a different kettle of fish, but there you are looking at parameters such as how your database is divided on the hard drives etc. At least in SqlServer you can select to divide tables to different hdd's and even disk plates and this can have a big effect because the drives and drive heads can work in parallel. Another is how you can build your queries so that the database can run them in several threads and processors in parallel, but both of these issues again depend on the database engine and even version you are using.
[Caution: Most of this Answer does not apply to MySQL. I bring this up because the OP tagged the Question with mysql.]
"I'm interested particularly in how indices will affect joins"
As an example, I'll take the case of equijoin (SELECT FROM A,B WHERE A.x = B.y).
If there are no indexes at all (which is possible in theory but I think not in SQL), then basically the only way to compute the join is to take the entire table A and partition it over x, take the entire table y and partition it over y, then match the partitions, and finally for each pair of matching partitions compute the result rows. That's costly (or even outright impossible due to memory restrictions) for all but the smallest tables.
Same story if there do exist indexes on A and/or B, but not any of them has x resp. y as its first attribute.
If there does exist an index on x, but not on y (or conversely), then another possibility opens up : scan table B, for each row pick value y, lookup that value in the index and fetch the corresponding A rows to compute the join. Note that this still won't win you much if no other further restrictions apply (AND z = ...) - except in the case where there are only few matches between x and y values.
If ordered indexes (hash-based indexes are not ordered) exist on both x and y, then a third possibility opens up : do a matching scan on the indexes themselves (the indexes themselves are likely to be smaller than the tables themselves, so scanning the index itself will take a shorter time), and for the matching x/y values, compute the join of the corresponding rows.
That's the baseline. Variations arise for joins on x>y etc.