Performance question ...
I have a database of houses that have geolocation data (longitude & latitude).
What I want to do is find the best way to store the locational data in my MySQL (v5.0.24a) using InnoDB database-engine so that I can perform a lot of queries where I'm returning all the home records that are between x1 and x2 latitude and y1 and y2 longitude.
Right now, my database schema is
---------------------
Homes
---------------------
geolat - Float (10,6)
geolng - Float (10,6)
---------------------
And my query is:
SELECT ...
WHERE geolat BETWEEN x1 AND x2
AND geolng BETWEEN y1 AND y2
Is what I described above the best way to store the
latitude and longitude data in MySQL using Float (10,6) and separating out the longitude/latitude? If not, what is? There exist Float, Decimal and even Spatial as a data type.
Is this the best way to perform the
SQL from a performance standpoint? If not, what is?
Does using a different MySQL
database-engine make sense?
UPDATE: Still Unanswered
I have 3 different answers below. One person say to use Float. One person says to use INT. One person says to use Spatial.
So I used MySQL "EXPLAIN" statement to measure the SQL execution speed. It appears that absolutely no difference in SQL execution (result set fetching) exist if using INT or FLOAT for the longitude and latitude data type..
It also appears that using the "BETWEEN" statement is SIGNIFICANTLY faster than using the ">" or "<" SQL statements. It's nearly 3x faster to use "BETWEEN" than to use the ">" and "<" statement.
With that being said, I still am unceratin on what the performance impact would be if using Spatial since it's unclear to me if it's supported with my version of MySQL running (v5.0.24) ... as well as how I enable it if supported.
Any help would be greatly appreacited
float(10,6) is just fine.
Any other convoluted storage schemes will require more translation in and out, and floating-point math is plenty fast.
I know you're asking about MySQL, but if spatial data is important to your business, you might want to reconsider. PostgreSQL + PostGIS are also free software, and they have a great reputation for managing spatial and geographic data efficiently. Many people use PostgreSQL only because of PostGIS.
I don't know much about the MySQL spatial system though, so perhaps it works well enough for your use-case.
The problem with using any other data type than "spatial" here is that your kind of "rectangular selection" can (usually, this depends on how bright your DBMS is - and MySQL certainly isn't generally the brightest) only be optimised in one single dimension.
The system can pick either the longitude index or the latitude index, and use that to reduce the set of rows to inspect. But after it has done that, there is a choice of : (a) fetching all found rows and scanning over those and test for the "other dimension", or (b) doing the similar process on the "other dimension" and then afterwards matching those two result sets to see which rows appear in both. This latter option may not be implemented as such in your particular DBMS engine.
Spatial indexes sort of do the latter "automatically", so I think it's safe to say that a spatial index will give the best performance in any case, but it may also be the case that it doesn't significantly outperform the other solutions, and that it's just not worth the bother. This depends on all sorts of things like the volume of and the distribution in your actual data etc. etc.
It is certainly true that float (tree) indexes are by necessity slower than integer indexes, because of the longer time it usually takes to execute '>' on floats than it does on integers. But I would be surprised if this effect were actually noticeable.
Google uses float(10,6) in their "Store locator" example. That's enough for me to go with that.
https://stackoverflow.com/a/5994082/1094271
Also, starting MySQL 5.6.x, spatial extensions support is much better and comparable to PostGIS in features and performance.
I would store it as integers (int, 4-bytes) represented in 1/1,000,000th degrees. That would give you a resolution of few inches.
I don't think there is any intrinsic spatial datatype in MySQL.
Float (10,6)
Where is latitude or longitude 5555.123456?
Don't you mean Float(9,6) instead?
I have the exact same schema (float(10,6)) and query (selecting inside a rectangle) and I found that switching the db engine from innoDB to myisam doubled the speed for a "point in rectangle look-up" in a table with 780,000 records.
Additionally, I converted all lng/lat values to cartesian integers (x,y) and created a two-column index on the x,y and my speed went from ~27 ms to 1.3 ms for the same look-up.
It really depends on how you are using the data. But in a gross over-simplification of the facts, decimal is faster but less accurate in aproximations. More info here:
http://msdn.microsoft.com/en-us/library/aa223970(SQL.80).aspx
Also, The standard for GPS coordinates is specified in ISO 6709:
http://en.wikipedia.org/wiki/ISO_6709
I know probably you would have moved past this problem. I just wanted to add another approach to this question, in case someone is looking to store geolocation data.
You could encode latitude and longitude information into a geohash. Since they are prefixed searchable to a required degree of precision. It seems you can convert your query to a start and end prefix and do a prefix search with LIKE query.
Related
Speaking with a friend of mine about DB structure, he says that for telephone number he use to create integer attributes and cast them directly into code that will extract data from DB (he add zero ahead number). Apart that method that could be questionable, I suggest him to use a varchar field.
He says that use a varchar is less efficient because:
It take more memory for information storage (and this is true)
It take more "time" for ordering the field
I'm pretty confused as I guess that rdbms, with all optimization, will do this sort in a O(n log(n)) or something like regardless of data type. Mining the internet for information, unfortunately, turned out to be useless.
Could someone helps me understand if what I'm saying here has sense or not?
Thank you.
RDBMS use indexes to optimize ordering(sorting). Indexes stored as B-tree in the storages and is more large as the indexed field(s) is. Thus the disk IO operations increases because the data is large. In another hands, the O(n log(n)) is different for different type of data. The semantic of comparing string and numeric(integer) are different and comparison of strings are more complicated than integers.
I've worked with many databases over the last 20 yrs and have only ran into this "interesting" type of implicit data conversion problem with SQL Server.
If i create a table with one small int column and insert two rows with a value 1 and 2 into it and then run the following query "Select Avg(Column) From table" i get a truncated result instead of the 1.5 that i would get from pretty much any other dB on the planet that would automatically upsize the datatype to store the entire results rather than truncating/rounding to the columns data type. Now i know i can cast my way around this for every possible scenario but not a good dynamic solution especially for data analytics with data analytic products... I.E: Cognos/Microstrategy etc...
I am in data warehousing and have fact tables with millions of rows in them... I would love to store small columns and have proper aggregation results. My current approach to work around this nuance is to define the smallest quantifiable columns as Numeric(19,5) to account for all situations even though these columns many times only store 1 or 0 for which a tinyint would be great but will not naturally aggregate well.
Is there not any directive that tells SQL server do do what every other DB (oracle/db2/informix/access etc...) does? Which is promote to a larger type and show the entire results and let me do what i want with them?
You could create views on the tables which would cast the smallint or tinyint to float and only publish these views to the users. This would keep the small memory usage. The conversion should be no overhead, compared to other database systems that must do that as well if they use a different data type for aggregation.
While it might frustrate you, a lot of programming languages also behave this way with ints, 1 / 2 will spit out 0. See:
With c++ integers, does 1 divided by 2 reliably equal 0, and 3/2 = 1, 5/2 = 2 etc.?
It's a design quirk, it'd break a lot of things if they changed it. You're asking can you change a fairly fundamental way SQL Server behaves and thus potentially break any one else's code running on the server.
Simply put, no you can't.
And you're wrong that every other DB product behaves this way, Derby also does the same thing:
http://docs.oracle.com/javadb/10.6.2.1/ref/rrefsqlj32693.html
In the Oracle docs they specifically warn you that AVG will return a float regardless of the original type. This is because every language has to make the choice, do I return the original type or the most precise answer? To stop overflows, a lot of languages chose the former to the constant frustration of programmers everywhere.
So in SQL Server, to get a float out, put a float in.
To my best knowledge, the fastest way would be to do an implicit cast: SELECT AVG(Field * 1.0). You could of course do an explicit cast the same way. As far as I know there is no way to tell SQL Server that you want integers converted to floats when you average them, and arguably that's actually correct behavior.
I first have to say that I really am a rookie in caching, so please do elaborate on any explanation and bear with me if my question is stupid.
I have a server with pretty limited resources, so I'm really interested in caching db-queries as effectively as I can. My issue is this:
I have a MYSQL DB with a table for geolocations, there are columns (lat and lng) - I only indexed lat since a query will always have both lat and lng, and only 1 index can be effectively used to my understanding (?).
The queries are very alternating in coordinates like
select lat, lng
where lat BETWEEN 123123123 AND 312412312 AND lng BETWEEN 235124231 AND 34123124
where the long numbers that are the boundaries of the BETWEEN query are constantly changing, so IS there a way to cache this the smart way, so that the cache doesn't have to be a complete query match, but the values of previous between queries can be held against a new to save some db resources?
I hope you get my question - if not please ask.
Thank you so much
Update 24/01/2011
Now that I've gotten some response I want to know what the most efficient way of querying would be.
Would the Between query with int values execute faster or
would the radius calculation with point values execute faster
if 1. then how would the optimal index look like?
If your table is MyISAM you can use Point datatype (see this answer for more details)
If you are not willing or are not able to use spatial indexes, you should two separate indexes:
CREATE INDEX ix_mytable_lat_lon ON mytable (lat, lon)
CREATE INDEX ix_mytable_lon_lat ON mytable (lon, lat)
In this case, MySQL can use an index_intersect over these indexes which is sometimes faster than mere filtering with a single index.
Even if it does not, it can pick a more selective index if there are two of those.
As for the caching, all pages read from the indexes are cached and reside in memory until they will be overwritten with hotter data (it not all database fits to the cache).
This will prevent MySQL from the need to read the data from disk.
MySQL is also able to cache the whole resultsets in memory, however, this requires the query to be repeated verbatim, with all parameters exactly the same.
I think to do significantly better you'll need to characterize your data better. If you've got data that's uniformly distributed across longitude and latitude, with no correlation, and if your queries are similarly distributed and independent - you're stuck. But if your data or your queries cluster in interesting ways, you may find that you can introduce new columns that make at least some queries quicker. If most queries happen within some hard range, maybe you can set that data aside - add a flag, link it to some other table, even put the frequently-requested data into its own table. Can you tell us any more about the data?
And if so, why? I mean, is a tinyint faster to search than int?
If so, what are the practical differences in performance?
Depending on the data types, yes, it does make a difference.
int vs. tinyint wouldn't make a noticeable difference in speed, but it would make a difference in data sizes. Assuming tinyint is 1 byte, versus int being 4, that's 3 bytes saved every row. it adds up after awhile.
Now, if it was int against varchar, then there would be a bit of a drop, as things like sorts would be much faster on integer values than string values.
If it's a comparable type, and you're not very pressed for space, go with the one that's easier and more robust.
Theoretically, yes, a tinyint is faster than an int. But good database design and proper indexing has a far more substantial effect on performance, so I always use int for design simplicity.
I would venture that there are no practical performance differences in that case. Storage space is the more substantial factor, but even then, it's not much difference. The difference is perhaps 2 bytes? After 500,000 rows you've almost used an extra megabyte. Hopefully you aren't pinching megabytes if you are working with that much data.
Choosing the right data type can improve performance. In a lot of cases the practical difference might not be a lot but a bad choice can definitely have an impact. Imagine using a 1000 character char field instead of a varchar field when you are only going to be storing a string of a few characters. It's a bit of an extreme example but you would definitely be a lot better using a varchar. You would probably never notice a difference in performance between an int and a tinyint. Your overall database design (normalized tables, good indices, etc.) will have a far larger impact.
of course choosing right datatypes always helps in faster execution
take a look in at this article this will surely help you out:
http://www.peachpit.com/articles/article.aspx?p=30885&seqNum=7
The performance consideration all depends on the scale of your model and usage. While the consideration for storage space in these modern times is almost a non-issue, you might need to think about performance:
Database engines tend to store data in chunks called pages. Sql Server has 8k pages, Oracle 2k and MySql 16k page size by default? Not that big for any of these systems. Whenever you perform an operation on a bit of data (the field and row) its entire page is fetched from the db and put into memory. When your data is smaller (tiny int vs. int) then you can fit more individual rows and data items into a page and therefore your likelihood of having to fetch more pages goes down and the overall performance speeds up. So yes, using the smallest possible representation of your data will definitely have an impact on performance because it allows the db engine to be more efficient.
One way it can affect performance is by not requiring you to convert it to the correct type to manipulate the data. This is true when someone uses varchar for instance instead of a datetime datatype and then they have to be converted to do date math. It can also affect performance by giving a smaller record (this why you shouldn't define everything at the max size) which affects how pages are stored and retrieved in the database.
Of course using the correct type of data can also help data integrity; you can't store a date that doesn't exist in a datetime field but you can in varchar field. If you use float instead of int then your values aren't restricted to integer values etc. ANd speaking of float, it is generally bad to use if you intend to do math calulations as you get rounding errors since it is not an exact type.
Specifically, in relational database management systems, why do we need to know the data type of a column (more likely, the attribute of an object) at creation time?
To me, data types feel like an optimization, because one data point can be implemented in any number of ways. Wouldn't it be better to assign semantic roles and constraints to a data point and then have the engine internally examine and optimize which data type best serves the user?
I suspect this is where the heavy lifting is and why it's easier to just ask the user rather than to do the work.
What do you think? Where are we headed? Is this a realistic expectation? Or do I have a misguided assumption?
The type expresses a desired constraint on the values of the column.
The answer is storage space and fixed size rows.
Fixed-size rows are much, MUCH faster to search than variable length rows, because you can seek directly to the correct byte if you know which record number and field you want.
Edit: Having said that, if you use proper indexing in your database tables, the fixed-size rows thing isn't as important as it used to be.
SQLite does not care.
Other RDBMS's use principles that were designed in early 80's, when it was vital for performance.
Oracle, for instance, does not distinguish between a NULL and an empty string, and keeps its NUMBER's as sets of centesimal digits.
That hardly makes sense today, but these were very clever solutions when Oracle was being developed.
In one of the databases I developed, though, non-indexed values were used that were stored as VARCHAR2's, casted dynamically into appropriate datatypes depending on several conditions.
That was quite a special thing, though: it was used for bulk loading key-value pairs in one call to the database using collections.
Dynamic SQL statements were used for parsing data and putting them into appropriate tables based on key name.
All values were loaded to the temporary VARCHAR2 column as is and then converted into NUMBER's and DATETIME's to be put into their columns.
Explicit data types are huge for efficiency, and storage. If they are implicit they have to be 'figured' out and therefore incur speed costs. Indexes would be hard to implement as well.
I would suspect, although not positive, that having explicit types also on average incur less storage space. For numbers especially, there is no comparison between a binary int and a string of digit characters.
Hm... Your question is sort of confusing.
If I understand it correctly, you're asking why it is that we specify data types for table columns, and why it is that the "engine" automatically determines what is needed for the user.
Data types act as a constraint - they secure the data's integrity. An int column will never have letters in it, which is a good thing. The data type isn't automatically decided for you, you specify it when you create the database - almost always using SQL.
You're right: assigning a data type to a column is an implementation detail and has nothing to do with the set theory or calculus behind a database engine. As a theoretical model, a database ought to be "typeless" and able to store whatever we throw at it.
But we have to implement the database on a real computer with real constraints. It's not practical, from a performance standpoint, to have the computer dynamically try to figure out how to best store the data.
For example, let's say you have a table in which you store a few million integers. The computer could -- correctly -- figure out that it should store each datum as an integral value. But if you were to one day suddenly try to store a string in that table, should the database engine stop everything until it converts all the data to a more general string format?
Unfortunately, specifying a data type is a necessary evil.
If you know that some data item is supposed to be numeric integer, and you deliberately choose NOT to let the DBMS take care of enforcing this, then it becomes YOUR responsibility to ensure all sorts of things such as data integrity (ensuring that no value 'A' can be entered in the column, ensuring that no value 1.5 can be entered in the column), such as consistency of system behaviour (ensuring that the value '01' is considered equal to the value '1', which is not the behaviour you get from type String), ...
Types take care of all those sorts of things for you.
I'm not sure of the history of datatypes in databases, but to me it makes sense to know the datatype of a field.
When would you want to do a sum of some fields which are entirely varchar?
If I know that a field is an integer, it makes perfect sense to do a sum, avg, max, etc.
Not all databases work this way. SQLite was mentioned earlier, but a much older set of databases also does this, multivalued databases.
Consider UniVerse (now an IBM property). It does not do any data validation, nor does it require that you specify what type it is. Searches are still (relatively) fast, it takes up less space (due to the way it stores data dynamically).
You can describe what the data may look like using meta-data (dictionary items), but that is the limit of how you restrict the data.
See the wikipedia article on UniVerse
When you're pushing half a billion rows in 5 months after go live, every byte counts (in our system)
There is no such anti-pattern as "premature optimisation" in database design.
Disk space is cheap, of course, but you use the data in memory.
You should care about datatypes when it comes to filtering (WHERE clause) or sorting (ORDER BY). For example "200" is LOWER than "3" if those values are strings, and the opposite when they are integers.
I believe sooner or later you wil have to sort or filter your data ("200" > "3" ?) or use some aggregate functions in reports (like sum() or (avg()). Until then you are good with text datatype :)
A book I've been reading on database theory tells me that the SQL standard defines a concept of a domain. For instance, height and width could be two different domains. Although both might be stored as numeric(10,2), a height and a width column could not be compared without casting. This allows for a "type" constraint that is not related to implementation.
I like this idea in general, though, since I've never seen it implemented, I don't know what it would be like to use it. I can see that it would reduce the chance of errors in using values whose implementation happen to be the same, when their conceptual domain is quite different. It might also help keep people from comparing cm and inches, for instance.
Constraint is perhaps the most important thing mentioned here. Data types exist for ensuring the correctness of your data so you are sure you can manipulate it correctly. There are 2 ways we can store a date. In a type of date or as a string "4th of January 1893". But the string could also have been "4/1 1893", "1/4 1893" or similar. Datatypes constrain that and defines a canonical form for a date.
Furthermore, a datatype has the advantage that it can undergo checks. The string "0th of February 1975" is accepted as a string, but should not be as a date. How about "30th of February 1983"? Poor databases, like MySQL, does not make these checks by default (although you can configure MySQL to do it -- and you should!).
data types will ensure the consistency of your data. This is one of the most important concepts as keeping your data sane will spare your head from insanity.
RDBMs generally require definition of column types so it can perform lookups fast. If you want to get the 5th column of every row in a huge dataset, having the columns defined is a huge optimisation.
Instead of scanning each row for some form of delimiter to retrieve the 5th column (if column widths were not fixed width), the RDBMs can just take the item at sizeOf(column1 - 4(bytes)) + sizeOf(column5(bytes)). Imagine how much quicker this would be on a table of say 10,000,000 rows.
Alternatively, if you don't want to specify the types of each column, you have two options that I'm aware of. Specify each column as a varchar(255) and decide what you want to do with it within the calling program. Or you can use a different database system that uses key-value pairs such as Redis.
database is all about physical storage, data type define this!!!