I'm storing myriad attributes from user-uploaded image files in a single table. The basic structure of the table is like this:
attrib_id | image_id | attrib_name | attrib_value
Alongside attributes like TAG and CAPTION, I'm storing LATITUDE and LONGITUDE of the image's location in the same manner. All the columns are of type varchar.
I'm trying to query for images associated with locations within a given bounding box - the inputs are upper and lower latitude, start and end longitude. The output of the query should be a list of image_ids that have a row with name=LATITUDE and value BETWEEN upper and lower latitude, as well as the same for longitude.
Since all the values are strings, and in the same columns, I don't really know where to start with this one.
While I'm willing to consider restructuring the table, my intuition tells me there's a way to accomplish this in SQL.
My database is currently MySql, but I'm likely to switch over to Postgres in the future, so I would prefer non-vendor specific solutions.
You can do something like this:
SELECT * FROM Table lat
INNER JOIN Table lon ON lat.image_id = lon.image_id
WHERE (lon.attrib_name = 'LONGITUDE' AND lon.attrib_value BETWEEN start AND end) AND
(lat.attrib_name = 'LATITUDE' AND lat.attrib_value BETWEEN upper AND lower)
Not sure if BETWEEN is supported in Postgres or whatever you migrate to so you can just use < and > operators.
Related
I have a ROUTES table which has columns SOURCE_AIRPORT and DESTINATION_AIRPORT and describes a particular route that an airplane would take to get from one to the other.
I have an AIRPORTS table which has columns LATITUDE and LONGITUDE which describes an airports geographic position.
I can join the two tables using columns which they both share called SOURCE_AIRPORT_ID and DESTINATION_AIRPORT_ID in the routes table, and called IATA in the airports table (a 3 letter code to represent an airport such as LHR for London Heathrow).
My question is, how can I write an SQL query using all of this information to find, for example, the longest route out of a particular airport such as LHR?
I believe I have to join the two tables, and for every row in the routes table where the source airport is LHR, look at the destination airport's latitude and longitude, calculate how far away that is from LHR, save that as a field called "distance", and then order the data by the highest distance first. But in terms of SQL syntax i'm at a loss.
This would have probably been a better question for the Mathematics Stack Exchange, but I’ll provide some insight here. If you are relatively farmiliar with trigonometry, I’m sure you could understand the implementation given this resource: https://en.m.wikipedia.org/wiki/Haversine_formula. You are looking to compute the distance between two point on the surface of a sphere in terms of their distance across its surface (not a straight line, you can’t travel through the Earth).
The page displays this formula:
https://wikimedia.org/api/rest_v1/media/math/render/svg/a65dbbde43ff45bacd2505fcf32b44fc7dcd8cc0
Where
• φ1, φ2 are the latitude of point 1 and latitude of point 2 (in radians),
• λ1, λ2 are the longitude of point 1 and longitude of point 2 (in radians).

If you data is in degrees, you can simply convert to radians by multiplying by pi/180
There is a formula called great circle distance to calculate distance between two points. You probably can load is as a library for your operating system. Forget the haversine, our planet is not a perfect sphere.
If you use this value often, save it in your routes table.
I think you're about 90% there in terms of the solution method. I'll add in some additional detail regarding a potential SQL query to get your answer. So there's 2 steps you need to do to calculate the distances - step 1 is to create a table containing joining the ROUTES table to the AIRPORTS table to get the latitude/longitude for both the SOURCE_AIRPORT and DESTINATION_AIRPORT on the route. This might look something like this:
SELECT t1.*, CONVERT(FLOAT, t2.LATITUDE) AS SOURCE_LAT, CONVERT(FLOAT, t2.LONGITUDE) AS SOURCE_LONG, CONVERT(FLOAT, t3.LATITUDE) AS DEST_LAT, CONVERT(FLOAT, t3.LONGITUDE) AS DEST_LONG, 0.00 AS DISTANCE_CALC
INTO ROUTE_CALCULATIONS
FROM ROUTES t1 LEFT OUTER JOIN AIRPORTS t2 ON t1.SOURCE_AIRPORT_ID = t2.IATA
LEFT OUTER JOIN AIRPORTS t3 ON t1.DESTINATION_AIRPORT_ID = t3.IATA;
The resulting output should create a new table titled ROUTE_CALCULATIONS made up of all the ROUTES columns, the longitude/latitude for both the SOURCE and DESTINATION airports, and a placeholder DISTANCE_CALC column with a value of 0.
Step 2 is calculating the distance. This should be a relatively straightforward calculation and update.
UPDATE ROUTE_CALCULATIONS
SET DISTANCE_CALC = 2 * 3961 * asin(sqrt((sin(radians((DEST_LAT- SOURCE_LAT) / 2))) ^ 2 + cos(radians(SOURCE_LAT)) * cos(radians(DEST_LAT)) * (sin(radians((DEST_LONG- SOURCE_LONG) / 2))) ^ 2))
And that should give the calculated distance in the DISTANCE_CALC table for all routes seen in the data. From there you should be able to do whatever distance-related route analysis you want.
I have a db structure that is vanilla Postgres:
CREATE TABLE IF NOT EXISTS locations (
name text NOT NULL,
lat double precision NOT NULL,
lng double precision NOT NULL,
);
CREATE INDEX ON locations(lat,lng);
When I want to calculate all locations in a bounding box where I have the lower left and upper right corners I use the following query:
SELECT * FROM locations
WHERE lat >= min_lat AND
WHERE lat <= max_lat AND
WHERE lng >= min_lng AND
WHERE lng <= max_lng;
Now, I want to generate a bounding box given a point and use the bounding box result in the locations query. I'm using the following PostGIS query to generate a bounding box:
SELECT
ST_Extent(
ST_Envelope(
ST_Rotate(
ST_Buffer(
ST_GeomFromText('POINT (-87.6297982 41.8781136)',4326)::GEOGRAPHY,160934)::GEOMETRY,0)));
Result: BOX(-89.568160053866 40.4285062983089,-85.6903925527536 43.3273499289221)
However, I'm not sure how use the results from the PostGIS query bounding box into the vanilla lat / lng Postgres query in one call. Any ideas on how to merge the two? Preferably such that the index is preserved.
If you want to get the bbox coordinates as separated values you might wanna take a look at ST_XMax, ST_YMax, ST_XMin, ST_YMin. The following CTE, that embeds your query, should give you an idea:
WITH j (geom) AS (
SELECT
ST_Extent(ST_Envelope(
ST_Rotate(ST_Buffer(
ST_GeomFromText('POINT(-87.6297982 41.8781136)',4326)::GEOGRAPHY,160934)::GEOMETRY,0)))
)
SELECT
ST_XMax(geom),ST_YMax(geom),
ST_XMin(geom),ST_YMin(geom)
FROM j
st_xmax | st_ymax | st_xmin | st_ymin
-------------------+-----------------+-------------------+------------------
-85.6903925527536 | 43.327349928921 | -89.5681600538661 | 40.4285062983098
Side note: Storing geometry values as numbers might look straightforward but it is hardly ever the better choice - specially when dealing with polygons! So I would really suggest you to store these values as geometry or geography, which might seem complex at first a glance but definitely pays off on the long run.
This answer might shed a light on distance/containment queries involving polygons: Getting all Buildings in range of 5 miles from specified coordinates
I'm trying to have a go at learning about PostgreSQL and in particular, it's PostGIS extension and the benefits with regards to geographic spatial features it provides. I've loaded a PostgreSQL DB with a table that contains 30,000 records of latitude, longitude and a price value (for houses) and I want to start querying the DB to return all the rows that would be in a radius of Xkm of a particular latitude and longitude.
I've hit a brick wall as to how I might run this type of query as I've found the documentation to be quite limited online and I've found no similar attempts at this method of querying online.
Some methods I've tried:
SELECT *
FROM house_prices
WHERE ST_DWithin( ST_MakePoint(53.3348279,-6.269547099999954)) <= radius_mi *
1609.34;
This prompts the following error:
ERROR: function st_dwithin(geometry) does not exist
Another attempt:
SELECT * FROM house_prices ST_DWithin( 53.3348279, -6.269547099999954, 5); <-- A latitude value, longitude value and 5 miles radius
This prompts the following error:
ERROR: syntax error at or near "53.3348279"
Could anyone point me in the right direction/ know of some documentation I could look at?
** Edit **
Structure and set up of database and table in pgAdmin4
The first query has an invalid number of parameters. The function ST_DWithin expects at least two geometries and the srid distance,
and optionally a Boolean parameter indicating the usage of a spheroid (see documentation).
The second query is missing a WHERE clause and has the same problem as the first query.
Example from documentation:
SELECT s.gid, s.school_name
FROM schools s
LEFT JOIN hospitals h ON ST_DWithin(s.the_geom, h.the_geom, 3000)
WHERE h.gid IS NULL;
Perhaps something like this would be what you want to achieve:
SELECT *
FROM house_prices h
WHERE ST_DWithin(ST_MakePoint(53.3348,-6.2695),h.geom,h.radius_mi * 1609.34)
Also pay attention to the order of the coordinates pair (x,y or y,x), otherwise you might easily land on the sea with these coordinates ;-)
EDIT: Taking into account that there is no geometry on the table, so the points are stored in two different columns, longitude and latitude:
SELECT *
FROM house_prices
WHERE ST_DWithin(ST_MakePoint(longitude,latitude),ST_MakePoint(53.3348,-6.2695),1609.34)
I would like to save milions of locations into Cassandra's ColumnFamily and than make a range query on this data.
For example:
Attributes: LocationName, latitude, longitude
Query: SELECT LocationName FROM ColumnFamily WHERE latitute> 10 AND latitude<20 AND longitude>30 AND longitude<40;
What structure and indexes shoul I use so the query will be efficient?
Depending on the granularity you need in your queries (and the variability of that granularity), one way to handle this would be to slice up your map into a grid, where all your locations belong inside a grid square with a defined lat/lon bounding box. You can then do your initial query for grid square IDs, followed by locations inside those squares, with a representation something like this:
GridSquareLat {
key: [very_coarse_lat_value] {
[square_lat_boundary]:[GridSquareIDList]
[square_lat_boundary]:[GridSquareIDList]
}
...
}
GridSquareLon {
key: [very_coarse_lon_value] {
[square_lon_boundary]:[GridSquareIDList]
[square_lon_boundary]:[GridSquareIDList]
}
...
}
Location {
key: [locationID] {
GridSquareID: [GridSquareID] <-- put a secondary index on this col
Lat: [exact_lat]
Lon: [exact_lon]
...
}
...
}
You can then give Cassandra the GridSquareLat/Lon keys representing the very coarse grain lat/lon values, along with a column slice range that will reduce the columns returned to only those squares within your boundaries. You'll get two lists, one of grid square IDs for lat and one for lon. The intersection of these lists will be the grid squares in your range.
To get the locations in these squares, query the Location CF, filtering on GridSquareID (using a secondary index, which will be efficient as long as your total grid square count is reasonable). You now have a reasonably sized list of locations with only a few very efficient queries, and you can easily reduce them to your exact list inside your application.
Let's pretend you are going to grow into the billions(and I will do the millions case later below). If you were using something like PlayOrm on cassandra(or you can do this yourself instead of using PlayOrm), you would need to partition by something. Let's say you choose to partition by longitude so that anything between >= 20 and < 30 is in partition 20 and between >= 30 and < 40 is in partition 30. Then in PlayOrm, you use it's scalable SQL to just write the same query you wrote but you need to query the proper partitions which in some cases would be multiple partitions unless you limit your result set size...
In PlayOrm, or in your data model, it would look like (no other tables needed)
Location {
key: [locationID] {
LonBottom: [partitionKey]
Lat: [exact_lat] <- #NoSqlIndexed
Lon: [exact_lon] <- #NoSqlIndexed
...
}
...
}
That said, if you are in the millions, you would not need partitions so just remove the LonBottom column above and do no partitioning....of course, why use noSQL as millions is not that big and an RDBMS can easily handle millions.
If you want to do it yourself, in the millions case, there are two rows for Lat and Lon(wide row pattern) that hold the indexed values of lat and long to query. For billinos case, it would be two rows per partition as each partition gets it's own index as you don't want indices that are too large.
An indexing row is simple for you to create. It is simply rowkey="index name" and each column name is a compound name of longitude and row key to location. There is NO value for each column, just a compound name (so that each col name is unique).
so your row might look like
longindex = 32.rowkey1, 32.rowkey45, 32.rowkey56, 33.rowkey87, 33.rowkey89
where 32 and 33 are longitudes and the rowkeys are pointing to the locations.
My database has a directory of about 2,000 locations scattered throughout the United States with zipcode information (which I have tied to lon/lat coordinates).
I also have a table function which takes two parameters (ZipCode & Miles) to return a list of neighboring zip codes (excluding the same zip code searched)
For each location I am trying to get the neighboring location ids. So if location #4 has three nearby locations, the output should look like:
4 5
4 24
4 137
That is, locations 5, 24, and 137 are within X miles of location 4.
I originally tried to use a cross apply with my function as follows:
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist(A.Sl_Zip,7))) AS Q
WHERE A.SL_StoreNum='04'
However that ran for over 20 minutes with no results so I canceled it. I did try hardcoding in the zipcode and it immediately returned a list
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist('12345',7))) AS Q
WHERE A.SL_StoreNum='04'
What is the most efficient way of accomplishing this listing of nearby locations? Keeping in mind while I used "04" as an example here, I want to run the analysis for 2,000 locations.
The "udf_GetLongLatDist" is a function which uses some math to calculate distance between two geographic coordinates and returns a list of zipcodes with a distance of > 0. Nothing fancy within it.
When you use the function you probably have to calculate every single possible distance for each row. That is why it takes so long. SInce teh actual physical locations don;t generally move, what we always did was precalculate the distance from each zipcode to every other zip code (and update only once a month or so when we added new possible zipcodes). Once the distances are precalculated, all you have to do is run a query like
select zip2 from zipprecalc where zip1 = '12345' and distance <=10
We have something similar and optimized it by only calculating the distance of other zipcodes whose latitude is within a bounded range. So if you want other zips within #miles, you use a
where latitude >= #targetLat - (#miles/69.2) and latitude <= #targetLat + (#miles/69.2)
Then you are only calculating the great circle distance of a much smaller subset of other zip code rows. We found this fast enough in our use to not require precalculating.
The same thing can't be done for longitude because of the variation between equator and pole of what distance a degree of longitude represents.
Other answers here involve re-working the algorithm. I personally advise the pre-calculated map of all zipcodes against each other. It should be possible to embed such optimisations in your existing udf, to minimise code-changes.
A refactoring of the query, however, could be as follows...
SELECT
A.SL_STORENUM, A.Sl_Zip, C.SL_STORENUM
FROM
tbl_store_locations AS A
CROSS APPLY
dbo.udf_GetLongLatDist(A.Sl_Zip,7) AS B
INNER JOIN
tbl_store_locations AS C
ON C.SL_Zip = B.zipnum
WHERE
A.SL_StoreNum='04'
Also, the performance of the CROSS APPLY will benefit greatly if you can ensure that the udf is INLINE rather than MULTI-STATEMENT. This allows the udf to be expanded inline (macro like) for a much cleaner execution plan.
Doing so would also allow you to return additional fields from the udf. The optimiser can then include or exclude those fields from the plan depending on whether you actually use them. Such an example would be to include the SL_StoreNum if it's easily accessible from the query in the udf, and so remove the need for the last join...