I’ve been working with GEORADIUS and the suite of functions but I’m curious if there’s a way to get the 100 closest members in a Z sorted GEORADIUS key.
For instance... I don’t know how many members are geographically close to a known member, so I don’t know how large of a radius to use with GEORADIUS in order to get 100 of the closest members.
How would I do this?
Redis' GEO support is limited. There's no such command, and you need to do it on the client side:
Use 1 meter as the radius to check if there're enough items.
If there're enough items, get the N closest items from the result.
If there're NOT enough items, double the radius, and goto step 1.
JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as KNN, Voronoi polygon, convexhull are also supported. It is open source. The KNN k nearest neighbor query can help you. It also supports vector shapes in addition to raster shapes like linestrings polygons.
Related
I'm looking at the freely available Solar potential dataset on Google BigQuery that may be found here: https://bigquery.cloud.google.com/table/bigquery-public-data:sunroof_solar.solar_potential_by_censustract?pli=1&tab=schema
Each record on the table has the following border definitions:
lat_max - maximum latitude for that region
lat_min - minimum latitude for that region
lng_max - maximum longitude for that region
lng_min - minimum longitude for that region
Now I have a coordinate (lat/lng pair) and I would like to query to see whether or not that coordinate is within the above range. How do I do that with BQ Standard SQL?
I've seen the Geo Functions here: https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions
But I'm still not sure how to write this query.
Thanks!
Assuming the points are just latitude and longitude as numbers, why can't you just do a standard numerical comparison?
Note: The first link doesn't work without a google account, so I can't see the data.
But if you want to become spatial, I'd suggest you're going to need to take the border coordinates that you have and turn them into a polygon using one of: ST_MAKEPOLYGON, ST_GEOGFROMGEOJSON, or ST_GEOGFROMTEXT. Then create a point using the coords you wish to test ST_MAKEPOINT.
Now you have two geographies you can compare them both using ST_INTERSECTION or ST_DISJOINT depending on what outcome you want.
If you want to get fancy and see how far aware from the border you are (which I guess means more efficient?) you can use ST_DISTANCE.
Agree with Jonathan, just checking if each of the lat/lon value is within the bounds is simplest way to achieve it (unless there are any issues around antimeridian, but most likely you can just ignore them).
If you do want to use Geography objects for that, you can construct Geography objects for these rectangles, using
ST_MakePolygon(ST_MakeLine(
[ST_GeogPoint(lon_min, lat_min), ST_GeogPoint(lon_max, lat_min),
ST_GeogPoint(lon_max, lat_max), ST_GeogPoint(lon_min, lat_max),
ST_GeogPoint(lon_min, lat_min)]))
And then check if the point is within particular rectangle using
ST_Intersects(ST_GeogPoint(lon, lat), <polygon-above>)
But it will likely be slower and would not provide any benefit for this particular case.
I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.
For example, if a product has:
Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new
Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.
I'm open to suggestions as the project is just starting.
One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:
Colors Weight Height Condition
00011011000 10110110 10001100 01
Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).
The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.
Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.
Further question: also if I could use different weights for each attribute, it would be awesome.
You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.
Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.
You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.
What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.
You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.
What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements.
When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).
Simple example of BK-tree (from the second link)
BOOK
/ \
/(1) \(4)
/ \
BOOKS CAKE
/ / \
/(2) /(1) \(2)
/ | |
BOO CAPE CART
Your metric should be Hamming distance (count of differences between bit representations of two objects).
BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?
BK-tree with details:
https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/
Im trying to analyse data from cycle accidents in the UK to find statistical black spots. Here is the example of the data from another website. http://www.cycleinjury.co.uk/map
I am currently using SQLite to ~100k store lat / lon locations. I want to group nearby locations together. This task is called cluster analysis.
I would like simplify the dataset by ignoring isolated incidents and instead only showing the origin of clusters where more than one accident have taken place in a small area.
There are 3 problems I need to overcome.
Performance - How do I ensure finding nearby points is quick. Should I use SQLite's implementation of an R-Tree for example?
Chains - How do I avoid picking up chains of nearby points?
Density - How to take cycle population density into account? There is a far greater population density of cyclist in london then say Bristol, therefore there appears to be a greater number of backstops in London.
I would like to avoid 'chain' scenarios like this:
Instead I would like to find clusters:
London screenshot (I hand drew some clusters)...
Bristol screenshot - Much lower density - the same program ran over this area might not find any blackspots if relative density was not taken into account.
Any pointers would be great!
Well, your problem description reads exactly like the DBSCAN clustering algorithm (Wikipedia). It avoids chain effects in the sense that it requires them to be at least minPts objects.
As for the differences in densities across, that is what OPTICS (Wikipedia) is supposed do solve. You may need to use a different way of extracting clusters though.
Well, ok, maybe not 100% - you maybe want to have single hotspots, not areas that are "density connected". When thinking of an OPTICS plot, I figure you are only interested in small but deep valleys, not in large valleys. You could probably use the OPTICS plot an scan for local minima of "at least 10 accidents".
Update: Thanks for the pointer to the data set. It's really interesting. So I did not filter it down to cyclists, but right now I'm using all 1.2 million records with coordinates. I've fed them into ELKI for analysis, because it's really fast, and it actually can use the geodetic distance (i.e. on latitude and longitude) instead of Euclidean distance, to avoid bias. I've enabled the R*-tree index with STR bulk loading, because that is supposed to help to get the runtime down a lot. I'm running OPTICS with Xi=.1, epsilon=1 (km) and minPts=100 (looking for large clusters only). Runtime was around 11 Minutes, not too bad. The OPTICS plot of course would be 1.2 million pixels wide, so it's not really good for full visualization anymore. Given the huge threshold, it identified 18 clusters with 100-200 instances each. I'll try to visualize these clusters next. But definitely try a lower minPts for your experiments.
So here are the major clusters found:
51.690713 -0.045545 a crossing on A10 north of London just past M25
51.477804 -0.404462 "Waggoners Roundabout"
51.690713 -0.045545 "Halton Cross Roundabout" or the crossing south of it
51.436707 -0.499702 Fork of A30 and A308 Staines By-Pass
53.556186 -2.489059 M61 exit to A58, North-West of Manchester
55.170139 -1.532917 A189, North Seaton Roundabout
55.067229 -1.577334 A189 and A19, just south of this, a four lane roundabout.
51.570594 -0.096159 Manour House, Picadilly Line
53.477601 -1.152863 M18 and A1(M)
53.091369 -0.789684 A1, A17 and A46, a complex construct with roundabouts on both sides of A1.
52.949281 -0.97896 A52 and A46
50.659544 -1.15251 Isle of Wight, Sandown.
...
Note, these are just random points taken from the clusters. It may be sensible to compute e.g. cluster center and radius instead, but I didn't do that. I just wanted to get a glimpse of that data set, and it looks interesting.
Here are some screenshots, with minPts=50, epsilon=0.1, xi=0.02:
Notice that with OPTICS, clusters can be hierarchical. Here is a detail:
First, your example is quite misleading. You have two different sets of data, and you don't control the data. If it appears in a chain, then you will get a chain out.
This problem is not exactly suitable for a database. You'll have to write code or find a package that implements this algorithm on your platform.
There are many different clustering algorithms. One, k-means, is an iterative algorithm where you look for a fixed number of clusters. k-means requires a few complete scans of the data, and voila, you have your clusters. Indexes are not particularly helpful.
Another, which is usually appropriate on slightly smaller data sets, is hierarchical clustering -- you put the two closest things together, and then build the clusters. An index might be helpful here.
I recommend though that you peruse a site such as kdnuggets in order to see what software -- free and otherwise -- is available.
Does any one know of database structure such as this http://www.maxmind.com/app/geolitecity that is optimized for super fast retrieval of long and lat based on either ZIP or (City, State, Country) parameters?
Maxmind's database does not support any other retrieval than IP retrieval, at least not to mine knowledge. So if you know how to do it preferably in Java, I'm all ears.
This should not be SQL type database or CSV file or Google API solution. Thous are just to slow. Especially if you want to offer search results sorted by distance.
Paid solutions are also option. The data structure doesn't have to be free.
I don't believe there is such a thing as a "fast" way to do this. I've built a geocoding API for Canadian postal codes and the way we search is to have two indexes of postal codes - one sorted by lattitude and one sorted by longitude. You can do some spherical geometry and develop a bounding "box" that fits everything in a given radius but you still have to go back and do a point to point distance measurement using Vincenty or Haversine or your algorithm of choice for the distance between your origin and each postal code you find.
With a world-wide database, your math gets complicated by the fact that you can cross meridians and the equator.
You'll want some kind of encoding scheme that lets you work in radians, since that is what most distance calculation hueristics require.
this can be done very quickly with any database engine that supports two dimensional indexes... and mysql supports unlimited dimensions as well as I know... it's simple.. you use a 2-d index to limit your result set to a reasonable size extremely quickly... then you examine your result set with a high precision calculation algorithm if you need to.. not hard.. except you may need to or two lists together if they cross the 180/-180 longitude line
making a 2d index is simple.... index (latitude,longitude) ... that index only works on latitude or latitude,longitude pairs... it won't work on longitude alone... if you want an additional index for longitude index (longitude) .... I select out a rough estimate square and round the corners if I care about them. ...
if you have a zip or city to start with... zip codes are just a 1-d index... no problem making that happen fast.. just use an index index(zip) ... and if your hard drive is too slow, get a solid state drive to eliminate the seek times.. or use a huge ram and cache the whole table... this is not a hard problem either way you want to go
if that's not fast enough for you, using someones service won't help because you have network overhead... you will have to hold your data directly in ram/ssd and build your own 2-d /1-d indexing system if you need it (not hard)... that route could probably beat sql by a factor of 10 or so because the sql engine has a lot of overhead.... I suppose someone might offer a service that runs on your own machine, but realistically, that wouldn't beat sql by very far because you still have to go through a bunch of hoopdiloops to make the request to their service. sql and 2-d indexes with a solid state drive will be damned fast you shouldn't need to process the data yourself unless you are the post office, sorting 10,000 pieces of mail per second with one machine serving the data. then you'll have to write your own data management routines.
Recently I was faced with this interview question (K-Means Clustering solution). The design I came up with did not meet the expectations of the interviewer (to put simply I didnt get the job because I lost to another candidate on this design problem). I am wondering how many different / efficient / simply solutions can the SO community come up with (by doing this I am hoping to hone my skills):
To implement a simple algorithm to cluster people according to their weight and height. The
data set includes a list of people with their weights and heights like so:
Person Weight Height
(kg) (inches)
Person 1 70 70
Person 2 75 80
Person 3 120 85
You can plot the data as a 2 dimensional data. Weight being one dimension and height being
the other dimension. Weight can range from a minimum of 50kg to 150kg. Height can range
from a minimum of 38inches to 90inches
Algorithm:
The algorithm (called K-means clustering) will cluster data into K groups goes as such:
Start with K clusters. Each cluster is defined by its center point which will start of as
random weight and random height. Pick random numbers from within the
corresponding ranges defined above.
For each person
Calculate distance to center of each cluster using formula
distance = sqrt(pow((wperson−wcenter), 2) + (pow(hperson−hcenter),2))
where wperson = weight of person,
hperson = height of person
wcenter = weight of cluster center point,
hcenter = height of cluster center point
Assign the person to the cluster with the shortest distance to center point of cluster
After end of step 2, you will end up with K clusters each assigned with a set of people
For each cluster, set the weight and height of the center point to the average of the
people in the cluster
wcenter = (sum of weight of each person in cluster)/(number of people in cluster)
hcenter = (sum of height of each person in cluster)/number of people in cluster)
Repeat steps 2 to 5 for 1000 iterations, then print out following information for each
cluster.
weight and height of center of cluster.
list of people in cluster.
I am not looking for a implementation/solution but for a high level design. can you list the interfaces / classes etc.
I dont want to give my solution now, but will post it later in the day?
This is my attempt at the design. I only show the static diagram since the algorithm is pretty much laid out already. I would have a plan to have a visitor for the representation of the clusters, could allow different types of output (xml, strings, csv..etc). Maybe the visitor is overkill, if it was then I'd just have something like a ToString method that could be overridden.
Note: the Cluster creates a CenterClusterItem on the SetCenter and FindNewCenter methods. The CenterClusterItem is not a PersonClusterItem, it just holds the same amount of AClusterValues as a PersonClusterItem would (since the average isn't really a person).
Also, I forgot to make a method on the KCluster to begin the process, but that's implied.
Class Diagram http://img11.imageshack.us/img11/499/kcluster.png
Well, I would first tackle all the constants/magic numbers that reduce the reusability of the algorithm:
instead of a fixed number of iterations, use a stopping criterion (e.g., if clusters don't change too much, terminate)
don't restrict yourself to 2-dim data, use vectors
let the user define the number of clusters to be found
Then, you could hide some specifics behind interfaces, e.g. the distance might be calculated differently (for example, it might at some point have to cope with values other than double).
On the other hand, if you really have this simple problem, some of these generalizations might well be overkill - but that's what I would discuss with someone telling me to implement this algorithm.
You can create the following classes:
Person to store data about persons and centers. Properties: id, weight and height. Method: calculateDistance
Cluster to store one center and a list of persons: Properties: center and list of Person. Method: calculateCenter.
KCluster to hold your algorithm and store a list of clusters: Property: list of Cluster. Methods: generateClusters.
I'm not sure what your question actually is, the steps you point out effectively define the algorithm you're talking about.
A better idea may be to include exactly what you did then people can give you some hints / tips as to where you might have gone wrong or what they would have done differently.
That sounds like a really good way to do it. K-means will usually converge quickly (though not necessarily to the global optimum), so my one suggestion would be to run the algorithm until no more changes occur, rather than a fixed number of 1000 iterations. You could then repeat the entire process a few times with different random starting points.
One weakness of k-means is that it does require you to specify (i.e. guess) an appropriate value for k up-front. I think you would get points for asking the interviewer what an appropriate value for k would be, or, if there is no way to know, describing some goodness-of-fit measure and then calculating that measure for different values of k to find a "just low enough" value.