I have an oracle database (11g spatial) that includes a series of area polygons and water mains. I'm trying to attribute each of these mains to the area in which it is contained and for the most part this is straightforward enough (using the SDO_CONTAINS function) but I'm not sure how to deal with mains that straddle multiple polygons due to errors in digitisation.
In cases like this what I'd ideally like to do is attribute a main to an area polygon if the majority of it's length (>50%) is contained within onit. I know that I can use the SDO_RELATE function to determine every polygon that any given main interacts with, but I don't know how to then go about determining how much of it's length is contained within each area.
The principle is like this:
Correlate mains and areas. Assuming you have many mains and many areas, the most efficient approach is to use SDO_JOIN
For each couple (main/area) returned, compute their intersection (SDO_GEM.SDO_INTERSECTION) and measure the length of that intersection (SDO_GEOM.SDO_LENGTH).
From those results, retain the area for each main where the length is the maximum
If you want a full SQL example, allow me a bit of time to write that using sample data.
Related
I'm looking at the freely available Solar potential dataset on Google BigQuery that may be found here: https://bigquery.cloud.google.com/table/bigquery-public-data:sunroof_solar.solar_potential_by_censustract?pli=1&tab=schema
Each record on the table has the following border definitions:
lat_max - maximum latitude for that region
lat_min - minimum latitude for that region
lng_max - maximum longitude for that region
lng_min - minimum longitude for that region
Now I have a coordinate (lat/lng pair) and I would like to query to see whether or not that coordinate is within the above range. How do I do that with BQ Standard SQL?
I've seen the Geo Functions here: https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions
But I'm still not sure how to write this query.
Thanks!
Assuming the points are just latitude and longitude as numbers, why can't you just do a standard numerical comparison?
Note: The first link doesn't work without a google account, so I can't see the data.
But if you want to become spatial, I'd suggest you're going to need to take the border coordinates that you have and turn them into a polygon using one of: ST_MAKEPOLYGON, ST_GEOGFROMGEOJSON, or ST_GEOGFROMTEXT. Then create a point using the coords you wish to test ST_MAKEPOINT.
Now you have two geographies you can compare them both using ST_INTERSECTION or ST_DISJOINT depending on what outcome you want.
If you want to get fancy and see how far aware from the border you are (which I guess means more efficient?) you can use ST_DISTANCE.
Agree with Jonathan, just checking if each of the lat/lon value is within the bounds is simplest way to achieve it (unless there are any issues around antimeridian, but most likely you can just ignore them).
If you do want to use Geography objects for that, you can construct Geography objects for these rectangles, using
ST_MakePolygon(ST_MakeLine(
[ST_GeogPoint(lon_min, lat_min), ST_GeogPoint(lon_max, lat_min),
ST_GeogPoint(lon_max, lat_max), ST_GeogPoint(lon_min, lat_max),
ST_GeogPoint(lon_min, lat_min)]))
And then check if the point is within particular rectangle using
ST_Intersects(ST_GeogPoint(lon, lat), <polygon-above>)
But it will likely be slower and would not provide any benefit for this particular case.
I'm working with a lot of name data where the following events are happening:
In one stream the data is submitted as "Sung" and in the other stream "Snug" my initial thought to this was to convert Sung and Snug to where each character equals a number then the sums would be the same, so even if they transverse a character, I'd be able to bucket these appropriately.
The other is where in one stream it comes in as "Lillly" as opposed to "Lilly" in the other stream. I'd like to figure out how to fuzzy match these such that I can identify them. I'm not sure if this is possible in Oracle.
I'm working with many millions of data points and trying to figure out how to write these classification buckets such that I can stop having so much noise in my primary task of finding where people are truly different people as opposed to a clerical error.
Any thoughts would be very appreciated.
A common measure for such distance is called Levenshtein distance (Wikipedia here). This measures the "edit" distance between two strings -- number of edit operations needed to convert one into the other.
That's the good news. More good news is that Oracle even has an implementation in the UTL_MATCH library.
The bad news is that it is really, really expensive on millions of data points. Unfortunately, I cannot help you there so much. One idea is to determine which names are "close enough" because they already share a certain minimum number of characters.
Another method is to convert the strings to what they sound like. That is called soundex. You may be able to use the two together -- assuming your names are predominantly English (the soundex algorithm was invented by the US Census Bureau, so it would work best on names in America).
I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.
For example, if a product has:
Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new
Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.
I'm open to suggestions as the project is just starting.
One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:
Colors Weight Height Condition
00011011000 10110110 10001100 01
Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).
The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.
Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.
Further question: also if I could use different weights for each attribute, it would be awesome.
You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.
Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.
You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.
What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.
You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.
What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements.
When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).
Simple example of BK-tree (from the second link)
BOOK
/ \
/(1) \(4)
/ \
BOOKS CAKE
/ / \
/(2) /(1) \(2)
/ | |
BOO CAPE CART
Your metric should be Hamming distance (count of differences between bit representations of two objects).
BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?
BK-tree with details:
https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/
Im trying to analyse data from cycle accidents in the UK to find statistical black spots. Here is the example of the data from another website. http://www.cycleinjury.co.uk/map
I am currently using SQLite to ~100k store lat / lon locations. I want to group nearby locations together. This task is called cluster analysis.
I would like simplify the dataset by ignoring isolated incidents and instead only showing the origin of clusters where more than one accident have taken place in a small area.
There are 3 problems I need to overcome.
Performance - How do I ensure finding nearby points is quick. Should I use SQLite's implementation of an R-Tree for example?
Chains - How do I avoid picking up chains of nearby points?
Density - How to take cycle population density into account? There is a far greater population density of cyclist in london then say Bristol, therefore there appears to be a greater number of backstops in London.
I would like to avoid 'chain' scenarios like this:
Instead I would like to find clusters:
London screenshot (I hand drew some clusters)...
Bristol screenshot - Much lower density - the same program ran over this area might not find any blackspots if relative density was not taken into account.
Any pointers would be great!
Well, your problem description reads exactly like the DBSCAN clustering algorithm (Wikipedia). It avoids chain effects in the sense that it requires them to be at least minPts objects.
As for the differences in densities across, that is what OPTICS (Wikipedia) is supposed do solve. You may need to use a different way of extracting clusters though.
Well, ok, maybe not 100% - you maybe want to have single hotspots, not areas that are "density connected". When thinking of an OPTICS plot, I figure you are only interested in small but deep valleys, not in large valleys. You could probably use the OPTICS plot an scan for local minima of "at least 10 accidents".
Update: Thanks for the pointer to the data set. It's really interesting. So I did not filter it down to cyclists, but right now I'm using all 1.2 million records with coordinates. I've fed them into ELKI for analysis, because it's really fast, and it actually can use the geodetic distance (i.e. on latitude and longitude) instead of Euclidean distance, to avoid bias. I've enabled the R*-tree index with STR bulk loading, because that is supposed to help to get the runtime down a lot. I'm running OPTICS with Xi=.1, epsilon=1 (km) and minPts=100 (looking for large clusters only). Runtime was around 11 Minutes, not too bad. The OPTICS plot of course would be 1.2 million pixels wide, so it's not really good for full visualization anymore. Given the huge threshold, it identified 18 clusters with 100-200 instances each. I'll try to visualize these clusters next. But definitely try a lower minPts for your experiments.
So here are the major clusters found:
51.690713 -0.045545 a crossing on A10 north of London just past M25
51.477804 -0.404462 "Waggoners Roundabout"
51.690713 -0.045545 "Halton Cross Roundabout" or the crossing south of it
51.436707 -0.499702 Fork of A30 and A308 Staines By-Pass
53.556186 -2.489059 M61 exit to A58, North-West of Manchester
55.170139 -1.532917 A189, North Seaton Roundabout
55.067229 -1.577334 A189 and A19, just south of this, a four lane roundabout.
51.570594 -0.096159 Manour House, Picadilly Line
53.477601 -1.152863 M18 and A1(M)
53.091369 -0.789684 A1, A17 and A46, a complex construct with roundabouts on both sides of A1.
52.949281 -0.97896 A52 and A46
50.659544 -1.15251 Isle of Wight, Sandown.
...
Note, these are just random points taken from the clusters. It may be sensible to compute e.g. cluster center and radius instead, but I didn't do that. I just wanted to get a glimpse of that data set, and it looks interesting.
Here are some screenshots, with minPts=50, epsilon=0.1, xi=0.02:
Notice that with OPTICS, clusters can be hierarchical. Here is a detail:
First, your example is quite misleading. You have two different sets of data, and you don't control the data. If it appears in a chain, then you will get a chain out.
This problem is not exactly suitable for a database. You'll have to write code or find a package that implements this algorithm on your platform.
There are many different clustering algorithms. One, k-means, is an iterative algorithm where you look for a fixed number of clusters. k-means requires a few complete scans of the data, and voila, you have your clusters. Indexes are not particularly helpful.
Another, which is usually appropriate on slightly smaller data sets, is hierarchical clustering -- you put the two closest things together, and then build the clusters. An index might be helpful here.
I recommend though that you peruse a site such as kdnuggets in order to see what software -- free and otherwise -- is available.
We have an application that has a database full of polygons (currently stored as points) that a .net app pulls out and checks if they overlap.
I occurred to me that it would be much nicer to convert these point arrays to polygon / polyline objects within the database and use sql to get a bool of weather they overlap or not.
I have seen different methods suggested to do this but non of the examples given were quite in-line with my needs.
I would be very happy to receive input from those kind enough to offer their experience.
Additional:
In response to questions: It is indeed 2D. and yes any crossover of the two is considered true. The polygons have n points and can be concave. The polygons will be saved as 1 per row (after data conversion task) as polygons (i.e. the polygon type .. it might be called something else spatial / geom my memory is not on my side right now)
You can use .STIntersection with .STAsText() to test for overlapping polygons. (I really hate the terminology Microsoft has used (or whoever set the standard terms). "Touching," in my mind, should be a test for whether or not two geometry/geography shapes overlap at all, not just share a border.)
Anyway....
If #RadiusGeom is a geometry representing a radius from a point, the following will return a list of any two polygons where an intersection (a geometry that represents the area where two geometries overlap) is not empty.
SELECT CT.ID AS CTID, CT.[Geom] AS CensusTractGeom
FROM CensusTracts CT
WHERE CT.[Geom].STIntersection(#RadiusGeom).STAsText() <> 'GEOMETRYCOLLECTION EMPTY'
If your geometry field is spatially indexed, this runs pretty quickly. I ran this on 66,000 US CT records in about 3 seconds. There may be a better way, but since no one else had an answer, this was my attempt at an answer for you. Hope it helps!
Calculate and store the bounding rectangle of each polygon in a set of new fields within the row which is associated with that polygon. (I assume you have one; if not, create one.) When your dotnet app has a polygon and is looking for overlapping polygons, it can fetch from the database only those polygons whose bounding rectangles overlap, using a relatively simple SQL SELECT statement. Those polygons should be relatively few, so this will be efficient. Then, your dotnet app can perform the finer polygon overlap calculations in order to determine which ones of those really overlap.
Okay, I got another idea, so I am posting it as a different answer. I think my previous answer with the bounding polygons probably has some merit on its own, even if it was to reduce the number of polygons fetched from the database by a small percentage, but this one is probably better.
MSSQL supports integration with the CLR since version 2005. This means that you can define your own data type in an assembly, register the assembly with MSSQL, and from that moment on MSSQL will be accepting your user-defined data type as a valid type for a column, and it will be invoking your assembly to perform operations with your user-defined data type.
An example article for this technique on the CodeProject: Creating User-Defined Data Types in SQL Server 2005
I have never used this mechanism, so I do not know details about it, but I presume that you should be able to either define a new operation on your data type, or perhaps overload some existing operation like "less-than", so that you can check if one polygon intersects another. This is likely to speed things up a lot.