How would you seperate the two structures from the cluster of points - data-science

I have a bunch points in x,y that correspond to so physical processes. My goal to extract and group points based on the event/process the correspond to. The image attached shows a example of how the data looks like. By inspection you can clearly make out at least 2 curves that correspond to process I want. The data itself has a lot of noise and some false positive events.
I have already played around with Dbscan and it doesnt quite work the way I want, because of the tight packing and intersection it either groups them together or makes small broken up groups.
Any help would be appreciated
Image of cluster of points

Related

Appropriate data structure for storing data read from a large text file

I need to read data from a very large (~a million entries) text file and am trying to decide which data structure is most appropriate. Each entry in the file contains two integers that represent an edge in a directed graph (the tail and the head vertices), and the vast majority of vertices have at least one outgoing edge. My "naive" solution is to use a vector of vectors, so if the tail vertex was 1 and the head vertex was 2 I'd just do something like graph[1].push_back(2) to read in the entry "1 2". Once the graph is read in I'll be using Kosaraju's algorithm to compute the strongly-connected components, so I figure it will be handy to be able to access each element via the [] operator in constant time.
What are the "typical" choices in terms of data structures in a situation like this? Also, assuming the vector of vectors idea is a bad one, why is it bad? I guess the fact that they vector will need to re-size itself will slow things down, but the number of edges/vertices isn't known until runtime so I'm not sure of a way around that.
Thanks
Do you know number of vertices?
Vector of vectors isn't such bad idea as you think because you can resize the outer vector before reading edges. So copying of the whole graph would be prevented.
As far as I know vector of vectors is good structure for graph. It is often used on olympiads on computer science.

Optimize pattern of rotating holes for all combinations

Sort of a programming question, sort of a general logic question. Imagine a circular base with a pattern of circles:
And another circle, mounted above and able to rotate, with holes that expose the colored circles below:
There must be an optimal pattern of either the colored circles or the openings (or both) that will allow for all N possible combinations of colors... but I have no idea how to attack the problem! At this point, combinations of 2 seem probably the easiest and would be fine as a starting point (red/blue, red/green, red/white, etc).
I would imagine there will need to be gaps in the colors, unlike the example above. Any suggestions welcome!
Edit: clarified the question (hopefully!) thanks to feedback from Robert Harvey
For two holes, you could look for a perfect matching in a bipartite graph, each permutation described by two nodes, one in each partition. Nodes would be connected if they share one element, i.e. the (blue,red) node from the first partition connected to the (red,green) node of the second. The circles arranged in the same distance would allow for both of these patterns. A perfect matching in that graph would correspond to chains or cycles of permutations where two of them always share a single color. A bit like dominoes. If you had a set of cycles of the same length, you could interleave them to form the pattern on the lower disk. I'm not sure how easy it will be to obtain these same length cycles, though, and I also don't know how to generalize this to more than two elements in each permutation.

Is there a common/standard/accepted way to model GPS entities (waypoints, tracks)?

This question somewhat overlaps knowledge on geospatial information systems, but I think it belongs here rather than GIS.StackExchange
There are a lot of applications around that deal with GPS data with very similar objects, most of them defined by the GPX standard. These objects would be collections of routes, tracks, waypoints, and so on. Some important programs, like GoogleMaps, serialize more or less the same entities in KML format. There are a lot of other mapping applications online (ridewithgps, strava, runkeeper, to name a few) which treat this kind of data in a different way, yet allow for more or less equivalent "operations" with the data. Examples of these operations are:
Direct manipulation of tracks/trackpoints with the mouse (including drawing over a map);
Merging and splitting based on time and/or distance;
Replacing GPS-collected elevation with DEM/SRTM elevation;
Calculating properties of part of a track (total ascent, average speed, distance, time elapsed);
There are some small libraries (like GpxPy) that try to model these objects AND THEIR METHODS, in a way that would ideally allow for an encapsulated, possibly language-independent Library/API.
The fact is: this problem is around long enough to allow for a "common accepted standard" to emerge, isn't it? In the other hand, most GIS software is very professionally oriented towards geospatial analyses, topographic and cartographic applications, while the typical trip-logging and trip-planning applications seem to be more consumer-hobbyist oriented, which might explain the quite disperse way the different projects/apps treat and model the problem.
Thus considering everything said, the question is: Is there, at present or being planned, a standard way to model canonicaly, in an Object-Oriented way, the most used GPS/Tracklog entities and their canonical attributes and methods?
There is the GPX schema and it is very close to what I imagine, but it only contains objects and attributes, not methods.
Any information will be very much appreciated, thanks!!
As far as I know, there is no standard library, interface, or even set of established best practices when it comes to storing/manipulating/processing "route" data. We have put a lot of effort into these problems at Ride with GPS and I know the same could be said by the other sites that solve related problems. I wish there was a standard, and would love to work with someone on one.
GPX is OK and appears to be a sort-of standard... at least until you start processing GPX files and discover everyone has simultaneously added their own custom extensions to the format to deal with data like heart rate, cadence, power, etc. Also, there isn't a standard way of associating a route point with a track point. Your "bread crumb trail" of the route is represented as a series of trkpt elements, and course points (e.g. "turn left onto 4th street") are represented in a separate series of rtept elements. Ideally you want to associate a given course point with a specific track point, rather than just giving the course point a latitude and longitude. If your path does several loops over the same streets, it can introduce some ambiguity in where the course points should be attached along the route.
KML and Garmin's TCX format are similar to GPX, with their own pros and cons. In the end these formats really only serve the purpose of transferring the data between programs. They do not address the issue of how to represent the data in your program, or what type of operations can be performed on the data.
We store our track data as an array of objects, with keys corresponding to different attributes such as latitude, longitude, elevation, time from start, distance from start, speed, heart rate, etc. Additionally we store some metadata along the route to specify details about each section. When parsing our array of track points, we use this metadata to split a Route into a series of Segments. Segments can be split, joined, removed, attached, reversed, etc. They also encapsulate the method of trackpoint generation, whether that is by interpolating points along a straight line, or requesting a path representing directions between the endpoints. These methods allow a reasonably straightforward implementation of drag/drop editing and other common manipulations. The Route object can be used to handle operations involving multiple segments. One example is if you have a route composed of segments - some driving directions, straight lines, walking directions, whatever - and want to reverse the route. You can ask each segment to reverse itself, maintaining its settings in the process. At a higher level we use a Map class to wire up the interface, dispatch commands to the Route(s), and keep a series of snapshots or transition functions updated properly for sensible undo/redo support.
Route manipulation and generation is one of the goals. The others are aggregating summary statistics are structuring the data for efficient visualization/interaction. These problems have been solved to some degree by any system that will take in data and produce a line graph. Not exactly new territory here. One interesting characteristic of route data is that you will often have two variables to choose from for your x-axis: time from start, and distance from start. Both are monotonically increasing, and both offer useful but different interpretations of the data. Looking at the a graph of elevation with an x-axis of distance will show a bike ride going up and down a hill as symmetrical. Using an x-axis of time, the uphill portion is considerably wider. This isn't just about visualizing the data on a graph, it also translates to decisions you make when processing the data into summary statistics. Some weighted averages make sense to base off of time, some off of distance. The operations you end up wanting are min, max, weighted (based on your choice of independent var) average, the ability to filter points and perform a filtered min/max/avg (only use points where you were moving, ignore outliers, etc), different smoothing functions (to aid in calculating total elevation gain for example), a basic concept of map/reduce functionality (how much time did I spend between 20-30mph, etc), and fixed window moving averages that involve some interpolation. The latter is necessary if you want to identify your fastest 10 minutes, or 10 minutes of highest average heartrate, etc. Lastly, you're going to want an easy and efficient way to perform whatever calculations you're running on subsets of your trackpoints.
You can see an example of all of this in action here if you're interested: http://ridewithgps.com/trips/964148
The graph at the bottom can be moused over, drag-select to zoom in. The x-axis has a link to switch between distance/time. On the left sidebar at the bottom you'll see best 30 and 60 second efforts - those are done with fixed window moving averages with interpolation. On the right sidebar, click the "Metrics" tab. Drag-select to zoom in on a section on the graph, and you will see all of the metrics update to reflect your selection.
Happy to answer any questions, or work with anyone on some sort of standard or open implementation of some of these ideas.
This probably isn't quite the answer you were looking for but figured I would offer up some details about how we do things at Ride with GPS since we are not aware of any real standards like you seem to be looking for.
Thanks!
After some deeper research, I feel obligated, for the record and for the help of future people looking for this, to mention the pretty much exhaustive work on the subject done by two entities, sometimes working in conjunction: ISO and OGC.
From ISO (International Standards Organization), the "TC 211 - Geographic information/Geomatics" section pretty much contains it all.
From OGS (Open Geospatial Consortium), their Abstract Specifications are very extensive, being at the same time redundant and complimentary to ISO's.
I'm not sure it contains object methods related to the proposed application (gps track and waypoint analysis and manipulation), but for sure the core concepts contained in these documents is rather solid. UML is their schema representation of choice.
ISO 6709 "[...] specifies the representation of coordinates, including latitude and longitude, to be used in data interchange. It additionally specifies representation of horizontal point location using coordinate types other than latitude and longitude. It also specifies the representation of height and depth that can be associated with horizontal coordinates. Representation includes units of measure and coordinate order."
ISO 19107 "specifies conceptual schemas for describing the spatial characteristics of geographic features, and a set of spatial operations consistent with these schemas. It treats vector geometry and topology up to three dimensions. It defines standard spatial operations for use in access, query, management, processing, and data exchange of geographic information for spatial (geometric and topological) objects of up to three topological dimensions embedded in coordinate spaces of up to three axes."
If I find something new, I'll come back to edit this, including links when available.

Algorithm for reducing GPS track data to discard redundant data?

We're building a GIS interface to display GPS track data, e.g. imagine the raw data set from a guy wandering around a neighborhood on a bike for an hour. A set of data like this with perhaps a new point recorded every 5 seconds, will be large and displaying it in a browser or a handheld device will be challenging. Also, displaying every single point is usually not necessary since a user can't visually resolve that much data anyway.
So for performance reasons we are looking for algorithms that are good at 'reducing' data like this so that the number of points being displayed is reduced significantly but in such a way that it doesn't risk data mis-interpretation. For example, if our fictional bike rider stops for a drink, we certainly don't want to draw 100 lat/lon points in a cluster around the 7-Eleven.
We are aware of clustering, which is good for when looking at a bunch of disconnected points, however what we need is something that applies to tracks as described above. Thanks.
A more scientific and perhaps more math heavy solution is to use the Ramer-Douglas-Peucker algorithm to generalize your path. I used it when I studied for my Master of Surveying so it's a proven thing. :-)
Giving your path and the minimum angle you can tolerate in your path, it simplifies the path by reducing the number of points.
Typically the best way of doing that is:
Determine the minimum number of screen pixels you want between GPS points displayed.
Determine the distance represented by each pixel in the current zoom level.
Multiply answer 1 by answer 2 to get the minimum distance between coordinates you want to display.
starting from the first coordinate in the journey path, read each next coordinate until you've reached the required minimum distance from the current point. Repeat.

Elegant representations of graphs in R^3

If I have a graph of a reasonable size (e.g. ~100 nodes, ~40 edges coming out of each node) and I want to represent it in R^3 (i.e. map each node to a point in R^3 and draw a straight line between any two nodes which are connected in the original graph) in a way which would make it easy to understand its structure, what do you think would make a good drawing criterion?
I know this question is ill-posed; it's not objective. The idea behind it is easier to understand with an extreme case. Suppose you have a connected graph in which each node connects to two and only two other nodes, except for two nodes which only connect to one other node. It's not difficult to see that this graph, when drawn in R^3, can be drawn as a straight line (with nodes sprinkled over the line). Nevertheless, it is possible to draw it in a way which makes it almost impossible to see its very simple structure, e.g. by "twisting" it as much as possible around some fixed point in R^3. So, for this simple case, it's clear that a simple 3D representation is that of a straight line. However, it is not clear what this simplicity property is in the general case.
So, the question is: how would you define this simplicity property?
I'm happy with any kind of answer, be it a definition of "simplicity" computable for graphs, or a greedy approximated algorithm which transforms graphs and that converges to "simpler" 3D representations.
Thanks!
EDITED
In the mean time I've put force-based graph drawing ideas suggested in the answer into practice and wrote an OCaml/openGL program to simulate how imposing an electrical repulsive force between nodes (Coulomb's Law) and a spring-like behaviour on edges (Hooke's law) would turn out. I've posted the video on youtube. The video starts with an initial graph of 100 nodes each with approximately 1-2 outgoing edges and places the nodes randomly in 3D space. Then all the forces I mentioned are put into place and the system is left to move around subject to those forces. In the beginning, the graph is a mess and it's very difficult to see the structure. Closer to the end, it is clear that the graph is almost linear. I've also experience with larger-sized graphs but sometimes the geometry of the graph is just a mess and no matter how you plot it, you won't be able to visualise anything. And here is an even more extreme example with 500 nodes.
One simple approach is described, e.g., at http://en.wikipedia.org/wiki/Force-based_algorithms_%28graph_drawing%29 . The underlying notion of "simplicity" is something like "minimal potential energy", which doesn't really correspond to simplicity in any useful sense but might be good enough in practice.
(If you have 100 nodes of degree 40, I have some doubt as to whether any way of drawing them is going to reveal much in the way of human-accessible structure. That's a lot of edges. Still, good luck!)