Fairly new to rapidminer and data science.
I imported data (it's very wide, so it took a while to classify all of the data types). I put the data through random forest and it appears to have emphasized the wrong things. I believe this is due to incorrect data type classification. I can't seem to find good data type documentation and am looking for an explanation of how rapidminer looks at each.
For example, I have some columns with 90% blanks and a couple filled it. I labeled this as "nominal" and rapid miner weighted this column heavily. I wanted it to weigh the dates columns more since I'm trying to predict cycle tmie.... any help or insight very much appreciated!
Some of the data types available are:
Nominal
Polynomial
binomial
dates
text
etc.
I'm not 100% sure if I got your question correctly, but neither RapidMiner or the RandomForest algorithm emphasized a certain data type over another.
So if the algorithm puts more importance on the nominal columns, it is because the strongly separate your example.
The different data types in RapidMiner are the to allow, dis-allow certain operations.
Classic example are phone numbers. If they are stored as a real number, you could get something like a square root or averages, which does not make sense. So storing them as String (or Nominal) makes more sense.
If you want to exclude certain attributes, you could try a feature selection or dimensionality reduction method (like PCA or the Remove Correlated, Remove Useless operators.
Also feel free to ask further, or re-post, questions in the RapidMiner community forum.
G'Day
I came across an application that is converting GPS co-ords to something else.
I am tryng to figure out what the format is.
Enter the standard
-26.61722 152.96033
and it stores this in the database
-15732970 91615170
Any ideas what the second pair are?
Cheers.
This is going to be nearly impossible to answer, mainly because there are literally hundreds of different standards for storing point data. They all have different units of measurement and have differing levels of accuracy. Some take into account the curvature of an object, such as the earth, and others are simply distances from a point in a 2 dimensional plane.
This is of course assuming that the data stored is actually a different standard and not just a custom encoding of the numbers provided.
Perhaps the name of the application and vendor might help.
Hi I am new to Mechanical Turk. I have 10k images and I want to ask turkers to write down a short summary for each image in Mechanical Turk. Since all images in my image set are similar, when a Turker does the similar summarization task more than 10 times, he'll find out some tricks in this task and write down similar summary of the following images.
To increase the diversity and randomness, I want to ask as many different people to do the task as possible. The perfect strategy is that one unique turker is only allowed to label just one image (or less than 10 images), while one image can be summarized by more than one turker. My experiment aims at collect different textual summarization from different people which covers a rich vocabulary set.
If I understand you correctly, you have unique images in total to label. It sounds like you're looking to have each task (HIT) request that a Worker label 10 unique images. This will result in 1K HITs with 10 images per HIT. If you'd like one just one unique Worker to label each image, then you'll set the Number of Assignments Requested to 1. If you'd like for multiple Workers to work on the same image (say, to ensure quality, or just to broaden the number and type of labels you might get) then you'll set Number of Assignments Requested to the # of Unique Workers you'd like to work on each task (HIT).
If I've misunderstood what you're looking to do, just clarify and I'll be happy to revise my answer.
You can learn more about these concepts and more here:
http://docs.aws.amazon.com/AWSMechTurk/latest/RequesterUI/mechanical-turk-concepts.html
Good luck!
Yes, this is possible. According to documentation:
http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMechanicalTurkRequester/amt-dg.pdf (page 4)
You can specify the maximum number of assignments that any Worker can accept for your HITs.You
can set two types of limits:
The maximum number of assignments any Worker can accept for a specific HIT type you've created
The maximum number of assignments any Worker can accept for all your HITs that don't otherwise
have a HIT-type-specific limit already assigned
I'm using a HackRF One device and its corresponding osmocom Sink block inside of gnuradio-companion. Because the input to this block is Complex (i.e. a pair of Floats), I could conceivably send it an enormously large value. At some point the osmocom Sink will hit a maximum value and stop driving the attached HackRF to output stronger signals.
I'm trying to figure out what that maximum value is.
I've looked through the documentation at a number of different sites, for both the HackRF One and the osmocom source and can't find an answer. I tried looking through the source code itself, but couldn't see any clear indication there, although I may have missed something there.
http://sdr.osmocom.org/trac/wiki/GrOsmoSDR
https://github.com/osmocom/gr-osmosdr
I also thought of deriving the value empirically, but didn't trust my equipment to get a precise measure of when the block started to hit the rails.
Any ideas?
Thanks
Friedman
I'm using a HackRF One device and its corresponding osmocom Sink block inside of gnuradio-companion. Because the input to this block is Complex (i.e. a pair of Floats), I could conceivably send it an enormously large value.
No, the complexes z must meet
because the osmocom sink/the underlying drivers and devices map that -1 – +1 range to the range of the I and Q DAC values.
You're right, though, it's hard to measure empirically, because typically, the output amplifiers go into nonlinearity close to the maximum DAC outputs, and on top of that, everything is frequency-dependent, so e.g. 0.5+j0.5 at 400 MHz doesn't necessarily produce the same electrical field strength as 0.5+j0.5 at 1GHz.
That's true for all non-calibrated SDR devices (which, aside from your typical multi-10k-Dollar Signal Generator, is everything, unless you calibrate for all frequencies of interest yourself).
This question somewhat overlaps knowledge on geospatial information systems, but I think it belongs here rather than GIS.StackExchange
There are a lot of applications around that deal with GPS data with very similar objects, most of them defined by the GPX standard. These objects would be collections of routes, tracks, waypoints, and so on. Some important programs, like GoogleMaps, serialize more or less the same entities in KML format. There are a lot of other mapping applications online (ridewithgps, strava, runkeeper, to name a few) which treat this kind of data in a different way, yet allow for more or less equivalent "operations" with the data. Examples of these operations are:
Direct manipulation of tracks/trackpoints with the mouse (including drawing over a map);
Merging and splitting based on time and/or distance;
Replacing GPS-collected elevation with DEM/SRTM elevation;
Calculating properties of part of a track (total ascent, average speed, distance, time elapsed);
There are some small libraries (like GpxPy) that try to model these objects AND THEIR METHODS, in a way that would ideally allow for an encapsulated, possibly language-independent Library/API.
The fact is: this problem is around long enough to allow for a "common accepted standard" to emerge, isn't it? In the other hand, most GIS software is very professionally oriented towards geospatial analyses, topographic and cartographic applications, while the typical trip-logging and trip-planning applications seem to be more consumer-hobbyist oriented, which might explain the quite disperse way the different projects/apps treat and model the problem.
Thus considering everything said, the question is: Is there, at present or being planned, a standard way to model canonicaly, in an Object-Oriented way, the most used GPS/Tracklog entities and their canonical attributes and methods?
There is the GPX schema and it is very close to what I imagine, but it only contains objects and attributes, not methods.
Any information will be very much appreciated, thanks!!
As far as I know, there is no standard library, interface, or even set of established best practices when it comes to storing/manipulating/processing "route" data. We have put a lot of effort into these problems at Ride with GPS and I know the same could be said by the other sites that solve related problems. I wish there was a standard, and would love to work with someone on one.
GPX is OK and appears to be a sort-of standard... at least until you start processing GPX files and discover everyone has simultaneously added their own custom extensions to the format to deal with data like heart rate, cadence, power, etc. Also, there isn't a standard way of associating a route point with a track point. Your "bread crumb trail" of the route is represented as a series of trkpt elements, and course points (e.g. "turn left onto 4th street") are represented in a separate series of rtept elements. Ideally you want to associate a given course point with a specific track point, rather than just giving the course point a latitude and longitude. If your path does several loops over the same streets, it can introduce some ambiguity in where the course points should be attached along the route.
KML and Garmin's TCX format are similar to GPX, with their own pros and cons. In the end these formats really only serve the purpose of transferring the data between programs. They do not address the issue of how to represent the data in your program, or what type of operations can be performed on the data.
We store our track data as an array of objects, with keys corresponding to different attributes such as latitude, longitude, elevation, time from start, distance from start, speed, heart rate, etc. Additionally we store some metadata along the route to specify details about each section. When parsing our array of track points, we use this metadata to split a Route into a series of Segments. Segments can be split, joined, removed, attached, reversed, etc. They also encapsulate the method of trackpoint generation, whether that is by interpolating points along a straight line, or requesting a path representing directions between the endpoints. These methods allow a reasonably straightforward implementation of drag/drop editing and other common manipulations. The Route object can be used to handle operations involving multiple segments. One example is if you have a route composed of segments - some driving directions, straight lines, walking directions, whatever - and want to reverse the route. You can ask each segment to reverse itself, maintaining its settings in the process. At a higher level we use a Map class to wire up the interface, dispatch commands to the Route(s), and keep a series of snapshots or transition functions updated properly for sensible undo/redo support.
Route manipulation and generation is one of the goals. The others are aggregating summary statistics are structuring the data for efficient visualization/interaction. These problems have been solved to some degree by any system that will take in data and produce a line graph. Not exactly new territory here. One interesting characteristic of route data is that you will often have two variables to choose from for your x-axis: time from start, and distance from start. Both are monotonically increasing, and both offer useful but different interpretations of the data. Looking at the a graph of elevation with an x-axis of distance will show a bike ride going up and down a hill as symmetrical. Using an x-axis of time, the uphill portion is considerably wider. This isn't just about visualizing the data on a graph, it also translates to decisions you make when processing the data into summary statistics. Some weighted averages make sense to base off of time, some off of distance. The operations you end up wanting are min, max, weighted (based on your choice of independent var) average, the ability to filter points and perform a filtered min/max/avg (only use points where you were moving, ignore outliers, etc), different smoothing functions (to aid in calculating total elevation gain for example), a basic concept of map/reduce functionality (how much time did I spend between 20-30mph, etc), and fixed window moving averages that involve some interpolation. The latter is necessary if you want to identify your fastest 10 minutes, or 10 minutes of highest average heartrate, etc. Lastly, you're going to want an easy and efficient way to perform whatever calculations you're running on subsets of your trackpoints.
You can see an example of all of this in action here if you're interested: http://ridewithgps.com/trips/964148
The graph at the bottom can be moused over, drag-select to zoom in. The x-axis has a link to switch between distance/time. On the left sidebar at the bottom you'll see best 30 and 60 second efforts - those are done with fixed window moving averages with interpolation. On the right sidebar, click the "Metrics" tab. Drag-select to zoom in on a section on the graph, and you will see all of the metrics update to reflect your selection.
Happy to answer any questions, or work with anyone on some sort of standard or open implementation of some of these ideas.
This probably isn't quite the answer you were looking for but figured I would offer up some details about how we do things at Ride with GPS since we are not aware of any real standards like you seem to be looking for.
Thanks!
After some deeper research, I feel obligated, for the record and for the help of future people looking for this, to mention the pretty much exhaustive work on the subject done by two entities, sometimes working in conjunction: ISO and OGC.
From ISO (International Standards Organization), the "TC 211 - Geographic information/Geomatics" section pretty much contains it all.
From OGS (Open Geospatial Consortium), their Abstract Specifications are very extensive, being at the same time redundant and complimentary to ISO's.
I'm not sure it contains object methods related to the proposed application (gps track and waypoint analysis and manipulation), but for sure the core concepts contained in these documents is rather solid. UML is their schema representation of choice.
ISO 6709 "[...] specifies the representation of coordinates, including latitude and longitude, to be used in data interchange. It additionally specifies representation of horizontal point location using coordinate types other than latitude and longitude. It also specifies the representation of height and depth that can be associated with horizontal coordinates. Representation includes units of measure and coordinate order."
ISO 19107 "specifies conceptual schemas for describing the spatial characteristics of geographic features, and a set of spatial operations consistent with these schemas. It treats vector geometry and topology up to three dimensions. It defines standard spatial operations for use in access, query, management, processing, and data exchange of geographic information for spatial (geometric and topological) objects of up to three topological dimensions embedded in coordinate spaces of up to three axes."
If I find something new, I'll come back to edit this, including links when available.