Text classification of inconsistent regulatory material for regulatory requirements - tensorflow

Hi I am new to tensorflow and tensorflowHub for that matter and would like to know what I should be using from the API considering my text classification use case.
I want to find out the housing setback information for different municipalities. Using Python I have web scraped the info successfully and used NLTK to classify the words but I want to take it a step further and use ML given no two municipality codes are alike! As an example one municipality may have something like this:
Setback requirements.
Minimum front setback: 25 feet.
Minimum side setback from a street right-of-way: 25 feet.
Minimum side setback from an interior lot line: five feet.
Minimum rear setback for principal uses: 25 feet.
Minimum rear setback for accessory uses: ten feet.
etc
While another may have the following text.
For all R-1 districts except 4-R-1, the minimum setbacks shall be as follows:
Front. No building or structure shall be located within fifty (50) feet of the centerline of any street nor twenty (20) feet of the property line, whichever is greater.
There shall be a side yard setback on each side of the parcel equal to ten percent (10%) of the width of the parcel. In no case shall the minimum required side yard setback be less than five (5) feet. In order to preserve architectural integrity, the side yard setback required for an addition to an existing building or structure may be permitted to utilize the established setback, provided that the established side yard setback is not less than five (5) feet.
Rear. There shall be a rear yard setback of not less than fifteen (15) feet.
For the 4-R-1 district, the minimum setbacks shall be as follows:
Front. No building or structure shall be located within forty (40) feet of the centerline of any street nor ten (10) feet of the property line, whichever is greater.
Side. Three (3) feet.
Rear. Fifteen (15) feet.
etc
How can I classify this text on the required setbacks that each municipality requires? I eventually want to use this in ARCgis as a shape file or similiar.
Any help would be appreciated!

In general it's a challenging modelling task given that I assume there is not much data from a given municipality and not a lot of standardization between municipalities.
If the amount of data is available would support it, you could try fine-tuning one of the existing transformers (https://tfhub.dev/google/collections/transformer_encoders_text/1) using the same pre-training objective that was used to train them in the first place, e.g. masked language model (MLM).
In general, tensorflow-hub might not be the best tag to get a general modeling advice since tfhub.dev is a repository of pretrained models published by the OSS community.

Related

Adapting Smartphone Camera to derive Blackbody temperature

At first blush this presumably means -
(1) looking only at lower IR frequencies,
(2) select a IR frequency cut-off for low frequency buckets of the u/v FFT grid
(3) Once we have that, derive the power distribution - squares of amplitudes - for that IR range of frequency buckets the camera supports.
(4) Fit that distribution against the Rayleigh-Jones classical Black Box radiation formula:
(https://en.wikipedia.org/wiki/Rayleigh%E2%80%93Jeans_law#Other_forms_of_Rayleigh%E2%80%93Jeans_law)
(5) Assign a Temperature of 'best fit'.
The units for B(ν,T) are Power per unit frequency per unit surface area at equilibrium Temperature
Of course, this leaves many details out, such as (6) cancelling background, etc, but one could perhaps use the opposite facing camera to assist in that. Where buckets do not straddle the temperature of interest, (7) use a one-sided distribution to derive an inferred Gaussian curve to fit the Rayleigh-Jeans curve at that derived central frequency ν, for measured temperature T.
Finally (8) check if this procedure can consistently detect a high vs low surface temperature (9) check if it can consistently identify a 'fever' temperature (say, 101 Fahrenheit / 38 Celcius) pointing at a forehead.
If all that can be done, (10) Voila! a body fever detector
So those who are capable can fill us in on whether this is possible to do so for eventual posting at an app store as a free Covid19 safe body temperature app? I have a strong sense there's quite a few out there who can verify this in a week or two!
It appears that the analog signal assumed in (1) and (2) are not available in the Android digital Camera2 interface.
Android RAW image stream, that is uncompressed YUV, is already encoded Y green monochrome, and U,V are blue and red shifts from zero for converting green monochrome to color.
The original analog frequency / energy signal is not immediately accessible. So adaptation is not possible (yet).

Multiple trained models vs Multple features and one model

I'm trying to build a regression based M/L model using tensorflow.
I am trying to estimate an object's ETA based on the following:
distance from target
distance from target (X component)
distance from target (Y component)
speed
The object travels on specific journeys. This could be represented as from A->B or from A->C or from D->F (POINT 1 -> POINT 2). There are 500 specific journeys (between a set of points).
These journeys aren't completely straight lines, and every journey is different (ie. the shape of the route taken).
I have two ways of getting around this problem:
I can have 500 different models with 4 features and one label(the training ETA data).
I can have 1 model with 5 features and one label.
My dilemma is that if I use option 1, that's added complexity, but will be more accurate as every model will be specific to each journey.
If I use option 2, the model will be pretty simple, but I don't know if it would work properly. The new feature that I would add are originCode+ destinationCode. Unfortunately these are not quantifiable in order to make any numerical sense or pattern - they're just text that define the journey (journey A->B, and the feature would be 'AB').
Is there some way that I can use one model, and categorize the features so that one feature is just a 'grouping' feature (in order separate the training data with respect to the journey.
In ML, I believe that option 2 is generally the better option. We prefer general models rather than tailoring many models to specific tasks, as that gets dangerously close to hardcoding, which is what we're trying to get away from by using ML!
I think that, depending on the training data you have available, and the model size, a one-hot vector could be used to describe the starting/end points for the model. Eg, say we have 5 points (ABCDE), and we are going from position B to position C, this could be represented by the vector:
0100000100
as in, the first five values correspond to the origin spot whereas the second five are the destination. It is also possible to combine these if you want to reduce your input feature space to:
01100
There are other things to consider, as Scott has said in the comments:
How much data do you have? Maybe the feature space will be too big this way, I can't be sure. If you have enough data, then the model will intuitively learn the general distances (not actually, but intrinsically in the data) between datapoints.
If you have enough data, you might even be able to accurately predict between two points you don't have data for!
If it does come down to not having enough data, then finding representative features of the journey will come into use, ie. length of journey, shape of the journey, elevation travelled etc. Also a metric for distance travelled from the origin could be useful.
Best of luck!
I would be inclined to lean toward individual models. This is because, for a given position along a given route and a constant speed, the ETA is a deterministic function of time. If one moves monotonically closer to the target along the route, it is also a deterministic function of distance to target. Thus, there is no information to transfer from one route to the next, i.e. "lumping" their parameters offers no a priori benefit. This is assuming, of course, that you have several "trips" worth of data along each route (i.e. (distance, speed) collected once per minute, or some such). If you have only, say, one datum per route then lumping the parameters is a must. However, in such a low-data scenario, I believe that including a dummy variable for "which route" would ultimately be fruitless, since that would introduce a number of parameters that rivals the size of your dataset.
As a side note, NEITHER of the models you describe could handle new routes. I would be inclined to build an individual model per route, data quantity permitting, and a single model neglecting the route identity entirely just for handling new routes, until sufficient data is available to build a model for that route.

Calculating walking distance for user over time

I'm trying to track the distance a user has moved over time in my application using the GPS. I have the basic idea in place, so I store the previous location and when a new GPS location is sent I calculate the distance between them, and add that to the total distance. So far so good.
There are two big issues with this simple implementation:
Since the GPS is inacurate, when the user moves, the GPS points will not be a straight line but more of a "zig zag" pattern making it look like the user has moved longer than he actually have moved.
Also a accuracy problem. If the phone just lays on the table and polls GPS possitions, the answer is usually a couple of meters different every time, so you see the meters start accumulating even when the phone is laying still.
Both of these makes the tracking useless of coruse, since the number I'm providing is nowwhere near accurate enough.
But I guess that this problem is solvable since there are a lot of fitness trackers and similar out there that does track distance from GPS. I guess they do some kind of interpolation between the GPS values or something like that? I guess that won't be 100% accurate either, but probably good enough for my usage.
So what I'm after is basically a algorithm where I can put in my GPS positions, and get as good approximation of distance travelled as possible.
Note that I cannot presume that the user will follow roads, so I cannot use the Google Distance Matrix API or similar for this.
This is a common problem with the position data that is produced by GPS receivers. A typical consumer grade receiver that I have used has a position accuracy defined as a CEP of 2.5 metres. This means that for a stationary receiver in a "perfect" sky view environment over time 50% of the position fixes will lie within a circle with a radius of 2.5 metres. If you look at the position that the receiver reports it appears to wander at random around the true position sometimes moving a number of metres away from its true location. If you simply integrate the distance moved between samples then you will get a very large apparent distance travelled.for a stationary device.
A simple algorithm that I have used quite successfully for a vehicle odometer function is as follows
for(;;)
{
Stored_Position = Current_Position ;
do
{
Distance_Moved = Distance_Between( Current_Position, Stored_Position ) ;
} while ( Distance_Moved < MOVEMENT_THRESHOLD ) ;
Cumulative_Distance += Distance_Moved ;
}
The value of MOVEMENT_THRESHOLD will have an effect on the accuracy of the final result. If the value is too small then some of the random wandering performed by the stationary receiver will be included in the final result. If the value is too large then the path taken will be approximated to a series of straight lines each of which is as long as the threshold value. The extra distance travelled by the receiver as its path deviates from this straight line segment will be missed.
The accuracy of this approach, when compared with the vehicle odometer, was pretty good. How well it works with a pedestrian would have to be tested. The problem with people is that they can make much sharper turns than a vehicle resulting in larger errors from the straight line approximation. There is also the perennial problem with sky view obscuration and signal multipath caused by buildings, vehicles etc. that can induce positional errors of 10s of metres.

Creating an MLP that learns based on GPS coordinates

I have some data that tells me the amount of hours water is available for particular towns.
You can see it here
I want to use train a Multilayer Perceptron based on that data, to take a set of coordinates and indicate the approximate number of hours for which that coordinate will have water.
Does this make sense?
If so, am I correct in saying, there has to be two input layers? One for lat and one for long. And the output layer should be the number of hours.
Would love some guidance.
I would solve that differently:
Just create an ArrayList of WaterInfo:
WaterInfo contains lat,lon, waterHours.
Then for a given coordinate search the closest WaterInfo in the list.
Since you have not many elements, just do a brute force search, to find the closest.
You further can optimize, to find the three closest WaterInfo points, and calculate the weithted average of WaterHours. As weight you use the air distance from current position to Waterinfo position.
To answer your question:
"Does this makes sense"?
From the goal to get a working solution: NO!
Ask yourself, why do you want to use MLP for this task.
Further i doubt that using two layers for lat / long makes sense.
A coordinate (lat/lon) is one point on the world, so that should be one layer in the model. You can convert the lat/lon coord to a cell identifier: Span a grid over Brazil; with cell width 10 or 50km; now convert a lat/long coordinate to a cellId: Like E4 on a chess board, you will calculate one integer value representing the cell. (There are other solutions to get an unique number, too, choose one you like)
Now you have a modell geoCellID -> waterHours, which better represents the real world situation.

Need help generating discrete random numbers from distribution

I searched the site but did not find exactly what I was looking for... I wanted to generate a discrete random number from normal distribution.
For example, if I have a range from a minimum of 4 and a maximum of 10 and an average of 7. What code or function call ( Objective C preferred ) would I need to return a number in that range. Naturally, due to normal distribution more numbers returned would center round the average of 7.
As a second example, can the bell curve/distribution be skewed toward one end of the other? Lets say I need to generate a random number with a range of minimum of 4 and maximum of 10, and I want the majority of the numbers returned to center around the number 8 with a natural fall of based on a skewed bell curve.
Any help is greatly appreciated....
Anthony
What do you need this for? Can you do it the craps player's way?
Generate two random integers in the range of 2 to 5 (inclusive, of course) and add them together. Or flip a coin (0,1) six times and add 4 to the result.
Summing multiple dice produces a normal distribution (a "bell curve"), while eliminating high or low throws can be used to skew the distribution in various ways.
The key is you are going for discrete numbers (and I hope you mean integers by that). Multiple dice throws famously generate a normal distribution. In fact, I think that's how we were first introduced to the Gaussian curve in school.
Of course the more throws, the more closely you approximate the bell curve. Rolling a single die gives a flat line. Rolling two dice just creates a ramp up and down that isn't terribly close to a bell. Six coin flips gets you closer.
So consider this...
If I understand your question correctly, you only have seven possible outcomes--the integers (4,5,6,7,8,9,10). You can set up an array of seven probabilities to approximate any distribution you like.
Many frameworks and libraries have this built-in.
Also, just like TokenMacGuy said a normal distribution isn't characterized by the interval it's defined on, but rather by two parameters: Mean μ and standard deviation σ. With both these parameters you can confine a certain quantile of the distribution to a certain interval, so that 95 % of all points fall in that interval. But resticting it completely to any interval other than (−∞, ∞) is impossible.
There are several methods to generate normal-distributed values from uniform random values (which is what most random or pseudorandom number generators are generating:
The Box-Muller transform is probably the easiest although not exactly fast to compute. Depending on the number of numbers you need, it should be sufficient, though and definitely very easy to write.
Another option is Marsaglia's Polar method which is usually faster1.
A third method is the Ziggurat algorithm which is considerably faster to compute but much more complex to program. In applications that really use a lot of random numbers it may be the best choice, though.
As a general advice, though: Don't write it yourself if you have access to a library that generates normal-distributed random numbers for you already.
For skewing your distribution I'd just use a regular normal distribution, choosing μ and σ appropriately for one side of your curve and then determine on which side of your wanted mean a point fell, stretching it appropriately to fit your desired distribution.
For generating only integers I'd suggest you just round towards the nearest integer when the random number happens to fall within your desired interval and reject it if it doesn't (drawing a new random number then). This way you won't artificially skew the distribution (such as you would if you were clamping the values at 4 or 10, respectively).
1 In testing with deliberately bad random number generators (yes, worse than RANDU) I've noticed that the polar method results in an endless loop, rejecting every sample. Won't happen with random numbers that fulfill the usual statistic expectations to them, though.
Yes, there are sophisticated mathematical solutions, but for "simple but practical" I'd go with Nosredna's comment. For a simple Java solution:
Random random=new Random();
public int bell7()
{
int n=4;
for (int x=0;x<6;++x)
n+=random.nextInt(2);
return n;
}
If you're not a Java person, Random.nextInt(n) returns a random integer between 0 and n-1. I think the rest should be similar to what you'd see in any programming language.
If the range was large, then instead of nextInt(2)'s I'd use a bigger number in there so there would be fewer iterations through the loop, depending on frequency of call and performance requirements.
Dan Dyer and Jay are exactly right. What you really want is a binomial distribution, not a normal distribution. The shape of a binomial distribution looks a lot like a normal distribution, but it is discrete and bounded whereas a normal distribution is continuous and unbounded.
Jay's code generates a binomial distribution with 6 trials and a 50% probability of success on each trial. If you want to "skew" your distribution, simply change the line that decides whether to add 1 to n so that the probability is something other than 50%.
The normal distribution is not described by its endpoints. Normally it's described by it's mean (which you have given to be 7) and its standard deviation. An important feature of this is that it is possible to get a value far outside the expected range from this distribution, although that will be vanishingly rare, the further you get from the mean.
The usual means for getting a value from a distribution is to generate a random value from a uniform distribution, which is quite easily done with, for example, rand(), and then use that as an argument to a cumulative distribution function, which maps probabilities to upper bounds. For the standard distribution, this function is
F(x) = 0.5 - 0.5*erf( (x-μ)/(σ * sqrt(2.0)))
where erf() is the error function which may be described by a taylor series:
erf(z) = 2.0/sqrt(2.0) * Σ∞n=0 ((-1)nz2n + 1)/(n!(2n + 1))
I'll leave it as an excercise to translate this into C.
If you prefer not to engage in the exercise, you might consider using the Gnu Scientific Library, which among many other features, has a technique to generate random numbers in one of many common distributions, of which the Gaussian Distribution (hint) is one.
Obviously, all of these functions return floating point values. You will have to use some rounding strategy to convert to a discrete value. A useful (but naive) approach is to simply downcast to integer.