How to normalize/standardize time-inputs for a neuronal network? - input

Im using AForge to build a Neuronal network.
But I have a hard time to define my input parameters. In my training Data, one input value is the time of an event.
As far as I know the input values should be between -1 and +1?
I cant figure out what the best way is to normalize/standardize the time-value.
One way would be to choose a min- and a max-value. The min value would be -1 and the max-value is 1. But then the network would stop working for values outside this timeframe or when the timeframe is to big, the difference between the inputs will be very small.
I thought of splitting the time value into several input values (like minute, hour, day, month, year) and use this as a several inputs but this moves the problem just to the "year"-input.
Another way is to use a logarithmic scale.
Are there any best practices for this or a good possibility I just did not think of?
Update:
The input consists of:
A module number
the time it was opened
the user who opened it
...
Output:
the module number of the module which will likley opened next

You can just use timesteps.
You probably have dataset something like that:
12/13/2012 01:00,23,345,235,235,644,757,
12/13/2012 01:02,455,325,235578,23524,6413,757567,
12/13/2012 01:08,123,125,2375,23554,64123,75778,
...,
12/13/2019 07:33,244,245,231235,2158935,6567944,7567557
You can drop your first column:
23,345,235,235,644,757,
455,325,235578,23524,6413,757567,
123,125,2375,23554,64123,75778,
...,
244,245,231235,2158935,6567944,7567557
and work with your data directly.
Check this article https://machinelearningmastery.com/time-series-forecasting-supervised-learning/. Hope it helps.

Related

How to insert a time range in SQL?

So I have a field,
,Time_ TIME NULL
And this field is meant to represent the time a lecture would take place. In this case, the lecture would take place between 1:00 and 2:00. Is it possible to insert this time range or would I only be able to insert a starting time? i.e.
,'1:00PM'
Thanks for the help!
Is it possible to insert a time range into a single field? Yes.
Is it possible to insert a time range into a field formatted as TIME? No.
Is it a good idea to store two data points in one field? Absolutely not.
Am I going to suggest any ways that you COULD store both in one field? Sorry, but no, I'm afraid I'm not.
You'll run into problems parsing the field. You'll run into problems calculating the length of a lecture. You'll just run into problems every time you turn around.
I imagine you want to display the something like "Math 101, 11:00-12:00", and that's fine. In your presentation layer, be it an SSRS report, a spreadsheet, a web page, what have you, call for the class name, the start time, and the end time, then format them appropriately at the output stage. It keeps your data manageable for the next use case.
You cannot insert a time range. If you need to know the range of the lecture you could create columns for 'lecture_starttime' and 'lecture_endtime'.
I would recommend storing both times and calculating the length:
start_time
end_time
This will cater for lectures that aren't always 1 hour long. You can calculate the length as end_time - start_time.

Random number seed in numpy

numpy.random.seed(7)
In different machine learning and data analysis tutorials, I saw this seed set with a different number. Does it make a real difference in choosing a specific seed number? Or any number is fine? The goal of choosing a seed number is to reproducibility of the same experiments.
Supplying the same seed will give the same results every time the program is run. This is useful during developing/testing to reliably get the same results over and over.
When your app is "in production", change the seed source to something dynamic, like the current time (or something less predictable) to have "typical random behavior". If you don't supply a seed, many generators will default to something like the current time as milliseconds since some epoch.
The actual number doesn't matter. I use my school ID number (9 digits), just out of habit since I have it thoroughly memorized, but also use short 2 digits numbers for quick tests if I want it to be reproducible.

Neural Network Input and Output Data formatting

and thanks for reading my thread.
I have read some of the previous posts on formatting/normalising input data for a Neural Network, but cannot find something that addresses my queries specifically. I apologise for the long post.
I am attempting to build a radial basis function network for analysing horse racing data. I realise that this has been done before, but the data that I have is "special" and I have a keen interest in racing/sportsbetting/programming so would like to give it a shot!
Whilst I think I understand the principles for the RBFN itself, I am having some trouble understanding the normalisation/formatting/scaling of the input data so that it is presented in a "sensible manner" for the network, and I am not sure how I should formulate the output target values.
For example, in my data I look at the "Class change", which compares the class of race that the horse is running in now compared to the race before, and can have a value between -5 and +5. I expect that I need to rescale these to between -1 and +1 (right?!), but I have noticed that many more runners have a class change of 1, 0 or -1 than any other value, so I am worried about "over-representation". It is not possible to gather more data for the higher/lower class changes because thats just 'the way the data comes'. Would it be best to use the data as-is after scaling, or should I trim extreme values, or something else?
Similarly, there are "continuous" inputs - like the "Days Since Last Run". It can have a value between 1 and about 1000, but values in the range of 10-40 vastly dominate. I was going to scale these values to be between 0 and 1, but even if I trim the most extreme values before scaling, I am still going to have a huge representation of a certain range - is this going to cause me an issue? How are problems like this usually dealt with?
Finally, I am having trouble understanding how to present the "target" values for training to the network. My existing results data has the "win/lose" (0 or 1?) and the odds at which the runner won or lost. If I just use the "win/lose", it treats all wins and loses the same when really they're not - I would be quite happy with a network that ignored all the small winners but was highly profitable from picking 10-1 shots. Similarly, a network could be forgiven for "losing" on a 20-1 shot but losing a bet at 2/5 would be a bad loss. I considered making the results (+1 * odds) for a winner and (-1 / odds) for a loser to capture the issue above, but this will mean that my results are not a continuous function as there will be a "discontinuity" between short price winners and short price losers.
Should I have two outputs to cover this - one for bet/no bet, and another for "stake"?
I am sorry for the flood of questions and the long post, but this would really help me set off on the right track.
Thank you for any help anyone can offer me!
Kind regards,
Paul
The documentation that came with your RBFN is a good starting point to answer some of these questions.
Trimming data aka "clamping" or "winsorizing" is something I use for similar data. For example "days since last run" for a horse could be anything from just one day to several years but tends to centre in the region of 20 to 30 days. Some experts use a figure of say 63 days to indicate a "spell" so you could have an indicator variable like "> 63 =1 else 0" for example. One clue is to look at outliers say the upper or lower 5% of any variable and clamp these.
If you use odds/dividends anywhere make sure you use the probabilities ie 1/(odds+1) and a useful idea is to normalize these to 100%.
The odds or parimutual prices tend to swamp other predictors so one technique is to develop separate models, one for the market variables (the market model) and another for the non-market variables (often called the "fundamental" model).

Assigning values to missing data for use in binary logistic regression in SAS

Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
You can impute values for the missing fields, subject to logical restrictions on your experimental design and the fact that it will weaken the power of your experiment some relative to having the same experiment with no missing values.
SAS offers a few ways to do this. The simplest is to use PROC MI and PROC MIANALYZE, but even those are certainly not a simple matter of plugging a few numbers in. See this page for more information. Ultimately this is probably a better question for Cross-Validated at least until you have figured out the experimental design issues.

How to distinguish in master data and calculated interpolated data?

I'm getting a bunch of vectors with datapoints for a fixed set of values, in the example below you see an example of a vector with a value per time point
1D:2
2D:
7D:5
1M:6
6M:6.5
But alas not for all the timepoints is a value available. All vectors are stored in a database and with a trigger we calcuate the missing values by interpolation, or possibly a more advanced algorithm. Somehow I want to be able to tell which data points have been calculated and which have been original delivered to us. Of course I can add a flag column to the table with values indicating whether the value was a master value or is calculated, but I'm wondering whether there is a more sophisticated way. We probably don't need to determine on a regular basis, so cpu cycles are not an issue for determining or insertion.
The example above shows some nice looking numbers but in reality it would look more somethin like 3.1415966533.
The database for storage is called oracle 10.
cheers.
Could you deactivate the trigger temporarily?