Providing own imputed data into Mice - r-mice

I would like to provide my own imputed data into Mice. Is it possible to do this and just run the fit routine [with(data = ..)]? I searched but was unable find anything on this. Thanks.

Related

Reducing number of unique categories pandas

I have my dataset with job metrics, and one of my features is industry. It is a categorical feature and has 1200 unique values. Before I go on and work on building a model, I need to figure out how to best encode it esp because it has 1200 unique values. Does anyone have any tips or guidance as to where I should start?
The picture below shows the top 9 industries. I am thinking of selective encoding - maybe only using one-hot encoding for these 15-20 most frequent values, but I will be thankful for any suggestions. Thanks
Tried to look for several resources, but couldn't find anything promising so far
[A picture of the 9 most occurring industries]
https://i.stack.imgur.com/tDAEk.jpg
You could one hot encode everything, and maybe check correlations against target to see which job categories may be informative features.
if the data is too large to do this, then yes perhaps selective encoding as you said -- just conditionally fill everything else as "other" and then proceed with one hot encoding.

How to do sampling in sql query to get dataframe with pandas

Note my question is a bit different here:
I am working with pandas on a dataset that has a lot of data (10M+):
q = "SELECT COUNT(*) as total FROM `<public table>`"
df = pd.read_gbq(q, project_id=project, dialect='standard')
I know I can do with pandas function with a frac option like
df_sample = df.sample(frac=0.01)
however, I do not want to generate the original df with that size. I wonder what is the best practice to generate a dataframe with data already sampled.
I've read some sql posts showing the sample data was generated from a slice, that is absolutely not accepted in my case. The sample data needs to be evenly distributed as much as possible.
Can anyone shed me with more light?
Thank you very much.
UPDATE:
Below is a table showing how the data looks like:
Reputation is the field I am working on. You can see majority records have a very small reputation.
I don't want to work with a dataframe with all the records, I want the sampled data also looks like the un-sampled data, for example, similar histogram, that's what I meant "evenly".
I hope this clarifies a bit.
A simple random sample can be performed using the following syntax:
select * from mydata where rand()>0.9
This gives each row in the table a 10% chance of being selected. It doesn't guarantee a certain sample size or guarantee that every bin is represented (that would require a stratified sample). Here's a fiddle of this approach
http://sqlfiddle.com/#!9/21d1ee/2
On average, random sampling will provide a distribution the same as that of the underlying data, so meets your requirement. However if you want to 'force' the sample to be more representative or force it to be a certain size we need to look at something a little more advanced.

How to replace the missing data from AMELIA results

I have run a AMELIA imputation for a data set including missing data. I need to replace the missing points by the result of amelia(). But it content 5 group of imputed values. How can i choose the best one to replace the missing values (to plot a graph of data set after imputing)
You use all 5.
You have to perform whatever you wanted to do with the data on all 5 sets of data and then combine the results of that.
i.e. you run a t-test on all 5 datasets and then combine the results..somehow.. I have not yet looked into that, but from what I have heared you can use the zelig R package to do it somewhat easily. I also noted a reference to papers that should describe methods to combine those, but have not looked into that either: King et al. (2001) and Schafer (1997).
My guess is that you just average out the p-values gained from the analysis?

Neural Network Input and Output Data formatting

and thanks for reading my thread.
I have read some of the previous posts on formatting/normalising input data for a Neural Network, but cannot find something that addresses my queries specifically. I apologise for the long post.
I am attempting to build a radial basis function network for analysing horse racing data. I realise that this has been done before, but the data that I have is "special" and I have a keen interest in racing/sportsbetting/programming so would like to give it a shot!
Whilst I think I understand the principles for the RBFN itself, I am having some trouble understanding the normalisation/formatting/scaling of the input data so that it is presented in a "sensible manner" for the network, and I am not sure how I should formulate the output target values.
For example, in my data I look at the "Class change", which compares the class of race that the horse is running in now compared to the race before, and can have a value between -5 and +5. I expect that I need to rescale these to between -1 and +1 (right?!), but I have noticed that many more runners have a class change of 1, 0 or -1 than any other value, so I am worried about "over-representation". It is not possible to gather more data for the higher/lower class changes because thats just 'the way the data comes'. Would it be best to use the data as-is after scaling, or should I trim extreme values, or something else?
Similarly, there are "continuous" inputs - like the "Days Since Last Run". It can have a value between 1 and about 1000, but values in the range of 10-40 vastly dominate. I was going to scale these values to be between 0 and 1, but even if I trim the most extreme values before scaling, I am still going to have a huge representation of a certain range - is this going to cause me an issue? How are problems like this usually dealt with?
Finally, I am having trouble understanding how to present the "target" values for training to the network. My existing results data has the "win/lose" (0 or 1?) and the odds at which the runner won or lost. If I just use the "win/lose", it treats all wins and loses the same when really they're not - I would be quite happy with a network that ignored all the small winners but was highly profitable from picking 10-1 shots. Similarly, a network could be forgiven for "losing" on a 20-1 shot but losing a bet at 2/5 would be a bad loss. I considered making the results (+1 * odds) for a winner and (-1 / odds) for a loser to capture the issue above, but this will mean that my results are not a continuous function as there will be a "discontinuity" between short price winners and short price losers.
Should I have two outputs to cover this - one for bet/no bet, and another for "stake"?
I am sorry for the flood of questions and the long post, but this would really help me set off on the right track.
Thank you for any help anyone can offer me!
Kind regards,
Paul
The documentation that came with your RBFN is a good starting point to answer some of these questions.
Trimming data aka "clamping" or "winsorizing" is something I use for similar data. For example "days since last run" for a horse could be anything from just one day to several years but tends to centre in the region of 20 to 30 days. Some experts use a figure of say 63 days to indicate a "spell" so you could have an indicator variable like "> 63 =1 else 0" for example. One clue is to look at outliers say the upper or lower 5% of any variable and clamp these.
If you use odds/dividends anywhere make sure you use the probabilities ie 1/(odds+1) and a useful idea is to normalize these to 100%.
The odds or parimutual prices tend to swamp other predictors so one technique is to develop separate models, one for the market variables (the market model) and another for the non-market variables (often called the "fundamental" model).

Assigning values to missing data for use in binary logistic regression in SAS

Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
You can impute values for the missing fields, subject to logical restrictions on your experimental design and the fact that it will weaken the power of your experiment some relative to having the same experiment with no missing values.
SAS offers a few ways to do this. The simplest is to use PROC MI and PROC MIANALYZE, but even those are certainly not a simple matter of plugging a few numbers in. See this page for more information. Ultimately this is probably a better question for Cross-Validated at least until you have figured out the experimental design issues.