Imputation performance with the percentage of missing values - missing-data

I wonder which method is typically used for missing value imputation (except for MICE and missForest). What is the maximum percent of missing values where it works if the data is missing at random? What if the data is missing not at random? How to check the performance of an imputation method on your data (real-life data so there is no "right" answer)? Thank you!

Related

pandas.qcut returning NaN values

enter image description here
I want to make a new column from "TotalPrice" with qcut function but some values returns as NaN. I don't know why?
I tried to change the data type of the column. But nothing has changed.
Edit:
you are doing a cqut on df rather than rfm dataframe. Ensure that this is what you expect to be doing
Because you did not provide some data to build a minimal reproducible example, I would guess that there's not enough data or too many repeated values. Then, the underlying quartile function may fail to find the edge of the quantile and returns NaN
(this did not make any sense because "M" buckets did not make sense with "TotalPrice")

Does a column of only zeros provide any information in a data analysis?. What if it has missing values?

I have a tricky question someone asked me:
I got a bunch of columns with data to predict some future sales.
There are a bunch of these columns that use a lot of memory and only got zeros. The question is: Can I just remove these columns from my analysis?
Second part. What if the columns that only have zeros also have missing values. What do you do?
Analyzing datasets with large numbers of zeros can indeed be a waste of computational resources (see sparse matrices computation), especially since they may not be contributing any meaningful information. In fact, they might even add noise to your dataset, obscuring any relationships you might find otherwise.
But there are cases were zeros can be incredibly meaningful. For example, if you were trying to predict future sales of products based on units sold (count data), with each column representing a month of sales, you might want to keep the zeros as they provide insight into sales of your product during those specific months.
Removing missing values is definitely tricky, and they can often be a sign that you might need to re-examine the data collection process for explanation; is there a reason why the data is missing, or does the missing value mean something (e.g. sometimes zeros can be coded as NAs)?
How you deal with zeros/missing values varies greatly based on the domain you're in, the specific question you're trying to answer, and what exactly the meaning of your data/columns are. So it's kind of hard to answer without knowing much about the data itself.

How do you deal with missing data when it's missing like 60%?

My data has a lot of missing values and I have to predict those values. One way is to take the average of those values. But I want to hear an other perspective on it. How experienced data scientist solve such kind of issue?
Are your missing values categorical or continuous?
One way is to remove the samples entirely, however this may lead to a sampling bias, since the missing values could have been the result of some causal effect, that is the missing values are not missing completely at random.
If your data has enough dimensionality, you can treat your missing values as the output and try to apply a predicting model and hope that it can faithfully estimate the missing values, given the explanatory variables you already have.
Picking the most frequent value, the median, or averaging as you point out could also be an option, however be careful with outliers when averaging as these can have a tremendous effect on the mean.
It depends on nature of variables, it may be some statistics like mean or median. Another practice is assign to missing variables some value different from others for example 0, -1 or something like this.
The hardest approach is to impute the dataset and not deviate too far from the truth. A test to validate how well you have done this is the following. If the other parameters provide enough evidenced insight to impute with a level of precision for missing data....it should be able to do it with existing data.
So if 60 percent of the column is missing, take the row observations where this column is PRESENT.
Next, randomly choose to remove 60% of this subsetted data. Now run imputation methods of your choosing.
Compare the imputed dataset to the real data set for similarity. Decide if they are close enough for you to then run this against the full data set. At least this approach will give you a leg to stand on if you need to defend yourself.
Fight the Good Fight.

R missForest mixError does no sense?

For the mixError function in missForest, the documentation says
Usage:
mixError(ximp, xmis, xtrue)
Arguments
ximp : imputed data matrix with variables in the columns and observations in the rows. Note there should not be any missing values.
xmis: data matrix with missing values.
xtrue: complete data matrix. Note there should not be any missing values.
Then my question is..
If I have already xtrue, why do I need this function?
All the examples have a complete data, they impute some NA's on purpose, then they use missForest to fill out the NA's and then they calculate the error comparing the imputed data with the original data without NA's.
But.. what is the sense of that? If I already have the complete data!
So, the question is also
Could xtrue be the original data with all the rows with NA's removed??

Assigning values to missing data for use in binary logistic regression in SAS

Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
You can impute values for the missing fields, subject to logical restrictions on your experimental design and the fact that it will weaken the power of your experiment some relative to having the same experiment with no missing values.
SAS offers a few ways to do this. The simplest is to use PROC MI and PROC MIANALYZE, but even those are certainly not a simple matter of plugging a few numbers in. See this page for more information. Ultimately this is probably a better question for Cross-Validated at least until you have figured out the experimental design issues.