Missing value imputation - missing-data

If in a dataset we have missing values in both categorical and continuous variables, how can I deal with them by replacing with mode for the categorical variable and mean for the continuous variable?

When the missing data are missing at random, you could impute the missing values using multiple imputation.
For more information about multiple imputation, I would recommend the book Applied Missing Data by C.K. Enders (2010). It also has a great companion website.
For multiple imputation in R you could use the mice package. Here is the link to the package on CRAN, the link to the documentation, and the link to the article in the Journal of Statistical Software.
There are other packages for multiple imputation.

You can try to use either fillna() or interpolate()
For more details about these two please refer my answer to this question in StackOverflow. link is: Missing values in Time Series in python

Related

How to deal with missing data? Info will be used for data visualization

How does everyone deal with missing values in dataframe? I created a dataframe by using a Census Web Api to get the data. The 'GTCBSA' variable provides the City information which is required for me to use it for (plotly and dash) and I found that there is a lot of missing values in the data. Do I just leave it blank and continue with my data visualization? The following is my variable
Example data for 2004 = https://api.census.gov/data/2004/cps/basic/jun?get=GTCBSA,PEFNTVTY&for=state:*
Variable description = https://api.census.gov/data/2022/cps/basic/jan/variables/GTCBSA.json
There are different ways of dealing with missing data depending on the use case and the type of data that is missing. For example, for a near-continuous stream of timeseries signals data with some missing values, you can attempt to fill the missing values based on nearby values by performing some type of interpolation (linear interpolation, for example).
However, in your case, the missing values are cities and the rows are all independent (each row is a different respondent). As far as I can tell, you don't have any way to reasonably infer the city for the rows where the city is missing so you'll have to drop these rows from consideration.
I am not an expert in the data collection method(s) used by the US census, but from this source, it seems like there are multiple methods used so I can see how it might be possible that the city of the respondent isn't known (the online tool might not be able to obtain the city of the respondent, or perhaps the respondent declined to state their city). Missing data is a very common issue.
However, before dropping all of rows with missing cities, you might do a brief check to see if there is any pattern (e.g. are the rows with missing cities predominantly from one state, for example?). If you are doing any state-level analysis, you could keep the rows with missing cities.

python - pandas - dataframe - data padding multidimensional statistics

i have a dataframe with columns accounting for different characteristics of stars and rows accounting for measurements of different stars. (something like this)
\property_______A _______A_error_______B_______B_error_______C_______C_error ...
star1
star2
star3
...
in some measurements the error for a specifc property is -1.00 which means the measurement was faulty.
in such case i want to discard the measurement.
one way to do so is by eliminating the entire row (along with other properties who's error was not -1.00)
i think it's possible to fill in the faulty measurement with a value generated by the distribution based on all the other measurements, meaning - given the other properties which are fine, this property should have this value in order to reduce the error of the entire dataset.
is there a proper name to the idea i'm referring to?
how would you apply such an algorithm?
i'm a student on a solo project so would really appreciate answers that also elaborate on theory (:
edit
after further reading, i think what i was referring to is called regression imputation.
so i guess my question is - how can i implement multidimensional linear regression in a dataframe in the most efficient way???
thanks!

Tensorflow Quickstart Tutorial example - where do we say which column is what?

I'm slowly training myself on Tensorflow, following the Get Started section on Tensorflow.org. So far so good.
I have one question regarding the Iris data set example:
https://www.tensorflow.org/get_started/estimator
When it comes to that section:
https://www.tensorflow.org/get_started/estimator#construct_a_deep_neural_network_classifier
It is not entirely clear to me when/where do we tell the system which columns are features and which column is the target value.
The only thing i see is that when we load the dataset, we tell the system the target is in integer format and the features are in float32. But what if my data is all integer or all float32 ? How can I tell the system that, for example, first 5 columns are features, and target is in last column. Or is it implied that first columns are always features and last column can only be the target ?
The statement 'feature_columns=feature_columns' when defining the DNNClassifier says which columns are to be considered a part of the data, and the input function says what are inputs and what are labels.

How to replace the missing data from AMELIA results

I have run a AMELIA imputation for a data set including missing data. I need to replace the missing points by the result of amelia(). But it content 5 group of imputed values. How can i choose the best one to replace the missing values (to plot a graph of data set after imputing)
You use all 5.
You have to perform whatever you wanted to do with the data on all 5 sets of data and then combine the results of that.
i.e. you run a t-test on all 5 datasets and then combine the results..somehow.. I have not yet looked into that, but from what I have heared you can use the zelig R package to do it somewhat easily. I also noted a reference to papers that should describe methods to combine those, but have not looked into that either: King et al. (2001) and Schafer (1997).
My guess is that you just average out the p-values gained from the analysis?

Assigning values to missing data for use in binary logistic regression in SAS

Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
You can impute values for the missing fields, subject to logical restrictions on your experimental design and the fact that it will weaken the power of your experiment some relative to having the same experiment with no missing values.
SAS offers a few ways to do this. The simplest is to use PROC MI and PROC MIANALYZE, but even those are certainly not a simple matter of plugging a few numbers in. See this page for more information. Ultimately this is probably a better question for Cross-Validated at least until you have figured out the experimental design issues.