R missForest mixError does no sense? - missing-data

For the mixError function in missForest, the documentation says
Usage:
mixError(ximp, xmis, xtrue)
Arguments
ximp : imputed data matrix with variables in the columns and observations in the rows. Note there should not be any missing values.
xmis: data matrix with missing values.
xtrue: complete data matrix. Note there should not be any missing values.
Then my question is..
If I have already xtrue, why do I need this function?
All the examples have a complete data, they impute some NA's on purpose, then they use missForest to fill out the NA's and then they calculate the error comparing the imputed data with the original data without NA's.
But.. what is the sense of that? If I already have the complete data!
So, the question is also
Could xtrue be the original data with all the rows with NA's removed??

Related

pandas.qcut returning NaN values

enter image description here
I want to make a new column from "TotalPrice" with qcut function but some values returns as NaN. I don't know why?
I tried to change the data type of the column. But nothing has changed.
Edit:
you are doing a cqut on df rather than rfm dataframe. Ensure that this is what you expect to be doing
Because you did not provide some data to build a minimal reproducible example, I would guess that there's not enough data or too many repeated values. Then, the underlying quartile function may fail to find the edge of the quantile and returns NaN
(this did not make any sense because "M" buckets did not make sense with "TotalPrice")

python - pandas - dataframe - data padding multidimensional statistics

i have a dataframe with columns accounting for different characteristics of stars and rows accounting for measurements of different stars. (something like this)
\property_______A _______A_error_______B_______B_error_______C_______C_error ...
star1
star2
star3
...
in some measurements the error for a specifc property is -1.00 which means the measurement was faulty.
in such case i want to discard the measurement.
one way to do so is by eliminating the entire row (along with other properties who's error was not -1.00)
i think it's possible to fill in the faulty measurement with a value generated by the distribution based on all the other measurements, meaning - given the other properties which are fine, this property should have this value in order to reduce the error of the entire dataset.
is there a proper name to the idea i'm referring to?
how would you apply such an algorithm?
i'm a student on a solo project so would really appreciate answers that also elaborate on theory (:
edit
after further reading, i think what i was referring to is called regression imputation.
so i guess my question is - how can i implement multidimensional linear regression in a dataframe in the most efficient way???
thanks!

How to do sampling in sql query to get dataframe with pandas

Note my question is a bit different here:
I am working with pandas on a dataset that has a lot of data (10M+):
q = "SELECT COUNT(*) as total FROM `<public table>`"
df = pd.read_gbq(q, project_id=project, dialect='standard')
I know I can do with pandas function with a frac option like
df_sample = df.sample(frac=0.01)
however, I do not want to generate the original df with that size. I wonder what is the best practice to generate a dataframe with data already sampled.
I've read some sql posts showing the sample data was generated from a slice, that is absolutely not accepted in my case. The sample data needs to be evenly distributed as much as possible.
Can anyone shed me with more light?
Thank you very much.
UPDATE:
Below is a table showing how the data looks like:
Reputation is the field I am working on. You can see majority records have a very small reputation.
I don't want to work with a dataframe with all the records, I want the sampled data also looks like the un-sampled data, for example, similar histogram, that's what I meant "evenly".
I hope this clarifies a bit.
A simple random sample can be performed using the following syntax:
select * from mydata where rand()>0.9
This gives each row in the table a 10% chance of being selected. It doesn't guarantee a certain sample size or guarantee that every bin is represented (that would require a stratified sample). Here's a fiddle of this approach
http://sqlfiddle.com/#!9/21d1ee/2
On average, random sampling will provide a distribution the same as that of the underlying data, so meets your requirement. However if you want to 'force' the sample to be more representative or force it to be a certain size we need to look at something a little more advanced.

Tensorflow Quickstart Tutorial example - where do we say which column is what?

I'm slowly training myself on Tensorflow, following the Get Started section on Tensorflow.org. So far so good.
I have one question regarding the Iris data set example:
https://www.tensorflow.org/get_started/estimator
When it comes to that section:
https://www.tensorflow.org/get_started/estimator#construct_a_deep_neural_network_classifier
It is not entirely clear to me when/where do we tell the system which columns are features and which column is the target value.
The only thing i see is that when we load the dataset, we tell the system the target is in integer format and the features are in float32. But what if my data is all integer or all float32 ? How can I tell the system that, for example, first 5 columns are features, and target is in last column. Or is it implied that first columns are always features and last column can only be the target ?
The statement 'feature_columns=feature_columns' when defining the DNNClassifier says which columns are to be considered a part of the data, and the input function says what are inputs and what are labels.

How to distinguish in master data and calculated interpolated data?

I'm getting a bunch of vectors with datapoints for a fixed set of values, in the example below you see an example of a vector with a value per time point
1D:2
2D:
7D:5
1M:6
6M:6.5
But alas not for all the timepoints is a value available. All vectors are stored in a database and with a trigger we calcuate the missing values by interpolation, or possibly a more advanced algorithm. Somehow I want to be able to tell which data points have been calculated and which have been original delivered to us. Of course I can add a flag column to the table with values indicating whether the value was a master value or is calculated, but I'm wondering whether there is a more sophisticated way. We probably don't need to determine on a regular basis, so cpu cycles are not an issue for determining or insertion.
The example above shows some nice looking numbers but in reality it would look more somethin like 3.1415966533.
The database for storage is called oracle 10.
cheers.
Could you deactivate the trigger temporarily?