I know about sklearn.preprocessing.Imputer but does Tensorflow have built-in functions to do this as well?
In case your imputation cannot be the same for all entries as suggested before, you may want to use tensorflow-transform.
For example, if you want to use the mean or the median as the value to impute for the missing values in the corresponding entries, you can not do so with a default one as such values are dynamic and depend on the whole dataset (or a subset depending on your needs/rules).
Check out one of the examples on how you would do that in the official repository.
As far as I know, there isn't a handy function that does the same thing as sklearn.preprocessing.Imputer.
There are a few ways of dealing with missing values using built-in functions:
While reading in data: For example, you can set the default value for a missing value when reading in a CSV using the record_defaults field.
If you have the data already: You can replace the nans using tf.where (example)
Related
I want to find the average of my column'Preheat_To_Pour_Time' based on the values of the column Rampmelt_Active. Column Rampmelt_Active values are either a 1 or a 0 based on if it's active. I can't figure out how to use the values I get from .value_counts() if I even need them.
I have tried using .value_counts() on Rampmelt_Active to get me the count I need to use in my division. As well as the .mean() method. However, this only gives me one value instead of the average of the 0's and 1's.
You likely need to add a groupby() element (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html). However, without a minimal reproducible example it is hard to tell (https://stackoverflow.com/help/minimal-reproducible-example).
I just started so that might be stupid, but I have following problem:
I created a .csv-file for some basic data description. However, although they are all numerical values without any missing values when using df.dtyped() I receive all variables as objects with only some being int64 or float64. Do I have to manually convert all object variables to numerical ones with code?
Or is there anything I did wrong when creating my csv?
Also the date I have saved in the format yyyy-mm-dd is shown as object instead of date format.
The numbers of the data range from [0,2] for some variables and [0,2000000] for others.
Could the formatting in Excel be a problem?
Is there any "How to build your csv"-documentation? So that I dont have to ask stupid beginner questions like this?
Additionally, I was told for a model to work properly I need to do some Scaling/Normalization of my data as the value ranges differ a lot.. Where can I find more information on that?
I would suggest you just do data type conversion before saving the CSV file. you can use the below function as well for conversion.
astype()
to_numeric()
convert_dtypes()
you can use the attached link for scaling information. https://www.analyticsvidhya.com/blog/2020/07/types-of-feature-transformation-and-scaling/
pd.read_csv has already an option to specify the type so if you want you can specify the dtypeType with read_csv. For the date, you always have to change the format to datetime
To scale or normalize your date is going to depend on which machine learning model you are going to use also.
For example : if use a random forest and a KNN, the KNN will need to have scaling feature since it works with distance.
Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems is a good book to start in my personal opinion
Thanks for the ideas.
In the end a pd.readcsv(title, decimal:',') helped to create them as floats. As I used german formatting.
But conversion with to_numeric() also worked
I have a column that's generated from a custom component on the Data Flow in SSIS.
The data type of the column is float[DT_R8] that has along with valid float values, NaN's in there. I would like to identify these NaN's and treat(assign) these as NULL values.
I thought of doing something in the Derived Column Transformation like in the screenshot, but this didn't work.
It seems that in the Expression column, it can only be built from the functions available. But there isn't a 'isNaN' function that can be used.
Would you know of any other approaches, or how it can be done?
Thanks!
I'm trying to convert this dataframe:
into this dataframe in Knime
Any suggestions?
(Assuming you do not know in advance the name of the last column.)
I believe with my HiTS extension's Unpivot node it should work with a pattern like this (you will probably need a Column Renamer/String Manipulator to adjust it):
(q\d)(.*)
In case this is really just this single input, just use the Constant Value Column nodes to create the quarter, timing columns and the Column Rename/Column Resorter nodes to achieve the Dataframe2.
I'm pretty new to LabVIEW, but I do have experience in other programing languages like Python and C++. The code I'm going to ask about works, but there was a lot of manual work involved when putting it together. Basically I read from a text file and change control values based on values in the text file, in this case its 40 values.
I have set it up to pull from a text file and split the string by commas. Then I loop through all the values and set the indicator to read the corresponding value. I had to create 40 separate case statements to achieve this. I'm sure there is a better way of doing this. Does anyone have any suggestions?
There could be done following improvements (additionally to suggested by sweber:
If file contains just data, without "label - value" format, then you could read it as csv (comma separated values) format, and read actually just 1st row.
Currently, you set values based on order. In this case, you could: create reference to all indicators, build them to array in proper order, in For Loop assign values to indicators via property node Value.
Overall, I support sweber that if it is some key - value data, then better to use either JSON format, or .ini file format, which support such structure.
Let's start with some optimization:
It seems your data file contains nothing more than just 40 numbers. You can wire an 1D DBL array to the default input of the string-to-array VI, and you will get just a 1D array out. No need for a 2D array.
Second, there is no need to convert the FOR index value to a string, the CASE accepts integers, too.
Now, about your question: The simplest solution is to display the values as array, just as they come from the string-to-array VI.
But I guess each value has a special meaning, and you would like to display it's name/description somehow. In this case, create a cluster with 40 values, edit their labels as you like, and make sure their order in the cluster is the same as the order of the values in the files.
Then, wire the 1D array of values to this cluster via an array-to-cluster VI.
If you plan to use the text file to store and load the values, converting the cluster data to JSON and vv. might be something for you, as it transports the labels of the cluster into the file, too. (However, changing labels is an issue, then)