Tensorflow Quickstart Tutorial example - where do we say which column is what? - tensorflow

I'm slowly training myself on Tensorflow, following the Get Started section on Tensorflow.org. So far so good.
I have one question regarding the Iris data set example:
https://www.tensorflow.org/get_started/estimator
When it comes to that section:
https://www.tensorflow.org/get_started/estimator#construct_a_deep_neural_network_classifier
It is not entirely clear to me when/where do we tell the system which columns are features and which column is the target value.
The only thing i see is that when we load the dataset, we tell the system the target is in integer format and the features are in float32. But what if my data is all integer or all float32 ? How can I tell the system that, for example, first 5 columns are features, and target is in last column. Or is it implied that first columns are always features and last column can only be the target ?

The statement 'feature_columns=feature_columns' when defining the DNNClassifier says which columns are to be considered a part of the data, and the input function says what are inputs and what are labels.

Related

How can I recode 53k unique addresses (saved as objects) w/o One-Hot-Encoding in Pandas?

My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.
But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.
I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?
EDIT: My target variable is a simple binary one.
Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:
One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)

python - pandas - dataframe - data padding multidimensional statistics

i have a dataframe with columns accounting for different characteristics of stars and rows accounting for measurements of different stars. (something like this)
\property_______A _______A_error_______B_______B_error_______C_______C_error ...
star1
star2
star3
...
in some measurements the error for a specifc property is -1.00 which means the measurement was faulty.
in such case i want to discard the measurement.
one way to do so is by eliminating the entire row (along with other properties who's error was not -1.00)
i think it's possible to fill in the faulty measurement with a value generated by the distribution based on all the other measurements, meaning - given the other properties which are fine, this property should have this value in order to reduce the error of the entire dataset.
is there a proper name to the idea i'm referring to?
how would you apply such an algorithm?
i'm a student on a solo project so would really appreciate answers that also elaborate on theory (:
edit
after further reading, i think what i was referring to is called regression imputation.
so i guess my question is - how can i implement multidimensional linear regression in a dataframe in the most efficient way???
thanks!

How to explicitly split data for training and evaluation with BigQuery ML?

I understand there's already another post, but it's a bit old and it doesn't really answer the question.
I understand that we can use the parameter DATA_SPLIT_METHOD to separate dataset for training and evaluation. But I how do I make sure that they're both different data set?
So for example, I set DATA_SPLIT_METHOD to AUTO_SPLIT, and my data set is between 500 and 500k rows, so 20% of data will be used as evaluation. How do I make sure that the rest of 80% will be used for training when I run my evaluation (ML.EVALUATE?
The short answer is BigQuery does it for you.
The long answer would be that DATA_SPLIT_METHOD is a parameter of CREATE MODEL which upon called will already create and train the model using the right percentage set at DATA_SPLIT_METHOD.
When you run ML.EVALUATE, you run it for the model which will have DATA_SPLIT_METHOD as a parameter. Therefore, it already knows which part of the data set has to evaluate and uses the already trained model.
Very interesting question I would say.
As stated in the BQ's parameters of CREATE MODEL by using the DATA_SPLIT_METHOD (i.e. The method to split input data into training and evaluation sets.), that is done for you.
But in case you would still like to split your data, then here is a way for a random sampling method:
-- Create a new table that will include a new variable that splits the existing data into **training (80%)**, **evaluation (10%)**, and **prediction (10%)**
CREATE OR REPLACE TABLE `project_name.table_name` AS
SELECT *,
CASE
WHEN split_field < 0.8 THEN 'training'
WHEN split_field = 0.8 THEN 'evaluation'
WHEN split_field > 0.8 THEN 'prediction'
END AS churn_dataframe
FROM (
SELECT *,
ROUND(ABS(RAND()),1) AS split_field
FROM `project_name.table_name`
)
split_field = A field that generates a random number for each row between 0 to 1 (it is assumed that the numbers that are generated, are uniformly distributed)
Hope this could help.

How do I add a new feature column to a tf.data.Dataset object?

I am building an input pipeline for proprietary data using Tensorflow 2.0's data module and using the tf.data.Dataset object to store my features. Here is my issue - the data source is a CSV file that has only 3 columns, a label column and then two columns which just hold strings referring to JSON files where that data is stored. I have developed functions that access all the data I need, and am able to use Dataset's map function on the columns to get the data, but I don't see how I can add a new column to my tf.data.Dataset object to hold the new data. So if anyone could help with the following questions, it would really help:
How can a new feature be appended to a tf.data.Dataset object?
Should this process be done on the entire Dataset before iterating through it, or during (I think during iteration would allow utilization of the performance boost, but I don't know how this functionality works)?
I have all the methods for taking the input as the elements from the columns and performing everything required to get the features for each element, I just don't understand how to get this data into the dataset. I could do "hacky" workarounds, using a Pandas Dataframe as a "mediator" or something along those lines, but I want to keep everything within the Tensorflow Dataset and pipeline process, for both performance gains and higher quality code.
I have looked through the Tensorflow 2.0 documentation for the Dataset class (https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset), but haven't been able to find a method that can manipulate the structure of the object.
Here is the function I use to load the original dataset:
def load_dataset(self):
# TODO: Function to get max number of available CPU threads
dataset = tf.data.experimental.make_csv_dataset(self.dataset_path,
self.batch_size,
label_name='score',
shuffle_buffer_size=self.get_dataset_size(),
shuffle_seed=self.seed,
num_parallel_reads=1)
return dataset
Then, I have methods which allow me to take a string input (column element) and return the actual feature data. And I am able to access the elements from the Dataset using a function like ".map". But how do I add that as a column?
Wow, this is embarassing, but I have found the solution and it's simplicity literally makes me feel like an idiot for asking this. But I will leave the answer up just in case anyone else is ever facing this issue.
You first create a new tf.data.Dataset object using any function that returns a Dataset, such as ".map".
Then you create a new Dataset by zipping the original and the one with the new data:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

Assigning values to missing data for use in binary logistic regression in SAS

Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
You can impute values for the missing fields, subject to logical restrictions on your experimental design and the fact that it will weaken the power of your experiment some relative to having the same experiment with no missing values.
SAS offers a few ways to do this. The simplest is to use PROC MI and PROC MIANALYZE, but even those are certainly not a simple matter of plugging a few numbers in. See this page for more information. Ultimately this is probably a better question for Cross-Validated at least until you have figured out the experimental design issues.