Difference between numeric_column shape=2 and two numeric columns - tensorflow

Time-related data I initially have as integer in format:
1234 # corresponds to 12:34
2359 # corresponds to 23:59
1) The first option is to describe time as numeric_column:
tf.feature_column.numeric_column(key="start_time", dtype=tf.int32)
2) Another option is to split time into hours and minutes into two separated feature columns:
tf.feature_column.numeric_column(key="start_time_hours", dtype=tf.int32)
tf.feature_column.numeric_column(key="start_time_minutes", dtype=tf.int32)
3) The third option is to maintain a one feature column, but let tensorflow know that it can be described when split into hours and minutes:
tf.feature_column.numeric_column(key="start_time", shape=2, dtype=tf.int32)
Does this split makes sense and what is the difference between options 2) and 3)?
As additional question, I faced with problems how to decode vector data from csv:
1|1|FGTR|1|1|14,2|15,1|329|3|10|2013
1|1|LKJG|1|1|7,2|19,2|479|7|10|2013
1|1|LKJH|1|1|14,2|22,2|500|3|10|2013
How to let tensorflow know that "14,2", "15,1" should be considered as tensors shape=2?
Edit 1:
I found a solution to decode "array"-like data from csv.
In train and evaluate functions I added .map step to decode data for some columns:
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels)).map(parse_csv)
Where parse_csv implemented as:
def parse_csv(features, label):
features['start_time'] = tf.string_to_number(tf.string_split([features['start_time']], delimiter=',').values, tf.int32)
return features, label
As I think the difference between two separated columns and one column with shape=2 is in a way how "weights" are distributed.

Related

LSTM Keras input and output dimensions

I have 30 time steps with 26 features, so I would imagine my input into the first layer would be of dimension #_samples x 30 x 26.
One problem I have is that my # of samples varies by the time step. Should I pad to make them uniform?
Another is that I am trying to create the time-indexed 3D array by separating out the dataset into their respective time steps and combining them into a 3D array, but all the different methods I've tried have failed so far.
Here's my latest attempt:
def lstm_data_processing(X_data, Y_data):
num_time_steps = X_data['month_id'].nunique()
month_ids = X_data['month_id'].unique()
X_processed = []
X_processed.reshape(X_data.shape[0], X_data.shape[1], num_time_steps)
for i in range(num_time_steps):
month_df = X_data.loc[X_data['month_id'] == month_ids[i]].copy()
month_df.drop('month_id', axis=1, inplace=True)
print(month_df.shape)
np.stack(X_processed, month_df)
print(X_processed.shape)

train-test split of scikit learn resulting in features having only one unique value in train data

I am trying to train a multivariate linear regression model. I have a data set named 'main'. There are few categorical variable in this dataset. I dummified the categorical variable. Let's say the columns obtained after dummification are A, B, C, D and so on. Now when I am trying to run train-test split on this main dataset, the train dataset thus obtained has only values 0 in one of these columns. How can I overcome this problem.
The code which I am using is :
for train-test split:
from sklearn.model_selection import train_test_split
np.random.seed(0)
df_train, df_test = train_test_split(main, train_size = 0.7, test_size = 0.3, random_state = 100)
On running the below code :
main.columns[main.nunique() == 1]
The result is : Index([], dtype='object')
And when running the below code for train data :
df_train.columns[df_train.nunique() == 1]
The result is : Index(['A', 'D', 'S'], dtype='object')
I want the resulting train set to contain features with all combination of values in it. However, this split is giving me only one value in some features
Edit : I checked the unique values in these columns and these columns are highly unbalanced with only one value present for the positive case. I tries stratify and it needs at lease two rows of positive class. And this the case for many columns. So I cannot separately include this columnns in the train dataset as it would require writing code for all the columns. I want this to be done automatically.
Have you tried changing random_state value ?

Selection column in a dataframe in pandas apply min function

I have n-dataframe in a list
df=[df_1, df_2, df_3, ...., df_n]
Where df_n is a dataframe in pandas (python). df_n is a variable of my keras-model.
X_train=[df_1_1,df_2_1,...,df_n_1]
Where:
df_1_1 is the first dataframe of the list (the first variable) and the first columns of this dataframe, his dataframe has m columns.
Each column of this dataframe if this variable applies a different type of smoothing or filter.
I have 100 column in each dataframe and I want to select the combination of columns (of different dataframes), the X_train than have min value in the score of my model.
score = model.evaluate(X_test,Y_test)
X_test and Y_test are the last n occurrences of the selected columns.
There some library for selected this columns (neuronal networks, GA, colony ant, ...)?
How can I implement it?
What is your prediction task? Do you need a neural network or not? You are essentially looking at a feature selection problem here. You could use simpler models such as the lasso which will select columns using L1-regularization. Or you could use an ensembling technique such as random forests and consider the relative feature importances to select your columns. Perhaps have a look at scikit-learn.

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

tensorflow wide model: how to use one-hot feature?

I have read about the model in https://www.tensorflow.org/versions/r0.9/tutorials/wide_and_deep/index.html
the feature in article has two type: Categorical and Continuous
In my case, I have a column which describe the userid ,range from 0 to 10000000
I treat this column as Categorical and use hash-bucket , but only get a pool auc value about 0.50010
1)is it need to use one-hot to process this id column?
2)if it's needed, how to achieve this? I find a "tf.contrib.layers.one_hot_encoding" ,but it's not support column names so cannot be used in wide-n-deep demo.
No, you don't need to encode the UserID column. Each value is unique and is not a Categorical value. It makes sense to one-hot-encode when there are less than 1000 categories.
To answer your question on how to use the one_hot_encoding, assuming you have a list of labels (note that they must be integers):
import tensorflow as tf
with tf.Session() as sess:
labels = [0, 1, 2, 3]
labels_t = tf.constant(labels)
num_classes = len(labels)
one_hot = tf.contrib.layers.one_hot_encoding(labels_t, num_classes=num_classes)
print(one_hot.eval())