small test_set xgb predict - xgboost

i would like to ask a question about a problem that i have for the last couple days.
First of all i am a beginner in machine learning and this is my first time using the XGBoost algorithm so excuse me for any mistakes I have done.
I trained my model to predict whether a log file is malicious or not. After i save and reload my model on a different session i use the predict function which seems to be working normally ( with a few deviations in probabilities but that is another topic, I know I, have seen it in another topic )
The problem is this: Sometimes when i try to predict a "small" csv file after load it seems to be broken predicting only the Zero label, even for indexes that are categorized correct previously.
For example, i load a dataset containing 20.000 values , the predict() is working. I keep only the first 5 of these values using pandas drop, again its working. If i save the 5 values on a different csv and reload it its not working. The same error happens if i just remove by hand all indexes (19.995) and save file only with 5 remaining.
I would bet it is a size of file problem but when i drop the indexes on the dataframe through pandas it seems to be working
Also the number 5 ( of indexes ) is for example purpose the same happens if I delete a large portion of the dataset.
I first came up with this problem after trying to verify by hand some completely new logs, which seem to be classified correctly if thrown into the big csv file but not in a new file on their own.
Here is my load and predict code
##IMPORTS
import os
import pandas as pd
from pandas.compat import StringIO
from datetime import datetime
from langid.langid import LanguageIdentifier, model
import langid
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.externals import joblib
from ggplot import ggplot, aes, geom_line
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.metrics import average_precision_score
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from collections import defaultdict
import pickle
df = pd.read_csv('big_test.csv')
df3 = pd.read_csv('small_test.csv')
#This one is necessary for the loaded_model
class ColumnSelector(BaseEstimator, TransformerMixin):
def init(self, column_list):
self.column_list = column_list
def fit(self, x, y=None):
return self
def transform(self, x):
if len(self.column_list) == 1:
return x[self.column_list[0]].values
else:
return x[self.column_list].to_dict(orient='records')
loaded_model = joblib.load('finalized_model.sav')
result = loaded_model.predict(df)
print(result)
df2=df[:5]
result2 = loaded_model.predict(df2)
print(result2)
result3 = loaded_model.predict(df3)
print(result3)
The results i get are these:
[1 0 1 ... 0 0 0]
[1 0 1 0 1]
[0 0 0 0 0]
I can provide any code even from training or my dataset if necessary.
*EDIT: I use a pipeline for my data. I tried to reproduce the error after using xgb to fit the iris data and i could not. Maybe there is something wrong with my pipeline? the code is below :
df = pd.read_csv('big_test.csv')
# df.info()
# Split Dataset
attributes = ['uri','code','r_size','DT_sec','Method','http_version','PenTool','has_referer', 'Lang','LangProb','GibberFlag' ]
x_train, x_test, y_train, y_test = train_test_split(df[attributes], df['Scan'], test_size=0.2,
stratify=df['Scan'], random_state=0)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size=0.2,
stratify=y_train, random_state=0)
# print('Train:', len(y_train), 'Dev:', len(y_dev), 'Test:', len(y_test))
# set up graph function
def plot_precision_recall_curve(y_true, y_pred_scores):
precision, recall, thresholds = precision_recall_curve(y_true, y_pred_scores)
return ggplot(aes(x='recall', y='precision'),
data=pd.DataFrame({"precision": precision, "recall": recall})) + geom_line()
# XGBClassifier
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, column_list):
self.column_list = column_list
def fit(self, x, y=None):
return self
def transform(self, x):
if len(self.column_list) == 1:
return x[self.column_list[0]].values
else:
return x[self.column_list].to_dict(orient='records')
count_vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 2), min_df=10)
dict_vectorizer = DictVectorizer()
xgb = XGBClassifier(seed=0)
pipeline = Pipeline([
("feature_union", FeatureUnion([
('text_features', Pipeline([
('selector', ColumnSelector(['uri'])),
('count_vectorizer', count_vectorizer)
])),
('categorical_features', Pipeline([
('selector', ColumnSelector(['code','r_size','DT_sec','Method','http_version','PenTool','has_referer', 'Lang','LangProb','GibberFlag' ])),
('dict_vectorizer', dict_vectorizer)
]))
])),
('xgb', xgb)
])
pipeline.fit(x_train, y_train)
filename = 'finalized_model.sav'
joblib.dump(pipeline, filename)

Thats due to different dtypes in big and small file.
When you do:
df = pd.read_csv('big_test.csv')
The dtypes are these:
print(df.dtypes)
# Output
uri object
code object # <== Observe this
r_size object # <== Observe this
Scan int64
...
...
...
Now when you do:
df3 = pd.read_csv('small_test.csv')
the dtypes are changed:
print(df3.dtypes)
# Output
uri object
code int64 # <== Now this has changed
r_size int64 # <== Now this has changed
Scan int64
...
...
You see, pandas will try to determine the dtypes of the columns by itself. When you load the big_test.csv, there are some values in code and r_size column which are of string types, due to this whole column dtype is changed to string, which is not done in small_test.csv.
Now due to this change, the dictVectorizer encodes the data in a different way than before and the features are changed, and hence the results are also changed.
If you do this:
df3[['code', 'r_size']] = df3[['code', 'r_size']].astype(str)
and then call the predict(), the results are same again.

Related

Problem with manual data for PyTorch's DataLoader

I have a dataset which I have to process in such a way that it works with a convolutional neural network of PyTorch (I'm completely new to PyTorch). The data is stored in a dataframe with a column for pictures (28 x 28 ndarrays with int32 entries) and a column with its class labels. The pixels of the images merely adopt values +1 and -1 (since it is simulation data of a classical 2d Ising Model). The dataframe looks like this.
I imported the following (a lot of this is not relevant for now, but I included everything for completeness. "data_loader" is a custom py file.):
import numpy as np
import matplotlib.pyplot as plt
import data_loader
import pandas as pd
import torch
import torchvision.transforms as T
from torchvision.utils import make_grid
from torch.nn import Module
from torch.nn import Conv2d
from torch.nn import Linear
from torch.nn import MaxPool2d
from torch.nn import ReLU
from torch.nn import LogSoftmax
from torch import flatten
from sklearn.metrics import classification_report
import time as time
from torch.utils.data import DataLoader, Dataset
Then, I want to get this in the correct shape in order to make it useful for PyTorch. I do this by defining the following class
class MetropolisDataset(Dataset):
def __init__(self, data_frame, transform=None):
self.data_frame = data_frame
self.transform = transform
def __len__(self):
return len(self.data_frame)
def __getitem__(self,idx):
if torch.is_tensor(idx):
idx = idx.tolist()
label = self.data_frame['label'].iloc[idx]
image = self.data_frame['image'].iloc[idx]
image = np.array(image)
if self.transform:
image = self.transform(image)
return (image, label)
I call instances of this class as:
train_set = MetropolisDataset(data_frame = df_train,
transform = T.Compose([
T.ToPILImage(),
T.ToTensor()]))
validation_set = MetropolisDataset(data_frame = df_validation,
transform = T.Compose([
T.ToPILImage(),
T.ToTensor()]))
test_set = MetropolisDataset(data_frame = df_test,
transform = T.Compose([
T.ToPILImage(),
T.ToTensor()]))
The problem does not yet arise here, because I am able to read out and show images from these instances of the above defined class.
Then, as far as I found out, it is necessary to let this go through the DataLoader of PyTorch, which I do as follows:
batch_size = 64
train_dl = DataLoader(train_set, batch_size, shuffle=True, num_workers=3, pin_memory=True)
validation_dl = DataLoader(validation_set, batch_size, shuffle=True, num_workers=3, pin_memory=True)
test_dl = DataLoader(test_set, batch_size, shuffle=True, num_workers=3, pin_memory=True)
However, if I want to use these instances of the DataLoader, simply nothing happens. I neither get an error, nor the computation seems to get anywhere. I tried to run a CNN but it does not seem to compute anything. Something else I tried was to show some sample images with the code provided by this article, but the same issue occurs. The sample code is:
def show_images(images, nmax=10):
fig, ax = plt.subplots(figsize=(8, 8))
ax.set_xticks([]); ax.set_yticks([])
ax.imshow(make_grid((images.detach()[:nmax]), nrow=8).permute(1, 2, 0))
def show_batch(dl, nmax=64):
for images in dl:
show_images(images, nmax)
break
show_batch(test_dl)
It seems that there is some error in the implementation of my MetropolisDataset class or with the DataLoader itself. How could this problem be solved?
As mentioned in the comments, the problem was partly solved by setting num_workers to zero since I was working in a Jupyter notebook, as answered here. However, this left open one further problem that I got errors when I wanted to apply the DataLoader to run a CNN. The issue was then that my data did consist of int32 numbers instead of float32. I do not include further codes, because this was related directly to my data - however, the issue was (as very often) merely a wrong datatype.

Error finding attribute `feature_names_in_` that exists in docs

I'm getting the error AttributeError: 'LogisticRegression' object has no attribute 'feature_names_in_' even though that attribute is written in the docs.
I'm on scikit-learn version 1.0.2.
I created an object LogisticRegression and I am trying to use the documented attribute of feature_names_in_ but it's returning an error.
#imports
import numpy as np
import pandas as pd
import statistics
import scipy.sparse
from scipy.stats import chi2_contingency
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
# train_test_split()
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state = 42)
#create functions for preprocessing
# function to replace NaN's in the ordinal and interval data
def replace_NAN_median(X_df):
opinions = ['opinion_seas_vacc_effective', 'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'household_adults',
'household_children']
for column in opinions:
X_df[column].replace(np.nan, X_df[column].median(), inplace = True)
return X_df
# function to replace NaN's in the catagorical data
def replace_NAN_mode(X_df):
miss_cat_features = ['education', 'income_poverty', 'marital_status', 'rent_or_own', 'employment_status']
for column in miss_cat_features:
X_df[column].replace(np.nan, statistics.mode(X_df[column]), inplace = True)
return X_df
# Instantiate transformers
NAN_median = FunctionTransformer(replace_NAN_median)
NAN_mode = FunctionTransformer(replace_NAN_mode)
col_transformer = ColumnTransformer(transformers=
# replace NaN's in the binary data
[("NAN_0", SimpleImputer(missing_values=np.nan, strategy='constant', fill_value = 0),
['behavioral_antiviral_meds', 'behavioral_avoidance','behavioral_face_mask' ,
'behavioral_wash_hands', 'behavioral_large_gatherings', 'behavioral_outside_home',
'behavioral_touch_face', 'doctor_recc_seasonal', 'chronic_med_condition',
'child_under_6_months', 'health_worker', 'health_insurance']),
# MinMaxScaler on our numeric ordinal and interval data
("scaler", MinMaxScaler(), ['opinion_seas_vacc_effective', 'opinion_seas_risk',
'opinion_seas_sick_from_vacc',
'household_adults', 'household_children']),
# OHE catagorical string data
("ohe", OneHotEncoder(sparse = False), ['age_group','education', 'race', 'sex',
'income_poverty', 'marital_status', 'rent_or_own',
'employment_status', 'census_msa'])],
remainder="passthrough")
# Preprocessing Pipeline
preprocessing_pipe = Pipeline(steps=[
("NAN_median", NAN_median),
("NAN_mode", NAN_mode),
("col_transformer", col_transformer)
])
# model
logreg_optimized_pipe = Pipeline(steps=[("preprocessing_pipe", preprocessing_pipe),
("log_reg", LogisticRegression(solver = 'liblinear', random_state = 42, C = 10, penalty= 'l1'))])
#fit model to training data
logreg_optimized_pipe.fit(X_train, y_train)
#trying to get feature names
logreg_optimized_pipe.named_steps["log_reg"].feature_names_in_
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-38-512bfaf5962d> in <module>
----> 1 logreg_optimized_pipe.named_steps["log_reg"].feature_names_in_
AttributeError: 'LogisticRegression' object has no attribute 'feature_names_in_'
I'm open to alternative suggestions on how to get the feature names as well.
Docs says the following:
feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
You should make sure that data that reaches model has names in.
Also, it is defined only when fit is called.
Link to the docs for your version 1.0.2
LogisticRegression
So it turns out that SimpleImputer returns an array - thereby removing the column names. I replaced SimpleImputer with a function to fix this. I wasn't able to figure out how to use .feature_names_in_ on the LogisticRegression() model, but it did work when I called it on the preprocessing pipeline ColumnTransformer, and most importantly I was able to use .get_feature_names_out() on the preprocessing pipeline to get the feature names that were fed into the model.
Code:
#imports
import numpy as np
import pandas as pd
import statistics
import scipy.sparse
from scipy.stats import chi2_contingency
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
# train_test_split()
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state = 42)
#create functions for preprocessing
# function to replace NaN's in the ordinal and interval data
def replace_NAN_median(X_df):
opinions = ['opinion_seas_vacc_effective', 'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'household_adults',
'household_children']
for column in opinions:
X_df[column].replace(np.nan, X_df[column].median(), inplace = True)
return X_df
# function to replace NaN's in the catagorical data
def replace_NAN_mode(X_df):
miss_cat_features = ['education', 'income_poverty', 'marital_status', 'rent_or_own', 'employment_status']
for column in miss_cat_features:
X_df[column].replace(np.nan, statistics.mode(X_df[column]), inplace = True)
return X_df
# function to replace NaN's in the binary data
def replace_NAN_0(X_df):
miss_binary = ['behavioral_antiviral_meds', 'behavioral_avoidance','behavioral_face_mask' ,
'behavioral_wash_hands', 'behavioral_large_gatherings', 'behavioral_outside_home',
'behavioral_touch_face', 'doctor_recc_seasonal', 'chronic_med_condition',
'child_under_6_months', 'health_worker', 'health_insurance']
for column in miss_binary:
X_df[column].replace(np.nan, 0, inplace = True)
return X_df
# Instantiate transformers
NAN_median = FunctionTransformer(replace_NAN_median)
NAN_mode = FunctionTransformer(replace_NAN_mode)
NAN_0 = FunctionTransformer(replace_NAN_0)
col_transformer = ColumnTransformer(transformers= [
# MinMaxScaler on our numeric ordinal and interval data
("scaler", MinMaxScaler(), ['opinion_seas_vacc_effective', 'opinion_seas_risk',
'opinion_seas_sick_from_vacc',
'household_adults', 'household_children']),
# OHE catagorical string data
("ohe", OneHotEncoder(sparse = False), ['age_group','education', 'race', 'sex',
'income_poverty', 'marital_status', 'rent_or_own',
'employment_status', 'census_msa'])],
remainder="passthrough")
# Preprocessing Pipeline
preprocessing_pipe = Pipeline(steps=[
("NAN_median", NAN_median),
("NAN_mode", NAN_mode),
("NAN_0", NAN_0),
("col_transformer", col_transformer)
])
# model
logreg_optimized_pipe = Pipeline(steps=[("preprocessing_pipe", preprocessing_pipe),
("log_reg", LogisticRegression(solver = 'liblinear', random_state = 42, C = 10, penalty= 'l1'))])
#fit model to training data
logreg_optimized_pipe.fit(X_train, y_train)
#trying to get feature names
logreg_optimized_pipe.named_steps["preprocessing_pipe"][3].feature_names_in_
#output - feature names put into `ColumnTransformer`
array(['respondent_id', 'behavioral_antiviral_meds',
'behavioral_avoidance', 'behavioral_face_mask',
'behavioral_wash_hands', 'behavioral_large_gatherings',
'behavioral_outside_home', 'behavioral_touch_face',
'doctor_recc_seasonal', 'chronic_med_condition',
'child_under_6_months', 'health_worker', 'health_insurance',
'opinion_seas_vacc_effective', 'opinion_seas_risk',
'opinion_seas_sick_from_vacc', 'age_group', 'education', 'race',
'sex', 'income_poverty', 'marital_status', 'rent_or_own',
'employment_status', 'census_msa', 'household_adults',
'household_children'], dtype=object)
logreg_optimized_pipe.named_steps["preprocessing_pipe"][3].get_feature_names_out()
#output - feature names after `ColumnTransformer`
array(['scaler__opinion_seas_vacc_effective', 'scaler__opinion_seas_risk',
'scaler__opinion_seas_sick_from_vacc', 'scaler__household_adults',
'scaler__household_children', 'ohe__age_group_18 - 34 Years',
'ohe__age_group_35 - 44 Years', 'ohe__age_group_45 - 54 Years',
'ohe__age_group_55 - 64 Years', 'ohe__age_group_65+ Years',
'ohe__education_12 Years', 'ohe__education_< 12 Years',
'ohe__education_College Graduate', 'ohe__education_Some College',
'ohe__race_Black', 'ohe__race_Hispanic',
'ohe__race_Other or Multiple', 'ohe__race_White',
'ohe__sex_Female', 'ohe__sex_Male',
'ohe__income_poverty_<= $75,000, Above Poverty',
'ohe__income_poverty_> $75,000',
'ohe__income_poverty_Below Poverty', 'ohe__marital_status_Married',
'ohe__marital_status_Not Married', 'ohe__rent_or_own_Own',
'ohe__rent_or_own_Rent', 'ohe__employment_status_Employed',
'ohe__employment_status_Not in Labor Force',
'ohe__employment_status_Unemployed',
'ohe__census_msa_MSA, Not Principle City',
'ohe__census_msa_MSA, Principle City', 'ohe__census_msa_Non-MSA',
'remainder__respondent_id', 'remainder__behavioral_antiviral_meds',
'remainder__behavioral_avoidance',
'remainder__behavioral_face_mask',
'remainder__behavioral_wash_hands',
'remainder__behavioral_large_gatherings',
'remainder__behavioral_outside_home',
'remainder__behavioral_touch_face',
'remainder__doctor_recc_seasonal',
'remainder__chronic_med_condition',
'remainder__child_under_6_months', 'remainder__health_worker',
'remainder__health_insurance'], dtype=object)

XGboost + GridSearch : wired warning

Below is a code I wrote for Hyperparameter tuning of XGboost using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, auc
from pprint import pprint
from xgboost import XGBClassifier
import time
# instantiate XGBoost model
clf = XGBClassifier(missing=np.nan, nthreads=-1)
# Define scoring metrics
scorers = {
'accuracy_score': make_scorer(accuracy_score),
'precision_score': make_scorer(precision_score),
'recall_score': make_scorer(recall_score)
}
param_grid_dummy = {
"n_estimators": [25, 250],
"max_depth": [3,5],
"learning_rate": [0.0005, 0,005],
}
def random_search_wrapper(refit_score = 'precision_score'):
"""
fits a RandomizedSearchCV classifier using refit_score for optimization
prints classifier performance metrics
"""
rf_random = RandomizedSearchCV(estimator = clf, param_distributions = param_grid_dummy, n_iter = 3, scoring=scorers, refit = refit_score, cv = 3, return_train_score= True, n_jobs= -1)
rf_random.fit(X_train_df, Y_train)
# make the predictions
Y_pred = rf_random.predict(X_test_df)
print('Best params for {}'.format(refit_score))
print(rf_random.best_params_)
# confusion matrix on test data
print('\nConfusion matrix of Random Forest optimized for {} on the test data: '.format(refit_score))
print(pd.DataFrame(confusion_matrix(Y_test, Y_pred),
columns = ['pred_neg', 'pred_pos'], index = ['neg', 'pos']))
return rf_random
# Optimize classifier for recall score
start = time.time()
rf_random_cl = random_search_wrapper(refit_score='precision_score')
# Print time
end = time.time()
print()
print((end - start)/60, "minutes")
I get a wired warning.
/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
Can someone pls help me understand what wrong am I doing here?
when I do simple clf.fit(X_train_df, Y_train). It works perfectly fine
This is an issue with sklearn version. few versions < 0.20.1 throw this this error
Code is correct.

One-hot encoding Tensorflow Strings

I have a list of strings as labels for training a neural network. Now I want to convert them via one_hot encoding so that I can use them for my tensorflow network.
My input list looks like this:
labels = ['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']
The requested outcome should be something like
one_hot [0,1,0,2,0]
What is the easiest way to do this? Any help would be much appreciated.
Cheers,
Andi
the desired outcome looks like LabelEncoder in sklearn, not like OneHotEncoder - in tf you need CategoryEncoder - BUT it is A preprocessing layer which encodes integer features.:
inp = layers.Input(shape=[X.shape[0]])
x0 = layers.CategoryEncoding(
num_tokens=3, output_mode="multi_hot")(inp)
model = keras.Model(inputs=[inp], outputs=[x0])
model.compile(optimizer= 'adam',
loss='categorical_crossentropy',
metrics=[tf.keras.metrics.CategoricalCrossentropy()])
print(model.summary())
this part gets encoding of unique values... And you can make another branch in this model to input your initial vector & fit it according labels from this reference-branch (it is like join reference-table with fact-table in any database) -- here will be ensemble of referenced-data & your needed data & output...
pay attention to -- num_tokens=3, output_mode="multi_hot" -- are being given explicitly... AND numbers from class_names get apriory to model use, as is Feature Engineering - like this (in pd.DataFrame)
import numpy as np
import pandas as pd
d = {'transport_col':['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']}
dataset_df = pd.DataFrame(data=d)
classes = dataset_df['transport_col'].unique().tolist()
print(f"Label classes: {classes}")
df= dataset_df['transport_col'].map(classes.index).copy()
print(df)
from manual example REF: Encode the categorical label into an integer.
Details: This stage is necessary if your classification label is represented as a string. Note: Keras expected classification labels to be integers.
in another architecture, perhaps, you could use StringLookup
vocab= np.array(np.unique(labels))
inp = tf.keras.Input(shape= labels.shape[0], dtype=tf.string)
x = tf.keras.layers.StringLookup(vocabulary=vocab)(inp)
but labels are dependent vars usually, as opposed to features, and shouldn't be used at Input
Everything in keras.docs
possible FULL CODE:
import numpy as np
import pandas as pd
import keras
X = np.array([['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']])
vocab= np.unique(X)
print(vocab)
y= np.array([[0,1,0,2,0]])
inp = layers.Input(shape=[X.shape[0]], dtype='string')
x0= tf.keras.layers.StringLookup(vocabulary=vocab, name='finish')(inp)
model = keras.Model(inputs=[inp], outputs=[x0])
model.compile(optimizer= 'adam',
loss='categorical_crossentropy',
metrics=[tf.keras.metrics.categorical_crossentropy])
print(model.summary())
from tensorflow.keras import backend as K
for layerIndex, layer in enumerate(model.layers):
print(layerIndex)
func = K.function([model.get_layer(index=0).input], layer.output)
layerOutput = func([X]) # input_data is a numpy array
print(layerOutput)
if layerIndex==1: # the last layer here
scale = lambda x: x - 1
print(scale(layerOutput))
res:
[[0 1 0 2 0]]
another possible Solution for your case - layers.TextVectorization
import numpy as np
import keras
input_array = np.atleast_2d(np.array(['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']))
vocab= np.unique(input_array)
input_data = keras.Input(shape=(None,), dtype='string')
layer = layers.TextVectorization( max_tokens=None, standardize=None, split=None, output_mode="int", vocabulary=vocab)
int_data = layer(input_data)
model = keras.Model(inputs=input_data, outputs=int_data)
output_dataset = model.predict(input_array)
print(output_dataset) # starts from 2 ... probably [0, 1] somehow concerns binarization ?
scale = lambda x: x - 2
print(scale(output_dataset))
result:
array([[0, 1, 0, 2, 0]])

The iris tutorial in tensorflow's website does not work well

The code is showed below,and the wrong message is also showed below:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import urllib.request
import tensorflow as tf
import numpy as np
IRIS_TRAINING = "iris_training.csv"
IRIS_TRAINING_URL = "http://download.tensorflow.org/data/iris_training.csv"
IRIS_TEST = "iris_test.csv"
IRIS_TEST_RRL = "http://download.tensorflow.org/data/iris_test.csv"
if not os.path.exists(IRIS_TRAINING):
raw = urllib.request.urlopen(IRIS_TRAINING_URL).read()
with open(IRIS_TRAINING, 'w') as f:
f.write(raw)
if not os.path.exists(IRIS_TEST):
raw = urllib.request.urlopen(IRIS_TEST_RRL).read()
with open(IRIS_TEST, 'w') as f:
f.write(raw)
# load datasets.
training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=IRIS_TRAINING,
target_dtype=np.int,
features_dtype=np.float32)
test_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=IRIS_TEST,
target_dtype=np.int,
features_dtype=np.float32
)
# Specify that all features have real_valued data
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)]
# Build 3 layers DNN with 10, 20, 10 units respectively.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 30],
n_class=3,
model_dir="/tem/iris_model")
# Define the training imputs
def get_train_inputs():
x = tf.constant(training_set.data)
y = tf.constant(training_set.target)
return x, y
# Fit model
classifier.fit(input_fn=get_train_inputs(), steps=2000)
# Define the test inputs
def get_test_inputs():
x = tf.constant(test_set.data)
y = tf.constant(test_set.target)
return x, y
# Evaluate accuracy
accuracy_score = classifier.evaluate(input_fn=get_test_inputs(), steps=1)["accuracy"]
print("\nTest Accuracy: {0:f}\n".format(accuracy_score))
This prints the following stack-trace:
Traceback (most recent call last):
File "/home/skyfacon/PycharmProjects/LinearFitting/IrisClassification.py", line 35, in <module>
features_dtype=np.float32
File "/home/skyfacon/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py", line 69, in load_csv_without_header
data.append(np.asarray(row, dtype=features_dtype))
File "/home/skyfacon/anaconda3/envs/tensorflow/lib/python3.6/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'setosa'
Process finished with exit code 1
I would like to know which page you are using as tutorial for this. Because the first page which comes when searching in google is this:
https://www.tensorflow.org/get_started/tflearn
And the difference between this and what you posted is tf.contrib.learn.datasets.base.load_csv_without_header and tf.contrib.learn.datasets.base.load_csv_with_header.
The actual URL or iris data you have specified contains the header. And you are trying to load it as a file without the header. Hence the strings in the header are not able to get converted to float and the error.
Change your code to:
training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
filename=IRIS_TRAINING,
target_dtype=np.int,
features_dtype=np.float32)
test_set = tf.contrib.learn.datasets.base.load_csv_with_header(
filename=IRIS_TEST,
target_dtype=np.int,
features_dtype=np.float32)