ValueError: invalid literal for int() with base 10: 'whale' cifar100 - tensorflow

I want to convert labels of cifar100 to one hot like this
from keras.utils import to_categorical
y_test = to_categorical(df["label"], num_classes=100)
However when I run this it generates this error ValueError: invalid literal for int() with base 10: 'whale'
My label csv looks like following
image label
0 TIbrIzjlju.png whale
1 bIbtjNbPvW.png apple
2 vMsCEjUvde.png seal
3 yyPVYpDWSd.png sea
4 homfJrzXsv.png train

Related

How to concatenate + tokenize + pad strings in TFX preprocessing?

I'd like to perform the usual text preprocessing steps in a TensorFlow Extended pipeline's Transform step/component. My data is the following (strings in independent features, 0/1 integers in label column):
field1 field2 field3 label
--------------------------
aa bb cc 0
ab gfdg ssdg 1
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow_text import UnicodeCharTokenizer
def preprocessing_fn(inputs):
outputs = {}
outputs['features_xf'] = tf.sparse.concat(axis=0, sp_inputs=[inputs["field1"], inputs["field2"], inputs["field3"]])
outputs['label_xf'] = tf.convert_to_tensor(inputs["label"], dtype=tf.float32)
return outputs
but this doesn't work:
ValueError: Arrays were not all the same length: 3 vs 1 [while running 'Transform[TransformIndex0]/ConvertToRecordBatch']
(Later on I want to apply char-level tokenization and padding to MAX_LEN as well).
Any idea?

Shape must be rank 2 but is rank 3 for NonMaxSuppressionV3: ERROR

I'm facing this error while trying to use tf.image.non_max_suppression while video object detection. Tensorflow version is 1.10.0
ValueError: Shape must be rank 2 but is rank 3 for
'non_max_suppression/NonMaxSuppressionV3' (op: 'NonMaxSuppressionV3')
with input shapes: [1,500,4], [1,500], [], [], [].
I get this same error with tensorflow2.1 and the reason is that (as it says in the error) that the batch dimension cannot be there.
tf.image.non_max_suppression(
boxes, scores, max_output_size, iou_threshold=0.5,
score_threshold=float('-inf'), name=None )
boxes A 2-D float Tensor of shape [num_boxes, 4].
Example:
selected_indices = tf.image.non_max_suppression(
boxes=boxes,
scores=scores,
max_output_size=7,
iou_threshold=0.5)
You should remove the first(batch) dimension of the tensors (boxes and scores in the example above)
In case your batch dimension is 1 you can use
boxes = tf.squeeze(boxes)
scores = tf.squeeze(scores)
It seems that you can have batch dimension is in this beast:
https://www.tensorflow.org/api_docs/python/tf/image/combined_non_max_suppression

Getting Error while performing Undersampling for Sklearn

I am trying built an randomforest classifier for binary classification . My data is inbalanced hence I am performing undersampling.
train = data.drop(['Co_Name','Cust_ID','Phone','Shpr_ID','Resi_Cnt','Buz_Cnt','Nearby_Cnt','parseNumber','removeString','Qty','bins','Adj_Addr','Resi','Weight','Resi_Area','Lat','Lng'], axis=1)
Y = data['Resi']
from sklearn import metrics
rus = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(train, Y)
I am getting the below error
446 # make sure we actually converted to numeric:
447 if dtype_numeric and array.dtype.kind == "O":
--> 448 array = array.astype(np.float64)
449 if not allow_nd and array.ndim >= 3:
450 raise ValueError("Found array with dim %d. %s expected <= 2."
ValueError: setting an array element with a sequence.
How to fix this.
Can you share the dataframe? or a sample of that!
This error can be a lot of things, for example:
If you try:
np.asarray(
[
[1, 2],
[2, 3, 4]
],
dtype=np.float)
You will get:
ValueError: setting an array element with a sequence.
This is because the array have incorrect shape of columns. So you can't create an array from lists, with a column length different on the second list. So doesn't match column length.
But your error probably it's related to train vs Y shape or the type in the train(data). During the Under-sampled fit function should have some conversion that throws this error. Confirm if train (data) have the appropriate type before to do the RandomUnderSampler.

Getting error "could not convert string to float" in countvectorized sparse dataset

I am dataset with with text data, categorical and numeric data. I have countvectorized the text data and added it to the dataframe. Now I am trying to fit the model I am getting the below error
Error
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
400 force_all_finite)
401 else:
--> 402 array = np.array(array, dtype=dtype, order=order, copy=copy)
403
404 if ensure_2d:
ValueError: could not convert string to float: 'IP'
Code
cv = CountVectorizer( max_features = 500,analyzer='word')
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
for i, col in enumerate(cv.get_feature_names()):
data[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)
train = data.drop(['Co_Name','Cust_ID','Phone','Shpr_ID','Resi_Cnt','Buz_Cnt','Nearby_Cnt','parseNumber','removeString','Qty','bins','Adj_Addr','Resi','Co_Name_FLag','Phone_Type'], axis=1)
Y = data['Resi']
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.3)
gbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)
print('Training GBC')
gbc.fit(X_train, y_train)
I guess this because the categorical data is not getting converted into numerical data while model building. How I can dynamically convert them into numerical data
Categorical fields inside Sparse matrix
Phone_Type Co_Name_FLag Product
undefined Present IP
Landline Present IP
undefined Not_present IP
Landline Present IPD
Mobile Not_Present IP
Landline Present IE
Mobile Present IPF
Landline Present IP
undefined Present IP
Landline Present IP
You're getting this ValueError because you have not dropped the Product field from train.
You identified three categorical fields in your design matrix: Product, Phone_Type, and Co_Name_FLag. When you define train, you drop the latter two, but you keep Product. If you want to keep Product as a predictor, either apply pd.get_dummies() when it's still in Pandas format, or, alternately, use sklearn LabelEncoder.
Either way, the point is to transform the categorical variable into a series of indicator variables which represent each category level.

Cant fit scikit-neuralnetwork classifier because of tuple index out of range

I am trying to get this classifier working. It is a extension for scikit learn with dependencies to Theano.
My goal was to fit a neural network with a list of years and teach it to know if it is a leap year or not (later I would increase the range). But I run in an error if I want to test this example.
My code looks like this:
leapyear.py
import numpy as np
import calendar
from sknn.mlp import Classifier, Layer
from sklearn.cross_validation import train_test_split
# create years in range
years = np.arange(1970, 2001)
pre_is_leap = []
# test if year is a leapyear
for x in years:
pre_is_leap.append(calendar.isleap(x))
# convert true, false list to 0,1 list
is_leap = np.array(pre_is_leap, dtype=bool).astype(int)
# split
years_train, years_test, is_leap_train, is_leap_test = train_test_split(years, is_leap, test_size=0.33, random_state=42)
# test output
print(len(years_train))
print(len(is_leap_train))
print(years_train)
print(is_leap_train)
#neural network
nn = Classifier(
layers=[
Layer("Maxout", units=100, pieces=2),
Layer("Softmax")],
learning_rate=0.001,
n_iter=25)
# fit
nn.fit(years_train, is_leap_train)
#nn.fit(np.array(years_train), np.array(is_leap_train))
requirements.txt
numpy==1.9.2
PyYAML==3.11
scikit-learn==0.16.1
scikit-neuralnetwork==0.3
scipy==0.16.0
Theano==0.7.0
my output with error:
20
20
[1986 1975 1983 1981 1992 1971 1972 1995 1973 1991 1996 1988 2000 1990 1977
1980 1984 1998 1989 1976]
[0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 0 1]
/home/devnull/master/scikit/env/lib/python3.4/site-packages/sklearn/utils/validation.py:498: UserWarning: MinMaxScaler assumes floating point values as input, got int64
"got %s" % (estimator, X.dtype))
/home/devnull/master/scikit/env/lib/python3.4/site-packages/sklearn/preprocessing/data.py:256: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
X *= self.scale_
/home/devnull/master/scikit/env/lib/python3.4/site-packages/sklearn/preprocessing/data.py:257: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
X += self.min_
Traceback (most recent call last):
File "/home/devnull/master/scikit/leapyear.py", line 47, in <module>
pipeline.fit(years_train, is_leap_train)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sklearn/pipeline.py", line 141, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sknn/mlp.py", line 283, in fit
return super(Classifier, self)._fit(X, yp)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sknn/mlp.py", line 127, in _fit
X, y = self._initialize(X, y)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sknn/mlp.py", line 37, in _initialize
self._create_specs(X, y)
File "/home/devnull/master/scikit/env/lib/python3.4/site-packages/sknn/mlp.py", line 67, in _create_specs
self.unit_counts = [numpy.product(X.shape[1:]) if self.is_convolution else X.shape[1]]
IndexError: tuple index out of range
I looked into the sources of mlp.py, but I dont know how to fix it. What has to be changed that I can fit my network?
Update not question related:
I just wanted to add, that I need to convert the year to a binary representation, after this the neural network will work.
The problem is that the classifier requires the data to be presented as a 2 dimensional numpy array, with the first axis being the samples and the second axis being the features.
In your case you have only one "feature" (the year) so you need to turn the years data into a Nx1 2D numpy array. This can be achieved by adding the following line just before the data split statement:
years = np.array([[year] for year in years])