From numpy array of sentences to array of embedding - tensorflow

I'm learning to use tensorflow and trying to classify text. I have a dataset where each text is associated with a label 0 or 1. My goal is to use some sentence embedding to do the classification. First I've created an embedding from the whole text using the Gnews precompile embedding:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[2], dtype=tf.string,
trainable=True, output_shape=[None, 20])
Now I'd like to try something else (similar to this method http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) and I wanted to:
Separate each text into setences.
Create an array of embeddings for each text, one per sentence.
Use that as input for my model.
I'm able to separate the texts in sentences. Each text is an array of sentences saved as:
[array(['AITA - Getting Hugged At The Bar .',
'This all happened less than an hour ago..',
'I was at a bar I frequent and talking to some people I know, suddenly I feel someone from behind me hugging and starting to grind against me.',
"I know a lot of people at the bar, and assume it's a friend of mine, but when I look down at the shoes I do not recognize them.",
'I look back and I see a dude I do not know, nor have I ever seen.',
"He looks back at me, with horror in his eyes, because I'm a dude too...",
'I feel an urge of rage inside me and shove him in the chest with my elbow so I can get away..',
'He goes to his table and I go back to mine.',
'I was with my roommate and his girlfriend.',
'They asked what happened and I told them, then I see the guy who hugged me looking around for me.',
'Him and two of his friends come up to us and he says: .',
'"I just wanted to apologize, I thought you were someone else.".',
'I respond, "I understand, just check before you hug people.',
'Now, please fuck off".',
'He repeats his last statement, so do I.',
'This happens one more time and at this point his friends have surrounded me, my roommate is on his feet and I have left my beer at the table.',
'His friend goes in my face and says.', '.',
'"He just wanted to apologize, you really shouldn\'t be yelling at us" and starts waiving his finger at me.. We are at a rock bar, it\'s loud, I was speaking louder just to be sure I am heard..',
'The manager knows me so he comes asking me what happened.',
'I explain the situation and he speaks with them then he tells me.',
'.', '"They want to say sorry, can you guys shake hand?', '".',
'"Yeah sure, I just want them to leave me alone."', '.',
"Honestly I didn't even want to touch the guy, but whatever.",
"We shake hands and they go away.. Me and my roommate look at their table and there's no one that looks anything like me.",
'So, reddit, did I overreact?', 'Am I The Asshole here?'],
dtype='<U190')
array(["AITA if i don't want to pay my friend 5 dollars for a slice of pizzaSo, my friend bought herself, our other friend and I a pizza to eat for lunch.",
'Me and other friend ate 1 slice of pizza from an extra large pizza.',
'Other friend has already paid my friend that bought the pizza 5 dollars..',
'I am trying to save money wherever i can, but she really wants me to pay her 5 dollars "so its fair".. AITA?'],
dtype='<U146')
Now when I try to create an embedding from one element of the array it works. Here is my embedding function:
def embedding_f(test):
print("test shape:", test.shape)
# a = tf.constant(test)
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string,
trainable=True, output_shape=[None, 20])
ret = hub_layer(test)
# print(ret)
return ret.numpy()
# Works
emb = cnn.embedding_f(train_data[0])
But if I try to input a batch of data (as will be done later in the pipeline, the program crashes
# Crashes
emb = cnn.embedding_f(train_data[0:2])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-76f4f9171cad> in <module>
----> 1 emb = cnn.embedding_f(train_data[0:2])
~/AITA/aita/cnn.py in embedding_f(test)
22 hub_layer = hub.KerasLayer(embedding, input_shape=[2], dtype=tf.string,
23 trainable=True, output_shape=[None, 20])
---> 24 ret = hub_layer(test)
25 # print(ret)
26 return ret.numpy()
/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
817 return ops.convert_to_tensor_v2(x)
818 return x
--> 819 inputs = nest.map_structure(_convert_non_tensor, inputs)
820 input_list = nest.flatten(inputs)
821
/usr/lib/python3.8/site-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
615
616 return pack_sequence_as(
--> 617 structure[0], [func(*x) for x in entries],
618 expand_composites=expand_composites)
619
/usr/lib/python3.8/site-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
615
616 return pack_sequence_as(
--> 617 structure[0], [func(*x) for x in entries],
618 expand_composites=expand_composites)
619
/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py in _convert_non_tensor(x)
815 # `SparseTensors` can't be converted to `Tensor`.
816 if isinstance(x, (np.ndarray, float, int)):
--> 817 return ops.convert_to_tensor_v2(x)
818 return x
819 inputs = nest.map_structure(_convert_non_tensor, inputs)
/usr/lib/python3.8/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor_v2(value, dtype, dtype_hint, name)
1276 ValueError: If the `value` is a tensor not of given `dtype` in graph mode.
1277 """
-> 1278 return convert_to_tensor(
1279 value=value,
1280 dtype=dtype,
/usr/lib/python3.8/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
1339
1340 if ret is None:
-> 1341 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1342
1343 if ret is NotImplemented:
/usr/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py in _default_conversion_function(***failed resolving arguments***)
50 def _default_conversion_function(value, dtype, name, as_ref):
51 del as_ref # Unused.
---> 52 return constant_op.constant(value, dtype, name=name)
53
54
/usr/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
259 ValueError: if called on a symbolic tensor.
260 """
--> 261 return _constant_impl(value, dtype, shape, name, verify_shape=False,
262 allow_broadcast=True)
263
/usr/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
268 ctx = context.context()
269 if ctx.executing_eagerly():
--> 270 t = convert_to_eager_tensor(value, ctx, dtype)
271 if shape is None:
272 return t
/usr/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
94 dtype = dtypes.as_dtype(dtype).as_datatype_enum
95 ctx.ensure_initialized()
---> 96 return ops.EagerTensor(value, ctx.device_name, dtype)
97
98
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
The error states that it's not possible to convert a Numpy array to a tensor. I've tried changing the input_shape parameter of the KerasLayer to no avail. The only solution I see is to calculate the embedding for each text by looping through all of them one by one before finding the result to the rest of the network but that seems highly inefficient (and requires too much memory for my laptop). Examples I see with word embedding, do it this way however.
What is the correct way to go about getting a list of embedding from multiple sentences?

I think your output_shape should be set to [20] (from https://www.tensorflow.org/hub/api_docs/python/hub/KerasLayer):
hub.KerasLayer("/tmp/text_embedding_model",
output_shape=[20], # Outputs a tensor with shape [batch_size, 20].
input_shape=[], # Expects a tensor of shape [batch_size] as input.
dtype=tf.string) # Expects a tf.string input tensor.
Using TF 2.4.1 and tensorflow_hub 0.11.0, this works for me:
data = np.array(['AITA - Getting Hugged At The Bar .', 'This all happened less than an hour ago..'])
model_url = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
embedding = hub.KerasLayer(model_url, input_shape=[], dtype=tf.string,
trainable=True, output_shape=[20])(data)
If you don't want to add layers on top of the KerasLayer, you can also just call
model = hub.load(model_url)
embedding = model(data)

Related

'str' object has no attribute '_keras_mask' error when using tf.keras.Sequential

Background
I am using Tensorflow for the first time following a tutorial on featurization with the new Google Recommenders package: https://www.tensorflow.org/recommenders/examples/featurization
I ran into trouble swapping out their dataset (MovieLens) for one based on the Kaggle wine data. The following code works as expected:
wine_title_lookup= tf.keras.layers.experimental.preprocessing.StringLookup()
wine_title_lookup.adapt(np.unique(wine_train['title']))
print(f"Vocabulary: {wine_title_lookup.get_vocabulary()[:3]}")
Vocabulary: ['', '[UNK]', 'Žitavské Vinice Rhine Riesling']
wine_title_embedding = tf.keras.layers.Embedding(
# Let's use the explicit vocabulary lookup.
input_dim=wine_title_lookup.vocab_size(),
output_dim=32
)
x= wine_title_lookup(["Susana Balbo Signature Malbec"])
x= wine_title_embedding(x)
x
<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
array([[-0.03861505, -0.02146437, 0.04332292, -0.02598745, 0.03842534,
-0.01066433, 0.0292404 , 0.02783312, 0.03364438, 0.00054752,
-0.0295071 , 0.03200008, 0.01224083, -0.00100452, -0.04346857,
0.00105418, -0.01640136, -0.01778026, 0.00171928, 0.03215903,
0.00020416, -0.02083766, -0.00323264, 0.02582215, 0.04805436,
0.0325211 , 0.0100181 , -0.04965406, 0.02548517, 0.01569786,
0.03761304, 0.01659941]], dtype=float32)>
However the following produces an error
wine_title_model = tf.keras.Sequential([wine_title_lookup, wine_title_embedding])
wine_title_model(["Susana Balbo Signature Malbec"])
AttributeError Traceback (most recent call last)
in ()
----> 1 wine_title_model(["Susana Balbo Signature Malbec"])
3 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py in call(self, *args, **kwargs)
983
984 with ops.enable_auto_cast_variables(self._compute_dtype_object):
--> 985 outputs = call_fn(inputs, *args, **kwargs)
986
987 if self._activity_regularizer:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/sequential.py in call(self, inputs, training, mask)
370 if not self.built:
371 self._init_graph_network(self.inputs, self.outputs)
--> 372 return super(Sequential, self).call(inputs, training=training, mask=mask)
373
374 outputs = inputs # handle the corner case where self.layers is empty
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py in call(self, inputs, training, mask)
384 """
385 return self._run_internal_graph(
--> 386 inputs, training=training, mask=mask)
387
388 def compute_output_shape(self, input_shape):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py in _run_internal_graph(self, inputs, training, mask)
482 masks = self._flatten_to_reference_inputs(mask)
483 for input_t, mask in zip(inputs, masks):
--> 484 input_t._keras_mask = mask
485
486 # Dictionary mapping reference tensors to computed tensors.
AttributeError: 'str' object has no attribute '_keras_mask'
Notable differences with the source material
The Google code I based my script on uses a data format I am unfamiliar with which allows them to run map on their data. I tried converting my data into some tensorflow formats but could not seem to replicate their functionality. However this is the only step that is different and I cannot understand why the pieces of the Sequence op work individually but not as a whole.
I looked at some other examples from when this error has popped up on SO but could not find a solution to my problem. This what the raw data looks like.
wine_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 108655 entries, 0 to 120727
Data columns (total 16 columns):
Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 108600 non-null object
1 description 108652 non-null object
2 designation 77150 non-null object
3 points 108336 non-null float64
4 price 100871 non-null float64
5 province 108600 non-null object
6 region_1 108655 non-null object
7 region_2 42442 non-null object
8 title 108655 non-null object
9 variety 108655 non-null object
10 winery 108655 non-null object
11 designation_replace 108655 non-null object
12 user_id 108655 non-null int64
13 price_isna 108655 non-null bool
14 price_imputed 108650 non-null float64
15 wine_id 108655 non-null int64
dtypes: bool(1), float64(3), int64(2), object(10)
memory usage: 13.4+ MB
Try using
wine_title_model.predict(["Susana Balbo Signature Malbec"])
I solved this problem by fixing the following areas of my code
Converting wine_train to a Tensorflow format
When I posted this question I had already tried running tf.data.Dataset.from_tensor_slices on my pandas dataframe. However it will not work. Instead convert the dataframe to a dictionary as so: wine_features_dict = {name: np.array(value) for name, value in wine_train.items()} and then everything runs smoothly.
Tensorflow is very sensitive to missing or NaN values.
I thought I got everything but just dropping all rows with any missing data seemed to get rid of the error. If it's happening to you make sure that there is no missing data and all your data is either integer or string.
Edited 13 Oct 2021:
Adding the full solution below - as requested
We want to convert data into a tf dictionary
wine_features_dict = {name: np.array(value) for name, value in wine_train.items()}
import itertools
def slices(features):
for i in itertools.count():
# For each feature take index `i`
example = {name:values[i] for name, values in features.items()}
yield example
for example in slices(wine_features_dict):
for name, value in example.items():
print(f"{name:19s}: {value}")
break
create features_ds
features_ds = tf.data.Dataset.from_tensor_slices(wine_features_dict)
for example in features_ds:
for name, value in example.items():
print(f"{name:19s}: {value}")
break
which yields
country : b'Portugal'
description : b"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016."
points : 87.0
price : 15.0
province : b'Douro'
title : b'Quinta dos Avidagos 2011 Avidagos Red (Douro)'
variety : b'Portuguese Red'
winery : b'Quinta dos Avidagos'
designation_replace: b'Avidagos'
user_id : b'15'
price_isna : False
price_imputed : 15.0
wine_id : 1
Preprocessing and other stuff
Between this and the actual model definition there are a bunch of things too lengthy for a SlackOverflow post. Essentially we are going to create embeddings for categorical variables and normalize or discretize continuous features. I added timestamps that I also had to preprocess. Word embeddings for text and then combine them
wine titles, user_ids -> vocabularies -> embeddings
price, price imputed -> normalize or discretize (bucket) -> embed buckets
words -> tokenization (splitting into constituent words or word-pieces), followed by vocabulary learning, followed by an embedding.
timestamps -> bucket based on max and min -> embed buckets
Next I explicitly defined the training inputs
inputs = {}
for name, column in wine_train.items():
dtype = column.dtype
if dtype == object:
dtype = tf.string
else:
dtype = tf.float32
inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)
and then the model
class UserModel(tf.keras.Model):
def __init__(self, use_timestamps=False, use_country_origin=False):
super().__init__()
self._use_timestamps = use_timestamps
self._use_country_origin = use_country_origin
self.user_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.StringLookup(
vocabulary=unique_user_ids, mask_token=None),
tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
])
'''
# Can also do this if user_id_lookup is defined
self.user_embedding = tf.keras.Sequential([
user_id_lookup,
tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32),
])
'''
if use_country_origin:
self.country_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.StringLookup(
vocabulary=unique_countries, mask_token=None),
tf.keras.layers.Embedding(len(unique_countries) + 1, 32),
])
if use_timestamps:
self.timestamp_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
tf.keras.layers.Embedding(len(timestamp_buckets) + 2, 32)
])
self.normalized_timestamp = tf.keras.layers.experimental.preprocessing.Normalization()
self.normalized_timestamp.adapt(timestamps)
def call(self, inputs):
# If timestamps not active just do the user_id embedding
if not self._use_timestamps:
# Ignore country of origin if is not enabled
if not self._use_country_origin:
return self.user_embedding(inputs['user_id'])
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.country_embedding(inputs["country"]),
], axis=1)
# Take the input dictionary, pass it through each input layer,
# and concatenate the result.
if not self._use_country_origin:
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.timestamp_embedding(inputs["timestamp"]),
self.normalized_timestamp(inputs["timestamp"]),
], axis=1)
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.timestamp_embedding(inputs["timestamp"]),
self.normalized_timestamp(inputs["timestamp"]),
self.country_embedding(inputs["country"]),
], axis=1)
we can call it as well
user_model = UserModel()
# Delete quotes if timestamps are available
'''user_model.normalized_timestamp.adapt(
ratings.map(lambda x: x["timestamp"]).batch(128))
'''
for row in features_ds.batch(1).take(1):
print(f"Computed representations: {user_model(row)[0, :3]}")

Python Sklearn "ValueError: Classification metrics can't handle a mix of multiclass-multioutput and binary targets" error

I have already visited this answer but didn't understand.
I don't get this error when I use test_train_split function for using the same dateset for testing and training.
But when I try to use different csv files for testing and training I get this error.
link to titanic kaggle competition
Can Someone please explain why I am I getting this error?
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(df,survived_df)
predictions=logreg.predict(test)
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(test_survived,predictions) #error here Value Error ""ValueError: Classification metrics can't handle a mix of multiclass-multioutput and binary targets""
print(accuracy)
Full Error
ValueError Traceback (most recent call last)
<ipython-input-243-89c8ae1a928d> in <module>
----> 1 logreg.score(test,test_survived)
2
~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
497 """
498 from .metrics import accuracy_score
--> 499 return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
500
501 def _more_tags(self):
~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/metrics/_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
185
186 # Compute accuracy for each possible representation
--> 187 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
188 check_consistent_length(y_true, y_pred, sample_weight)
189 if y_type.startswith('multilabel'):
~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
88
89 if len(y_type) > 1:
---> 90 raise ValueError("Classification metrics can't handle a mix of {0} "
91 "and {1} targets".format(type_true, type_pred))
92
ValueError: Classification metrics can't handle a mix of multiclass-multioutput and binary targets
Full Code
df=pd.read_csv('data/train.csv')
test=pd.read_csv('data/test.csv')
test_survived=pd.read_csv('data/gender_submission.csv')
plt.figure(5)
df=df.drop(columns=['Name','SibSp','Ticket','Cabin','Parch','Embarked'])
test=test.drop(columns=['Name','SibSp','Ticket','Cabin','Parch','Embarked'])
sns.heatmap(df.isnull(),),
plt.figure(2)
sns.boxplot(data=df,y='Age')
# from boxplot 75th%ile seems to b 38 n 25th percentile seems to be 20.....
#so multiplying by 1.5 at both ends so Age(10,57) seems good and any value outside this ...lets consider as outliers..
#also using this age for calaculating mean for replacing na values of age.
df=df.loc[df['Age'].between(9,58),]
# test=test.loc[test['Age'].between(9,58),]
# test=test.loc[test['Age'].between(9,58),]
df=df.reset_index(drop=True,)
class_3_age=df.loc[df['Pclass']==3].Age.mean()
class_2_age=df.loc[df['Pclass']==2].Age.mean()
class_1_age=df.loc[df['Pclass']==1].Age.mean()
def remove_null_age(data):
agee=data[0]
pclasss=data[1]
if pd.isnull(agee):
if pclasss==1:
return class_1_age
elif pclasss==2:
return class_2_age
else:
return class_3_age
return agee
df['Age']=df[["Age","Pclass"]].apply(remove_null_age,axis=1)
test['Age']=test[["Age","Pclass"]].apply(remove_null_age,axis=1)
sex=pd.get_dummies(df['Sex'],drop_first=True)
test_sex=pd.get_dummies(test['Sex'],drop_first=True)
sex=sex.reset_index(drop=True)
test_sex=test_sex.reset_index(drop=True)
df=df.drop(columns=['Sex'])
test=test.drop(columns=['Sex'])
df=pd.concat([df,sex],axis=1)
test=test.reset_index(drop=True)
df=df.reset_index(drop=True)
test=pd.concat([test,test_sex],axis=1)
survived_df=df["Survived"]
df=df.drop(columns='Survived')
test["Age"]=test['Age'].round(1)
test.at[152,'Fare']=30
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(df,survived_df)
predictions=logreg.predict(test)
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(test_survived,predictions)
print(accuracy)
You probably want to get the accuracy for the predictions together with the column Survived of the test_survived dataframe:
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(test_survived['Survived'],predictions)
print(accuracy)
Your error occured, because the accuracy_score() only takes two 1-dimensional arrays, one as the ground truth labels and the other as the predicted labels. But you provided a 2-dimensional "array" (the dataframe) and the 1-dimensional predictions, hence it assumed that your first input is a multiclass-output.
The documentation is also very resourceful for this.

using `switch`/`cond` in a custom loss function

I need to implement a custom loss function in keras that computes the standard categorical crossentropy except when the y_true is all zeros.
This is my attempt to do so:
def masked_crossent(y_true, y_pred):
return K.switch(K.any(y_true),
losses.categorical_crossentropy(y_true, y_pred),
losses.categorical_crossentropy(y_true, y_pred) * 0)
However, I get the following error once training starts (compilation works fine):
~/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py
in init(self, graph, fetches, feeds)
419 self._ops.append(True)
420 else:
--> 421 self._assert_fetchable(graph, fetch.op)
422 self._fetches.append(fetch_name)
423 self._ops.append(False)
~/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py
in _assert_fetchable(self, graph, op)
432 if not graph.is_fetchable(op):
433 raise ValueError(
--> 434 'Operation %r has been marked as not fetchable.' % op.name)
435
436 def fetches(self):
ValueError: Operation 'IsVariableInitialized_4547' has been marked as
not fetchable.
In place of losses.categorical_crossentropy(y_true, y_pred) * 0, I've also tried the following with various other errors (either during compilation or once training has started):
K.zeros_like(losses.categorical_crossentropy(y_true, y_pred))
K.zeros((K.int_shape(y_true)[0]))
K.zeros((K.int_shape(y_true)[0], 1))
... though I imagine that there is a trivial way to do this.
I only have an idea for a workaround:
def masked_crossent(y_true, y_pred):
return K.max( y_true ) * K.categorical_crossentropy(y_true, y_pred)
You need to add the axis = -1 if this is for whole batches.

How to do cross validation for multiclass data?

I was able to use following method to do cross validation on binary data, but it seems not working for multiclass data:
> cross_validation.cross_val_score(alg, X, y, cv=cv_folds, scoring='roc_auc')
/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
169 y_type = type_of_target(y)
170 if y_type not in ("binary", "multilabel-indicator"):
--> 171 raise ValueError("{0} format is not supported".format(y_type))
172
173 if is_regressor(clf):
ValueError: multiclass format is not supported
> y.head()
0 10
1 6
2 12
3 6
4 10
Name: rank, dtype: int64
> type(y)
pandas.core.series.Series
I also tried changing roc_auc to f1 but still having error:
/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight)
1016 else:
1017 raise ValueError("Target is %s but average='binary'. Please "
-> 1018 "choose another average setting." % y_type)
1019 elif pos_label not in (None, 1):
1020 warnings.warn("Note that pos_label (set to %r) is ignored when "
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Is there any method I can use to do cross validation for such type of data?
As pointed out in the comment by Vivek Kumar sklearn metrics support multi-class averaging for both the F1 score and the ROC computations, albeit with some limitations when data is unbalanced. So you can manually construct the scorer with the corresponding average parameter or use one of the predefined ones (e.g.: 'f1_micro', 'f1_macro', 'f1_weighted').
If multiple scores are needed, then instead of cross_val_score use cross_validate (available since sklearn 0.19 in the module sklearn.model_selection).

Why this errror appears during fit while creating decision Tree Classifier

Hi I am trying Decision Tree Classifier by following this video Hello World - Machine Learning Recipes #1 Google Developers.
Here is my Code.
#Import the Pandas library
import pandas as pd
#Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv" train = pd.read_csv(train_url)
#Print the head of the train and test dataframes
train.head()
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv" test = pd.read_csv(test_url)
#Print the head of the train and test dataframes
test.head()
#from sklearn import tree
from sklearn import tree
#find the best feature to predict Survival rate
#define X_features and Y_labels
col_names=['Pclass','Age','SibSp','Parch']
X_features= train[col_names]
#assign survial to label
Y_labels= train.Survived
#create a decision tree classifier
clf=tree.DecisionTreeClassifier()
#fit (find patterns in Data)
clf=clf.fit(X_features, Y_labels)
clf.predict(test[col_names])
Getting Error
ValueError Traceback (most recent call last) in () 13#Y_train_sparse=Y_labels.to_sparse() 14 # fit (find patterns in Data) ---> 15 clf=clf.fit(X_features, Y_labels) 16 #clf.predict(test[col_names])
C:\Users\nitinahu\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\tree\tree.py
in fit(self, X, y, sample_weight, check_input, X_idx_sorted) 152
random_state = check_random_state(self.random_state) 153 if
check_input: --> 154 X = check_array(X, dtype=DTYPE,
accept_sparse="csc") 155 if issparse(X): 156 X.sort_indices()
C:\Users\nitinahu\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py
in check_array(array, accept_sparse, dtype, order, copy,
force_all_finite, ensure_2d, allow_nd, ensure_min_samples,
ensure_min_features, warn_on_dtype, estimator) 396 % (array.ndim,
estimator_name)) 397 if force_all_finite: --> 398
_assert_all_finite(array) 399 400 shape_repr = _shape_repr(array.shape)
C:\Users\nitinahu\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py
in _assert_all_finite(X) 52 and not np.isfinite(X).all()): 53 raise
ValueError("Input contains NaN, infinity" ---> 54 " or a value too
large for %r." % X.dtype) 55 56
ValueError: Input contains NaN, infinity or a value too large for
dtype('float32').
Just check all the values u r getting in the responses.
One or two is giving out of bound values and that is causing an overflow to occur.