How to change 0.0 - 1.0 class predictions to integer [duplicate] - numpy

I was doing Multi-class Classification using Keras.It contained 5 classes of Output. I converted the single class vector to matrix using one hot encoding and made a model. Now to evaluate the model I want to convert back the 5 class probabilistic result back to Single Column.
I am getting this as output in numpy array format
..................0..................1............................2.......................3.............................4
5.35433665e-02 1.72592481e-05 1.49291719e-03 9.44392741e-01
5.53713820e-04
1.97096306e-05 2.08907949e-08 3.11666554e-07 1.40611945e-07
9.99979794e-01
9.99999225e-01 2.42999278e-07 1.58917388e-07 7.84497018e-08
2.85837785e-07
7.09977685e-05 1.02068476e-09 1.38186664e-07 9.99928594e-01
2.73126261e-07
1.29937407e-05 2.49388819e-07 9.99986231e-01 4.76015231e-07
7.39421040e-08
Want to convert this matrix to
[3,4,0,3,2]

It seems like you are looking for np.argmax:
import numpy as np
class_labels = np.argmax(class_prob, axis=1) # assuming you have n-by-5 class_prob

Related

How to handle unknown number of values for a categorical feature?

I have a pandas dataframe that looks like this
Text | Label
Some text | 0
hellow bye what | 1
...
Each row is a data point. Label is 0/1 binary. The only feature is Text which contains a set of words. I want to use the presence or absence of each word as features. For example, the features could be contains_some contains_what contains_hello contains_bye, etc. This is typical one hot encoding.
However I don't want to manually create so many features, one for every single word in the vocabulary (the vocabulary is not huge, so I am not worried about the feature set exploding). But I just want to supply a list of words as a single column to tensorflow and I want it to create a binary feature for each word in the vocabulary.
Does tensorflow/keras have an API to do this?
You can use sklearn for this , try this:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(old_df['Text'])
new_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
new_df['Label'] = old_df['label']
and this should give you :
bye hellow some text what target
0 0 1 1 0 0
1 1 0 0 1 1
CountVectorizer convert a collection of text documents to a matrix of token counts.
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix and if binary = True then all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
What you're looking for is a (binary) bag of words which you can get from scikit-learn using their CountVectorizer here.
You can do something like:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(1, 1), binary=True)
X_train = bow.fit_transform(df_train['text'].values)
This will create an array of binary values indicating the presence of a word in each text. Use binary=True to output a 1 or 0 if the word is present. Without this field you will get counts of occurrences per word, either method works fine.
In order to inspect the counts you could use the below:
# Create sample dataframe of BoW outputs
count_vect_df = pd.DataFrame(X_train[:1].todense(),columns=bow.get_feature_names())
# Re-order counts in descending order. Keep top 10 counts for demo purposes
count_vect_df= count_vect_df[count_vect_df.iloc[-1,:].sort_values(ascending=False).index[:10]]
# print combination of original train dataframe with BoW counts
pd.concat([df_train['text'][:1].reset_index(drop=True), count_vect_df], axis=1)
Update
If your features include categorical data you could try using to_categorical from tf.keras. See the docs for more information.

TensorFlow:Failed to convert a NumPy array to a Tensor (Unsupported object type int)

I am practicing on this kaggle dataset regarding car price prediction (https://www.kaggle.com/hellbuoy/car-price-prediction). I dont know why am I receiving this error.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras import layers,models
cars_data=pd.read_csv('/content/CarPrice_Assignment.csv')
cars_data.head()
cars_data.info()
cars_data.describe()
train_data=cars_data.iloc[:103]
train_data=train_data.drop('price',axis=1)
train_data=np.asarray(train_data.values)
train_targets=cars_data.price.iloc[:103]
train_targets=np.asarray(train_targets)
test_data=cars_data.iloc[103:165]
test_data=test_data.drop('price',axis=1)
test_data=np.asarray(test_data.values)
test_targets=cars_data.price.iloc[103:165]
test_targets=np.asarray(test_targets)
val_data=cars_data.iloc[165:]
val_data=val_data.drop('price',axis=1)
val_data=np.asarray(val_data.values)
val_targets=cars_data.price.iloc[165:]
val_targets=np.asarray(val_targets)
model=models.Sequential()
model.add(layers.Dense(10,activation='relu',input_shape=(25,)))
model.add(layers.Dense(8,activation='relu'))
model.add(layers.Dense(6,activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop',loss='mse',metrics=['mae'])
model.fit(train_data,train_targets,epochs=20,batch_size=1)
There are 2 things you need to address in your code.
Categorical Variables
By printing the value of train_data, I can see there are still some categorical variables in form of string. Tensorflow cannot process that kind of data directly, so you need to deal with categorical variables. See answer from Best way to deal with categorical variables in regression problem - python as your starting point.
target shape
Your train_targets shape is (107,) means that this is a 1D array. The correct shape for tensorflow input(for simple regression problem) is (107,1). Modify your code like this to reshape the value :
train_targets=np.asarray(train_targets).reshape(-1,1)

How to perform kmean clustering from Gensim TFIDF values

I am using Gensim for vector space model. after creating a dictionary and corpus from Gensim I calculated the (Term frequency*Inverse document Frequency)TFIDF using the following line
Term_IDF = TfidfModel(corpus)
corpus_tfidf = Term_IDF[corpus]
The corpus_tfidf contain list of the list having Terms ids and corresponding TFIDF. then I separated the TFIDF from ids using following lines:
for doc in corpus_tfidf:
for ids,tfidf in doc:
IDS.append(ids)
tfidfmtx.append(tfidf)
IDS=[]
now I want to use k-means clustering so I want to perform cosine similarities of tfidf matrix the problem is Gensim does not produce square matrix so when I run following line it generates an error. I wonder how can I get the square matrix from Gensim to calculate the similarities of all the documents in vector space model. Also how to convert tfidf matrix (which in this case is a list of lists) into 2D NumPy array. any comments are much appreciated.
dumydist = 1 - cosine_similarity(tfidfmtx)
When you fit your corpus to a Gensim Dictionary, get the number or documents and tokens in the dictionary:
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(corpus_lists)
num_docs = dictionary.num_docs
num_terms = len(dictionary.keys())
Transform into bow:
corpus_bow = [dictionary.doc2bow(doc) for doc in corpus_lists]
Transform into tf-idf:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]
Now you can transform into sparse/dense matrix:
from gensim.matutils import corpus2dense, corpus2csc
corpus_tfidf_dense = corpus2dense(corpus_tfidf, num_terms, num_docs)
corpus_tfidf_sparse = corpus2csc(corpus_tfidf, num_terms, num_docs)
Now fit your model using either sparse/dense matrix (after transposing):
model = KMeans(n_clusters=7)
clusters = model.fit_predict(corpus_bow_dense.T)
To create document term matrix from gensim, you may use matutils.corpus2csv
Corpus - list of list(Genism Corpus)
from scipy.sparse import csc_matrix
scipy_csc_matrix =genism.matutils.corpus2csc(corpus)
full_matrix=csc_matrix(scipy_csc_matrix).toarray()
you may want to use scipy sparse format if your corpus size is very large.

Only size 1 arrays can be converted to python scalars

I created a 3 dimensional object using numpy.random module such as
import numpy as np
b = np.random.randn(4,4,3)
Why can't we cast type float to b?
TypeError
actual code
You can't float(b) because b isn't a number, it's a multidimensional array/matrix. If you're trying to convert every element to a Python float, that's a bad idea because numpy numbers are more precise, but if you really want to do that for whatever reason, you can do b.tolist(), which returns a Python list of floats. However, I don't believe you can have a numpy matrix of native Python types because that doesn't make any sense.

pybrain LSTM sequence to predict sequential data

I have written a simple code using pybrain to predict a simple sequential data.
For example a sequence of 0,1,2,3,4 will supposed to get an output of 5 from the network. The dataset specifies the remaining sequence.
Below are my codes implementation
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SequentialDataSet
from pybrain.structure import SigmoidLayer, LinearLayer
from pybrain.structure import LSTMLayer
import itertools
import numpy as np
INPUTS = 5
OUTPUTS = 1
HIDDEN = 40
net = buildNetwork(INPUTS, HIDDEN, OUTPUTS, hiddenclass=LSTMLayer, outclass=LinearLayer, recurrent=True, bias=True)
ds = SequentialDataSet(INPUTS, OUTPUTS)
ds.addSample([0,1,2,3,4],[5])
ds.addSample([5,6,7,8,9],[10])
ds.addSample([10,11,12,13,14],[15])
ds.addSample([16,17,18,19,20],[21])
net.randomize()
trainer = BackpropTrainer(net, ds)
for _ in range(1000):
print trainer.train()
x=net.activate([0,1,2,3,4])
print x
The output on my screen keeps showing [0.99999999 0.99999999 0.9999999 0.99999999] every simple time. What am I missing? Is the training not sufficient? Because trainer.train()
shows output of 86.625..
The pybrain sigmoidLayer is implementing the sigmoid squashing function, which you can see here:
sigmoid squashing function code
The relevant part is this:
def sigmoid(x):
""" Logistic sigmoid function. """
return 1. / (1. + safeExp(-x))
So, no matter what the value of x, it will only ever return values between 0 and 1. For this reason, and for others, it is a good idea to scale your input and output values to between 0 and 1. For example, divide all your inputs by the maximum value (assuming the minimum is no lower than 0), and the same for your outputs. Then do the reverse with the result (e.g. multiply by 25 if you were dividing by 25 at the beginning).
Also, I'm no expert on pybrain, but I wonder if you need OUTPUTS = 4? It looks like you have only one output in your data, so I'm wondering if you could just use OUTPUTS = 1.
You may also try scaling the inputs and outputs to a particular part of the sigmoid curve (e.g. between 0.1 and 0.9) to make the pybrain's job easier, but that makes the scaling before and after a little more complex.