Reading EMNIST dataset - pandas

I am building a CNN using tensorflow in python, but having problem with loading the data from EMNIST dataset. Can anyone please show me a sample code of retrieving each image in a batch and pass during the training session?

There are a couple of formats of the EMNIST dataset...the one I've found easiest to understand is the CSV version on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).
You can check out one of my implementations of an EMNIST CNN using Keras, where your dataset loading can be similar:
import pandas as pd
raw_data = pd.read_csv("data/emnist-balanced-train.csv")
train, validate = train_test_split(raw_data, test_size=0.1) # change this split however you want
x_train = train.values[:,1:]
y_train = train.values[:,0]
x_validate = validate.values[:,1:]
y_validate = validate.values[:,0]
from https://github.com/Josh-Payne/cs230/blob/master/Alphanumeric-Augmented-CNN/augmented-cnn.py

Related

What is the sequence for preprocessing text df with tensorflow?

I have a pandas data frame, containing two columns: sentences and annotations:
Col 0
Sentence
Annotation
1
[This, is, sentence]
[l1, l2, l3]
2
[This, is, sentence, too]
[l1, l2, l3, l4]
There are several things I need to do:
split to features and labels
split into train-val-test data
vectorize train data using:
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=maxlen,
standardize='lower',
split='whitespace',
ngrams=(1,3),
output_mode='tf-idf',
pad_to_max_tokens=True,)
I haven't worked with tensors before so I am a little confused about how to order the steps above and access the information from the tensors. Specifically, at what point do I have to split into features and labels, and how to access one or the other? Then, should I split into features and labels before splitting to train-val-test (I want to make it right and not use sklearn's train_test_split when I work with tensorflow) or it is the opposite?
You can split your dataset before creating a model. After splitting you need to tokenize your sentences using
tensorflow.keras.preprocessing.text.Tokenizer((num_words = vocab_size, oov_token=oov_tok)
After tokenizing you need to add padding to the sentence using
training_padded = pad_sequences(training_sequences, maxlen=max_length, truncating = trunc_type)
Then you can train your model with the data. For more details please refer to this working code example. Thank You.

Streamlit with Tensorflow to analyse image and return the probability if is positive or negative

I'm trying to use Tensorflow to Machine Learning to analyze an image and return the probability if is positive or negative based on a model created (extension .h5). I couldn't found a documentation exactly for that, or repository, so even a link to read will be awesome.
Link for the application: https://share.streamlit.io/felipelx/hackathon/IDC_Detector.py
Libraries that I'm trying to use.
import numpy as np
import streamlit as st
import tensorflow as tf
from keras.models import load_model
The function to load the model.
#st.cache(allow_output_mutation=True)
def loadIDCModel():
model_idc = load_model('models/IDC_model.h5', compile=False)
model_idc.summary()
return model_idc
The function to work the image, and what I'm trying to see: model.predict - I can see but is not updating the %, independent of the image the value is always the same.
if uploaded_file is not None:
# transform image to numpy array
file_bytes = tf.keras.preprocessing.image.load_img(uploaded_file, target_size=(96,96), grayscale = False, interpolation = 'nearest', color_mode = 'rgb', keep_aspect_ratio = False)
c.image(file_bytes, channels="RGB")
Genrate_pred = st.button("Generate Prediction")
if Genrate_pred:
model = loadMetModel()
input_arr = tf.keras.preprocessing.image.img_to_array(file_bytes)
input_arr = np.array([input_arr])
probability_model = tf.keras.Sequential([model, tf.keras.layers.Softmax()])
prediction = probability_model.predict(input_arr)
dict_pred = {0: 'Benigno/Normal', 1: 'Maligno'}
result = dict_pred[np.argmax(prediction)]
value = 0
if result == 'Benigno/Normal':
value = str(((prediction[0][0])*100).round(2)) + '%'
else:
value = str(((prediction[0][1])*100).round(2)) + '%'
c.metric('Predição', result, delta=value, delta_color='normal')
Thank you in advance to any help.
The first thing I'm noticing is that your function for loading the model is named loadIDCModel, but then the function you call for loading the model is loadMetModel. When I check your source code, though, it looks like you've already addressed this issue. I'd recommend updating your question to reflect this.
Playing around with your application, I think the issue is your model itself. I tried various images — images containing carcinomas, and even a picture of a cat — and each gave me a probability around 73%. The lowest score I got was 72.74%, and the highest was 73.11% (this one was the cat). It seems that the output percentage is varying slightly, hinting that rather than something being wrong in the code, your model itself is likely at fault. You might need to retrain your model, as it seems to have learned to always return a value of approximately 0.73.

How to import a CSV file, split it 70/30 and then use first column as my 'y' value?

I am having an issue at the moment, I think im making it far more complicated than it needs to be. my csv file is 31 rows by 500. I need to import this, split it in a 70/30 ratio and then be able to use the first column as my 'y' value for a neural network, and the remaining 30 columns need to be my 'x' value.
ive implemented the below code to do this, but when I run it through my basic sigmoid and testing functions, it provides results in a weird format i.e. [6.54694655e-06].
I believe this is due to my splitting/importing of the data, which I think I have done wrong. I need to import the data into arrays that are readable by my functions, and be able to separate my first column specifically to a 'y' value. how do I go about this?
df = pd.read_csv(r'data.csv', header=None)
df.to_numpy()
#splitting data 70/30
trainingdata= df[:329]
testingdata= df[:141]
#converting data to seperate arrays for training and testing
training_features= trainingdata.loc[:, trainingdata.columns != 0].values.reshape(329,30)
training_labels = trainingdata[0]
training_labels = training_labels.values.reshape(329,1)
testing_features = testingdata[0]
testing_labels = testingdata.loc[:, testingdata.columns != 0]
Usually for splitting the dataframe on test and train data I use sklearn.model_selection.train_test_split. Documentation here.
Some other methods are described here Hope this will help you!
Make you train/test split easy by using sklearn.model_selection.train_test_split.
If you don't have sklearn installed, first install it by running pip install -U scikit-learn.
Then
from sklearn.model_selection import train_test_split
df = pd.read_csv(r'data.csv', header=None)
# X is your features, y is your target column
X = df.loc[:,1:]
y = df.loc[:,0]
# Use train_test_split function with test size of 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
df = pd.read_csv(r'data.csv')
df.to_numpy()
print(df)

Can you map a dataset in tensorflow with 'keras.utils.to_categorical'?

My dataset here is comprised of data in the following structure [...x, y] and I want to convert it to
[...x], categorical([y])
this is what I tried:
def map_sequence(sequence):
return sequence[:-1], keras.utils.to_categorical(sequence[-1])
dataset = tf.data.Dataset.from_tensor_slices(input_sequences)
dataset = dataset.map(map_sequence)
but I am getting an error as sequence does not actually have any data when mapping is executed.
how does one use to_categorical and map() together?
Replacing keras.utils.to_categorical with tf.one_hot should do the trick.

pybrain LSTM sequence to predict sequential data

I have written a simple code using pybrain to predict a simple sequential data.
For example a sequence of 0,1,2,3,4 will supposed to get an output of 5 from the network. The dataset specifies the remaining sequence.
Below are my codes implementation
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SequentialDataSet
from pybrain.structure import SigmoidLayer, LinearLayer
from pybrain.structure import LSTMLayer
import itertools
import numpy as np
INPUTS = 5
OUTPUTS = 1
HIDDEN = 40
net = buildNetwork(INPUTS, HIDDEN, OUTPUTS, hiddenclass=LSTMLayer, outclass=LinearLayer, recurrent=True, bias=True)
ds = SequentialDataSet(INPUTS, OUTPUTS)
ds.addSample([0,1,2,3,4],[5])
ds.addSample([5,6,7,8,9],[10])
ds.addSample([10,11,12,13,14],[15])
ds.addSample([16,17,18,19,20],[21])
net.randomize()
trainer = BackpropTrainer(net, ds)
for _ in range(1000):
print trainer.train()
x=net.activate([0,1,2,3,4])
print x
The output on my screen keeps showing [0.99999999 0.99999999 0.9999999 0.99999999] every simple time. What am I missing? Is the training not sufficient? Because trainer.train()
shows output of 86.625..
The pybrain sigmoidLayer is implementing the sigmoid squashing function, which you can see here:
sigmoid squashing function code
The relevant part is this:
def sigmoid(x):
""" Logistic sigmoid function. """
return 1. / (1. + safeExp(-x))
So, no matter what the value of x, it will only ever return values between 0 and 1. For this reason, and for others, it is a good idea to scale your input and output values to between 0 and 1. For example, divide all your inputs by the maximum value (assuming the minimum is no lower than 0), and the same for your outputs. Then do the reverse with the result (e.g. multiply by 25 if you were dividing by 25 at the beginning).
Also, I'm no expert on pybrain, but I wonder if you need OUTPUTS = 4? It looks like you have only one output in your data, so I'm wondering if you could just use OUTPUTS = 1.
You may also try scaling the inputs and outputs to a particular part of the sigmoid curve (e.g. between 0.1 and 0.9) to make the pybrain's job easier, but that makes the scaling before and after a little more complex.