Dataframes, csv, and CNTK - pandas

I have been playing around with CNTK and am finding that models can only be trained using numpy arrays. Is this correct?
This makes sense for image recognition etc.
How would I turn my tidy dataset (read in as a dataframe using pandas) into a format that can train a logistic regression with? I have tried to read it into a numpy array
np.genfromtxt(“My.csv",delimiter=',' , dtype=float)
and I have also tried to wrap the variable with
np.array.MyVeriable.astype('float32')
But I do not get the result I want to be able to feed a model.
I also cannot find anything in the tutorial about how to do ML on tabular dataframes in CNTK.
Is it not supported?

CNTK 104 shows how to use pandas dataframes and numpy.
https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_104_Finance_Timeseries_Basic_with_Pandas_Numpy.ipynb
CNTK 106B shows how you could read data using csv files.
https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb

Thanks for these links. This is how I ended up reading in the csv it seemed to work but Sayan please correct as needed:
def generate_data_from_csv():
# try to find the data file local. If it doesn't report "file does not exists" if it does report "using loacl file"
data_path = os.path.join("MyPath")
csv_file = os.path.join(data_path, "My.csv")
if not os.path.exists(data_path):
os.makedirs(data_path)
if not os.path.exists(data_file):
print("file does not exists")
else:
print("using loacl file")
df = pd.read_csv(csy_file, usecols = ["predictor1", "predictor2",
"predictor3", "predictor4", "dependent_variable"], dtype=np.float32)
return df
Then I saved that dataframe as training_data
training_data = generate_data_from_csv()
I then turned that dataframe into an numpy array as follows
training_features = np.asarray(training_data[[["predictor1",
"predictor2", "predictor3", "predictor4",]], dtype = "float32")
training_labels = np.asarray(training_data[["dependent_variable"]],
dtype="float32")
The to train the model I used this code:
features, labels = training_features[:,[0,1,2,3]], training_labels

Related

Passing a dict of tensors to a Keras model

I am trying to preprocess the infamous Titanic data (from Kaggle) by following this tutorial.
Everything was okay until I get to run the titanic_processing Model on the data (titanic_features) and I get this error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
In the tutorial it is mentioned that one should transform the data into a dict of tensors, but:
I don't see how the code (see HERE1 tag in my code below) makes a dict of tensors (there is no tf.convert_to_tensor for example)
I don't understand why one should retransform all the data as the previous code was suppose to do just that (when one create preprocessed_inputs etc.)
Here is my code, but you can also execute it on Google Colab here.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
url = "https://raw.githubusercontent.com/aymeric75/IA/master/train.csv"
titanic = pd.read_csv(url)
titanic_features = titanic.copy()
titanic_labels = titanic_features.pop('Survived')
inputs = {}
for name, column in titanic_features.items():
dtype = column.dtype
if dtype == object:
dtype = tf.string
else:
dtype = tf.float32
inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)
numeric_inputs = {name:input for name,input in inputs.items()
if input.dtype==tf.float32}
x = layers.Concatenate()(list(numeric_inputs.values()))
norm = preprocessing.Normalization()
norm.adapt(np.array(titanic[numeric_inputs.keys()]))
all_numeric_inputs = norm(x)
preprocessed_inputs = [all_numeric_inputs]
for name, input in inputs.items():
if input.dtype == tf.float32:
continue
lookup = preprocessing.StringLookup(vocabulary=np.unique(titanic_features[name].dropna()))
one_hot = preprocessing.CategoryEncoding(max_tokens=lookup.vocab_size())
x = lookup(input)
x = one_hot(x)
preprocessed_inputs.append(x)
preprocessed_inputs_cat = layers.Concatenate()(preprocessed_inputs)
titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat)
titanic_features_dict = {}
# This model just contains the input preprocessing. You can run it to see what it does to your data.
# Keras models don't automatically convert Pandas DataFrames because
# it's not clear if it should be converted to one tensor or to a dictionary of tensors. So convert it to a dictionary of tensors:
# HERE1
titanic_features_dict = {name: np.array(value)
for name, value in titanic_features.items()}
features_dict = {name:values[:1] for name, values in titanic_features_dict.items()}
titanic_preprocessing(features_dict)
Thanks a lot for you support!
Aymeric
[UPDATE] if you can answer question 2 ("I don't understand why one should retransform all the data as the previous code was suppose to do just that (when one create preprocessed_inputs etc.") then I will validate your answer, because I think I need to reformat the input indeed (but I don't see what it the point of doing all the code before...)
In your case, the problem is caused by the fact that your feature "Cabin" contains some nan (Not a Number) values. Tensorflow is fine with nan in floating point and integer data types, but not for strings.
You can replace all those nan values with an empty strings in your pandas dataframe :
titanic_features["Cabin"] = titanic_features["Cabin"].fillna("")
The previous code simply declares a preprocessing function as a keras model. You don't actually preprocess any data until your call to the titanic_preprocessing model.

Classify intent of random utterance of chat bot from training data and give different graphical visualization using random forest?

I am creating a nlp model to detect the intent from the provided utterance from a excel file which I am using for training having 2 columns like shown below:
Utterence Intent
hi can I have an Apple Watch service
how much I will be paying monthly service
you still around YOU_THERE
are you still there YOU_THERE
you there YOU_THERE
Speak to me if you are there. YOU_THERE
you around YOU_THERE
There are like around 3000 utterances in the training files and many intents.
I trained my model using scikit learn module and my code looks like this.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import re
def preprocessing(userQuery):
letters_only = re.sub("[^a-zA-Z\\d]", " ", userQuery)
words = letters_only.lower().split()
return( " ".join(words ))
#read utterance data from a xlsx file
train = pd.read_excel('training.xlsx')
query_features = train['Utterence']
#create tfidf
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))
new_query = [preprocessing(query) for query in query_features]
features = tfidf_vectorizer.fit_transform(new_query).toarray()
#create random forest classification model
model = RandomForestClassifier()
model.fit(features, train['Intent'])
#intent prediction on user query
userQuery = "I want apple watch"
userQueryList=[]
userQueryList.append(preprocessing(userQuery))
utfidf = tfidf_vectorizer.transform(userQueryList)
print(" prediction: ", model.predict(utfidf))
The one of problem for me here is for example: when i run for utterance I want apple watch it gives predicted intent as you_there instead of service as shown below(confirmation on training snapshot above):
C:\U\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
prediction: ['YOU_THERE']
Please help me how should i train my model and what changes should I make to fix such issues and how i can check accuracy? Also I want to see graphical visualization and ROC curve how it can achieved using random forest. I am not very verse in NLP any help would be appreciated.
You are using word bags approach which does not perform well on sequence data.
For your problem, sequential is material to classification.
I would suggest to you that use LSTM (performs better on sequence data)
Let's address your first issue:
how should i train my model and what changes should I make to fix such issues
Below I'm using word2vec approach which rather than just converting the utterances to vectors using TFIDF approach (losing the semantic information contained in that particular sentence), it maintains the semantic info.
To understand more about word2vec, refer this blog :
[1]https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
Below is the code for predicted the intent using word2vec approach (Note - It's same as your code, just instead of using TFIDFVectorizer, I'm using word2vec to obtain the vectors. Also the code is divided into different functions to get a good overview of logics that will be evident by there names).
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
def preprocess_lower(token):
#utility for preprocessing
return token.lower()
def load_data(file_name):
# load a csv file in memory
return pd.read_csv(file_name)
def process_training_data(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in list(training_data.Utterence.values)]
target_class = training_data.Intent.values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(target_class))
return target_class, training_sentences, label_encoded_Y
def process_user_query(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in training_data]
return training_sentences
def train_word2vec_model(train_sentences_list):
# training word2vec on sentences list (inputted by user)
model = Word2Vec(train_sentences_list, size=100, window=4, min_count=1, workers=4)
return model
def convert_training_data_vectors(model, train_sentences_list):
#get the sentences average vector
training_sectences_vector = list()
for sentence in train_sentences_list:
sentence_vetor = [list(model.wv[token]) for token in sentence if token in model.wv.vocab ]
training_sectences_vector.append(list(np.mean(sentence_vetor, axis=0)))
return training_sectences_vector
def training_rf_prediction_model(training_data_vectors, label_encoded_Y):
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# training model on user inputted data
rf_model = RandomForestClassifier()
# here use the split function and divide the data into training and testing
x_train,x_test,y_train,y_test=train_test_split(training_data_vectors,label_encoded_Y,
train_size=0.8,test_size=0.2)
rf_model.fit(x_train, y_train)
y_pred = rf_model.predict(x_test)
print(accuracy_score(y_test, y_pred))
return rf_model
def training_svm_prediction_model(training_data_vectors, label_encoded_Y):
svm_model = SVC(gamma='auto')
svm_model.fit(training_data_vectors, label_encoded_Y)
return svm_model
def process_data_flow(file_name):
training_data = load_data(file_name)
target_class, training_sentences, label_encoded_Y = process_training_data(training_data)
word2vec_model = train_word2vec_model(train_sentences_list=training_sentences)
training_data_vectors = convert_training_data_vectors(word2vec_model, train_sentences_list=training_sentences)
prediction_model = training_rf_prediction_model(training_data_vectors, label_encoded_Y)
#intent prediction on user query
userQuery = ["i want apple watch"]
user_query_vectors = convert_training_data_vectors(word2vec_model, process_user_query(userQuery))
predicted_class = prediction_model.predict(user_query_vectors)[0]
predicted_intent = target_class[list(label_encoded_Y).index(predicted_class)]
return predicted_intent
print("Predicted class: ", process_data_flow("sample_intent_data.csv"))
sample data file is in csv format, you just need to format and paste the data in below format :
#sample_input_data.csv
Utterence,Intent
hi can I have an Apple Watch,service
how much I will be paying monthly,service
you still around,YOU_THERE
are you still there,YOU_THERE
you there,YOU_THERE
Speak to me if you are there,YOU_THERE
you around,YOU_THERE
Also note, your training data should contain good amount of training utterances for each intents for the approach to work.
For accuracy, you can use below approach:
Divide the data into training and testing (mention the split ratio) :
x_train,x_test,y_train,y_test=train_test_split(training_vectors,label_encoded_Y,
train_size=0.8,
test_size=0.2)
And after training the model, use predict function on x_test to get the predictions. Now just match the prediction for testing data from the model and actual from the data set and you will be able to easily determine the accuracy.
Edit: Added the accuracy score calculation while predicting.

Data pipeline in tf.keras with tfrecords or numpy

I want to train a model in tf.keras of Tensorflow 2.0 with data that is bigger than my ram, but the tutorials only show examples with predefined datasets.
I followed this tutorial:
Load Images with tf.data, I could not make this work for data on numpy arrays or tfrecords.
This is an example with array being transformed into tensorflow datasets. What I want is to make this work for multiple numpy array files or multiple tfrecords files.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
# Shuffle and slice the dataset.
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
# Since the dataset already takes care of batching,
# we don't pass a `batch_size` argument.
model.fit(train_dataset, epochs=3)
If you have tfrecords files:
path = ['file1.tfrecords', 'file2.tfrecords', ..., 'fileN.tfrecords']
dataset = tf.data.Dataset.list_files(path, shuffle=True).repeat()
dataset = dataset.interleave(lambda filename: tf.data.TFRecordDataset(filename), cycle_length=len(path))
dataset = dataset.map(parse_function).batch()
parse_function handles decoding and any kind of augmentation.
In case with numpy arrays, you can construct dataset either from a list of filenames or from list of arrays. Labels are just a list. Or they could be taken from file while parsing single example.
path = #list of numpy arrays
or
path = os.listdir(path_to files)
dataset = tf.data.Dataset.from_tensor_slices((path, labels))
dataset = dataset.map(parse_function).batch()
parse_function handles decoding:
def parse_function(filename, label): #Both filename and label will be passed if you provided both to from_tensor_slices
f = tf.read_file(filename)
image = tf.image.decode_image(f))
image = tf.reshape(image, [H, W, C])
label = label #or it could be extracted from, for example, filename, or from file itself
#do any augmentations here
return image, label
To decode .npy files, the best way is to use reshape without read_file or decode_raw, but first load numpys with np.load:
paths = [np.load(i) for i in ["x1.npy", "x2.npy"]]
image = tf.reshape(filename, [2])
or try using decode_raw
f = tf.io.read_file(filename)
image = tf.io.decode_raw(f, tf.float32)
Then just pass batched dataset to model.fit(dataset). TensorFlow 2.0 allows simple iteration over dataset. No need to use iterator. Even in later versions of 1.x API you could just pass dataset to .fit method
for example in dataset:
func(example)

Dask DataFrame - Prediction of Keras Model

I am working for the first time with dask and trying to run predict() from a trained keras model.
If I dont use dask, the function works fine (i.e. pd.DataFrame() versus dd.DataFrame () ). With Dask the error is below. Is this not a common use case (aside from scoring a groupby perhaps)
def calc_HR_ind_dsk(grp):
model=keras.models.load_model('/home/embedding_model.h5')
topk=10
x=[grp['user'].values,grp['item'].values]
pred_act=list(zip(model.predict(x)[:,0],grp['respond'].values))
top=sorted(pred_act, key=lambda x: -x[0])[0:topk]
hit=sum([x[1] for x in top])
return(hit)
import dask.dataframe as dd
#step 1 - read in data as a dask df. We could reference more than 1 files using '*' wildcard
df = dd.read_csv('/home/test_coded_final.csv',dtype='int64')
results=df.groupby('user').apply(calc_HR_ind_dsk).compute()
TypeError: Cannot interpret feed_dict key as Tensor: Tensor Tensor("Placeholder_30:0", shape=(55188, 32), dtype=float32) is not an element of this graph.
I found the answer. It is an issue with keras or tensorflow: https://github.com/keras-team/keras/issues/2397
Below code worked and using dask shaved 50% from the time versus standard pandas groupby.
#dask
model=keras.models.load_model('/home/embedding_model.h5')
#this part
import tensorflow as tf
global graph
graph = tf.get_default_graph()
def calc_HR_ind_dsk(grp):
topk=10
x=[grp['user'].values,grp['item'].values]
with graph.as_default(): #and this part from https://github.com/keras-team/keras/issues/2397
pred_act=list(zip(model.predict(x)[:,0],grp['respond'].values))
top=sorted(pred_act, key=lambda x: -x[0])[0:topk]
hit=sum([x[1] for x in top])
return(hit)
import dask.dataframe as dd
df = dd.read_csv('/home/test_coded_final.csv',dtype='int64')
results=df.groupby('user').apply(calc_HR_ind_dsk).compute()
Have a look at:
http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.groupby.DataFrameGroupBy.apply
Unlike pandas, in dask many function, which let you define your own custom op, needs the meta parameter. Without this dask will sonehow test your custom function and pass weird things to keras which would might not be happening during calling compute.
A different answer I wrote might help here (use-case was using a Dask with a pre-trained ML model to predict on 1,000,000 examples): https://stackoverflow.com/a/59015702/4900327

Understanding data augmentation in the object detection API

I am using the object detection API to train with a different dataset and I would like to know if it is possible to have sample images of what is reaching the network during the training.
I ask this because I am trying to find a good combination of data augmentation options (here the options), but the result adding them has been worse. Seeing what reaches the network in training would be very helpful.
Another question is if it is possible to get the API to help with balancing the classes, in case that the dataset passed have them unbalanced.
Thank you!
Yes, it is possible. Shortly speaking, you need to get an instance of tf.data.Dataset. Then, you can iterate over it and get the network input data as NumPy arrays. Saving it to image files using PIL or OpenCV is trivial then.
Assuming you use TF2 the pseudo-code is like this:
ds = ... get dataset object somehow
sample_num = 0
for features, _ in ds:
images = features[fields.InputDataFields.image] # is a [batch_size, H, W, C] float32 tensor with preprocessed images
batch_size = images.shape[0]
for i in range(batch_size):
image = np.array(images[i] * 255).astype(np.uint8) # assuming input data is only scaled to [0..1]
cv2.imwrite(output_path, image)
sample_num += 1
if sample_num >= MAX_SAMPLES:
break
The trick here is to get the Dataset instance. Google object detection API is very sophisticated, but I guess you should start with calling train_input function here: https://github.com/tensorflow/models/blob/3c8b6f1e17e230b68519fd8d58c4dd9e9570d789/research/object_detection/inputs.py#L763
It requires pipeline config sub-parts describing training, train_input and the model.
You can find some code snippets on how to work with pipeline here: Dynamically Editing Pipeline Config for Tensorflow Object Detection
import argparse
import tensorflow as tf
from google.protobuf import text_format
from object_detection.protos import pipeline_pb2
def parse_arguments():
parser = argparse.ArgumentParser(description='')
parser.add_argument('pipeline')
parser.add_argument('output')
return parser.parse_args()
def main():
args = parse_arguments()
pipeline_config = pipeline_pb2.TrainEvalPipelineConfig()
with tf.gfile.GFile(args.pipeline, "r") as f:
proto_str = f.read()
text_format.Merge(proto_str, pipeline_config)