Tensorflow dataset api - tensorflow

I am reading CSV file using tf.contrib.data.make_csv_dataset(csv_path), the CSV has 2 columns namely review and rating. What I want to perform tokenization on review column after reading .
dataset = tf.contrib.data.make_csv_dataset(csv_file, batch_size=2)
After creating dataset, I want my map below method to dataset for review column:
def create_tokens(sentence):
return tf.string_split([sentence).values
I am stuck here.

With this example data:
review, rating
Best film ever, 5
rather meh, 2
You should be able to use tf.data.map() as explained here and here in tensorflow 1.10:
def create_tokens(sentence):
return tf.string_split(sentence['review'])
dataset = tf.contrib.data.make_csv_dataset('test.csv', batch_size=2)
dataset = dataset.map(create_tokens)

Related

How to save the model comparison dataframe from compare_models() in pycaret?

I want to save the model comparison data frame from compare_models() in pycaret.
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
i.e. this data frame as shown above.
Does anyone know how to do that?
The solution is :
df = pull()
by Goosang Yu from the pycaret slack community.
compare_models() returns a pandas dataframe, containing the information of the list of models. Hence, you only need to save a dataframe, which can be for example achieved with best.to_csv(path). If you want to save the object in a different format (pickle, xml, ...), you can refer to pandas i/o documentation.

creating a custom TFDS dataset

I would like to create a custom tensorflow dataset for summarization task. I have a set of reports with three gold summaries for every report. All the data is in (.txt) format.
I would like to create a TFDS where the key is the report and the value is the summary. So I will have this format:
(report1 , summary11) (report1 , summary12) (report1 , summary13) (report2 , summary21) (report2 , summary22) (report2 , summary23)
Is there any solution that helps me achieve this task. I checked the official documentation on the tensorflow website and it wasn't useful for me.
Thank you !
A custom iterator for traversing through data files can be achieved through the generators. As tf datasets retrieves input from generator you could modify the generator code to return the results
def generator(tuples):
for tuple in tuples:
yield tuple[1]

Can you map a dataset in tensorflow with 'keras.utils.to_categorical'?

My dataset here is comprised of data in the following structure [...x, y] and I want to convert it to
[...x], categorical([y])
this is what I tried:
def map_sequence(sequence):
return sequence[:-1], keras.utils.to_categorical(sequence[-1])
dataset = tf.data.Dataset.from_tensor_slices(input_sequences)
dataset = dataset.map(map_sequence)
but I am getting an error as sequence does not actually have any data when mapping is executed.
how does one use to_categorical and map() together?
Replacing keras.utils.to_categorical with tf.one_hot should do the trick.

How to read multiple columns as labels using make_csv_dataset in tensorflow 2?

I'm trying to use the following code (that I found in Tensorflow tutorials here) to read the data from a CSV file:
def get_dataset(file_path, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=5, # Artificially small to make examples easier to show.
label_name=LABEL_COLUMN,
na_value="?",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset
It works fine when you have one column as the label column. However, in my CSV file I have multiple columns as labels (I have 1008 features and 2 columns as labels). I'm wondering how I can read my data using this make_csv_dataset.
Thank you!
Tensorflow's make_csv_dataset() does not support data for multi-output models as of yet (It is experimental afterall). You can alternatively use pandas dataframe to read in the data and then use tf.data.Dataset.from_tensor_slices() method to get your dataset. I would recommend memory mapping the csv data while creating the pandas dataframe as it would be faster this way

Reading EMNIST dataset

I am building a CNN using tensorflow in python, but having problem with loading the data from EMNIST dataset. Can anyone please show me a sample code of retrieving each image in a batch and pass during the training session?
There are a couple of formats of the EMNIST dataset...the one I've found easiest to understand is the CSV version on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).
You can check out one of my implementations of an EMNIST CNN using Keras, where your dataset loading can be similar:
import pandas as pd
raw_data = pd.read_csv("data/emnist-balanced-train.csv")
train, validate = train_test_split(raw_data, test_size=0.1) # change this split however you want
x_train = train.values[:,1:]
y_train = train.values[:,0]
x_validate = validate.values[:,1:]
y_validate = validate.values[:,0]
from https://github.com/Josh-Payne/cs230/blob/master/Alphanumeric-Augmented-CNN/augmented-cnn.py