How to create an SFrame compatible with TuriCreate for object detection task - object-detection

I am trying to create an SFrame containing images and bounding boxes' coordinates, in order to perform object detection using TuriCreate.
I have created my own dataset by IBM Cloud Annotations, exported as CreateML format. When I run:
usage_data = tc.SFrame.read_json("annotations.json")
I get:
[{'label': 'xyz'... | 8be1172e-44bb-4084-917f-db....
Which is not the format requested. It is confirmed running the code below:
data = tc.SFrame.read_json("annotations.json")
train_data, test_data = data.random_split(0.75)
model = tc.object_detector.create(train_data)
predictions = model.predict(test_data)
`I get:
ToolkitError: No "feature" column specified and no column with expected type "image" is found. "datasets" consists of columns with types: list, str.
I would like to know:
Is it correct export data in CreateML format?
Can I use SFrame.read_json() for reading this kind of data?

You need to create an SFrame from your images folder and then join it to your annotations SFrame like:
imagesSFrame = turicreate.image_analysis.load_images('imagesFolder/')
combinedSFrame = images.join(annotationsSFrame)
Just make sure that your annotations each have a path that exactly matches the path in your imagesSFrame. Below is my csv format:
path, annotation,
imagesFolder/image1.png,[{'label': 'dog', 'coordinates': {'height': 118, 'width': 240, 'x': 155, 'y': 129}}]
print(imagesSFrame) will allow you to inspect what the path is inside your imagesSFrame

Related

How to save the model comparison dataframe from compare_models() in pycaret?

I want to save the model comparison data frame from compare_models() in pycaret.
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
i.e. this data frame as shown above.
Does anyone know how to do that?
The solution is :
df = pull()
by Goosang Yu from the pycaret slack community.
compare_models() returns a pandas dataframe, containing the information of the list of models. Hence, you only need to save a dataframe, which can be for example achieved with best.to_csv(path). If you want to save the object in a different format (pickle, xml, ...), you can refer to pandas i/o documentation.

Usage of spark.catalog.refreshTable(tablename) in S3

I want to write a CSV file after transforming my Spark data with a function. The obtained Spark dataframe after the transformation seems good, but when I want to write it into a CSV file, I have an error:
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
But I really don't understand how to use the spark.catalog.refreshTable(tablename) function. I try to use it between the transformation and the file writing, but it said
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
So I don't know how to deal with it...
#Create the function to resize the images and extract the features with mobilenetV2 model
def red_dim(width, height, nChannels, data):
#Transform image data to tensorflow compatoble format
images = []
for i in range(height.shape[0]):
x = np.ndarray(
shape=(height[i], width[i], nChannels[i]),
dtype=np.uint8,
buffer=data[i],
strides=(width[i] * nChannels[i], nChannels[i], 1))
images.append(preprocess_input(x))
#Resize images with the chosen size of the model
images = np.array(tf.image.resize(images, [IMAGE_SIZE, IMAGE_SIZE]))
#Load the model
model = load_model('models')
#Predict features for images
preds = model.predict(images).reshape(len(width), 3 * 3 * 1280)
#Return a pandas series with list of features for all images
return pd.Series(list(preds))
#Transform the function to a pandas udf function
#This allow to split the function in multiple chunks
red_dim_udf = pandas_udf(red_dim, returnType=ArrayType(DoubleType()))
#4 actions :
# apply the udf function defined just before
# cast the array of features to a string so it can be written in a csv
# select only the data that will be witten in the csv
# write the data -> where the error occurs
results=df.withColumn("dim_red", red_dim_udf(col("image.width"), col("image.height"), \
col("image.nChannels"), \
col("image.data"))) \
.withColumn("dim_red_string", lit(col("dim_red").cast("string")))
.select("image.origin", 'dim_red_string')
.repartition(5).write.csv(S3dir + '/results' + today)
Its a well-known issue where the underlying source data is getting updated while spark is processing on it.
I would suggest you to checkpoint i.e. move/copy the data to another directory before applying your transformations.
I think I can close my question, as I found the answer
If you have this type of error, it can also be because you have space in your S3 folders used to make your Dataframe, and Spark doesn't recognize the space character in the folder, so think the folder doesn't exist anymore...
But thanks #Constantine for your help !

How to get the label strings of the fine-tuned network using the checkpoint files or tf-record?

For example, I finetuned a VGG network using my own data set, with only 2 labels foo and bar. I have converted images to tf.record using an example via this link:
labels_to_class_names = dict(zip(range(len(class_names)), class_names))
dataset_utils.write_label_file(labels_to_class_names, dataset_dir)
I'm going to build an API to predict images based on this new model, my question is: Is there any formal way to get the label string from the checkpoint files, or from the data set (for example predict_image("abc.png") returns foo string)? Since I have no clue which node in the logits layer that represents label foo, and which one represents bar
I've tried searching around but no help, and I'm still a tensorflow noobie.
The model (and incidentally the checkpoint files) do not have the name of each classes.
All it has is a certain number of output neurons, the first one corresponding to the first class, the second one to the second class and so on.
If you want to know which one is which, look at the label file created by this line (most likely named labels.txt):
dataset_utils.write_label_file(labels_to_class_names, dataset_dir)
Alternatively you can inspect the contents of the labels_to_class_names dict:
In [1]: class_names=['aaa', 'bbb', 'ccc']
In [2]: labels_to_class_names = dict(zip(range(len(class_names)), class_names))
In [3]: labels_to_class_names
Out[3]: {0: 'aaa', 1: 'bbb', 2: 'ccc'}
-> value at index 0 in the output of the model = class 'aaa', etc

Dataset API 'flat_map' method producing error for same code which works with 'map' method

I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()

How to create dataset in the same format as the FSNS dataset?

I'm working on this project based on TensorFlow.
I just want to train an OCR model by attention_ocr based on my own datasets, but I don't know how to store my images and ground truth in the same format as FSNS datasets.
Is there anybody also work on this project or know how to solve this problem?
The data format for storing training/test is defined in the FSNS paper https://arxiv.org/pdf/1702.03970.pdf (Table 4).
To store tfrecord files with tf.Example protos you can use tf.python_io.TFRecordWriter. There is a nice tutorial, an existing answer on the stackoverflow and a short gist.
Assume you have an numpy ndarray img which has num_of_views images stored side-by-side (see Fig. 3 in the paper):
and a corresponding text in a variable text. You will need to define some function to convert a unicode string into a list of character ids padded to a fixed length and unpadded as well. For example:
char_ids_padded, char_ids_unpadded = encode_utf8_string(
text='abc',
charset={'a':0, 'b':1, 'c':2},
length=5,
null_char_id=3)
the result should be:
char_ids_padded = [0,1,2,3,3]
char_ids_unpadded = [0,1,2]
If you use functions _int64_feature and _bytes_feature defined in the gist you can create a FSNS compatible tf.Example proto using a following snippet:
char_ids_padded, char_ids_unpadded = encode_utf8_string(
text, charset, length, null_char_id)
example = tf.train.Example(features=tf.train.Features(
feature={
'image/format': _bytes_feature("PNG"),
'image/encoded': _bytes_feature(img.tostring()),
'image/class': _int64_feature(char_ids_padded),
'image/unpadded_class': _int64_feature(char_ids_unpadded),
'height': _int64_feature(img.shape[0]),
'width': _int64_feature(img.shape[1]),
'orig_width': _int64_feature(img.shape[1]/num_of_views),
'image/text': _bytes_feature(text)
}
))
You should not use the below code directly:
"'image/encoded': _bytes_feature(img.tostring()),"
In my code, I wrote this:
_,jpegVector = cv2.imencode('.jpeg',img)
imgStr = jpegVector.tostring()
'image/encoded': _bytes_feature(imgStr)