Usage of spark.catalog.refreshTable(tablename) in S3 - amazon-s3

I want to write a CSV file after transforming my Spark data with a function. The obtained Spark dataframe after the transformation seems good, but when I want to write it into a CSV file, I have an error:
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
But I really don't understand how to use the spark.catalog.refreshTable(tablename) function. I try to use it between the transformation and the file writing, but it said
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
So I don't know how to deal with it...
#Create the function to resize the images and extract the features with mobilenetV2 model
def red_dim(width, height, nChannels, data):
#Transform image data to tensorflow compatoble format
images = []
for i in range(height.shape[0]):
x = np.ndarray(
shape=(height[i], width[i], nChannels[i]),
dtype=np.uint8,
buffer=data[i],
strides=(width[i] * nChannels[i], nChannels[i], 1))
images.append(preprocess_input(x))
#Resize images with the chosen size of the model
images = np.array(tf.image.resize(images, [IMAGE_SIZE, IMAGE_SIZE]))
#Load the model
model = load_model('models')
#Predict features for images
preds = model.predict(images).reshape(len(width), 3 * 3 * 1280)
#Return a pandas series with list of features for all images
return pd.Series(list(preds))
#Transform the function to a pandas udf function
#This allow to split the function in multiple chunks
red_dim_udf = pandas_udf(red_dim, returnType=ArrayType(DoubleType()))
#4 actions :
# apply the udf function defined just before
# cast the array of features to a string so it can be written in a csv
# select only the data that will be witten in the csv
# write the data -> where the error occurs
results=df.withColumn("dim_red", red_dim_udf(col("image.width"), col("image.height"), \
col("image.nChannels"), \
col("image.data"))) \
.withColumn("dim_red_string", lit(col("dim_red").cast("string")))
.select("image.origin", 'dim_red_string')
.repartition(5).write.csv(S3dir + '/results' + today)

Its a well-known issue where the underlying source data is getting updated while spark is processing on it.
I would suggest you to checkpoint i.e. move/copy the data to another directory before applying your transformations.

I think I can close my question, as I found the answer
If you have this type of error, it can also be because you have space in your S3 folders used to make your Dataframe, and Spark doesn't recognize the space character in the folder, so think the folder doesn't exist anymore...
But thanks #Constantine for your help !

Related

How to save the model comparison dataframe from compare_models() in pycaret?

I want to save the model comparison data frame from compare_models() in pycaret.
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
i.e. this data frame as shown above.
Does anyone know how to do that?
The solution is :
df = pull()
by Goosang Yu from the pycaret slack community.
compare_models() returns a pandas dataframe, containing the information of the list of models. Hence, you only need to save a dataframe, which can be for example achieved with best.to_csv(path). If you want to save the object in a different format (pickle, xml, ...), you can refer to pandas i/o documentation.

Writing preprocessed output CSV to S3 from Scikit Learn image on Sagemaker

My problem: writing out a CSV file to S3 from inside a Sagemaker SKLearn image. I know how to write CSVs to S3 from a notebook - that is working fine. It's within the docker image that I'm unable to get it to work.
This is a preprocessing.py script called as an entry_point parameter to the SKLearn estimator. The purpose is to pre-process the data prior to running an inference. It's the first step in my inference pipeline.
Everything is working as expected in my preprocessing script, except outputting the file at the end.
Attempt #1 - this produces a CSV file that has strange binary-looking data at the beginning and end of the file (before the first cell and after the last cell of the CSV). It's almost a valid CSV but not quite. See the image at the end.
def write_joblib(file, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
with BytesIO() as f:
joblib.dump(file, f)
f.seek(0)
boto3.client("s3").upload_fileobj(Bucket=s3_bucket, Key=s3_key, Fileobj=f)
predictors_csv = predictors_df.to_csv(index = False)
write_joblib(predictors_csv, predictors_s3_uri)
Attempt #2 - I used StringIO rather than BytesIO. However, this produced a zero-byte file on S3.
Attempt #3 - I tried boto3.client('s3').put_object(...) but got ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
I believe I am almost there with Attempt #1 above. I assume it's an encoding issue. If I can fix the structure of the CSV file to remove the non-text characters at the start it will be working. A screenshot of the CSV in a Notepad++ is below.
Notice the non-character text at the start of the CSV file below
I solved this myself. This works within an SKLearn estimator container. I assume it will work inside any inbuilt fit/tranform container for writing a CSV to S3 directly.
The use case is writing out the results of pre-processing for model featurization, vectorization, dimensionality reduction etc. This would occur prior to model inference as the first step in an Inference Pipeline.
def write_text_file_to_s3(file_string, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
s3 = boto3.resource('s3')
s3.Object(s3_bucket, s3_key).put(Body=file_string)
predictors_csv = predictors_df.to_csv(index = False, encoding='utf-8-sig')
write_text_file_to_s3(predictors_csv, predictors_s3_uri)

How to convert a HyperSpectral image or an image with many bands in TFRecord format?

I've been trying to use a hyperspectral image dataset that was in .mat files. I found that using the scipy library with its loadmat function I can load the hyperspectral images and selecting some bands to see them as an RGB.
def RGBread(image):
images = loadmat(image).get('new_image')
return abs(images[:,:,(12,6,4)])
def SIread(image):
images = loadmat(image).get('new_image')
return abs(images[:,:,:])
After trying to implement the pix2pix architecture I found an unexpected error. When passing the list of the names of the dataset files by a function that is responsible for load the data(which are still .mat files), Tensor Flow does not have a direct method for this reading or coding, so I get these data with my RGBread and SIread method and then I turned them into tensors.
def load_image(filename, augment=True):
inimg = tf.cast( tf.convert_to_tensor(RGBread(ImagePATH+'/'+filename)
,dtype=tf.float32),tf.float32)[...,:3]
tgimg = tf.cast( tf.convert_to_tensor(SIread(ImagePATH+'/'+filename)
,dtype=tf.float32),tf.float32)[...,:12]
inimg, tgimg = resize(inimg, tgimg,IMG_HEIGH,IMG_WIDTH)
if augment:
inimg, tgimg = random_jitter(inimg, tgimg)
return inimg, tgimg
When loading an image with the load_image method, using the name and path of a single .mat file (a hyperspectral image) of my dataset as argument of my function the method worked perfectly.
plt.imshow(load_train_image(tr_urls[1])[0])
The problem started when I created my dataSet tensor, because my RGBread function does not receive a tensor as a parameter since loadmat('.mat') expects a string. Having the following error.
train_dataset = tf.data.Dataset.from_tensor_slices(tr_urls)
train_dataset = train_dataset.map(load_train_image,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
TypeError: expected str, bytes or os.PathLike object, not Tensor
After reading a lot about reading .mat files I found a user who recommended passing the data to TFrecord format. I've been trying to do it but I couldn't. Someone could help me?
Rasterio may be useful here.
https://rasterio.readthedocs.io/en/latest/
It can read hyperspectral .tif which can be passed to tf.data using a tf.keras data-generator. It may be a bit slow and perhaps should be done before training rather than at runtime.
An alternative is to ask whether you need the geotiff metadata. If not, you can preprocess and save as numpy arrays for tfrecords.

Dataset API 'flat_map' method producing error for same code which works with 'map' method

I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()

Feeding .npy (numpy files) into tensorflow data pipeline

Tensorflow seems to lack a reader for ".npy" files.
How can I read my data files into the new tensorflow.data.Dataset pipline?
My data doesn't fit in memory.
Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.
It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are tf.data.FixedLengthRecordDataset and tf.io.decode_raw, along with a look at the documentation of the NPY format. For simplicity, let's suppose that a float32 NPY file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a float32 array. An NPY file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:
def npy_header_offset(npy_path):
with open(str(npy_path), 'rb') as f:
if f.read(6) != b'\x93NUMPY':
raise ValueError('Invalid NPY file.')
version_major, version_minor = f.read(2)
if version_major == 1:
header_len_size = 2
elif version_major == 2:
header_len_size = 4
else:
raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))
header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))
header = f.read(header_len)
if not header.endswith(b'\n'):
raise ValueError('Invalid NPY file.')
return f.tell()
With this you can create a dataset like this:
import tensorflow as tf
npy_file = 'my_file.npy'
num_features = ...
dtype = tf.float32
header_offset = npy_header_offset(npy_file)
dataset = tf.data.FixedLengthRecordDataset([npy_file], num_features * dtype.size, header_bytes=header_offset)
Each element of this dataset contains a long string of bytes representing a single example. You can now decode it to obtain an actual array:
dataset = dataset.map(lambda s: tf.io.decode_raw(s, dtype))
The elements will have indeterminate shape, though, because TensorFlow does not keep track of the length of the strings. You can just enforce the shape since you know the number of features:
dataset = dataset.map(lambda s: tf.reshape(tf.io.decode_raw(s, dtype), (num_features,)))
Similarly, you can choose to perform this step after batching, or combine it in whatever way you feel like.
The limitation is that you had to know the number of features in advance. It is possible to extract it from the NumPy header, though, just a bit of a pain, and in any case very hardly from within TensorFlow, so the file names would need to be known in advance. Another limitation is that, as it is, the solution requires you to either use only one file per dataset or files that have the same header size, although if you know that all the arrays have the same size that should actually be the case.
Admittedly, if one considers this kind of approach it may just be better to have a pure binary file without headers, and either hard code the number of features or read them from a different source...
You can do it with tf.py_func, see the example here.
The parse function would simply decode the filename from bytes to string and call np.load.
Update: something like this:
def read_npy_file(item):
data = np.load(item.decode())
return data.astype(np.float32)
file_list = ['/foo/bar.npy', '/foo/baz.npy']
dataset = tf.data.Dataset.from_tensor_slices(file_list)
dataset = dataset.map(
lambda item: tuple(tf.py_func(read_npy_file, [item], [tf.float32,])))
Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:
Consuming NumPy arrays
If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().
# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
features = data["features"]
labels = data["labels"]
# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.
Here is a post with some instructions.
FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.
If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.
In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.
Problem setup
I had a folder with images that were being fed into an InceptionV3 model for extraction of features. This seemed to be a huge bottleneck for the entire process. As a workaround, I extracted features from each image and then stored them on disk in a .npy format.
Now I had two folders, one for the images and one for the corresponding .npy files. There was an evident problem with the loading of .npy files in the tf.data.Dataset pipeline.
Workaround
I came across TensorFlow's official tutorial on show attend and tell which had a great workaround for the problem this thread (and I) were having.
Load numpy files
First off we need to create a mapping function that accepts the .npy file name and returns the numpy array.
# Load the numpy files
def map_func(feature_path):
feature = np.load(feature_path)
return feature
Use the tf.numpy_function
With the tf.numpy_function we can wrap any python function and use it as a TensorFlow op. The function must accept numpy object (which is exactly what we want).
We create a tf.data.Dataset with the list of all the .npy filenames.
dataset = tf.data.Dataset.from_tensor_slices(feature_paths)
We then use the map function of the tf.data.Dataset API to do the rest of our task.
# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item: tf.numpy_function(
map_func, [item], tf.float16),
num_parallel_calls=tf.data.AUTOTUNE)