Apply function to Xarray dataset - numpy

I am trying to apply a function to an Xarray dataset, using dataset.where mask to decide where to apply it. I am not sure how to do it.
Some context: the dataset has two variables (A and B), which are two overlapping raster images (same size and coordinates). I want to run a function (defined by me), that runs a calculation on each gridcell of the raster image, using the two variables. This is not a u_func, it is just a function that takes the values of the two overlapping gridcells, does some calculations and returns a third value I want to save to a third raster image (C).
How do I do that?
I have gotten this far, but it does not work because ds.A passes the entire array to the function, not just the value of ds.A in that particular gridcell:
def my_func(x, y):
..do things
return result
ds = xr.Dataset({
"A": xarray.open_rasterio("A.tif"),
"B": xarray.open_rasterio("B.tif"),
"C": xarray.open_rasterio("C.tif"),
})
ds.C.where(ds.A > 0, my_func(ds.A, ds.B))

Related

Applying Tensorflow Dataset .map() to subsequent dataset elements

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.
I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

How can I chance the Dimension with numpy and reduce complexity?

I recreated a deep learning network (Yolov3) and extracted a feature map after the prediction. This has the following dimensions (1, 13, 13, 3, 50). The dimensions 13x13 stand for the grid and the 3 for the RGB values. The 50 stand for the 50 different classes my model can predict.
Currently I am trying to reformat the feature maps for each class individually. That means, I try to create 50 arrays from the structure described above, which contain 3 arrays (each for RGB features) and should each contain the grid of 13x13.
What you have to consider is that the feature map contains the values of the 50 classes for each cell of the 13x13 grid.
Currently I have solved the problem with a for-loop that can only extract one class. So I have to ask myself if I can use Numpy for example with resize, transpose, reshape to set a better previous one.
def extract_feature_maps(model_output, class_index):
for row in model_output:
feature_maps= [[], [], []]
for column in row:
tmp = [[], [], []]
for three_dim in column:
counter = 0
for feature_map_tmp in three_dim:
feature_map_tmp_0 = feature_map_tmp[5:]
feature_number = feature_map_tmp_0[class_index]
tmp[counter].append(feature_number)
counter += 1
feature_maps[0].append(tmp[0])
feature_maps[1].append(tmp[1])
feature_maps[2].append(tmp[2])
return np.array(feature_maps[0]), np.array(feature_maps[1]), np.array(feature_maps[2])
As I said, I can currently only extract the feature map from one class in a very time consuming way. Is there a way to do this more clever?

tensorflow tfrecord storage for large datasets

I'm trying to understand the "proper" method of storage for large datasets for tensorflow ingestion. The documentation seems relatively clear that no matter what, tfrecord files are preferred. Large is a subjective measure, but the examples below are randomly generated regression datasets from sklearn.datasets.make_regression() of 10,000 rows and between 1 and 5,000 features, all float64.
I've experimented with two different methods of writing tfrecord files with dramatically different performance.
For numpy arrays, X, y (X.shape=(10000, n_features), y.shape=(10000,)
tf.train.Example with per-feature tf.train.Features
I construct a tf.train.Example in the way that tensorflow developers seem to prefer, at least judging by tensorflow example code at https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/convert_to_records.py.
For each observation or row in X, I create a dictionary keyed with feature names (f_0, f_1, ...) whose values are tf.train.Feature objects with the feature's observation data as a single element of its float_list.
def _feature_dict_from_row(row):
"""
Take row of length n+1 from 2-D ndarray and convert it to a dictionary of
float features:
{
'f_0': row[0],
'f_1': row[1],
...
'f_n': row[n]
}
"""
def _float64_feature(feature):
return tf.train.Feature(float_list=tf.train.FloatList(value=[feature]))
features = { "f_{:d}".format(i): _float64_feature(value) for i, value in enumerate(row) }
return features
def write_regression_data_to_tfrecord(X, y, filename):
with tf.python_io.TFRecordWriter('{:s}'.format(filename)) as tfwriter:
for row_index in range(X.shape[0]):
features = _feature_dict_from_row(X[row_index])
features['label'] = y[row_index]
example = tf.train.Example(features=tf.train.Features(feature=features))
tfwriter.write(example.SerializeToString())
tf.train.Example with one large tf.train.Feature containing all features
I construct a dictionary with one feature (really two counting the label) whose value is a tf.train.Feature with the entire feature row in as its float_list
def write_regression_data_to_tfrecord(X, y, filename, store_by_rows=True):
with tf.python_io.TFRecordWriter('{:s}'.format(filename)) as tfwriter:
for row_index in range(X.shape[0]):
features = { 'f_0': tf.train.Feature(float_list=tf.train.FloatList(value=X[row_index])) }
features['label'] = y[row_index]
example = tf.train.Example(features=tf.train.Features(feature=features))
tfwriter.write(example.SerializeToString())
As the number of features in the dataset grows, the second option gets considerably faster than the first, as shown in the following graph. Note the log scale
10,000 rows:
It makes intuitive sense to me that creating 5,000 tf.train.Feature objects is significantly slower than creating one object with a float_list of 5,000 elements, but it's not clear that this is the "intended" method for feeding large numbers of features into a tensorflow model.
Is there something inherently wrong with doing this the faster way?

Matrices with different row lengths in numpy

Is there a way of defining a matrix (say m) in numpy with rows of different lengths, but such that m stays 2-dimensional (i.e. m.ndim = 2)?
For example, if you define m = numpy.array([[1,2,3], [4,5]]), then m.ndim = 1. I understand why this happens, but I'm interested if there is any way to trick numpy into viewing m as 2D. One idea would be padding with a dummy value so that rows become equally sized, but I have lots of such matrices and it would take up too much space. The reason why I really need m to be 2D is that I am working with Theano, and the tensor which will be given the value of m expects a 2D value.
I'll give here very new information about Theano. We have a new TypedList() type, that allow to have python list with all elements with the same type: like 1d ndarray. All is done, except the documentation.
There is limited functionality you can do with them. But we did it to allow looping over the typed list with scan. It is not yet integrated with scan, but you can use it now like this:
import theano
import theano.typed_list
a = theano.typed_list.TypedListType(theano.tensor.fvector)()
s, _ = theano.scan(fn=lambda i, tl: tl[i].sum(),
non_sequences=[a],
sequences=[theano.tensor.arange(2, dtype='int64')])
f = theano.function([a], s)
f([[1, 2, 3], [4, 5]])
One limitation is that the output of scan must be an ndarray, not a typed list.
No, this is not possible. NumPy arrays need to be rectangular in every pair of dimensions. This is due to the way they map onto memory buffers, as a pointer, itemsize, stride triple.
As for this taking up space: np.array([[1,2,3], [4,5]]) actually takes up more space than a 2×3 array, because it's an array of two pointers to Python lists (and even if the elements were converted to arrays, the memory layout would still be inefficient).

ValueError: setting an array element with a sequence at fit(X, y) in k-nearest neighbor

i have an error at this line:neigh.fit(X, y) :
ValueError: setting an array element with a sequence.
I checked fit function and X is: {array-like, sparse matrix, BallTree, cKDTree}
My X is a list of list with first element solidity number and second elemnt humoment list (7 cells).
If i change and i take only first humoment number for having a pure list of list
give this error: query data dimension must match BallTree data dimension.
My code:
listafeaturevector = list()
path = 'imgknn/'
for infile in glob.glob( os.path.join(path, '*.jpg') ):
print("current file is: " + infile )
gray = cv2.imread(infile,0)
element = cv2.getStructuringElement(cv2.MORPH_CROSS,(6,6))
graydilate = cv2.erode(gray, element)
ret,thresh = cv2.threshold(graydilate,127,255,cv2.THRESH_BINARY_INV)
imgbnbin = thresh
#CONTOURS
contours, hierarchy = cv2.findContours(imgbnbin, cv2.RETR_TREE ,cv2.CHAIN_APPROX_SIMPLE)
print(len(contours))
for i in range (0, len(contours)):
fv = list() #1 feature vector
#HUMOMENTS
#print("humoments")
mom = cv2.moments(contours[i], 1)
Humoments = cv2.HuMoments(mom)
#print(Humoments)
fv.append(Humoments) #query data dimension must match BallTree data dimension
#SOLIDITY
area = cv2.contourArea(contours[i])
hull = cv2.convexHull(contours[i]) #ha tanti valori
hull_area = cv2.contourArea(hull)
solidity = float(area)/hull_area
fv.append(solidity)
#fv.append(elongation)
listafeaturevector.append(fv)
print("i have done")
print(len(listafeaturevector))
lenmatrice=len(listafeaturevector)
#KNN
X = listafeaturevector
y = [0,1,2,3]* (lenmatrice/4)
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y) #ValueError: setting an array element with a sequence.
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))
If i try to covert it in a numpy array:
listafv = np.dstack(listafeaturevector)
listafv=np.rollaxis(listafv,-1)
print(listafv.shape)
data = listafv.reshape((lenmatrice, -1))
print(data.shape)
#KNN
X = data
i got: setting an array element with a sequence
A couple of suggestions/questions:
Humoments = cv2.HuMoments(mom)
What is the class of the return value Humoments? a float or a list? If float, that is fine.
for each image file
for i in range (0, len(contours)):
fv = list() #1 feature vector
...
fv.append(Humoments)
...
fv.append(solidity)
listafeaturevector.append(fv)
The above code does not seem correct. In your problem, I think you need to a construct a feature vector for each image. So anything that is related to image i should go to the same feature vector x_i. Then you combine all feature vectors to get a list of feature vectors X. However, your listafeaturevector (or X) presents in the inner-most loop, it's obviously not correct.
Second, you have a loop against the number of elements in the contours, are you sure the number of elements stays the same for each image? Otherwise, the number of features (|x_i|) is totally different across different images, that might cause the error of
setting an array element with a sequence.
Third, are you clear about how you want to classify the images? what are the target values/labels of different images? I see you just setting labels with [0,1,2,3]* (lenmatrice/4). Can you elaborate on what you are trying to do with those images? Are they containing different type of object? Are they showing different patterns? Are those images describe different topic/color? If yes, for each different type, you give a different label - either 0,1,2 or 'red','white','black' (assume you have only 3 types). The values of the label do not matter. What matters is how many values they have. I am trying to understand the difference of labels in your case.
On the other hand, if you only want to retrieve similar images, you don't need to use a classifier or specify a label for each image. Instead, try to use NearestNeighbors.
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))
Fourth, the above two lines of test are not correct. You need to set an X-like object in order to get a prediction from the classifier. That is to say, you need a feature vector x with the identical structure as you constructed in your training examples (with all h,e,s in the same order).