How to split a tensorflow dataset into N datasets with shuffling - tensorflow

I have a tensorflow datasetds and I would like to split it into N datasets whose union is the original dataset and that do not share samples among them.
I tried:
ds_list = [ds.shard(N,index=i) for i in range(N)]
But unfortunately it's not random: each new dataset will always get the same samples from the original dataset. For instance, ds_list[0] will have samples number 0,N,2N,3N..., while ds_list[1] will have 1,N+1,2N+1,3N+1...
Is there any way to have a random subdivision of the original dataset into datasets of the same size?
Unfortunately simply shuffling before won't solve the issue:
import tensorflow as tf
import math
ds =[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ,15, 16, 17, 18, 19, 20])
ds = ds.shuffle(20)
ds_list = [ds.shard(N,index=i) for i in range(N)]
for ds in ds_list:
shard_set = sorted(set(list(ds.as_numpy_iterator())))
[3, 5, 6, 8, 11, 12, 14, 15, 19, 20]
[1, 2, 4, 5, 6, 7, 8, 14, 15, 20]
Same as:
ds =[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ,15, 16, 17, 18, 19, 20])
ds_list = []
ds = ds.shuffle(20)
size = ds.__len__()
sub = math.floor(size/N)
for n in range(N):
ds_sub = ds.take(sub)
remainder = ds.skip(sub)
ds = remainder
for ds in ds_list:
shard_set = sorted(set(list(ds.as_numpy_iterator())))

Perhaps (for N shards):
ds_list = []
ds = ds.shuffle()
size = ds.__len__()
sub = floor(size/N)
for n in range(N):
ds_sub = ds.take(sub)
remainder = ds.skip(sub)
ds = remainder

You can first shuffle the dataset and then shard it:
ds = ds.shuffle(buffer_size)
ds_list = [ds.shard(N,index=i) for i in range(N)]
Here buffer_size is the size of buffer used by TF for sorting. If size of dataset is small, you can pass total number of examples as buffer_size. Otherwise a smaller number (anything like 100), which can fit into memory, will work.


np.array for variable matrix

import numpy as np
data = np.array([[10, 20, 30, 40, 50, 60, 70, 80, 90],
[2, 7, 8, 9, 10, 11],
[3, 12, 13, 14, 15, 16],
[4, 3, 4, 5, 6, 7, 10, 12]],dtype=object)
target = data[:,0]
It has this error.
IndexError Traceback (most recent call last)
Input In \[82\], in \<cell line: 9\>()
data = np.array(\[\[10, 20, 30, 40, 50, 60, 70, 80, 90\],
\[2, 7, 8, 9, 10, 11\],
\[3, 12, 13, 14, 15, 16\],
\[4, 3, 4, 5, 6, 7, 10,12\]\],dtype=object)
# Define the target data ----\> 9 target = data\[:,0\]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
May I know how to fix it, please? I mean do not change the elements in the data. Many thanks. I made the matrix in the same size and the error message was gone. But I have the data with variable size.
You have a array of objects, so you can't use indexing on axis=1 as there is none (data.shape -> (4,)).
Use a list comprehension:
out = np.array([a[0] for a in data])
Output: array([10, 2, 3, 4])

How to plot my data using MatPloitLib with step size

Consider the following code and the graph obtained from it
import matplotlib.pyplot as plt
import numpy as np
fig,axs = plt.subplots(figsize=(10,10))
data1 = [5, 6, 18, 7, 19]
x_ax = [10, 20, 30, 40, 50]
y_ax = [0, 5, 10, 15, 20]
I need to plot my data1 = [5, 6, 18, 7, 19] with a step size of 10. 5 for 10, 6 for 20, 18 for 30, 7 for 40 and 19 for 50. But the plot is taking a step size of one.
How can I modify my code to do the required?
If you don't provide x values to plot, it'll automatically use 0, 1, 2 ....
So in your case you need:
x = range(10, len(data1)*10+1, 10)
axs.plot(x, data1, marker="o")

Tensorflow filter operation on dataset with several columns

I want to create a subset of my data by applying filter operation. I have this data:
data = tf.convert_to_tensor([[1, 2, 1, 1, 5, 5, 9, 12], [1, 2, 3, 8, 4, 5, 9, 12]])
dataset =
I want to retrieve a subset of 'dataset' which corresponds to all elements whose first column is equal to 1. So, result should be:
[[1, 1, 1], [1, 3, 8]] # dtype : dataset
I tried this:
subset = dataset.filter(lambda x: tf.equal(x[0], 1))
But I don't get the correct result, since it sends me back x[0]
Someone to help me ?
I finally resolved it:
a = tf.convert_to_tensor([1, 2, 1, 1, 5, 5, 9, 12])
b = tf.convert_to_tensor([1, 2, 3, 8, 4, 5, 9, 12])
data_set =, b))
subset = data_set.filter(lambda x, y: tf.equal(x, 1))

Clarification about flatten function in Theano

in [] it says:
This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
# or (500, 50 * 4 * 4) = (500, 800) with the default values.
layer2_input = layer1.output.flatten(2)
when I use flatten function on a numpy 3d array I get a 1D array. but here it says I get a matrix. How does flatten(2) work in theano?
A similar example on numpy produces 1D array:
a= array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, 11, 12],
[13, 14, 15],
[16, 17, 18]],
[[19, 20, 21],
[22, 23, 24],
[25, 26, 27]]])
a.flatten(2)=array([ 1, 10, 19, 4, 13, 22, 7, 16, 25, 2, 11, 20, 5, 14, 23, 8, 17,
26, 3, 12, 21, 6, 15, 24, 9, 18, 27])
numpy doesn't support flattening only some dimensions but Theano does.
So if a is a numpy array, a.flatten(2) doesn't make any sense. It runs without error but only because the 2 is passed as the order parameter which seems to cause numpy to stick with the default order of C.
Theano's flatten does support axis specification. The documentation explains how it works.
x (any TensorVariable (or compatible)) – variable to be flattened
outdim (int) – the number of dimensions in the returned variable
Return type:
variable with same dtype as x and outdim dimensions
variable with the same shape as x in the leading outdim-1 dimensions,
but with all remaining dimensions of x collapsed into the last dimension.
For example, if we flatten a tensor of shape (2, 3, 4, 5) with
flatten(x, outdim=2), then we’ll have the same (2-1=1) leading
dimensions (2,), and the remaining dimensions are collapsed. So the
output in this example would have shape (2, 60).
A simple Theano demonstration:
import numpy
import theano
import theano.tensor as tt
def compile():
x = tt.tensor3()
return theano.function([x], x.flatten(2))
def main():
a = numpy.arange(2 * 3 * 4).reshape((2, 3, 4))
f = compile()
print a.shape, f(a).shape
(2L, 3L, 4L) (2L, 12L)

Extracting the indices of outliers in Linear Regression

The following script computes R-squared value between two numpy arrays(x and y).
The R-squared value is very low due to outliers in the data. How can I extract the indices of those outliers?
import numpy as np, matplotlib.pyplot as plt, scipy.stats as stats
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
r2 = stats.linregress(x, y) [3]**2
print r2
plt.scatter(x, y)
An outlier is defined as: value-mean > 2*standard deviation.
You can do this with the line
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
What is does:
A list is constructed from the indices of x, where the element at that index satisfies the condition described above.
A quick test:
x = np.random.random_integers(1,50,50)
this gives me the array:
array([16, 6, 13, 18, 21, 37, 31, 8, 1, 48, 4, 40, 9, 14, 6, 45, 20,
15, 14, 32, 30, 8, 19, 8, 34, 22, 49, 5, 22, 23, 39, 29, 37, 24,
45, 47, 21, 5, 4, 27, 48, 2, 22, 8, 12, 8, 49, 12, 15, 18])
Now I add some outliers manually as there are none initially:
x[4] = 200
x[15] = 178
lets test:
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
[4, 15]
Is this what you was looking for?
I added the abs() function in the line above, because when you are working with negative numbers this might end bad. The abs() function takes the absolute value.
I think Sander's approach is the correct one, but if you must see R2 without those outliers before making a decision here is a way to do it.
Setup data and introduce outlier:
In [1]:
import numpy as np, scipy.stats as stats
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
y[5] = 100
Calculate R2 taking out one y value at a time (along with matching x value):
m = np.eye(y.shape[0])
r2 = np.apply_along_axis(lambda a: stats.linregress(np.delete(x, a.argmax()), np.delete(y, a.argmax()))[3]**2, 0, m)
Get index of the biggest outlier:
Get R2 when this outlier is taken out:
In [2]:
Get the value of the outlier:
In [3]:
To get top n outliers:
In [4]:
n = 5
sorted_index = r2.argsort()[::-1]
Out [4]:
array([ 5, 27, 34, 0, 17], dtype=int64)