Related
I have a tensorflow datasetds and I would like to split it into N datasets whose union is the original dataset and that do not share samples among them.
I tried:
ds_list = [ds.shard(N,index=i) for i in range(N)]
But unfortunately it's not random: each new dataset will always get the same samples from the original dataset. For instance, ds_list[0] will have samples number 0,N,2N,3N..., while ds_list[1] will have 1,N+1,2N+1,3N+1...
Is there any way to have a random subdivision of the original dataset into datasets of the same size?
Unfortunately simply shuffling before won't solve the issue:
import tensorflow as tf
import math
ds = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ,15, 16, 17, 18, 19, 20])
N=2
ds = ds.shuffle(20)
ds_list = [ds.shard(N,index=i) for i in range(N)]
for ds in ds_list:
shard_set = sorted(set(list(ds.as_numpy_iterator())))
print(shard_set)
Output:
[3, 5, 6, 8, 11, 12, 14, 15, 19, 20]
[1, 2, 4, 5, 6, 7, 8, 14, 15, 20]
Same as:
ds = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ,15, 16, 17, 18, 19, 20])
N=2
ds_list = []
ds = ds.shuffle(20)
size = ds.__len__()
sub = math.floor(size/N)
for n in range(N):
ds_sub = ds.take(sub)
remainder = ds.skip(sub)
ds_list.append(ds_sub)
ds = remainder
for ds in ds_list:
shard_set = sorted(set(list(ds.as_numpy_iterator())))
print(shard_set)
Perhaps (for N shards):
ds_list = []
ds = ds.shuffle()
size = ds.__len__()
sub = floor(size/N)
for n in range(N):
ds_sub = ds.take(sub)
remainder = ds.skip(sub)
ds_list.append(ds_sub)
ds = remainder
You can first shuffle the dataset and then shard it:
ds = ds.shuffle(buffer_size)
ds_list = [ds.shard(N,index=i) for i in range(N)]
Here buffer_size is the size of buffer used by TF for sorting. If size of dataset is small, you can pass total number of examples as buffer_size. Otherwise a smaller number (anything like 100), which can fit into memory, will work.
I'm developing a pythong script where I receive angular measurements from a motor which has a low resolution encoder attached to it. The data I get from the motor has a very low resolution (about 5 degrees division in between measurments). This is an example of the sensor output whilst it is rotating with a constant speed (in degrees):
sensor output = ([5, 5, 5, 5, 5, 10, 10, 10, 10 ,10, 15, 15, 20, 20, 20, 20, 25, 25, 30, 30, 30, 30, 30, 35, 35....])
As you can see, some of these measurements are repeating themselves.
From these measurements, I would like to interpolate in order to get the measurements in between the 1D data-points. For instance, if I at time k receive the angular measurement theta=5 and in the next instance at t=k+1 also receive a measurement of theta=5, I would like to compute an estimate that would be something like theta = 5+(1/5).
I have also been considering using some sort of predictive filtering, but I'm not sure which one to apply if that is even applicable in this case (e.g. Kalman filtering). The estimated output should be in a linear form since the motor is rotating with a constast angular velocity.
I have tried using numpy.linspace in order to acheive what I want, but cannot seem to get it to work the way I want:
# Interpolate for every 'theta_div' values in angle received through
# modbus
for k in range(np.size(rx)):
y = T.readSensorData() # take measurement (call read sensor function)
fp = np.linspace(y, y+1, num=theta_div)
for n in range(theta_div):
if k % 6 == 0:
if not y == fp[n]:
z = fp[n]
else:
z = y
print(z)
So for the sensor readings: ([5, 5, 5, 5, 5, 10, 10, 10, 10 ,10, 15, 15, 20, 20, 20, 20, 25, 25, 30, 30, 30, 30, 30, 35, 35....]) # each element at time=k0...kn
I would like the output to be something similar to:
theta = ([5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17.5, 20...])
So in short, I need some sort of prediction and then update the value with the actual reading from the sensor, similar to the procedure in a Kalman filter.
why dont just make a linear fit?
import numpy as np
import matplotlib.pyplot as plt
messurements = np.array([5, 5, 5, 5, 5, 10, 10, 10, 10 ,10, 15, 15, 20, 20, 20, 20, 25, 25, 30, 30, 30, 30, 30, 35, 35])
time_array = np.arange(messurements.shape[0])
fitparms = np.polyfit(time_array,messurements,1)
def line(x,a,b):
return a*x +b
better_time_array = np.linspace(0,np.max(time_array))
plt.plot(time_array,messurements)
plt.plot(better_time_array,line(better_time_array,fitparms[0],fitparms[1]))
I have a frequency analysis of words said in episodes of my favorite show. I'm making a plot.barh(s1e1_y, s1e1_x) but it's sorting by words instead of values.
The output of >>> s1e1_y
is
['know', 'go', 'now', 'here', 'gonna', 'can', 'them', 'think', 'come', 'time', 'got', 'elliot', 'talk', 'out', 'night', 'been', 'then', 'need', 'world', "what's"]
and >>>s1e1_x
[42, 30, 26, 25, 24, 22, 20, 19, 19, 18, 18, 18, 17, 17, 15, 15, 14, 14, 13, 13]
When the plots are actually plotted, the graph's y axis ticks are sorted alphabetically even though the plotting list is unsorted...
s1e1_wordlist = []
s1e1_count = []
for word, count in s1e01:
if((word[:-1] in excluded_words) == False):
s1e1_wordlist.append(word[:-1])
s1e1_count.append(int(count))
s1e1_sorted = sorted(list(sorted(zip(s1e1_count, s1e1_wordlist))),
reverse=True)
s1e1_20 = []
for i in range(0,20):
s1e1_20.append(s1e1_sorted[i])
s1e1_x = []
s1e1_y = []
for count, word in s1e1_20:
s1e1_x.append(word)
s1e1_y.append(count)
plot.figure(1, figsize=(20,20))
plot.subplot(341)
plot.title('Season1 : Episode 1')
plot.tick_params(axis='y',labelsize=8)
plot.barh(s1e1_x, s1e1_y)
From matplotlib 2.1 on you can plot categorical variables. This allows to plot plt.bar(["apple","cherry","banana"], [1,2,3]). However in matplotlib 2.1 the output will be sorted by category, hence alphabetically. This was considered as bug and is changed in matplotlib 2.2 (see this PR).
In matplotlib 2.2 the bar plot would hence preserve the order.
In matplotlib 2.1, you would plot the data as numeric data as in any version prior to 2.1. This means to plot the numbers against their index and to set the labels accordingly.
w = ['know', 'go', 'now', 'here', 'gonna', 'can', 'them', 'think', 'come',
'time', 'got', 'elliot', 'talk', 'out', 'night', 'been', 'then', 'need',
'world', "what's"]
n = [42, 30, 26, 25, 24, 22, 20, 19, 19, 18, 18, 18, 17, 17, 15, 15, 14, 14, 13, 13]
import matplotlib.pyplot as plt
import numpy as np
plt.barh(range(len(w)),n)
plt.yticks(range(len(w)),w)
plt.show()
Ok you seem to have a lot of spurious code in your example which isn't relevant to the problem as you've described it but assuming you don't want the y axis to sort alphabetically then you need to zip your two lists into a dataframe then plot the dataframe as follows
df = pd.DataFrame(list(zip(s1e1_y,s1e1_x))).set_index(1)
df.plot.barh()
This then produces the following
in [http://deeplearning.net/tutorial/lenet.html#lenet] it says:
This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
# or (500, 50 * 4 * 4) = (500, 800) with the default values.
layer2_input = layer1.output.flatten(2)
when I use flatten function on a numpy 3d array I get a 1D array. but here it says I get a matrix. How does flatten(2) work in theano?
A similar example on numpy produces 1D array:
a= array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, 11, 12],
[13, 14, 15],
[16, 17, 18]],
[[19, 20, 21],
[22, 23, 24],
[25, 26, 27]]])
a.flatten(2)=array([ 1, 10, 19, 4, 13, 22, 7, 16, 25, 2, 11, 20, 5, 14, 23, 8, 17,
26, 3, 12, 21, 6, 15, 24, 9, 18, 27])
numpy doesn't support flattening only some dimensions but Theano does.
So if a is a numpy array, a.flatten(2) doesn't make any sense. It runs without error but only because the 2 is passed as the order parameter which seems to cause numpy to stick with the default order of C.
Theano's flatten does support axis specification. The documentation explains how it works.
Parameters:
x (any TensorVariable (or compatible)) – variable to be flattened
outdim (int) – the number of dimensions in the returned variable
Return type:
variable with same dtype as x and outdim dimensions
Returns:
variable with the same shape as x in the leading outdim-1 dimensions,
but with all remaining dimensions of x collapsed into the last dimension.
For example, if we flatten a tensor of shape (2, 3, 4, 5) with
flatten(x, outdim=2), then we’ll have the same (2-1=1) leading
dimensions (2,), and the remaining dimensions are collapsed. So the
output in this example would have shape (2, 60).
A simple Theano demonstration:
import numpy
import theano
import theano.tensor as tt
def compile():
x = tt.tensor3()
return theano.function([x], x.flatten(2))
def main():
a = numpy.arange(2 * 3 * 4).reshape((2, 3, 4))
f = compile()
print a.shape, f(a).shape
main()
prints
(2L, 3L, 4L) (2L, 12L)
The following script computes R-squared value between two numpy arrays(x and y).
The R-squared value is very low due to outliers in the data. How can I extract the indices of those outliers?
import numpy as np, matplotlib.pyplot as plt, scipy.stats as stats
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
r2 = stats.linregress(x, y) [3]**2
print r2
plt.scatter(x, y)
plt.show()
An outlier is defined as: value-mean > 2*standard deviation.
You can do this with the line
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
What is does:
A list is constructed from the indices of x, where the element at that index satisfies the condition described above.
A quick test:
x = np.random.random_integers(1,50,50)
this gives me the array:
array([16, 6, 13, 18, 21, 37, 31, 8, 1, 48, 4, 40, 9, 14, 6, 45, 20,
15, 14, 32, 30, 8, 19, 8, 34, 22, 49, 5, 22, 23, 39, 29, 37, 24,
45, 47, 21, 5, 4, 27, 48, 2, 22, 8, 12, 8, 49, 12, 15, 18])
Now I add some outliers manually as there are none initially:
x[4] = 200
x[15] = 178
lets test:
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
result:
[4, 15]
Is this what you was looking for?
EDIT:
I added the abs() function in the line above, because when you are working with negative numbers this might end bad. The abs() function takes the absolute value.
I think Sander's approach is the correct one, but if you must see R2 without those outliers before making a decision here is a way to do it.
Setup data and introduce outlier:
In [1]:
import numpy as np, scipy.stats as stats
np.random.seed(123)
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
y[5] = 100
Calculate R2 taking out one y value at a time (along with matching x value):
m = np.eye(y.shape[0])
r2 = np.apply_along_axis(lambda a: stats.linregress(np.delete(x, a.argmax()), np.delete(y, a.argmax()))[3]**2, 0, m)
Get index of the biggest outlier:
r2.argmax()
Out[1]:
5
Get R2 when this outlier is taken out:
In [2]:
r2[r2.argmax()]
Out[2]:
0.85892084723588935
Get the value of the outlier:
In [3]:
y[r2.argmax()]
Out[3]:
100
To get top n outliers:
In [4]:
n = 5
sorted_index = r2.argsort()[::-1]
sorted_index[:n]
Out [4]:
array([ 5, 27, 34, 0, 17], dtype=int64)