Extracting the indices of outliers in Linear Regression - numpy

The following script computes R-squared value between two numpy arrays(x and y).
The R-squared value is very low due to outliers in the data. How can I extract the indices of those outliers?
import numpy as np, matplotlib.pyplot as plt, scipy.stats as stats
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
r2 = stats.linregress(x, y) [3]**2
print r2
plt.scatter(x, y)
plt.show()

An outlier is defined as: value-mean > 2*standard deviation.
You can do this with the line
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
What is does:
A list is constructed from the indices of x, where the element at that index satisfies the condition described above.
A quick test:
x = np.random.random_integers(1,50,50)
this gives me the array:
array([16, 6, 13, 18, 21, 37, 31, 8, 1, 48, 4, 40, 9, 14, 6, 45, 20,
15, 14, 32, 30, 8, 19, 8, 34, 22, 49, 5, 22, 23, 39, 29, 37, 24,
45, 47, 21, 5, 4, 27, 48, 2, 22, 8, 12, 8, 49, 12, 15, 18])
Now I add some outliers manually as there are none initially:
x[4] = 200
x[15] = 178
lets test:
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
result:
[4, 15]
Is this what you was looking for?
EDIT:
I added the abs() function in the line above, because when you are working with negative numbers this might end bad. The abs() function takes the absolute value.

I think Sander's approach is the correct one, but if you must see R2 without those outliers before making a decision here is a way to do it.
Setup data and introduce outlier:
In [1]:
import numpy as np, scipy.stats as stats
np.random.seed(123)
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
y[5] = 100
Calculate R2 taking out one y value at a time (along with matching x value):
m = np.eye(y.shape[0])
r2 = np.apply_along_axis(lambda a: stats.linregress(np.delete(x, a.argmax()), np.delete(y, a.argmax()))[3]**2, 0, m)
Get index of the biggest outlier:
r2.argmax()
Out[1]:
5
Get R2 when this outlier is taken out:
In [2]:
r2[r2.argmax()]
Out[2]:
0.85892084723588935
Get the value of the outlier:
In [3]:
y[r2.argmax()]
Out[3]:
100
To get top n outliers:
In [4]:
n = 5
sorted_index = r2.argsort()[::-1]
sorted_index[:n]
Out [4]:
array([ 5, 27, 34, 0, 17], dtype=int64)

Related

How to plot my data using MatPloitLib with step size

Consider the following code and the graph obtained from it
import matplotlib.pyplot as plt
import numpy as np
fig,axs = plt.subplots(figsize=(10,10))
data1 = [5, 6, 18, 7, 19]
x_ax = [10, 20, 30, 40, 50]
y_ax = [0, 5, 10, 15, 20]
axs.plot(data1,marker="o")
axs.set_xticks(x_ax)
axs.set_xticklabels(labels=x_ax,rotation=45)
axs.set_yticks(y_ax)
axs.set_yticklabels(labels=y_ax,rotation=45)
axs.set_xlabel("X")
axs.set_ylabel("Y")
axs.set_title("Name")
I need to plot my data1 = [5, 6, 18, 7, 19] with a step size of 10. 5 for 10, 6 for 20, 18 for 30, 7 for 40 and 19 for 50. But the plot is taking a step size of one.
How can I modify my code to do the required?
If you don't provide x values to plot, it'll automatically use 0, 1, 2 ....
So in your case you need:
x = range(10, len(data1)*10+1, 10)
axs.plot(x, data1, marker="o")

Convert numpy array of Datetime objects to UTC strings

I have a large array of datetime objects in numpy array. However I am trying to export them as a json object attribute and need them to be represented as a UTC string.
Here is my array ( a small chunk of it )
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=<UTC>), datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=<UTC>), datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=<UTC>)]
json = {
'datetimes': []
};
I know I can iterate over the list and convert them however I was hoping there was an efficient pandas or numpy technique for this.
I think you can create DataFrame, convert to iso format and save to dict, because DataFrame.to_json with orint='list' is not implemented yet:
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=datetime.timezone.utc)]
df = pd.DataFrame({'datetimes': datetimes})
#native convert to iso, but not support lists yet
print (df.to_json(date_format='iso'))
{"datetimes":{"0":"2015-07-12T18:33:14.000Z",
"1":"2015-07-12T18:33:32.000Z",
"2":"2015-07-12T18:33:50.000Z"}}
df = pd.DataFrame({'datetimes': datetimes})
df['datetimes'] = df['datetimes'].map(lambda x: x.isoformat())
print (json.dumps(df.to_dict(orient='l')))
{"datetimes": ["2015-07-12T18:33:14+00:00",
"2015-07-12T18:33:32+00:00",
"2015-07-12T18:33:50+00:00"]}
print(json.dumps({'datetimes': [x.isoformat() for x in datetimes]}))
{"datetimes": ["2015-07-12T18:33:14+00:00",
"2015-07-12T18:33:32+00:00",
"2015-07-12T18:33:50+00:00"]}
I test it more and list comprehension is fastest with isoformat:
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=datetime.timezone.utc)]*10000
In [116]: %%timeit
...: df = pd.DataFrame({'datetimes': datetimes})
...: df['datetimes'] = df['datetimes'].map(lambda x: x.isoformat())
...: json.dumps(df.to_dict(orient='l'))
...:
1 loop, best of 3: 552 ms per loop
#wrong output format, dictionaries not lists
In [117]: %%timeit
...: df = pd.DataFrame({'datetimes': datetimes})
...: df.to_json(date_format='iso')
...:
10 loops, best of 3: 104 ms per loop
In [118]: %%timeit
...: json.dumps({'datetimes': [x.isoformat() for x in datetimes]})
...:
10 loops, best of 3: 67.5 ms per loop

Pyplot sorting y-values automatically

I have a frequency analysis of words said in episodes of my favorite show. I'm making a plot.barh(s1e1_y, s1e1_x) but it's sorting by words instead of values.
The output of >>> s1e1_y
is
['know', 'go', 'now', 'here', 'gonna', 'can', 'them', 'think', 'come', 'time', 'got', 'elliot', 'talk', 'out', 'night', 'been', 'then', 'need', 'world', "what's"]
and >>>s1e1_x
[42, 30, 26, 25, 24, 22, 20, 19, 19, 18, 18, 18, 17, 17, 15, 15, 14, 14, 13, 13]
When the plots are actually plotted, the graph's y axis ticks are sorted alphabetically even though the plotting list is unsorted...
s1e1_wordlist = []
s1e1_count = []
for word, count in s1e01:
if((word[:-1] in excluded_words) == False):
s1e1_wordlist.append(word[:-1])
s1e1_count.append(int(count))
s1e1_sorted = sorted(list(sorted(zip(s1e1_count, s1e1_wordlist))),
reverse=True)
s1e1_20 = []
for i in range(0,20):
s1e1_20.append(s1e1_sorted[i])
s1e1_x = []
s1e1_y = []
for count, word in s1e1_20:
s1e1_x.append(word)
s1e1_y.append(count)
plot.figure(1, figsize=(20,20))
plot.subplot(341)
plot.title('Season1 : Episode 1')
plot.tick_params(axis='y',labelsize=8)
plot.barh(s1e1_x, s1e1_y)
From matplotlib 2.1 on you can plot categorical variables. This allows to plot plt.bar(["apple","cherry","banana"], [1,2,3]). However in matplotlib 2.1 the output will be sorted by category, hence alphabetically. This was considered as bug and is changed in matplotlib 2.2 (see this PR).
In matplotlib 2.2 the bar plot would hence preserve the order.
In matplotlib 2.1, you would plot the data as numeric data as in any version prior to 2.1. This means to plot the numbers against their index and to set the labels accordingly.
w = ['know', 'go', 'now', 'here', 'gonna', 'can', 'them', 'think', 'come',
'time', 'got', 'elliot', 'talk', 'out', 'night', 'been', 'then', 'need',
'world', "what's"]
n = [42, 30, 26, 25, 24, 22, 20, 19, 19, 18, 18, 18, 17, 17, 15, 15, 14, 14, 13, 13]
import matplotlib.pyplot as plt
import numpy as np
plt.barh(range(len(w)),n)
plt.yticks(range(len(w)),w)
plt.show()
Ok you seem to have a lot of spurious code in your example which isn't relevant to the problem as you've described it but assuming you don't want the y axis to sort alphabetically then you need to zip your two lists into a dataframe then plot the dataframe as follows
df = pd.DataFrame(list(zip(s1e1_y,s1e1_x))).set_index(1)
df.plot.barh()
This then produces the following

Fill blocks at random places on each 2D slice of a 3D array

I have 3D numpy array, for example, like this:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]]])
Is there a way to index it in such a way that I select, for example, top right corner of 2x2 elements in the first plane, and a center 2x2 elements subarray from the second plane? So that I could then zero out the elements 2,3,6,7,21,22,25,26:
array([[[ 0, 1, 0, 0],
[ 4, 5, 0, 0],
[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[16, 17, 18, 19],
[20, 0, 0, 23],
[24, 0, 0, 27],
[28, 29, 30, 31]]])
I have a batch of images, and I need to zero out a small window of fixed size, but at different (random) locations for each image in the batch. The first dimension is number of images.
Something like this:
a[:, x: x+2, y: y+2] = 0
where x and y are vectors which have different values for each first dimension of a.
Approach #1 : Here'e one approach that's mostly based on linear-indexing -
def random_block_fill_lidx(a, N, fillval=0):
# a is input array
# N is blocksize
# Store shape info
m,n,r = a.shape
# Get all possible starting linear indices for each 2D slice
possible_start_lidx = (np.arange(n-N+1)[:,None]*r + range(r-N+1)).ravel()
# Get random start indices from all possible ones for all 2D slices
start_lidx = np.random.choice(possible_start_lidx, m)
# Get linear indices for the block of (N,N)
offset_arr = (a.shape[-1]*np.arange(N)[:,None] + range(N)).ravel()
# Add in those random start indices with the offset array
idx = start_lidx[:,None] + offset_arr
# On a 2D view of the input array, use advance-indexing to set fillval.
a.reshape(m,-1)[np.arange(m)[:,None], idx] = fillval
return a
Approach #2 : Here's another and possibly more efficient one (for large 2D slices) using advanced-indexing -
def random_block_fill_adv(a, N, fillval=0):
# a is input array
# N is blocksize
# Store shape info
m,n,r = a.shape
# Generate random start indices for second and third axes keeping proper
# distance from the boundaries for the block to be accomodated within.
idx0 = np.random.randint(0,n-N+1,m)
idx1 = np.random.randint(0,r-N+1,m)
# Setup indices for advanced-indexing.
# First axis indices would be simply the range array to select one per elem.
# We need to extend this to 3D so that the latter dim indices could be aligned.
dim0 = np.arange(m)[:,None,None]
# Second axis indices would idx0 with broadcasted additon of blocksized
# range array to cover all block indices along this axis. Repeat for third.
dim1 = idx0[:,None,None] + np.arange(N)[:,None]
dim2 = idx1[:,None,None] + range(N)
a[dim0, dim1, dim2] = fillval
return a
Approach #3 : With the old-trusty loop -
def random_block_fill_loopy(a, N, fillval=0):
# a is input array
# N is blocksize
# Store shape info
m,n,r = a.shape
# Generate random start indices for second and third axes keeping proper
# distance from the boundaries for the block to be accomodated within.
idx0 = np.random.randint(0,n-N+1,m)
idx1 = np.random.randint(0,r-N+1,m)
# Iterate through first and use slicing to assign fillval.
for i in range(m):
a[i, idx0[i]:idx0[i]+N, idx1[i]:idx1[i]+N] = fillval
return a
Sample run -
In [357]: a = np.arange(2*4*7).reshape(2,4,7)
In [358]: a
Out[358]:
array([[[ 0, 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12, 13],
[14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27]],
[[28, 29, 30, 31, 32, 33, 34],
[35, 36, 37, 38, 39, 40, 41],
[42, 43, 44, 45, 46, 47, 48],
[49, 50, 51, 52, 53, 54, 55]]])
In [359]: random_block_fill_adv(a, N=3, fillval=0)
Out[359]:
array([[[ 0, 0, 0, 0, 4, 5, 6],
[ 7, 0, 0, 0, 11, 12, 13],
[14, 0, 0, 0, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27]],
[[28, 29, 30, 31, 32, 33, 34],
[35, 36, 37, 38, 0, 0, 0],
[42, 43, 44, 45, 0, 0, 0],
[49, 50, 51, 52, 0, 0, 0]]])
Fun stuff : Being in-place filling, if we keep running random_block_fill_adv(a, N=3, fillval=0), we will eventually end up with all zeros a. Thus, also verifying the code.
Runtime test
In [579]: a = np.random.randint(0,9,(10000,4,4))
In [580]: %timeit random_block_fill_lidx(a, N=2, fillval=0)
...: %timeit random_block_fill_adv(a, N=2, fillval=0)
...: %timeit random_block_fill_loopy(a, N=2, fillval=0)
...:
1000 loops, best of 3: 545 µs per loop
1000 loops, best of 3: 891 µs per loop
100 loops, best of 3: 10.6 ms per loop
In [581]: a = np.random.randint(0,9,(1000,40,40))
In [582]: %timeit random_block_fill_lidx(a, N=10, fillval=0)
...: %timeit random_block_fill_adv(a, N=10, fillval=0)
...: %timeit random_block_fill_loopy(a, N=10, fillval=0)
...:
1000 loops, best of 3: 739 µs per loop
1000 loops, best of 3: 671 µs per loop
1000 loops, best of 3: 1.27 ms per loop
So, which one to choose depends on the first axis length and blocksize.

Clarification about flatten function in Theano

in [http://deeplearning.net/tutorial/lenet.html#lenet] it says:
This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
# or (500, 50 * 4 * 4) = (500, 800) with the default values.
layer2_input = layer1.output.flatten(2)
when I use flatten function on a numpy 3d array I get a 1D array. but here it says I get a matrix. How does flatten(2) work in theano?
A similar example on numpy produces 1D array:
a= array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, 11, 12],
[13, 14, 15],
[16, 17, 18]],
[[19, 20, 21],
[22, 23, 24],
[25, 26, 27]]])
a.flatten(2)=array([ 1, 10, 19, 4, 13, 22, 7, 16, 25, 2, 11, 20, 5, 14, 23, 8, 17,
26, 3, 12, 21, 6, 15, 24, 9, 18, 27])
numpy doesn't support flattening only some dimensions but Theano does.
So if a is a numpy array, a.flatten(2) doesn't make any sense. It runs without error but only because the 2 is passed as the order parameter which seems to cause numpy to stick with the default order of C.
Theano's flatten does support axis specification. The documentation explains how it works.
Parameters:
x (any TensorVariable (or compatible)) – variable to be flattened
outdim (int) – the number of dimensions in the returned variable
Return type:
variable with same dtype as x and outdim dimensions
Returns:
variable with the same shape as x in the leading outdim-1 dimensions,
but with all remaining dimensions of x collapsed into the last dimension.
For example, if we flatten a tensor of shape (2, 3, 4, 5) with
flatten(x, outdim=2), then we’ll have the same (2-1=1) leading
dimensions (2,), and the remaining dimensions are collapsed. So the
output in this example would have shape (2, 60).
A simple Theano demonstration:
import numpy
import theano
import theano.tensor as tt
def compile():
x = tt.tensor3()
return theano.function([x], x.flatten(2))
def main():
a = numpy.arange(2 * 3 * 4).reshape((2, 3, 4))
f = compile()
print a.shape, f(a).shape
main()
prints
(2L, 3L, 4L) (2L, 12L)