I am using version 1.5.1 of numpy and Python 2.6.6.
I am reading a binary file into a numpy array:
>>> dt = np.dtype('<u4,<i2,<i2,<i2,<i2,<i2,<i2,<i2,<i2,u1,u1,u1,u1')
>>> file_data = np.fromfile(os.path.join(folder,f), dtype=dt)
This works just fine. Examining the result:
>>> type(file_data)
<type 'numpy.ndarray'>
>>> file_data
array([(3571121L, -54, 103, 1, 50, 48, 469, 588, -10, 0, 102, 0, 0),
(3571122L, -78, 20, 25, 45, 44, 495, 397, -211, 0, 102, 0, 0),
(3571123L, -69, -48, 23, 60, 19, 317, -26, -151, 0, 102, 0, 0), ...,
(3691138L, -53, 52, -2, -11, 76, 988, 288, -101, 1, 102, 0, 0),
(3691139L, -11, 21, -27, 25, 47, 986, 253, 176, 1, 102, 0, 0),
(3691140L, -30, -19, -63, 59, 12, 729, 23, 302, 1, 102, 0, 0)],
dtype=[('f0', '<u4'), ('f1', '<i2'), ('f2', '<i2'), ... , ('f12', '|u1')])
>>> file_data[0]
(3571121L, -54, 103, 1, 50, 48, 469, 588, -10, 0, 102, 0, 0)
>>> file_data[0][0]
3571121
>>> len(file_data)
120020
When I try to slice the first column:
>>> file_data[:,0]
I get:
IndexError: invalid index.
I have looked at simple examples and was able to do the slicing:
>>> a = np.array([(1,2,3),(4,5,6)])
>>> a[:,0]
array([1, 4])
The only difference I can see between my case and the simple example is that I am using the dtype. What I am doing wrong?
When you set the dtype like that, you are creating a Record Array. Numpy treats that like a 1D array of elements of your dtype. There's a fundamental difference between
file_data[0][0]
and
file_data[0,0]
In the first, you are asking for the first element of a 1D array and then retrieving the first element of that returned element. In the second, you are asking for the element in the first row of the first column of a 2D array. That's why you are getting the IndexError.
If you want to access an individual element using 2D notation, you can create a view and work with that. Unfortunately, AFAIK if you want to treat your object like a 2D array, all elements have to have the same dtype.
Related
I have a LSTM model I am using to predict the unemployment rate from federal reserve filings. It uses glove vectors and vocab2index embedding and the training went as planned. However, upon attempting to feed a word embedding into the model for prediction testing it keeps throwing various errors.
Here is the model:
def load_glove_vectors(glove_file= glove_embedding_vectors_text_file):
"""Load the glove word vectors"""
word_vectors = {}
with open(glove_file) as f:
for line in f:
split = line.split()
word_vectors[split[0]] = np.array([float(x) for x in split[1:]])
return word_vectors
def get_emb_matrix(pretrained, word_counts, emb_size = 300):
""" Creates embedding matrix from word vectors"""
vocab_size = len(word_counts) + 2
vocab_to_idx = {}
vocab = ["", "UNK"]
W = np.zeros((vocab_size, emb_size), dtype="float32")
W[0] = np.zeros(emb_size, dtype='float32') # adding a vector for padding
W[1] = np.random.uniform(-0.25, 0.25, emb_size) # adding a vector for unknown words
vocab_to_idx["UNK"] = 1
i = 2
for word in word_counts:
if word in word_vecs:
W[i] = word_vecs[word]
else:
W[i] = np.random.uniform(-0.25,0.25, emb_size)
vocab_to_idx[word] = i
vocab.append(word)
i += 1
return W, np.array(vocab), vocab_to_idx
word_vecs = load_glove_vectors()
pretrained_weights, vocab, vocab2index = get_emb_matrix(word_vecs, counts)
Unfortunately when I feed this array
[array([ 3, 10, 6287, 6, 113, 271, 3, 6639, 104, 5105, 7525,
104, 7526, 9, 23, 9, 10, 11, 24, 7527, 7528, 104,
11, 24, 7529, 7530, 104, 11, 24, 7531, 7530, 104, 11,
24, 7532, 7530, 104, 11, 24, 7533, 7534, 24, 7535, 7536,
104, 7537, 104, 7538, 7539, 7540, 6643, 7541, 7354, 7542, 7543,
7544, 9, 23, 9, 10, 11, 24, 25, 8, 10, 11,
24, 3, 10, 663, 168, 9, 10, 290, 291, 3, 4909,
198, 10, 1478, 169, 15, 4621, 3, 3244, 3, 59, 1967,
113, 59, 520, 198, 25, 5105, 7545, 7546, 7547, 7546, 7548,
7549, 7550, 1874, 10, 7551, 9, 10, 11, 24, 7552, 6287,
7553, 7554, 7555, 24, 7556, 24, 7557, 7558, 7559, 6, 7560,
323, 169, 10, 7561, 1432, 6, 3134, 3, 7562, 6, 7563,
1862, 7144, 741, 3, 3961, 7564, 7565, 520, 7566, 4833, 7567,
7568, 4901, 7569, 7570, 4901, 7571, 1874, 7572, 12, 13, 7573,
10, 7574, 7575, 59, 7576, 59, 638, 1620, 7577, 271, 6488,
59, 7578, 7579, 7580, 7581, 271, 7582, 7583, 24, 669, 5932,
7584, 9, 113, 271, 3764, 3, 5930, 3, 59, 4901, 7585,
793, 7586, 7587, 6, 1482, 520, 7588, 520, 7589, 3246, 7590,
13, 7591])
into torch.LongTensor() I keep getting the following error:
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
Any ideas on how to remedy? I am fairly new to AI in general, and I am an economist by trade so I am almost certain I have made a boneheaded error.
I have a dataframe (120,238), with 12 values spread across it. I am trying to use radial interpolation to fill up the remaining empty points. For that I created an list with the coordinates of the points, and another list with the values of each of these points.
for i in range(238):
col.append('')
df_map = pd.DataFrame(columns = col, index = range(120))
x_rbf = [8, 227, 19, 116, 11, 223, 5, 231, 116, 116, 13, 222] #x represents the columns
y_rbf = [59, 59, 102, 111, 17, 17, 9, 9, 62, 17, 7, 7] #y represents the rows
z_rbf = [16.2,15.99,16.2,16.3,15.7,15,14.2,14.2,16.4,16.4,13,11]
y = x_rbf, y_rbf
f = scipy.interpolate.RBFInterpolator(y,z_rbf)
However, when I run this code, I get the following error'
ValueError: Expected the first axis of `d` to have length 2.
Does anyone know how to go around this?
After countless tries, I figured out the issue with utilizing the RBF Interpolator. The x and y coordinates have to be flattened (using np.ravel()), and then stacked into one array
for i in range(238):
col.append('')
df_map = pd.DataFrame(columns = col, index = range(120))
x_rbf = [8, 227, 19, 116, 11, 223, 5, 231, 116, 116, 13, 222] #x represents the columns
y_rbf = [59, 59, 102, 111, 17, 17, 9, 9, 62, 17, 7, 7] #y represents the rows
z_rbf = [16.2,15.99,16.2,16.3,15.7,15,14.2,14.2,16.4,16.4,13,11]
sp = np.stack([y_rbf.ravel(),x_rbf.ravel()],-1)
f = scipy.interpolate.RBFInterpolator(sp,z_rbf.ravel(), kernel = 'linear')
Should work this way
Basically what the title entails.
The two matrices are mostly zeros. And the first is 1 x 9999999999999 and the second is 9999999999999 x 1
When I try to do a dot product I get this.
Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64
Full traceback </br>
MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64
In [31]: imputed.dot(s)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-31-670cfc69d4cf> in <module>
----> 1 imputed.dot(s)
~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in dot(self, other)
357
358 """
--> 359 return self * other
360
361 def power(self, n, dtype=None):
~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in __mul__(self, other)
478 if self.shape[1] != other.shape[0]:
479 raise ValueError('dimension mismatch')
--> 480 return self._mul_sparse_matrix(other)
481
482 # If it's a list or whatever, treat it like a matrix
~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other)
499
500 major_axis = self._swap((M, N))[0]
--> 501 other = self.__class__(other) # convert to this format
502
503 idx_dtype = get_index_dtype((self.indptr, self.indices,
~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
32 arg1 = arg1.copy()
33 else:
---> 34 arg1 = arg1.asformat(self.format)
35 self._set_self(arg1)
36
~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in asformat(self, format, copy)
320 # Forward the copy kwarg, if it's accepted.
321 try:
--> 322 return convert_method(copy=copy)
323 except TypeError:
324 return convert_method()
~/.local/lib/python3.8/site-packages/scipy/sparse/csc.py in tocsr(self, copy)
135 idx_dtype = get_index_dtype((self.indptr, self.indices),
136 maxval=max(self.nnz, N))
--> 137 indptr = np.empty(M + 1, dtype=idx_dtype)
138 indices = np.empty(self.nnz, dtype=idx_dtype)
139 data = np.empty(self.nnz, dtype=upcast(self.dtype))
MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64
It seems the scipy is trying to create a temp array.
I am using the .dot method that scipy provides.
I am also open to non-scipy solutions.
Thanks!
In [105]: from scipy import sparse
If I make a (100,1) csr matrix:
In [106]: A = sparse.random(100,1,format='csr')
In [107]: A
Out[107]:
<100x1 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
The data and indices are:
In [109]: A.data
Out[109]: array([0.19060481])
In [110]: A.indices
Out[110]: array([0], dtype=int32)
In [112]: A.indptr
Out[112]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
So even with only 1 nonzero term, one array is large (101).
On the other hand the csc format for the same array has a much smaller storage. But csc with (1,100) shape will look like the csr.
In [113]: Ac = A.tocsc()
In [114]: Ac.indptr
Out[114]: array([0, 1], dtype=int32)
In [115]: Ac.indices
Out[115]: array([88], dtype=int32)
Math, especially matrix products is done with csr/csc formats. So it may be hard to avoid this 80 TB memory use.
Looking at the traceback I see that it's trying to convert other to the format that matches self.
So with A.dot(B), and A is (1,N) csr, the small shape. B is (N,1) csc, also the small shape. But B.tocsr() requires the large (N+1,) shaped indptr.
Let's try an alternative to dot
First 2 matrices:
In [122]: A = sparse.random(1,100, .2,format='csr')
In [123]: B = sparse.random(100,1, .2,format='csc')
In [124]: A
Out[124]:
<1x100 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Row format>
In [125]: B
Out[125]:
<100x1 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Column format>
In [126]: A#B
Out[126]:
<1x1 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
In [127]: _.A
Out[127]: array([[1.33661021]])
Their nonzero element indices. Only the ones that match matter.
In [128]: A.indices, B.indices
Out[128]:
(array([16, 20, 23, 28, 30, 37, 39, 40, 43, 49, 54, 59, 61, 63, 67, 70, 74,
91, 94, 99], dtype=int32),
array([ 5, 8, 15, 25, 34, 35, 40, 46, 47, 51, 53, 60, 68, 70, 75, 81, 87,
90, 91, 94], dtype=int32))
equality matrix:
In [129]: mask = A.indices[:,None]==B.indices
In [132]: np.nonzero(mask.any(axis=0))
Out[132]: (array([ 6, 13, 18, 19]),)
In [133]: np.nonzero(mask.any(axis=1))
Out[133]: (array([ 7, 15, 17, 18]),)
The matching indices:
In [139]: A.indices[Out[133]]
Out[139]: array([40, 70, 91, 94], dtype=int32)
In [140]: B.indices[Out[132]]
Out[140]: array([40, 70, 91, 94], dtype=int32)
sum of the corresponding data values matches [127]
In [141]: (A.data[Out[133]]*B.data[Out[132]]).sum()
Out[141]: 1.3366102138511582
I have 2 global arrays "tab1" and "tab2" with dimensions respectively equal to 21x21 and 17x17.
I would like to assign the block of "tab1" ( indexed by [15:20,0:7]) by the block of "tab2" indexed by [7:17:2,0:7] (so with a step between elements of 1st array dimension) : I tried whith this syntax :
tab1[15:20,0:7] = tab2[7:17:2,0:7]
Unfortunately, this doesn't work, it seems that only "diagonal" (I mean one by one) elements of 15:20 are taken into account following the values of "tab2" along [7:17:2].
Is there a way to assign a subarray of "tab1" with another subarray "tab2" composed of indexes with step spaced values ?
If someone could see what's wrong or suggest another method, this would be nice.
UPDATE 1: indeed, from my last tests, it seems good but is it also the same for the assignment of block [15:20,15:20] :
tab1[15:20,15:20] = tab2[7:17:2,7:17:2]
??
ANSWER : it seems ok also for this block assignment, sorry
The assignment works as I expect.
In [1]: arr = np.ones((20,10),int)
The two blocks have the same shape:
In [2]: arr[15:20, 0:7].shape
Out[2]: (5, 7)
In [3]: arr[7:17:2, 0:7].shape
Out[3]: (5, 7)
and assigning something interesting, looks right:
In [4]: arr2 = np.arange(200).reshape(20,10)
In [5]: arr[15:20, 0:7] = arr2[7:17:2, 0:7]
In [6]: arr
Out[6]:
array([[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
...
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 70, 71, 72, 73, 74, 75, 76, 1, 1, 1],
[ 90, 91, 92, 93, 94, 95, 96, 1, 1, 1],
[110, 111, 112, 113, 114, 115, 116, 1, 1, 1],
[130, 131, 132, 133, 134, 135, 136, 1, 1, 1],
[150, 151, 152, 153, 154, 155, 156, 1, 1, 1]])
I see a (5,7) block of values from arr2, skipping rows like [80, 100,...]
I have a small matrix A with dimensions MxNxO
I have a large matrix B with dimensions KxMxNxP, with P>O
I have a vector ind of indices of dimension Ox1
I want to do:
B[1,:,:,ind] = A
But, the lefthand of my equation
B[1,:,:,ind].shape
is of dimension Ox1xMxN and therefore I can not broadcast A (MxNxO) into it.
Why does accessing B in this way change the dimensions of the left side?
How can I easily achieve my goal?
Thanks
There's a feature, if not a bug, that when slices are mixed in the middle of advanced indexing, the sliced dimensions are put at the end.
Thus for example:
In [204]: B = np.zeros((2,3,4,5),int)
In [205]: ind=[0,1,2,3,4]
In [206]: B[1,:,:,ind].shape
Out[206]: (5, 3, 4)
The 3,4 dimensions have been placed after the ind, 5.
We can get around that by indexing first with 1, and then the rest:
In [207]: B[1][:,:,ind].shape
Out[207]: (3, 4, 5)
In [208]: B[1][:,:,ind] = np.arange(3*4*5).reshape(3,4,5)
In [209]: B[1]
Out[209]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],
[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
This only works when that first index is a scalar. If it too were a list (or array), we'd get an intermediate copy, and couldn't set the value like this.
https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
It's come up in other SO questions, though not recently.
weird result when using both slice indexing and boolean indexing on a 3d array