Numpy index array of unknown dimensions? - numpy

I need to compare a bunch of numpy arrays with different dimensions, say:
a = np.array([1,2,3])
b = np.array([1,2,3],[4,5,6])
assert(a == b[0])
How can I do this if I do not know either the shape of a and b, besides that
len(shape(a)) == len(shape(b)) - 1
and neither do I know which dimension to skip from b. I'd like to use np.index_exp, but that does not seem to help me ...
def compare_arrays(a,b,skip_row):
u = np.index_exp[ ... ]
assert(a[:] == b[u])
Edit
Or to put it otherwise, I wan't to construct slicing if I know the shape of the array and the dimension I want to miss. How do I dynamically create the np.index_exp, if I know the number of dimensions and positions, where to put ":" and where to put "0".

I was just looking at the code for apply_along_axis and apply_over_axis, studying how they construct indexing objects.
Lets make a 4d array:
In [355]: b=np.ones((2,3,4,3),int)
Make a list of slices (using list * replicate)
In [356]: ind=[slice(None)]*b.ndim
In [357]: b[ind].shape # same as b[:,:,:,:]
Out[357]: (2, 3, 4, 3)
In [358]: ind[2]=2 # replace one slice with index
In [359]: b[ind].shape # a slice, indexing on the third dim
Out[359]: (2, 3, 3)
Or with your example
In [361]: b = np.array([1,2,3],[4,5,6]) # missing []
...
TypeError: data type not understood
In [362]: b = np.array([[1,2,3],[4,5,6]])
In [366]: ind=[slice(None)]*b.ndim
In [367]: ind[0]=0
In [368]: a==b[ind]
Out[368]: array([ True, True, True], dtype=bool)
This indexing is basically the same as np.take, but the same idea can be extended to other cases.
I don't quite follow your questions about the use of :. Note that when building an indexing list I use slice(None). The interpreter translates all indexing : into slice objects: [start:stop:step] => slice(start, stop, step).
Usually you don't need to use a[:]==b[0]; a==b[0] is sufficient. With lists alist[:] makes a copy, with arrays it does nothing (unless used on the RHS, a[:]=...).

Related

Boolean indexing with 2D arrays

I have two arrays, a and b, one 2D and one 1D, containing values of two related quantities that are filled in the same order, such that a[0] is related to b[0] and so on.
I would like to access the element of b where a is equal to a given value, where the value is a 1D array itself.
For example
a=np.array([[0,0],[0,1],[1,0],[1,1]])
b=np.array([0, 7, 9, 4])
value = np.array([0,1])
In 1D cases I could use boolean indexing easily and do
b[a==value]
The result I want is 7.
But in this case, it does not work because it checks each element of b in the comparison, instead of checking subarrays...
Is there a quick way to do this?
The question doesn't seem to match the example, but this returns [7]:
b[(a == value).all(axis=-1)]

How to perform matching between two sequences?

I have two mini-batch of sequences :
a = C.sequence.input_variable((10))
b = C.sequence.input_variable((10))
Both a and b have variable-length sequences.
I want to do matching between them where matching is defined as: match (eg. dot product) token at each time step of a with token at every time step of b .
How can I do this?
I have mostly answered this on github but to be consistent with SO rules, I am including a response here. In case of something simple like a dot product you can take advantage of the fact that it factorizes nicely, so the following code works
axisa = C.Axis.new_unique_dynamic_axis('a')
axisb = C.Axis.new_unique_dynamic_axis('b')
a = C.sequence.input_variable(1, sequence_axis=axisa)
b = C.sequence.input_variable(1, sequence_axis=axisb)
c = C.sequence.broadcast_as(C.sequence.reduce_sum(a), b) * b
c.eval({a: [[1, 2, 3],[4, 5]], b: [[6, 7], [8]]})
[array([[ 36.],
[ 42.]], dtype=float32), array([[ 72.]], dtype=float32)]
In the general case you need the following steps
static_b, mask = C.sequence.unpack(b, neutral_value).outputs
scores = your_score(a, static_b)
The first line will convert the b sequence into a static tensor with one more axis than b. Because of packing, some elements of this tensor will be invalid and those will be indicated by the mask. The neutral_value will be placed as a dummy value in the static_b tensor wherever data was missing. Depending on your score you might be able to arrange for the neutral_value to not affect the final score (e.g. if your score is a dot product a 0 would be a good choice, if it involves a softmax -infinity or something close to that would be a good choice). The second line can now have access to each element of a and all the elements of b as the first axis of static_b. For a dot product static_b is a matrix and one element of a is a vector so a matrix vector multiplication will result in a sequence whose elements are all inner products between the corresponding element of a and all elements of b.

How to find if any column in an array has duplicate values

Let's say I have a numpy matrix A
A = array([[ 0.5, 0.5, 3.7],
[ 3.8, 2.7, 3.7],
[ 3.3, 1.0, 0.2]])
I would like to know if there is at least two rows i and i' such that A[i, j]=A[i', j] for some column j?
In the example A, i=0 and i'=1 for j=2 and the answer is yes.
How can I do this?
I tried this:
def test(A, n):
for j in range(n):
i = 0
while i < n:
a = A[i, j]
for s in range(i+1, n):
if A[s, j] == a:
return True
i += 1
return False
Is there a faster/better way?
There are a number of ways of checking for duplicates. The idea is to use as few loops in the Python code as possible to do this. I will present a couple of ways here:
Use np.unique. You would still have to loop over the columns since it wouldn't make sense for unique to accept an axis argument because each column could have a different number of unique elements. While it still requires a loop, unique allows you to find the positions and other stats of repeated elements:
def test(A):
for i in range(A.shape[1]):
if np.unique(A[:, i]).size < A.shape[0]:
return True
return False
With this method, you basically check if the number of unique elements in a column is equal to the size of the column. If not, there are duplicates.
Use np.sort, np.diff and np.any. This is a fully vectorized solution that does not require any loops because you can specify an axis for each of these functions:
def test(A):
return np.any(diff(np.sort(A, axis=0), axis=0) == 0)
This literally reads "if any of the column-wise differences in the column-wise sorted array are zero, return True". A zero difference in the sorted array means that there are identical elements. axis=0 makes sort and diff operate on each column individually.
You never need to pass in n since the size of the matrix is encoded in the attribute shape. If you need to look at the subset of a matrix, just pass in the subset using indexing. It will not copy the data, just return a view object with the required dimensions.
A solution without numpy would look like this: First, swap columns and rows with zip()
zipped = zip(*A)
then check if any now row has any duplicates. You can check for duplicates by turning a list into a set, which discards duplicates, and check the length.
has_duplicates = any(len(set(row)) != len(row) for row in zip(*A))
Most probably way slower and also more memory intensive than the pure numpy solution, but this may help for clarity

initialize pandas SparseArray

Is it possible to initialize a pandas SparseArray by providing only the dense entries? I could not figure this out from the documentation: http://pandas.pydata.org/pandas-docs/stable/sparse.html .
For example, say I want a length 1000 SparseArray with a one at index 9 and zeros everywhere else, how would I go about creating it? This is one way:
a = [0] * 1000
a[9] = 1
sparse_a = pd.SparseArray(data=a, fill_value=0)
But, in the above, we have to create the dense array before the sparse one. Is there a way to specify only the indices and the dense entries to create the SparseArray directly?
A length 10 SparseArray with a one at index 9 and zeros everywhere else:
pd.SparseArray(1, index= range(1), kind='block',
sparse_index= BlockIndex(10, [8], [1]),
fill_value=0)
Notes:
index could be any list as long as its length is equal to all non-sparsed part of the array (the smaller part of the data), in this case, number of 1 in the sparse array
BlockIndex(10, [8], [1]) is the object pointing to the positions of the non-parsed part of the data where the first argument is the TOTAL length of the array (sparse + non-sparse), the second argument is a list of starting positions of the non-sparse data and the third argument is a list of how long each block of non-sparse lasts. Notice: that the length of the array mentioned in point 1 is the sum of all elements of the list in the third argument of this BlockIndex
So a more general example is: to make a length 20 SparseArray where the 2nd, 3rd, 6th,7th,8th elements are 1 and the rest is 0 is:
pd.SparseArray(1, index= range(5), kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
or
pd.SparseArray(1, index= [None, 3, 2, 7, np.inf], kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
Sadly, I don't know any good way to specify an array of non-sparsed data as the first argument for SparseArray-- it does not mean that it can't be done, this is only a disclaimer. I think as long as you specify index=... pandas will require a scalar for the first argument (the data).
Tested on Windows 7, pandas version 0.20.2 installed by Aconda.

Looking for built-in, invertible, list-of-list-accepting constructor/deconstructor pair for pandas dataframes

Are there built-in ways to construct/deconstruct a dataframe from/to a Python list-of-Python-lists?
As far as the constructor (let's call it make_df for now) that I'm looking for goes, I want to be able to write the initialization of a dataframe from literal values, including columns of arbitrary types, in an easily-readable form, like this:
df = make_df([[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]],
['d', 'i'])
For the deconstructor, I want to essentially recover from a dataframe df the arguments one would need to pass to such make_df to re-create df.
AFAIK,
officially at least, the pandas.DataFrame constructor accepts only a numpy ndarray, a dict, or another DataFrame (and not a simple Python list-of-lists) as its first argument;
the pandas.DataFrame.values property does not preserve the original data types.
I can roll my own functions to do this (e.g., see below), but I would prefer to stick to built-in methods, if available. (The Pandas API is pretty big, and some of its names not what I would expect, so it is quite possible that I have missed one or both of these functions.)
FWIW, below is a hand-rolled version of what I described above, minimally tested. (I doubt that it would be able to handle every possible corner-case.)
import pandas as pd
import collections as co
import pandas.util.testing as pdt
def make_df(values, columns):
return pd.DataFrame(co.OrderedDict([(columns[i],
[row[i] for row in values])
for i in range(len(columns))]))
def unmake_df(dataframe):
columns = list(dataframe.columns)
return ([[dataframe[c][i] for c in columns] for i in dataframe.index],
columns)
values = [[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]]
columns = ['d', 'i']
df = make_df(values, columns)
Here's what the output of the call to make_df above produced:
>>> df
d i
0 9.750 1
1 6.375 2
2 9.000 3
3 0.250 1
4 1.875 2
5 3.750 3
6 8.625 1
A simple check of the round-trip1:
>>> df == make_df(*unmake_df(df))
True
>>> (values, columns) == unmake_df(make_df(*(values, columns)))
True
BTW, this is an example of the loss of the original values' types:
>>> df.values
array([[ 9.75 , 1. ],
[ 6.375, 2. ],
[ 9. , 3. ],
[ 0.25 , 1. ],
[ 1.875, 2. ],
[ 3.75 , 3. ],
[ 8.625, 1. ]])
Notice how the values in the second column are no longer integers, as they were originally.
Hence,
>>> df == make_df(df.values, columns)
False
1 In order to be able to use == to test for equality between dataframes above, I resorted to a little monkey-patching:
def pd_DataFrame___eq__(self, other):
try:
pdt.assert_frame_equal(self, other,
check_index_type=True,
check_column_type=True,
check_frame_type=True)
except:
return False
else:
return True
pd.DataFrame.__eq__ = pd_DataFrame___eq__
Without this hack, expressions of the form dataframe_0 == dataframe_1 would have evaluated to dataframe objects, not simple boolean values.
I'm not sure what documentation you are reading, because the link you give explicitly says that the default constructor accepts other list-like objects (one of which is a list of lists).
In [6]: pandas.DataFrame([['a', 1], ['b', 2]])
Out[6]:
0 1
0 a 1
1 b 2
[2 rows x 2 columns]
In [7]: t = pandas.DataFrame([['a', 1], ['b', 2]])
In [8]: t.to_dict()
Out[8]: {0: {0: 'a', 1: 'b'}, 1: {0: 1, 1: 2}}
Notice that I use to_dict at the end, rather than trying to get back the original list of lists. This is because it is an ill-posed problem to get the list arguments back (unless you make an overkill decorator or something to actually store the ordered arguments that the constructor was called with).
The reason is that a pandas DataFrame, by default, is not an ordered data structure, at least in the column dimension. You could have permuted the order of the column data at construction time, and you would get the "same" DataFrame.
Since there can be many differing notions of equality between two DataFrame (e.g. same columns even including type, or just same named columns, or some columns and in same order, or just same columns in mixed order, etc.) -- pandas defaults to trying to be the least specific about it (Python's principle of least astonishment).
So it would not be good design for the default or built-in constructors to choose an overly specific idea of equality for the purposes of returning the DataFrame back down to its arguments.
For that reason, using to_dict is better since the resulting keys will encode the column information, and you can choose to check for column types or ordering however you want to for your own application. You can even discard the keys by iterating the dict and simply pumping the contents into a list of lists if you really want to.
In other words, because order might not matter among the columns, the "inverse" of the list-of-list constructor maps backwards into a bigger set, namely all the permutations of the same column data. So the inverse you're looking for is not well-defined without assuming more structure -- and casual users of a DataFrame might not want or need to make those extra assumptions to get the invertibility.
As mentioned elsewhere, you should use DataFrame.equals to do equality checking among DataFrames. The function has many options that allow you specify the specific kind of equality testing that makes sense for your application, while leaving the default version as a reasonably generic set of options.