How do I apply a non-integer -> integer dict to a numpy array? - numpy

Say I have these two arrays:
dictionary = np.array(['a', 'b', 'c'])
array = np.array([['a', 'a', 'c'], ['b', 'b', 'c']])
And I'd like to replace every element in array with the index of its value in dictionary. So:
for index, value in enumerate(dictionary):
array[array == value] = index
array = array.astype(int)
To get:
array([[0, 0, 2],
[1, 1, 2]])
Is there a vectorized way to do this? I know that if array already contained indices and I wanted the strings in dictionary, I could just do dictionary[array]. But I effectively need a "lookup" of strings here.
(I also see this answer, but wondering if something new were available since 2010.)

If your dictionary is sorted, and dictionary and array contain the same elements, np.unique does the trick
uniq, inv = np.unique(array, return_inverse=True)
result = inv.reshape(array.shape)
If some elements are missing in array:
uniq, inv = np.unique(np.r_[dictionary, array.ravel()], return_inverse=True)
result = inv[len(dictionary):].reshape(array.shape)
General case:
uniq, inv = np.unique(np.r_[dictionary, array.ravel()], return_inverse=True)
back = np.empty_like(inv[:len(dictionary)])
back[inv[:len(dictionary)]] = np.arange(len(dictionary))
result=back[inv[len(dictionary):]].reshape(array.shape)
Explanation: np.unique in the form we are using it here returns the unique elements in sorted order and the indices into this sorted list of each element of the argument. So to get the indices into the original dictionary we need to remap the indices. We know that uniq[inv[:len(uniq)]] == dictionary. Therefore we must solve X[inv[:len(uniq)]] == np.arange(len(uniq)), which is what the code does.

Related

Changing the value of a Numpy Array based on a probability and the value itself

I have a 2d Numpy Array:
a = np.reshape(np.random.choice(['b','c'],4), (2,2))
I want to go through each element and, with probability p=0.2 change the element. If the element is 'b' I want to change it to 'c' and vice versa.
I've tried all sorts (looping through with enumerate, where statements) but I can't seen to figure it out.
Any help would be appreciated.
You could generate a random mask with the wanted probability and use it to swap the values on a subset of the array:
# select 20% of cells
mask = np.random.choice([True, False], a.shape, p=[0.2, 0.8])
# swap the values for those
a[mask] = np.where(a=='b', 'c', 'b')[mask]
example output:
array([['b', 'c'],
['c', 'c']], dtype='<U1')

Looping through a dictionary of dataframes and counting a column

I am wondering if anyone can help. I have a number of dataframes stored in a dictionary. I simply want to access each of these dataframes and count the values in a column in the column I have 10 letters. In the first dataframe there are 5bs and 5 as. For example the output from the count I would expect to be is a = 5 and b =5. However for each dataframe this count would be different hence I would like to store the output of these counts either into another dictionary or a separate variable.
The dictionary is called Dict and the column name in all the dataframes is called letters. I have tried to do this by accessing the keys in the dictionary but can not get it to work. A section of what I have tried is shown below.
import pandas as pd
for key in Dict:
Count=pd.value_counts(key['letters'])
Count here would ideally change with each new count output to store into a new variable
A simplified example (the actual dataframe sizes are max 5000,63) of the one of the 14 dataframes in the dictionary would be
`d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
df = pd.DataFrame(data=d)`
The other dataframes are names df2,df3,df4 etc
I hope that makes sense. Any help would be much appreciated.
Thanks
If you want to access both key and values when iterating over a dictionary, you should use the items function.
You could use another dictionary to store the results:
letter_counts = {}
for key, value in Dict.items():
letter_counts[key] = value["letters"].value_counts()
You could also use dictionary comprehension to do this in 1 line:
letter_counts = {key: value["letters"].value_counts() for key, value in Dict.items()}
The easiest thing is probably dictionary comprehension:
d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
d2 = {'col1': [1, 2,3,4,5,6,7,8,9,10,11], 'letters': ['a','a','a','b','b','a','b','a','b','b','a']}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(d2)
df_dict = {'d': df, 'd2': df2}
new_dict = {k: v['letters'].count() for k,v in df_dict.items()}
# out
{'d': 10, 'd2': 11}

numpy array.shape behaviour

For following:
d = np.array([[0,1,4,3,2],[10,18,4,7,5]])
print(d.shape)
Output is:
(2, 5)
It is expected.
But, for this(difference in number of elements in individual rows):
d = np.array([[0,1,4,3,2],[10,18,4,7]])
print(d.shape)
Output is:
(2,)
How to explain this behaviour?
Short answer: It parses it as an array of two objects: two lists.
Numpy is used to process "rectangular" data. In case you pass it non-rectangular data, the np.array(..) function will fallback on considering it a list of objects.
Indeed, take a look at the dtype of the array here:
>>> d
array([list([0, 1, 4, 3, 2]), list([10, 18, 4, 7])], dtype=object)
It is an one-dimensional array that contains two items two lists. These lists are simply objects.

Slicing a numpy array and passing the slice to a function

I want to have a function that can operate on either a row or a column of a 2D ndarray. Assume the array has C order. The function changes values in the 2D data.
Inside the function I want to have identical index syntax whether it is called with a row or column. A row slice is [n,:] and column slice [:,n] so they have different shapes. Inside the function this requires different indexing expressions.
Is there a way to do this that does not require moving or allocating memory? I am under the impression that using reshape will force a copy to make the data to make it contiguous. Is there a way to use nditer in the function?
Do you mean like this:
In [74]: def foo(arr, n):
...: arr += n
...:
In [75]: arr = np.ones((2,3),int)
In [76]: foo(arr[0,:],1)
In [77]: arr
Out[77]:
array([[2, 2, 2],
[1, 1, 1]])
In [78]: foo(arr[:,1],[100,200])
In [79]: arr
Out[79]:
array([[ 2, 102, 2],
[ 1, 201, 1]])
In the first case I'm adding 1 to one row of the array, ie. a row slice. In the second case I'm add a array (list) to a column. In that case n has to have the right length.
Usually we don't worry about whether the values are C contiguous. Striding takes care of access either way.

Get indices for values of one array in another array

I have two 1D-arrays containing the same set of values, but in a different (random) order. I want to find the list of indices, which reorders one array according to the other one. For example, my 2 arrays are:
ref = numpy.array([5,3,1,2,3,4])
new = numpy.array([3,2,4,5,3,1])
and I want the list order for which new[order] == ref.
My current idea is:
def find(val):
return numpy.argmin(numpy.absolute(ref-val))
order = sorted(range(new.size), key=lambda x:find(new[x]))
However, this only works as long as no values are repeated. In my example 3 appears twice, and I get new[order] = [5 3 3 1 2 4]. The second 3 is placed directly after the first one, because my function val() does not track which 3 I am currently looking for.
So I could add something to deal with this, but I have a feeling there might be a better solution out there. Maybe in some library (NumPy or SciPy)?
Edit about the duplicate: This linked solution assumes that the arrays are ordered, or for the "unordered" solution, returns duplicate indices. I need each index to appear only once in order. Which one comes first however, is not important (neither possible based on the data provided).
What I get with sort_idx = A.argsort(); order = sort_idx[np.searchsorted(A,B,sorter = sort_idx)] is: [3, 0, 5, 1, 0, 2]. But what I am looking for is [3, 0, 5, 1, 4, 2].
Given ref, new which are shuffled versions of each other, we can get the unique indices that map ref to new using the sorted version of both arrays and the invertibility of np.argsort.
Start with:
i = np.argsort(ref)
j = np.argsort(new)
Now ref[i] and new[j] both give the sorted version of the arrays, which is the same for both. You can invert the first sort by doing:
k = np.argsort(i)
Now ref is just new[j][k], or new[j[k]]. Since all the operations are shuffles using unique indices, the final index j[k] is unique as well. j[k] can be computed in one step with
order = np.argsort(new)[np.argsort(np.argsort(ref))]
From your original example:
>>> ref = np.array([5, 3, 1, 2, 3, 4])
>>> new = np.array([3, 2, 4, 5, 3, 1])
>>> np.argsort(new)[np.argsort(np.argsort(ref))]
>>> order
array([3, 0, 5, 1, 4, 2])
>>> new[order] # Should give ref
array([5, 3, 1, 2, 3, 4])
This is probably not any faster than the more general solutions to the similar question on SO, but it does guarantee unique indices as you requested. A further optimization would be to to replace np.argsort(i) with something like the argsort_unique function in this answer. I would go one step further and just compute the inverse of the sort:
def inverse_argsort(a):
fwd = np.argsort(a)
inv = np.empty_like(fwd)
inv[fwd] = np.arange(fwd.size)
return inv
order = np.argsort(new)[inverse_argsort(ref)]