Changing the value of a Numpy Array based on a probability and the value itself - numpy

I have a 2d Numpy Array:
a = np.reshape(np.random.choice(['b','c'],4), (2,2))
I want to go through each element and, with probability p=0.2 change the element. If the element is 'b' I want to change it to 'c' and vice versa.
I've tried all sorts (looping through with enumerate, where statements) but I can't seen to figure it out.
Any help would be appreciated.

You could generate a random mask with the wanted probability and use it to swap the values on a subset of the array:
# select 20% of cells
mask = np.random.choice([True, False], a.shape, p=[0.2, 0.8])
# swap the values for those
a[mask] = np.where(a=='b', 'c', 'b')[mask]
example output:
array([['b', 'c'],
['c', 'c']], dtype='<U1')

Related

Sort index at a level and per partitioning by earlier levels

I'm using pandas to count the different types or errors and correct predictions for different (machine learning) models, in order to display confusion matrices.
A particular order of the prediction and ground truth labels makes sense, for example by putting the majority class 'B' first.
However, when I sort using pd.DataFrame.sort_index, the other index levels are also permuted. I'd like to sort the second level per unique value of the first index.
errors = pd.DataFrame([
{'model': model, 'ground truth': ground_truth, 'prediction': prediction,
'count': np.random.randint(0, (10000 if prediction=='B' else 1000) if prediction==ground_truth else 100)}
for model in ['foo', 'bar']
for prediction in 'ABC'
for ground_truth in 'ABC'
])
def sort_index(index):
return index.map('BCA'.index)
errors.pivot(
index=['model', 'ground truth'],
columns=['prediction'],
values='count'
).fillna(0).astype(int).sort_index(level=1, key=sort_index)[['B', 'C', 'A']]
One solution is to sort by all earlier indices as well, but it's quite verbose. It's silly to have one function applied over all indices, as if they all are semantically the same. Moreover, this also rearranges the order of the models, which isn't necessarily needed. Finally it's a waste of compute in two ways: sorting smaller partitions is faster since sorting scales super-linearly, and element comparisons are slower when considering more indices.
def sort_index(index):
if index.name == 'ground truth':
return index.map('BCA'.index)
return index
errors.pivot(
index=['model', 'ground truth'],
columns=['prediction'],
values='count'
).fillna(0).astype(int).sort_index(level=[0, 1], key=sort_index)[['B', 'C', 'A']]
Is there a clean way to sort on higher index levels, keeping the earlier levels tied together?
You might want to use the reindex method.
Code:
import numpy as np
import pandas as pd
# Create a sample dataframe
errors = pd.DataFrame([ {'model': model, 'ground truth': ground_truth, 'prediction': prediction, 'count': np.random.randint(0, (10000 if prediction=='B' else 1000) if prediction==ground_truth else 100)} for model in ['foo', 'bar'] for prediction in 'ABC' for ground_truth in 'ABC' ])
# Pivot and reindex the dataframe
errors.pivot(
index=['model', 'ground truth'],
columns=['prediction'],
values='count'
).fillna(0).astype(int).reindex(['B', 'C', 'A'], level=1)[['B', 'C', 'A']]
Output:

How to interpolate a 5 dimensional array?

I have an array of shape- [41, 101, 6, 4, 280]. I want to interpolate it so that if I give it a value from 41 temperature and 101 density values, it spits out an array of [6,4,280] shape. Is there a NumPy function that can deal with this?
let's start step by step :
Q : Is there a NumPy function that can deal with this?
Yes, there is.
The first step is to generate an instance of a 5D numpy.ndarray, that will contain your known data-points ( do not mind the dtype, that was used for just reminding we can go literally from bits upto complex128 values here, if later needed ):
>>> import numpy as np
>>>
>>> a5Dtensor = np.ndarray( (41, 101, 6, 4, 280 ), dtype = np.uint8 )
Now, let's validate it's .shape :
>>> a5Dtensor.shape
(41, 101, 6, 4, 280)
The core trick is the built-in smart numpy-slicing :
>>> a5Dtensor[0,0,:,:,:].shape
(6, 4, 280)
This indeed returns the requested 3D-cube of data-points.
The slicing-trick is also very smart in not producing any new memory-allocations (which will be of interest once the sizes grow somewhere beyond L1/L2/L3-CPU-cache horizons, the more once you get beyond a few GB-of data)
>>> a5Dtensor[0,0,:,:,:].flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False <------ may enjoy FORTRAN efficient data layout, where needed
OWNDATA : False <------ 3D-cube data not "copied", rather "viewed" inside 5D
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
Last, but not least, if aTemperatureVALUE and aDensityVALUE variables are not indices into 5D, but rather data-values, among which you seek for the sought for interpolation of 3D-cube datapoints' values, the numpy can serve with a piecewise linear interpolation ( with some constraints ), yet making any such interpolation for each of the result ( in the 3D-cube of interpolated-values ) requires a 2D-interpolation being run for each of the 3D-cube coordinates, based on the values held for nearest-{ lower, upper } temperature and density values, present in the original 5D-data-points.
There are other smart tools for this in numpy ( nD; n = 0+ .meshgrid() method, .argwhere() and others ), yet finding ( pre-sorting ), indirect indexing may be needed for this, in case the original 5D-data-points do not exhibit some properties, like a 3D-cubes of data-point having been already pre-sorted in the first two dimensions for easier processing for the sought-for 2D-(temp,density)-interpolator ( be it specifically tailor-made for dtype=uint8, float64, complex128 or object ).

Colors for pandas bar chart [duplicate]

I have a pandas dataframe with positive and negative values and want to plot it as a bar chart.
I want to plot the positive colors 'green' and the negative values 'red' (very original...lol).
I'm not sure how to pass if > 0 'green' else < 0 'red'?
data = pd.DataFrame([[-15], [10], [8], [-4.5]],
index=['a', 'b', 'c', 'd'],
columns=['values'])
data.plot(kind='barh')
I would create a dummy column for whether the observation is larger than 0.
In [39]: data['positive'] = data['values'] > 0
In [40]: data
Out[40]:
values positive
a -15.0 False
b 10.0 True
c 8.0 True
d -4.5 False
[4 rows x 2 columns]
In [41]: data['values'].plot(kind='barh',
color=data.positive.map({True: 'g', False: 'r'}))
Also, you may want to be careful not to have column names that overlap with DataFrame attributes. DataFrame.values give the underlying numpy array for a DataFrame. Having overlapping names prevents you from using the df.<column name> syntax.
If you want to avoid adding a column, you can do TomAugspurger's solution in one step:
data['values'].plot(kind='barh',
color=(data['values'] > 0).map({True: 'g',
False: 'r'}))
Define
def bar_color(df,color1,color2):
return np.where(df.values>0,color1,color2).T
then
data.plot.barh(color=bar_color(data,'r','g'))
gives
It also works for multiple bar series
df=pd.DataFrame(np.random.randint(-10,10,(4,6)))
df.plot.barh(color=bar_color(df,'r','g'))
gives
Drawing on #Max Ghenis answer (which doesn't work for me but seems to be a minor change in the packages):
tseries = data['values']
color = (tseries > 0).apply(lambda x: 'g' if x else 'r')
splot = tseries.plot.barh(color=color)
gives:
.. what you expect to see.

How do I apply a non-integer -> integer dict to a numpy array?

Say I have these two arrays:
dictionary = np.array(['a', 'b', 'c'])
array = np.array([['a', 'a', 'c'], ['b', 'b', 'c']])
And I'd like to replace every element in array with the index of its value in dictionary. So:
for index, value in enumerate(dictionary):
array[array == value] = index
array = array.astype(int)
To get:
array([[0, 0, 2],
[1, 1, 2]])
Is there a vectorized way to do this? I know that if array already contained indices and I wanted the strings in dictionary, I could just do dictionary[array]. But I effectively need a "lookup" of strings here.
(I also see this answer, but wondering if something new were available since 2010.)
If your dictionary is sorted, and dictionary and array contain the same elements, np.unique does the trick
uniq, inv = np.unique(array, return_inverse=True)
result = inv.reshape(array.shape)
If some elements are missing in array:
uniq, inv = np.unique(np.r_[dictionary, array.ravel()], return_inverse=True)
result = inv[len(dictionary):].reshape(array.shape)
General case:
uniq, inv = np.unique(np.r_[dictionary, array.ravel()], return_inverse=True)
back = np.empty_like(inv[:len(dictionary)])
back[inv[:len(dictionary)]] = np.arange(len(dictionary))
result=back[inv[len(dictionary):]].reshape(array.shape)
Explanation: np.unique in the form we are using it here returns the unique elements in sorted order and the indices into this sorted list of each element of the argument. So to get the indices into the original dictionary we need to remap the indices. We know that uniq[inv[:len(uniq)]] == dictionary. Therefore we must solve X[inv[:len(uniq)]] == np.arange(len(uniq)), which is what the code does.

Numpy index array of unknown dimensions?

I need to compare a bunch of numpy arrays with different dimensions, say:
a = np.array([1,2,3])
b = np.array([1,2,3],[4,5,6])
assert(a == b[0])
How can I do this if I do not know either the shape of a and b, besides that
len(shape(a)) == len(shape(b)) - 1
and neither do I know which dimension to skip from b. I'd like to use np.index_exp, but that does not seem to help me ...
def compare_arrays(a,b,skip_row):
u = np.index_exp[ ... ]
assert(a[:] == b[u])
Edit
Or to put it otherwise, I wan't to construct slicing if I know the shape of the array and the dimension I want to miss. How do I dynamically create the np.index_exp, if I know the number of dimensions and positions, where to put ":" and where to put "0".
I was just looking at the code for apply_along_axis and apply_over_axis, studying how they construct indexing objects.
Lets make a 4d array:
In [355]: b=np.ones((2,3,4,3),int)
Make a list of slices (using list * replicate)
In [356]: ind=[slice(None)]*b.ndim
In [357]: b[ind].shape # same as b[:,:,:,:]
Out[357]: (2, 3, 4, 3)
In [358]: ind[2]=2 # replace one slice with index
In [359]: b[ind].shape # a slice, indexing on the third dim
Out[359]: (2, 3, 3)
Or with your example
In [361]: b = np.array([1,2,3],[4,5,6]) # missing []
...
TypeError: data type not understood
In [362]: b = np.array([[1,2,3],[4,5,6]])
In [366]: ind=[slice(None)]*b.ndim
In [367]: ind[0]=0
In [368]: a==b[ind]
Out[368]: array([ True, True, True], dtype=bool)
This indexing is basically the same as np.take, but the same idea can be extended to other cases.
I don't quite follow your questions about the use of :. Note that when building an indexing list I use slice(None). The interpreter translates all indexing : into slice objects: [start:stop:step] => slice(start, stop, step).
Usually you don't need to use a[:]==b[0]; a==b[0] is sufficient. With lists alist[:] makes a copy, with arrays it does nothing (unless used on the RHS, a[:]=...).