How is the pyplot histogram bins interpreted? - matplotlib

I am confused about the matplotlib hist function.
The documentation explains:
If a sequence of values, the values of the lower bound of the bins to be used.
But when I have two values in sequence i.e [0,1], I only get 1 bin.
And when I have three like so:
plt.hist(votes, bins=[0,1,2], normed=True)
I only get two bins. My guess is that the last value is just an upper bound for the last bin.
Is there a way to have "the rest" of the values in the last bin, other than to but a very big value there? (or in other words, without making that bin much bigger than the others)
It seems like the last bin value is included in the last bin
votes = [0,0,1,2]
plt.hist(votes, bins=[0,1])
This gives me one bin of height 3. i.e. 0,0,1.
While:
votes = [0,0,1,2]
plt.hist(votes, bins=[0,1,2])
Gives me two bins with two in each.
I find this counter intuative, that adding a new bin changes the widthlimits of the others.
votes = [0,0,1]
plit.hist[votes, bins=2)
yeilds two bins size 2 and 1. These seems to have been split on 0,5 since the x-axis goes from 0 to 1.
How should the bins array be interpreted? How is the data split?

votes = [0, 0, 1, 2]
plt.hist(votes, bins=[0,1])
this gives you one bin of height 3, because it splits the data into one single bin with the interval: [0, 1]. It puts into that bin the values: 0, 0, and 1.
votes = [0, 0, 1, 2]
plt.hist(votes, bins=[0, 1, 2])
this gives you an histogram with bins with intervals: [0, 1[ and [1, 2];
so you have 2 items in the 1st bin (the 0 and 0), and 2 items in the 2nd bin (the 1 and 2).
If you try to plot:
plt.hist(votes, bins=[0, 1, 2, 3])
the idea behind the data splitting into bins is the same:
you will get three intervals:
[0, 1[; [1, 2[; [2, 3], and you will notice that the value 2 changes its bin, going to the bin with interval [2, 3] (instead of staying in the bin [1, 2] as in the previous example).
In conclusion, if you have an ordered array in the bins argument like:
[i_0, i_1, i_2, i_3, i_4, ..., i_n]
that will create the bins:
[i_0, i_1[
[i_1, i_2[
[i_2, i_3[
[i_3, i_4[
...
[i_(n-1), i_n]
with the boundaries of each open or closed according to the brackets.

Related

Numpy Interpolation for Array of Arrays

I have an array of arrays that I want to interpolate based on each array's min and max.
For a simple mxn array , with values ranging from 0 to 1, I can do this as follows:
x_inp=np.interp(x,(x.min(),x.max()),(0,0.7))
This suppresses every existing value to 0 to 0.7. However, if I have an array of dimension 100xmxn, the above method considers the global min/max and not the individual min/max of each of the mxn array.
Edit:
For example
x1=np.random.randint(0,5, size=(2, 4))
x2=np.random.randint(6,10, size=(2, 4))
my_list=[x1,x2]
my_array=np.asarray(my_list)
print(my_array)
>> array([[[1, 4, 3, 4],
[3, 2, 0, 0]],
[9, 6, 8, 6],
8, 7, 6, 7]]])
my_array is now of dimension 2x2x4 and my_array.min() and my_array.max() would give me 0 and 9. So If I interpolate, it won't work based on the min/max of the individual 2x4 arrays. What I want is, to have the interpolation work based on min/max of 0/4 for the 1st array and 6/9 for the second.

Numpy, how to retrieve sub-array of array (specific indices)?

I have an array:
>>> arr1 = np.array([[1,2,3], [4,5,6], [7,8,9]])
array([[1 2 3]
[4 5 6]
[7 8 9]])
I want to retrieve a list (or 1d-array) of elements of this array by giving a list of their indices, like so:
indices = [[0,0], [0,2], [2,0]]
print(arr1[indices])
# result
[1,6,7]
But it does not work, I have been looking for a solution about it for a while, but I only found ways to select per row and/or per column (not per specific indices)
Someone has any idea ?
Cheers
Aymeric
First make indices an array instead of a nested list:
indices = np.array([[0,0], [0,2], [2,0]])
Then, index the first dimension of arr1 using the first values of indices, likewise the second:
arr1[indices[:,0], indices[:,1]]
It gives array([1, 3, 7]) (which is correct, your [1, 6, 7] example output is probably a typo).

can any one explain about this code output

1.I have tried to understand this code but I couldn't.would you help me?
a = np.arange(5)
hist, bin_edges = np.histogram(a, density=True)
hist
2.why is the output like this ?
array([0.5, 0. , 0.5, 0. , 0. , 0.5, 0. , 0.5, 0. , 0.5])
The default for the bins argument to np.histogram is 10. So the histogram counts which bins your array elements fall into. In this case a = np.array([0, 1, 2, 3, 4]). If we are creating a histogram with 10 bins then we break the interval 0-4 (inclusive) into 10 equal bins. This gives us (note that 11 end points gives us 10 bins):
np.linspace(0, 4, 11) = array([0. , 0.4, 0.8, 1.2, 1.6, 2. , 2.4, 2.8, 3.2, 3.6, 4. ])
We now just need to see which bins your elements in the array a fall into. We can count them as follows:
[1, 0, 1, 0, 0, 1, 0, 1, 0, 1]
Now this is still not exactly what the output is. The density=True argument states (from the docs): "If True, the result is the value of the
probability density function at the bin, normalized such that
the integral over the range is 1."
Each bin (of height .5) has a width of .4 so 5 x .5 x .4 = 1 as is the requirement of this argument.
numpy.arange(5) generates a numpy array of 5 elements evenly spaced: array([0,1,2,3,4]).
np.histogram(a, density=True) returns the bin edges and the values of an histogram obtained from your array a using 10 bins (which is the default value).
bin_edges gives the edges of the bin, while histogram gives the number of occurrences for each bin. Given that you set density=True the occurrences are normalized (the integral over the range is 1.).
Look here for more information.
Please check this post. Hint: When you call np.histogram, the default bin value is 10, so that's why your output has 10 elements.

How do you create a "count" matrix from a series?

I have a Pandas series of lists of categorical variables. For example:
df = pd.Series([["A", "A", "B"], ["A", "C"]])
Note that in my case the series is pretty long (50K elements) and the number of possible distinct elements in the list is also big (20K elements).
I would like to obtain a matrix having a column for each distinct feature and its count as value. For the previous example, this means:
[[2, 0, 0], [1, 0, 1]]
This is the same output as the one obtained with OneHot encoding, except that it contains the count instead of just 1.
What is the best way to achieve this?
Let's try explode:
df.explode().groupby(level=0).value_counts().unstack(fill_value=0)
Output:
A B C
0 2 1 0
1 1 0 1
To get the list of list, chain the above with .values:
array([[2, 1, 0],
[1, 0, 1]])
Note that you will end up with a 50K x 20K array.

Slice pandas.DataFrame's second Multiindex

I have a pandas Dataframe of the form
"a" "b" "c" #first level index
0, 1, 2 0, 1, 2 0, 1, 2 #second level index
index
0 1,2,3 6,7,8 5,3,4
1 2,3,4 7,5,4 9,2,5
2 3,4,5 4,5,6 0,4,5
...
representing a spot (a,b or c) where a measurement took place and the results of the measurments (0,1,2) that took place on this spot.
I want to do the following:
pick a slice in the sample (say the first measurement on each spot at measurement 0)
mean each i-th measurement (mean("a"[0], "b"[0], "c"[0]), mean("a"[1], "b"[1], "c"[1]), ...)
I tried to get the hang of the pandas Multiindex documentation but do not manage to slice for the second level.
This is the column index:
MultiIndex(levels=[['a', 'b', 'c', ... , 'y'], [0, 1, 2, ... , 49]],
labels=[[0, 0, 0, ... , 0, 1, 1, 1, ... 1, ..., 49, 49, 49, ... 49]])
And the index
Float64Index([204.477752686, 204.484664917, 204.491577148, ..., 868.723022461], dtype='float64', name='wavelength', length=43274)
Using
df[:][0]
yields a key-error (0 not in index)
df.iloc[0]
returns the horizontal slice
0 "a":(1,2,3), "b":(6,7,8), "c":(5,3,4)
but I would like to have
"a":(1,2,3), "b":(6,7,4), "c":(5,9,0)
THX for any help
PS: version:pandas-0.19, python-3.4
The trick was to specify the axis...
df.loc(axis=1)[:,0]
provides the 0-th measurment of each spot.
Since I use integers on the second level index, I am not sure if this actually yields the label "0" or just the 0-th measurment in the DataFrame, label-agnostic.
But for my use-case, this is actually sufficient.