I'm wondering if there is a numpythonic way of inverting a histogram back to an intensity signal.
For example:
>>> A = np.array([7, 2, 1, 4, 0, 7, 8, 10])
>>> H, edge = np.histogram(A, bins=10, range=(0,10))
>>> np.sort(A)
[ 0 1 2 4 7 7 8 10]
>>> H
[1 1 1 0 1 0 0 2 1 1]
>>> edge
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]
Is there a way to reconstruct the original A intensities using the H and edge? Of course, positional information will have been lost, but I'd just like to recover the intensities and relative number of occurrences.
I have this loopy way of doing it:
>>> reco = []
>>> for i, h in enumerate(H):
... for _ in range(h):
... reco.append(edge[i])
...
>>> reco
[0.0, 1.0, 2.0, 4.0, 7.0, 7.0, 8.0, 9.0]
# I've done something wrong with the right-most histogram bin, but we can ignore that for now
For large histograms, the loopy way is inefficient. Is there a vectorized equivalent of what I did in the loop? (my gut says that numpy.digitize will be involved..)
Sure, you can use np.repeat for this:
import numpy as np
A = np.array([7, 2, 1, 4, 0, 7, 8, 10])
counts, edges = np.histogram(A, bins=10, range=(0,10))
print(np.repeat(edges[:-1], counts))
# [ 0. 1. 2. 4. 7. 7. 8. 9.]
Obviously it's impossible to recover the exact position of a value within a bin, since you lose that information in the process of generating the histogram. You could either use the lower or upper bin edge (as in the example above), or you could use the center value, e.g.:
print(np.repeat((edges[:-1] + edges[1:]) / 2., counts))
# [ 0.5 1.5 2.5 4.5 7.5 7.5 8.5 9.5]
Related
This question focus on pandas own functions. There are still solutions (pandas DataFrame: replace nan values with average of columns) but with own written functions.
In SPSS there is function MEAN.n which gives you the mean value of list of numbers only when n elements of that list are valid (not pandas.NA). With that function you are able to imputat missing values only if a minimum number of items are valid.
Are there pandas function to do this with?
Example
Values [1, 2, 3, 4, NA].
Mean of the valid values is 2.5.
The resulting list should be [1, 2, 3, 4, 2.5].
Assume the rule that in a 5 item list 3 should have valid values for imputation. Otherwise the result is NA.
Values [1, 2, NA, NA, NA].
Mean of the valid values is 1.5 but it does not matter.
The resulting list should not be changed [1, 2, NA, NA, NA] because imputation is not allowed.
Assuming you want to work with pandas, you can define a custom wrapper (using only pandas functions) to fillna with the mean only if a minimum number of items are not NA:
from pandas import NA
s1 = pd.Series([1, 2, 3, 4, NA])
s2 = pd.Series([1, 2, NA, NA, NA])
def fillna_mean(s, N=4):
return s if s.notna().sum() < N else s.fillna(s.mean())
fillna_mean(s1)
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# 4 2.5
# dtype: float64
fillna_mean(s2)
# 0 1
# 1 2
# 2 <NA>
# 3 <NA>
# 4 <NA>
# dtype: object
fillna_mean(s2, N=2)
# 0 1.0
# 1 2.0
# 2 1.5
# 3 1.5
# 4 1.5
# dtype: float64
Lets try list comprehension, though it will be messy
Option1
You can use pd.Series and numpy
s= [x if np.isnan(lst).sum()>=3 else pd.Series(lst).mean(skipna=True) if x is np.nan else x for x in lst]
Option2 use numpy all through
s=[x if np.isnan(lst).sum()>=3 else np.mean([x for x in lst if str(x) != 'nan']) if x is np.nan else x for x in lst]
Case1
lst=[1, 2, 3, 4, np.nan]
outcome
[1, 2, 3, 4, 2.5]
Case2
lst=[1, 2, np.nan, np.nan, np.nan]
outcome
[1, 2, nan, nan, nan]
if you wanted it as a pd. Series, simply
pd.Series(s, name='lst')
How it works
s=[x if np.isnan(lst).sum()>=3 \ #give me element x if the sum of nans in the list is greater than or equal to 3
else pd.Series(lst).mean(skipna=True) if x is np.nan else x \# Otherwise replace the Nan in list with the mean of non NaN elements in the list
for x in lst\#For every element in lst
]
1.I have tried to understand this code but I couldn't.would you help me?
a = np.arange(5)
hist, bin_edges = np.histogram(a, density=True)
hist
2.why is the output like this ?
array([0.5, 0. , 0.5, 0. , 0. , 0.5, 0. , 0.5, 0. , 0.5])
The default for the bins argument to np.histogram is 10. So the histogram counts which bins your array elements fall into. In this case a = np.array([0, 1, 2, 3, 4]). If we are creating a histogram with 10 bins then we break the interval 0-4 (inclusive) into 10 equal bins. This gives us (note that 11 end points gives us 10 bins):
np.linspace(0, 4, 11) = array([0. , 0.4, 0.8, 1.2, 1.6, 2. , 2.4, 2.8, 3.2, 3.6, 4. ])
We now just need to see which bins your elements in the array a fall into. We can count them as follows:
[1, 0, 1, 0, 0, 1, 0, 1, 0, 1]
Now this is still not exactly what the output is. The density=True argument states (from the docs): "If True, the result is the value of the
probability density function at the bin, normalized such that
the integral over the range is 1."
Each bin (of height .5) has a width of .4 so 5 x .5 x .4 = 1 as is the requirement of this argument.
numpy.arange(5) generates a numpy array of 5 elements evenly spaced: array([0,1,2,3,4]).
np.histogram(a, density=True) returns the bin edges and the values of an histogram obtained from your array a using 10 bins (which is the default value).
bin_edges gives the edges of the bin, while histogram gives the number of occurrences for each bin. Given that you set density=True the occurrences are normalized (the integral over the range is 1.).
Look here for more information.
Please check this post. Hint: When you call np.histogram, the default bin value is 10, so that's why your output has 10 elements.
I'd like to compute the following sums for each value of a in A:
D = np.array([1, 2, 3, 4])
A = np.array([0.5, 0.25, -0.5])
beta = 0.5
np.sum(np.square(beta) - np.square(D-a))
and the result is an array of all the sums. To compute it by hand, it would look something like this:
[np.sum(np.square(beta)-np.square(D-0.5)),
np.sum(np.square(beta)-np.square(D-0.25)),
np.sum(np.square(beta)-np.square(D-0.5))]
Use np.sum with broadcasting
np.sum(np.square(beta) - np.square(D[None,:] - A[:,None]), axis=1)
Out[98]: array([-20. , -24.25, -40. ])
Explain: We need the whole array D subtracts each element of array A. We can't simple call D - A because it just does subtraction element-wise between D and A. Therefore, we need employing numpy broadcasting. We need to add an additional dimension to D and A to satisfy rules of broadcasting. After that, just do calculation and sum them along axis=1
Step by step:
Increase dimension D from 1D to 2D at axis=0
In [10]: D[None,:]
Out[10]: array([[1, 2, 3, 4]])
In [11]: D.shape
Out[11]: (4,)
In [12]: D[None,:].shape
Out[12]: (1, 4)
Doing the same for A, but at axis=1
In [13]: A[:,None]
Out[13]:
array([[ 0.5 ],
[ 0.25],
[-0.5 ]])
In [14]: A.shape
Out[14]: (3,)
In [15]: A[:,None].shape
Out[15]: (3, 1)
On subtraction, numpy broadcasting kicks in to broadcast each array to compatible dimension and does subtraction to create 2D array result
In [16]: D[None,:] - A[:,None]
Out[16]:
array([[0.5 , 1.5 , 2.5 , 3.5 ],
[0.75, 1.75, 2.75, 3.75],
[1.5 , 2.5 , 3.5 , 4.5 ]])
Next, it is just element-wise square and subtraction and square.
np.square(beta) - np.square(D[None,:] - A[:,None])
Out[17]:
array([[ 0. , -2. , -6. , -12. ],
[ -0.3125, -2.8125, -7.3125, -13.8125],
[ -2. , -6. , -12. , -20. ]])
Lastly, sum alongs axis=1 to get the final output:
np.sum(np.square(beta) - np.square(D[None,:] - A[:,None]), axis=1)
Out[18]: array([-20. , -24.25, -40. ])
You may read docs on numpy broadcasting here to get more info https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
I'm not too familiar with numpy, so there may be a vectorized way to do this. But with list comprehension, this will do:
[ np.sum(np.square(beta) - np.square(D-a)) for a in A ]
Output:
[-20.0, -24.25, -40.0]
I am trying to cluster some products based on the users' behaviors. What I reach at the end are clusters that have a very different number of observations.
I have checked k-means clustering parameters and was not able to find a parameter that controls the minimum (or maximum) number of observations per cluster.
For example here is how the number of observations is distributed across different clusters.
cluster_id num_observations
0 6
1 4
2 1
3 3
4 29
5 5
How to deal with this issue?
For those who still looking for an answer. I found a good module or this module that deal with this kind of problem
Use pip install size-constrained-clustering or pip install git+https://github.com/jingw2/size_constrained_clustering.git and use MinMaxKMeansMinCostFlow where you can select the size_min and size_max
n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400, size_max=800)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
This will solve by k-means-constrained pip library.. check here
Example:
>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
... n_clusters=2,
... size_min=2,
... size_max=5,
... random_state=0
... )
>>> clf.fit_predict(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
>>> clf.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
I am currently having a dataset for the location of stores and name of item to predict sales of a particular product.
I wanted to use binary encoding or pandas get_dummies(), but there are 5000 names for items and it causes memory error, is there any alternative or better way to handle this? Thanks all!
print(train.shape)
print(train.dtypes)
print(train.head())
(125497040, 6)
id int64
date object
store_nbr int64
item_nbr int64
unit_sales float64
onpromotion object
dtype: object
id date store_nbr item_nbr unit_sales onpromotion
0 0 2013-01-01 25 103665 7.0 NaN
1 1 2013-01-01 25 105574 1.0 NaN
2 2 2013-01-01 25 105575 2.0 NaN
3 3 2013-01-01 25 108079 1.0 NaN
4 4 2013-01-01 25 108701 1.0 NaN
Instead of creating gazillions of dummy variables you should use one-hot encoding instead: https://en.wikipedia.org/wiki/One-hot
Pandas doesn't have this functionality built-in, so the easiest way is to use scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
The way I see it you could:
Not to use all items but only most frequent ones.
This way creating dummies, creates fewer new columns and needs less memory. For this happen you will need items with few counts (define few with a threshold) and you will lose some information.
An alternative approach will be to use a Factorization Machine.
You could use both suggestions above and at the end average their prediction for an even better score.