I am currently having a dataset for the location of stores and name of item to predict sales of a particular product.
I wanted to use binary encoding or pandas get_dummies(), but there are 5000 names for items and it causes memory error, is there any alternative or better way to handle this? Thanks all!
print(train.shape)
print(train.dtypes)
print(train.head())
(125497040, 6)
id int64
date object
store_nbr int64
item_nbr int64
unit_sales float64
onpromotion object
dtype: object
id date store_nbr item_nbr unit_sales onpromotion
0 0 2013-01-01 25 103665 7.0 NaN
1 1 2013-01-01 25 105574 1.0 NaN
2 2 2013-01-01 25 105575 2.0 NaN
3 3 2013-01-01 25 108079 1.0 NaN
4 4 2013-01-01 25 108701 1.0 NaN
Instead of creating gazillions of dummy variables you should use one-hot encoding instead: https://en.wikipedia.org/wiki/One-hot
Pandas doesn't have this functionality built-in, so the easiest way is to use scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
The way I see it you could:
Not to use all items but only most frequent ones.
This way creating dummies, creates fewer new columns and needs less memory. For this happen you will need items with few counts (define few with a threshold) and you will lose some information.
An alternative approach will be to use a Factorization Machine.
You could use both suggestions above and at the end average their prediction for an even better score.
Related
Let's say I have a list of coefficients [1,2,3]. How do I convert this to x^3 + 2x^2 + 3 or something similar in NumPy? Is it even possible?
As described in the docs, which I recommend reading:
>>> p1 = np.polynomial.Polynomial([3, 2, 1])
>>> p1
Polynomial([3., 2., 1.], domain=[-1, 1], window=[-1, 1])
>>> p1(0)
3.0
Note that the order of coefficients is reversed.
I am trying to cluster some products based on the users' behaviors. What I reach at the end are clusters that have a very different number of observations.
I have checked k-means clustering parameters and was not able to find a parameter that controls the minimum (or maximum) number of observations per cluster.
For example here is how the number of observations is distributed across different clusters.
cluster_id num_observations
0 6
1 4
2 1
3 3
4 29
5 5
How to deal with this issue?
For those who still looking for an answer. I found a good module or this module that deal with this kind of problem
Use pip install size-constrained-clustering or pip install git+https://github.com/jingw2/size_constrained_clustering.git and use MinMaxKMeansMinCostFlow where you can select the size_min and size_max
n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400, size_max=800)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
This will solve by k-means-constrained pip library.. check here
Example:
>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
... n_clusters=2,
... size_min=2,
... size_max=5,
... random_state=0
... )
>>> clf.fit_predict(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
>>> clf.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
I would like to have pandas raise an exception when dividing by zero as in:
d = {'col1': [2., 0.], 'col2': [4., 0.]}
df = pd.DataFrame(data=d)
2/df
Instead of the current result:
0 1.000000
1 inf
Name: col1, dtype: float64
Any suggestions how to achieve that?
I know with numpy I can np.seterr(divide='raise') but pandas does ignore that.
Many thanks
A closer look into the source code and the trace shows that inside pandas you can find a lot of context handlers like this:
with np.errstate(all='ignore'):
or
with numeric.errstate(all='ignore'):
This is the reason why np.seterr is ignored and there is probably no easy way to get rid of this.
It's far from ideal, but one potential option is to interpret the elements of your dataframe as Python objects rather than the more optimized numpy or pandas dtypes that it typically uses:
In [37]: d = {'col1': [2., 0.], 'col2': [4., 0.]}
...: df = pd.DataFrame(data=d)
...: 2/df
Out[37]:
col1 col2
0 1.0 0.5
1 inf inf
In [38]: 2 / df.astype('O')
---------------------------------------------------------------------------
ZeroDivisionError: float division by zero
I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):
>>> s = pd.Series([5, np.nan, np.nan])
When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)
>>> s.quantile([0.5, 1])
0.5 5.0
1.0 5.0
dtype: float64
But when I apply .qcut() with an integer value for number of quantiles an error is thrown:
>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5., 5., 5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Even after I set the duplicates argument, it still fails:
>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0
How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)
The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:
0 (4.999, 5.000]
1 NaN
2 NaN
Ok, this is a workaround which might work for you.
pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]:
0 (4.999, 5.0]
1 NaN
2 NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)
#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)
#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
Use python 3.5 instead of python 2.7 .
This worked for me
I'm wondering if there is a numpythonic way of inverting a histogram back to an intensity signal.
For example:
>>> A = np.array([7, 2, 1, 4, 0, 7, 8, 10])
>>> H, edge = np.histogram(A, bins=10, range=(0,10))
>>> np.sort(A)
[ 0 1 2 4 7 7 8 10]
>>> H
[1 1 1 0 1 0 0 2 1 1]
>>> edge
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]
Is there a way to reconstruct the original A intensities using the H and edge? Of course, positional information will have been lost, but I'd just like to recover the intensities and relative number of occurrences.
I have this loopy way of doing it:
>>> reco = []
>>> for i, h in enumerate(H):
... for _ in range(h):
... reco.append(edge[i])
...
>>> reco
[0.0, 1.0, 2.0, 4.0, 7.0, 7.0, 8.0, 9.0]
# I've done something wrong with the right-most histogram bin, but we can ignore that for now
For large histograms, the loopy way is inefficient. Is there a vectorized equivalent of what I did in the loop? (my gut says that numpy.digitize will be involved..)
Sure, you can use np.repeat for this:
import numpy as np
A = np.array([7, 2, 1, 4, 0, 7, 8, 10])
counts, edges = np.histogram(A, bins=10, range=(0,10))
print(np.repeat(edges[:-1], counts))
# [ 0. 1. 2. 4. 7. 7. 8. 9.]
Obviously it's impossible to recover the exact position of a value within a bin, since you lose that information in the process of generating the histogram. You could either use the lower or upper bin edge (as in the example above), or you could use the center value, e.g.:
print(np.repeat((edges[:-1] + edges[1:]) / 2., counts))
# [ 0.5 1.5 2.5 4.5 7.5 7.5 8.5 9.5]