DeepAR Building Product Categories - pandas

I have a problem with the understanding of the DeepAR Algorithm.
I tried to forecast the sales of single products with the Algorithm.
First I tried it for one SKU on a daily frequence but I got the following error message:
ParamValidationError: Parameter validation failed:
Invalid type for parameter Body, value: [datetime
I thought, that the reason for that error was that I have too many "NaN"- values in my targets. Could that be the reason?
(I didn't apply any categories or dynamic_feats)
I then tried to make the forecast on a monthly frequence, but the result was that I didn't have enough timestamps for the algorithm.
Would it be possible to group my products within the DeepAR algorithm through the "cat" or the "dynamic_feat" operators, so that I would have less "NaN"-values in my targets?
I would like to group the products by different features like color, price or size. Do you know if that is possible, or do I have to do that before I apply DeepAR?
Thanks in advance:)

It looks like the error is thrown by boto (ParamValidationError). I suspect that you are not using the correct json-format to send the requests. See an example here.
I then tried to make the forecast on a monthly frequence, but the result was that I didn't have enough timestamps for the algorithm.
There is also weekly frequency, which you try out. However, DeepAR should also be able to handle NaN values.
Would it be possible to group my products within the DeepAR algorithm through the "cat" or the "dynamic_feat" operators, so that I would have less "NaN"-values in my targets?
Generally, cat is used to assign one or more categories to a time-series. However, I don't see how that should affect the number of NaN-values in your targets. Also, DeepAR does not emit NaNs in the prediction.
I would like to group the products by different features like color, price or size. Do you know if that is possible, or do I have to do that before I apply DeepAR?
Yes, that's what cat is for. The documentation explains how you can encode these category values.

Why do you have so many NaN values? Is it because of you not knowing the values of these days or because there were no sales on these days? Before feeding data into an (any) algorithm, you need to handle missing values. If you can replace your NaN values with zero values or any other imputation method, you will have fewer issues.
DeepAR is best when you add more categories ("cat" values) such as brand, color, size, and similar values. DeepAR is using these categories to calculate embeddings that are encoding the "meanings" of these categories as they affect the sales values. For example, if you have 10 colors that some of them are "girly" and some are "boyish", or some are "crazy" and some are "solid", the embedding calculation has the potential to capture these attributes and use them to improve the accuracy of the calculation.
Prices are different often as they can be elastic (applying discounts or promotions), and they should be represented as "dynamic_feat", and you should add their values for each day/month/other series frequency.
If your prices are static you can still use them as categories ("cat") by converting them into buckets, such as "high"/"medium"/"low". This is also a standard method when analyzing the features that you have and transform them to the capabilities and strengths of the algorithm that you are going to use. DeepAR, in this case, is good in encoding categories (static and low cardinality features) and calculating regression to numeric features with possible correlation with the target.

Related

Select the best "team" of 9 "players" based on overall "team" performance only

I have 9 bins named A through I containing the following number of objects:
A(8), B(7), C(6), D(7), E(5), F(6), G(6), H(6), I(6)
Objects from each bin fulfill a specific role and cannot be interchanged. I am selecting one object from each bin at random forming a "team" of 9 "players":
T_ijklmnopq = {a_i, b_j, c_k, d_l, e_m, f_n, g_o, h_p, i_q}
There are 15,240,960 such combinations - a huge number. I have means of evaluating performance of each "team" via a costly objective function, F(T_ijklmnopq). Thus, I can feasibly sample a limited number of random combinations, say no more than 500 samples.
Having results of such sampling, I want to predict the most likely best combination of "players". How to do it?
Keep in mind this is different from classical team selection because there is no meaningful evaluation of F() based on individual performance. For example, "player" a_6 may be good individually, but he may not "like" e_2 and therefore the performance of "team" containing the two suffers. Conversely, three mediocre players b_1, f_5, i_2 may be a part of an awesome "team". What's know is the whole "team" performance, that's all.
One more detail: contributions of the individual roles A through I are not weighted equally. Position of, say, E may be more important than, say, H. Unfortunately, these weights are not known upfront.
The described problem must be know to combinatorial analysts, but I haven't found anything exactly like it. Linear programming solutions with known individual "player" scores do not apply here. I will be most grateful for a specific name under which this problem is known to experts.
So far I have collected 400 samples. Here is a graph of the sorted F(T) values vs. a (arbitrary) sample number to illustrate that F(T) is "reasonable".
F(T) graph of sorted samples

How can I recode 53k unique addresses (saved as objects) w/o One-Hot-Encoding in Pandas?

My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.
But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.
I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?
EDIT: My target variable is a simple binary one.
Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:
One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)

Understanding Stratified sampling in numpy

I am currently completing an exercise book on machine learning to wet my feet so to speak in the discipline. Right now I am working on a real estate data set: each instance is a district of california and has several attributes, including the district's median income, which has been scaled and capped at 15. The median income histogram reveals that most median income values are clustered around 2 to 5, but some values go far beyond 6. The author wants to use stratified sampling, basing the strata on the median income value. He offers the next piece of code to create an income category attribute.
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
He explains that he divides the median_income by 1.5 to limit the number of categories and that he then keeps only those categories lower than 5 and merges all other categories into category 5.
What I don't understand is
Why is it mathematically sound to divide the median_income of each instance to create the strata? What exactly does the result of this division mean? Are there other ways to calculate/limit the number of strata?
How does the division restrict the number of categories and why did he choose 1.5 as the divisor instead of a different value? How did he know which value to pick?
Why does he only want 5 categories and how did he know beforehand that there would be at least 5 categories?
Any help understanding these decisions would be greatly appreciated.
I'm also not sure if this is the StackOverFlow category I should post this question in, so if I made a mistake by doing so please let me know what might be the appropriate forum.
Thank you!
You may be the right person to analyze more on this based on your data set. But I can help you understanding stratified sampling, so that you will have an idea.
STRATIFIED SAMPLING: suppose you have a data set with consumers who eat different fruits. One feature is 'fruit type' and this feature has 10 different categories(apple,orange,grapes..etc) now if you just sample the data from data set, there is a possibility that sample data might not cover all the categories. Which is very bad when train the data. To avoid such scenario, we have a method called stratified sampling, in this probability of sampling each different category is same so that we will not miss any useful data.
Please let me know if you still have any questions, I would be very happy to help you.

Similarity matching algorithm

I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.
For example, if a product has:
Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new
Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.
I'm open to suggestions as the project is just starting.
One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:
Colors Weight Height Condition
00011011000 10110110 10001100 01
Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).
The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.
Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.
Further question: also if I could use different weights for each attribute, it would be awesome.
You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.
Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.
You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.
What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.
You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.
What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements.
When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).
Simple example of BK-tree (from the second link)
BOOK
/ \
/(1) \(4)
/ \
BOOKS CAKE
/ / \
/(2) /(1) \(2)
/ | |
BOO CAPE CART
Your metric should be Hamming distance (count of differences between bit representations of two objects).
BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?
BK-tree with details:
https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/

Assigning values to missing data for use in binary logistic regression in SAS

Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
You can impute values for the missing fields, subject to logical restrictions on your experimental design and the fact that it will weaken the power of your experiment some relative to having the same experiment with no missing values.
SAS offers a few ways to do this. The simplest is to use PROC MI and PROC MIANALYZE, but even those are certainly not a simple matter of plugging a few numbers in. See this page for more information. Ultimately this is probably a better question for Cross-Validated at least until you have figured out the experimental design issues.