tensorflow crossed column feature with vocabulary list for cross terms - tensorflow

How would a make a crossed_column with a vocabulary list for the crossed terms? That is suppose that I have two categorical columns
animal [dog, cat, puma, other]
food [pizza, salad, quinoa, other]
and now I want to make the crossed column, animal x food - but I've done some frequency counts of the training data (in spark before exporting tfrecords for training tensorflow models), and puma x quinoa only showed up once, and cat x quinoa never showed up. So I don't want to generate features for them, I don't think I have enough training examples to learn what their weights should be. What I'd like is for both of them to get absorbed in the "other x other" feature -- the thought that I'll learn some kind of average weight for a feature that covers all the infrequent terms.
It doesn't look like I can do that with tf.feature_column.crossed_column -- any idea how I would do this kind of thing in tensorflow?
Or, should I not worry about it? If I crosses all the features I'd get 20, but there are only 18 that I think are important - so maybe set the hash map size to 18 or, less, causing collisions? Then include the first order columns, animal and food, so the model can figure out what it is looking at? This is the approach I'm getting from reading the docs. I like it because it is simpler, but am concerned about model accuracy.
I think what I really want is some kind of sparse table lookup, rather than hashing the cross -- imagine I have
column A - integer Ids, 1 to 10,000
column B - integer Ids, 1 to 10,000
column C - integer Ids, 1 to 10,000
and there are only 1 million of the 1 trillion possible crosses between A,B,C that I want to make features for -- all the rest will go into the 1 million + 1 other x other x other feature, how would I do that in tensorflow?


Include belonging into model

If you had data like (prices and market-cap are not real)
Date Stock Close Market-cap GDP
15.4.2010 Apple 7.74 1.03 ...
15.4.2010 VW 50.03 0.8 ...
15.5.2010 Apple 7.80 1.04 ...
15.5.2010 VW 52.04 0.82 ...
where Close is the y you want to predict and Market-cap and GDP are your x-variables, would you also include Stock in your model as another independent variable as it could for example be that price building for Apple works than differently than for VW.
If yes, how would you do it? My idea is to assign 0 to Apple and 1 to VW in the column Stock.
You first need to identify what exactly are you trying to predict. As it stands, you have longitudinal data such that you have multiple measurements from the same company over a period of time.
Are you trying to predict the close price based on market cap + GDP?
Or are you trying to predict the future close price based on previous close price measurements?
You could stratify based on company name, but it really depends on what you are trying to achieve. What is the question you are trying to answer ?
You may also want to take the following considerations into account:
close prices measured at different times on the same company are correlated with each other.
correlations between two measurements soon after each other will be better than correlations between two measurements far apart in time.
There are four assumptions associated with a linear regression model:
Linearity: The relationship between X and the mean of Y is linear.
Homoscedasticity: The variance of residual is the same for any value of X.
Independence: Observations are independent of each other.
Normality: For any fixed value of X, Y is normally distributed.

How is the Gini-Index minimized in CART Algorithm for Decision Trees?

For neural networks for example I minimize the cost function by using the backpropagation algorithm. Is there something equivalent for the Gini Index in decision trees?
CART Algorithm always states "choose partition of set A, that minimizes Gini-Index", but how to I actually get that partition mathematically?
Any input on this would be helpful :)
For a decision tree, there are different methods for splitting continuous variables like age, weight, income, etc.
A) Discretize the continuous variable to use it as a categorical variable in all aspects of the DT algorithm. This can be done:
only once during the start and then keeping this discretization
at every stage where a split is required, using percentiles or
interval ranges or clustering to bucketize the variable
B) Split at all possible distinct values of the variable and see where there is the highest decrease in the Gini Index. This can be computationally expensive. So, there are optimized variants where you sort the variables and instead of choosing all distinct values, choose the midpoints between two consecutive values as the splits. For example, if the variable 'weight' has 70, 80, 90 and 100 kgs in the data points, try 75, 85, 95 as splits and pick the best one (highest decrease in Gini or other impurities)
But then, what is the exact split algorithm that is implemented in scikit-learn in python, rpart in R, and the mlib package in pyspark , and what are the differences between them in the splitting of a continuous variable is something I am not sure as well and am still researching.
Here there is a good example of CART algorithm. Basically, we get the gini index like this:
For each attribute we have different values each of which will have a gini index, according to the class they belong to. For example, if we had two classes (positive and negative), each value of an attribute will have some records that belong to the positive class and some other values that belong to the negative class. So we can calculate the probabilities. Say if an attribute was called weather and it had two values (e.g. rainy and sunny), and we had these information:
rainy: 2 positive, 3 negative
sunny: 1 positive, 2 negative
we could say:
Then we can have the weighted sum of gini indexes for weather (assuming we had a total of 8 records):
We do this for all the other attributes (like we did for weather) and at the end we choose the attribute with the lowest gini index to be the one to split the tree from. We have to do all this at each split (unless we could classify the sub-tree without the need for splitting).

Is multiple regression the best approach for optimization?

I am being asked to take a look at a scenario where a company has many projects that they wish to complete, but with any company budget comes into play. There is a Y value of a predefined score, with multiple X inputs. There are also 3 main constraints of Capital Costs, Expense Cost and Time for Completion in Months.
The ask is could an algorithmic approach be used to optimize which projects should be done for the year given the 3 constraints. The approach also should give different results if the constraint values change. The suggested method is multiple regression. Though I have looked into different approaches in detail. I would like to ask the wider community, if anyone has dealt with a similar problem, and what approaches have you used.
Fisrt thing we should understood, a conclution of something is not base on one argument.
this is from communication theory, that every human make a frame of knowledge (understanding conclution), where the frame construct from many piece of knowledge / information).
the concequence is we cannot use single linear regression in math to create a ML / DL system.
at least we should use two different variabel to make a sub conclution. if we push to use single variable with use linear regression (y=mx+c). it's similar to push computer predict something with low accuration. what ever optimization method that you pick...it's still low accuracy..., why...because linear regresion if you use in real life, it similar with predict 'habbit' base on data, not calculating the real condition.
that's means...., we should use multiple linear regression (y=m1x1+m2x2+ ... + c) to calculate anything in order to make computer understood / have conclution / create model of regression. but, not so simple like it. because of computer try to make a conclution from data that have multiple character / varians ... you must classified the data and the conclution.
for an example, try to make computer understood phitagoras.
we know that phitagoras formula is c=((a^2)+(b^2))^(1/2), and we want our computer can make prediction the phitagoras side (c) from two input values (a and b). so to do that, we should make a model or a mutiple linear regresion formula of phitagoras.
step 1 of course we should make a multi character data of phitagoras.
this is an example
a b c
3 4 5
8 6 10
3 14 etc..., try put 10 until 20 data
try to make a conclution of regression formula with multiple regression to predic the c base on a and b values.
you will found that some data have high accuration (higher than 98%) for some value and some value is not to accurate (under 90%). example a=3 and b=14 or b=15, will give low accuration result (under 90%).
so you must make and optimization....but how to do it...
I know many method to optimize, but i found in manual way, if I exclude the data that giving low accuracy result and put them in different group then, recalculate again to the data group that excluded, i will get more significant result. do again...until you reach the accuracy target that you want.
each group data, that have a new regression, is a new class.
means i will have several multiple regression base on data that i input (the regression come from each group of data / class) and the accuracy is really high, 99% - 99.99%.
and with the several class, the regresion have a fuction as a 'label' of the class, this is what happens in the backgroud of the automation computation. but with many module, the user of the module, feel put 'string' object as label, but the truth is, the string object binding to a regresion that constructed as label.
with some conditional parameter you can get the good ML with minimum number of data train.
try it on excel / libreoffice before step more further...
try to follow the tutorial from this video
and implement it in simple data that easy to construct in excel, like pythagoras.
so the answer is yes...the multiple regression is the best approach for optimization.

Is there a prdefined name for the following solution search/optimization algorithm?

Consider a problem whose solution maximizes an objective function.
Problem : From 500 elements, 15 needs to be selected (candidate solution), Value of Objective function depends on the pairwise relationships between the elements in a candidate solution and some more.
The steps for solving such a problem is described here:
1. Generate a set of candidate solutions in guided random manner(population) //not purely random the direction is given to generate the population
2. Evaluating the objective function for current population
3. If the current_best_solution exceeds the global_best_solution, then replace the global_best with current_best
4. Repeat steps 1,2,3 for N (arbitrary number) times
where size of population and N are smaller (approx 50)
After N iterations it returns a candidate solution stored in global_best_solution
Is this the description of a well-known algorithm?
If it is, what is the name of that algorithm or if not under which category these type of algorithms fit?
What you have sounds like you are just fishing. Note that you might as well get rid of steps 3 and 4 since running the loop 100 times would be the same as doing it once with an initial population 100 times as large.
If you think of the objective function as a random variable which is a function of random decision variables then what you are doing would e.g. give you something in the 99.9th percentile with very high probability -- but there is no limit to how far the optimum might be from the 99.9th percentile.
To illustrate the difficulty, consider the following sort of Travelling Salesman Problem. Imagine two clusters of points A and B, each of which has 100 points. Within the clusters, each point is arbitrarily close to every other point (e.g. 0.0000001). But -- between the clusters the distance is say 1,000,000. The optimal tour would clearly have length 2,000,000 (+ a negligible amount). A random tour is just a random permutation of those 200 decision points. Getting an optimal or near optimal tour would be akin to shuffling a deck of 200 cards with 100 read and 100 black and having all of the red cards in the deck in a block (counting blocks that "wrap around") -- vanishingly unlikely (It can be calculated as 99 * 100! * 100! / 200! = 1.09 x 10^-57). Even if you generate quadrillions of tours it is overwhelmingly likely that each of those tours would be off by millions. This is a min problem, but it is also easy to come up with max problems where it is vanishingly unlikely that you will get a near-optimal solution by purely random settings of the decision variables.
This is an extreme example, but it is enough to show that purely random fishing for a solution isn't very reliable. It would make more sense to use evolutionary algorithms or other heuristics such as simulated annealing or tabu search.
why do you work with a population if the members of that population do not interact ?
what you have there is random search.
if you add mutation it looks like an Evolution Strategy: https://en.wikipedia.org/wiki/Evolution_strategy

Markovian chains with Redis

For self-education purposes, I want to implement a Markov chain generator, using as much Redis, and as little application-level logic as possible.
Let's say I want to build a word generator, based on frequency table with history depth N (say, 2).
As a not very interesting example, for dictionary of two words bar and baz, the frequency table is as follows ("." is terminator, numbers are weights):
. . -> b x2
. b -> a x2
b a -> r x1
b a -> z x1
a r -> . x1
a z -> . x1
When I generate the word, I start with history of two terminators . .
There is only one possible outcome for the first two letters, b a.
Third letter may be either r or z, with equal probabilities, since their weights are equal.
Fourth letter is always a terminator.
(Things would be more interesting with longer words in dictionary.)
Anyway, how to do this with Redis elegantly?
Redis sets have SRANDMEMBER, but do not have weights.
Redis sorted sets have weights, but do not have random member retrieval.
Redis lists allow to represent weights as entry copies, but how to make set intersections with them?
Looks like application code is doomed to do some data processing...
You can accomplish a weighted random selection with a redis sorted set, by assigning each member a score between zero and one, according to the cumulative probability of the members of the set considered thus far, including the current member.
The ordering you use is irrelevant; you may choose any order which is convenient for you. The random selection is then accomplished by generating a random floating point number r uniformly distributed between zero and one, and calling
which will return the first element with a score greater than or equal to r.
A little bit of reasoning should convince you that the probability of choosing a member is thus weighted correctly.
Unfortunately, the fact that the scores assigned to the elements needs to be proportional to the cumulative probability would seem to make it difficult to use the sorted set union or intersection operations in a way which would preserve the significance of the scores for random selection of elements. That part would seem to require some significant application logic.