My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.
But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.
I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?
EDIT: My target variable is a simple binary one.
Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:
One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)
I have an extremely pedantic question on big-O notation that I would like some opinions on. One of my uni subjects states “Best O(1) if their first element is the same” for a question on checking if two lists have a common element.
My qualm with this is that it does not describe the function on the entire domain of large inputs, rather the restricted domain of large inputs that have two lists with the same first element. Does it make sense to describe a function by only talking about a subset of that function’s domain? Of course, when restricted to that domain, the time complexity is omega(1), O(1) and therefore theta(1), but this isn’t describing the original function. From my understanding it would be more correct to say the entire function is bounded by omega(1). (and O(m*n) where m, n are the sizes of the two input lists).
What do all of you think?
It is perfectly correct to discuss cases (as you correctly point out, a case is a subset of the function's domain) and bounds on the runtime of algorithms in those cases (Omega, Oh or Theta). Whether or not it's useful is a harder question and one that is very situation-dependent. I'd generally think that Omega-bounds on the best case, Oh-bounds on the worst case and Theta bounds (on the "universal" case of all inputs, when such a bound exists) are the most "useful". But calling the subset of inputs where the first elements of each collection are the same, the "best case", seems like reasonable usage. The "best case" for bubble sort is the subset of inputs which are pre-sorted arrays, and is bound by O(n), better than unmodified merge sort's best-case bound.
Fundamentally, big-O notation is a way of talking about how some quantity scales. In CS we so often see it used for talking about algorithm runtimes that we forget that all the following are perfectly legitimate use cases for it:
The area of a circle of radius r is O(r2).
The volume of a sphere of radius r is O(r3).
The expected number of 2’s showing after rolling n dice is O(n).
The minimum number of 2’s showing after rolling n dice is O(1).
The number of strings made of n pairs of balanced parentheses is O(4n).
With that in mind, it’s perfectly reasonable to use big-O notation to talk about how the behavior of an algorithm in a specific family of cases will scale. For example, we could say the following:
The time required to sort a sequence of n already-sorted elements with insertion sort is O(n). (There are other, non-sorted sequences where it takes longer.)
The time required to run a breadth-first search of a tree with n nodes is O(n). (In a general graph, this could be larger if the number of edges were larger.)
The time required to insert n items in sorted order into an initially empty binary heap is On). (This can be Θ(n log n) for a sequence of elements that is reverse-sorted.)
In short, it’s perfectly fine to use asymptotic notation to talk about restricted subcases of an algorithm. More generally, it’s fine to use asymptotic notation to describe how things grow as long as you’re precise about what you’re quantifying. See #Patrick87’s answer for more about whether to use O, Θ, Ω, etc. when doing so.
Ive been revisiting genetic algorithms with encoding, optimizing and decoding. My first attempt was the travelling salesman with ordered cross over which worked great. I found an article that tried to optimize a more complex genome while optimizing a 2d packing problem.
The author encodes the problem using reverse polish notation that made sense. It uses a combination of parts and either V Or H as opertors.
Ie 34H5V
With decoding the stack having to be resolved to one stack element that is my final layout. That being said, the number of operater up until a certain point must be 1 less than the number of parts up until the same point. The author then states that he used a mixed cross over by using an ordered cross over on the parts and binary crossover for the operators.
I mulled this over but i cannot understand how he seperates the parts and operators before crossing over and then recombines them before evaluating performance and they offer little details. If a binary cross over occured replacing parts with an "X" to keep the relative positions so they can be recombined after crossover but the relationship between operator and parts doesnt hold true.
Does anyone perhaps have a resource that has dealt with a similar scenario or perhaps has used this successfully.
This looked way more difficult than it actually was. When the original population is generated, you need to adhere to the limitations set out by postfix notation. When a crossover occurs you simply build a mask of the parent
Ie xxxxooxoxx
Where x is an object and o is an operaror. Once you have the mask holding the positions you can create a sting only of operators and one only of objects. The operators can be done with a binary cross over and the objects as partial map crossover. Once done you fill the mask with the value in the order they appear in each group. Since the mask was valid, the progeny is valid too.
The only issue ia getting all the possible arrangements because without it, it will all be limited to the masks. He solves this by doing a swap mutation dictated by the mutation rates.
Select an item at random.
If the item is an operator then
A. Swithc the operator to another kind
B. Select another. If its an object then make sure the requirementa are met and if so then switch.
I'd like to use the Levenshtein algorithm to compare two files in VB.NET. I know I can use an MD5 hash to determine if they're different, but I want to know HOW MUCH different the two files are. The files I'm working with are both around 250 megs. I've experimented with different ways of doing this and I've realized I really can't load both files into memory (all kinds of string-related issues). So I figured I'd just stream the bytes I need as I go. Fine. But the implementations that I've found of the Levenshtein algorithm all dimension a matrix that's length 1 * length 2 in size, which in this case is impossible to work with. I've heard there's a way to do this with just two vectors instead of the whole matrix.
How can I compute Levenshtein distance of two large files without declaring a matrix that's the product of their file sizes?
Note that the values in each row of the Levenshtein matrix depend only on the values in the row above it. This means that you only need two one-dimensional arrays: one contains the values of the current row; the other is populated with the new values that you can compute from the current row. Then, you swap their roles (the "new" row becomes the "current" row and vice versa) and continue.
Note that this approach only lets you compute the Levenshtein distance (which seems to be what you want); it cannot tell you which operations must be done in order to transform one string into the other. There exists a very clever modification of the algorithm that lets you reconstruct the edit operations without using nm memory, but I've forgotten how it works.
Question after BIG edition :
I need to built a ranking using genetic algorithm, I have data like this :
P(a>b)=0.9
P(b>c)=0.7
P(c>d)=0.8
P(b>d)=0.3
now, lets interpret a,b,c,d as names of football teams, and P(x>y) is probability that x wins with y. We want to build ranking of teams, we lack some observations P(a>d),P(a>c) are missing due to lack of matches between a vs d and a vs c.
Goal is to find ordering of team names, which the best describes current situation in that four team league.
If we have only 4 teams than solution is straightforward, first we compute probabilities for all 4!=24 orderings of four teams, while ignoring missing values we have :
P(abcd)=P(a>b)P(b>c)P(c>d)P(b>d)
P(abdc)=P(a>b)P(b>c)(1-P(c>d))P(b>d)
...
P(dcba)=(1-P(a>b))(1-P(b>c))(1-P(c>d))(1-P(b>d))
and we choose the ranking with highest probability. I don't want to use any other fitness function.
My question :
As numbers of permutations of n elements is n! calculation of probabilities for all
orderings is impossible for large n (my n is about 40). I want to use genetic algorithm for that problem.
Mutation operator is simple switching of places of two (or more) elements of ranking.
But how to make crossover of two orderings ?
Could P(abcd) be interpreted as cost function of path 'abcd' in assymetric TSP problem but cost of travelling from x to y is different than cost of travelling from y to x, P(x>y)=1-P(y<x) ? There are so many crossover operators for TSP problem, but I think I have to design my own crossover operator, because my problem is slightly different from TSP. Do you have any ideas for solution or frame for conceptual analysis ?
The easiest way, on conceptual and implementation level, is to use crossover operator which make exchange of suborderings between two solutions :
CrossOver(ABcD,AcDB) = AcBD
for random subset of elements (in this case 'a,b,d' in capital letters) we copy and paste first subordering - sequence of elements 'a,b,d' to second ordering.
Edition : asymetric TSP could be turned into symmetric TSP, but with forbidden suborderings, which make GA approach unsuitable.
It's definitely an interesting problem, and it seems most of the answers and comments have focused on the semantic aspects of the problem (i.e., the meaning of the fitness function, etc.).
I'll chip in some information about the syntactic elements -- how do you do crossover and/or mutation in ways that make sense. Obviously, as you noted with the parallel to the TSP, you have a permutation problem. So if you want to use a GA, the natural representation of candidate solutions is simply an ordered list of your points, careful to avoid repitition -- that is, a permutation.
TSP is one such permutation problem, and there are a number of crossover operators (e.g., Edge Assembly Crossover) that you can take from TSP algorithms and use directly. However, I think you'll have problems with that approach. Basically, the problem is this: in TSP, the important quality of solutions is adjacency. That is, abcd has the same fitness as cdab, because it's the same tour, just starting and ending at a different city. In your example, absolute position is much more important that this notion of relative position. abcd means in a sense that a is the best point -- it's important that it came first in the list.
The key thing you have to do to get an effective crossover operator is to account for what the properties are in the parents that make them good, and try to extract and combine exactly those properties. Nick Radcliffe called this "respectful recombination" (note that paper is quite old, and the theory is now understood a bit differently, but the principle is sound). Taking a TSP-designed operator and applying it to your problem will end up producing offspring that try to conserve irrelevant information from the parents.
You ideally need an operator that attempts to preserve absolute position in the string. The best one I know of offhand is known as Cycle Crossover (CX). I'm missing a good reference off the top of my head, but I can point you to some code where I implemented it as part of my graduate work. The basic idea of CX is fairly complicated to describe, and much easier to see in action. Take the following two points:
abcdefgh
cfhgedba
Pick a starting point in parent 1 at random. For simplicity, I'll just start at position 0 with the "a".
Now drop straight down into parent 2, and observe the value there (in this case, "c").
Now search for "c" in parent 1. We find it at position 2.
Now drop straight down again, and observe the "h" in parent 2, position 2.
Again, search for this "h" in parent 1, found at position 7.
Drop straight down and observe the "a" in parent 2.
At this point note that if we search for "a" in parent one, we reach a position where we've already been. Continuing past that will just cycle. In fact, we call the sequence of positions we visited (0, 2, 7) a "cycle". Note that we can simply exchange the values at these positions between the parents as a group and both parents will retain the permutation property, because we have the same three values at each position in the cycle for both parents, just in different orders.
Make the swap of the positions included in the cycle.
Note that this is only one cycle. You then repeat this process starting from a new (unvisited) position each time until all positions have been included in a cycle. After the one iteration described in the above steps, you get the following strings (where an "X" denotes a position in the cycle where the values were swapped between the parents.
cbhdefga
afcgedbh
X X X
Just keep finding and swapping cycles until you're done.
The code I linked from my github account is going to be tightly bound to my own metaheuristics framework, but I think it's a reasonably easy task to pull the basic algorithm out from the code and adapt it for your own system.
Note that you can potentially gain quite a lot from doing something more customized to your particular domain. I think something like CX will make a better black box algorithm than something based on a TSP operator, but black boxes are usually a last resort. Other people's suggestions might lead you to a better overall algorithm.
I've worked on a somewhat similar ranking problem and followed a technique similar to what I describe below. Does this work for you:
Assume the unknown value of an object diverges from your estimate via some distribution, say, the normal distribution. Interpret your ranking statements such as a > b, 0.9 as the statement "The value a lies at the 90% percentile of the distribution centered on b".
For every statement:
def realArrival = calculate a's location on a distribution centered on b
def arrivalGap = | realArrival - expectedArrival |
def fitness = Σ arrivalGap
Fitness function is MIN(fitness)
FWIW, my problem was actually a bin-packing problem, where the equivalent of your "rank" statements were user-provided rankings (1, 2, 3, etc.). So not quite TSP, but NP-Hard. OTOH, bin-packing has a pseudo-polynomial solution proportional to accepted error, which is what I eventually used. I'm not quite sure that would work with your probabilistic ranking statements.
What an interesting problem! If I understand it, what you're really asking is:
"Given a weighted, directed graph, with each edge-weight in the graph representing the probability that the arc is drawn in the correct direction, return the complete sequence of nodes with maximum probability of being a topological sort of the graph."
So if your graph has N edges, there are 2^N graphs of varying likelihood, with some orderings appearing in more than one graph.
I don't know if this will help (very brief Google searches did not enlighten me, but maybe you'll have more success with more perseverance) but my thoughts are that looking for "topological sort" in conjunction with any of "probabilistic", "random", "noise," or "error" (because the edge weights can be considered as a reliability factor) might be helpful.
I strongly question your assertion, in your example, that P(a>c) is not needed, though. You know your application space best, but it seems to me that specifying P(a>c) = 0.99 will give a different fitness for f(abc) than specifying P(a>c) = 0.01.
You might want to throw in "Bayesian" as well, since you might be able to start to infer values for (in your example) P(a>c) given your conditions and hypothetical solutions. The problem is, "topological sort" and "bayesian" is going to give you a whole bunch of hits related to markov chains and markov decision problems, which may or may not be helpful.