RapidMiner FP-growth operator not returning any results - text-mining

i'm running into a problem with the fp-growth operator in rapidminer. i'm processing about 20 text files that are all in all <1MB in size. i used the process documents operator and within that tokenize, filter stop words, transform cases, generate n-grams, and filter tokens. from there i used the numerical to binominal operator. up until this point everything works fine, but when i run the fp-growth operator it just processes indefinitely with no result. i tried tweaking the min support parameter, but to no avail. would you have any suggestions on how to troubleshoot this? i'd really appreciate any guidance.

FP-Growth is an algorithm to find Frequent Item Sets within a number of transactions that contain multiple items. In your scenario, the items are probably the words occurring in the text, while each text is a transaction.
Unfortunately the problem of frequent item sets is exponentially: Say you have a frequent item set that contains the items {A, B, C}, that means that there are enough transactions that contain all three items. But that also means that all the subsets are frequent as well, because the subset {A, B} is at least contained in all transactions that contain {A, B, C}. So if {A, B, C} is frequent, {A, B}, {B, C}, {A, C} and {A}, {B} and {C} as well. The number of sets is (2^n) - 1. So for a set of four items, we already have 15 subsets, for five 31 and so on.
So the question is: What makes a set frequent and why might there be so many frequent sets that RapidMiner takes so long to compute them all?
The most important factor is of course the min_support. This defines the threshold in percent of in how many transactions a set must occur to be frequent. If you increase the min_support towards 1, then there will be much less item sets and the computation will be faster.
However, don't get tricked by the "find min number of itemsets". If this is checked, RapidMiner will try to always find the specified minimal number of itemsets and will automatically lower the min_support if it couldn't find any. My advice: Switch it off.
Another thing you should make sure is, that the right value is recognized as "positive", so as indicator that this item is part of the transaction. If you used a Numerical to Binominal operator before, this is "true". So you should enter "true" into the positive_value parameter of RapidMiner. This parameter is only visible in expert mode. If you are not in expert mode, a line will show up below the parameters telling you, that "4 hidden expert parameters" are available. You can click on the line to switch to expert mode.
In your specific scenario, where you are getting your 'transactions' from text files, you will have special problems:
You will have thousands of attributes, especially if you generated NGrams like in your case. A high number of attributes also will result in a massively increased runtime.
If you don't remove frequent words by applying a reasonable pruning in the Process Document operator, words that occur pretty frequently will be exploding the number of frequent item sets. Say you didn't filter Stopwords, then the words "a", "the", "is" will occur all over the place, causing the other frequent words to co-occur with them. So the frequent set {A, B, C}, will always be extended to {A, B, C, a, the, is} so we now have 2^6 -1 subsets instead of just 7...
Hope this helps!

Related

Does it make sense to use big-O to describe the best case for a function?

I have an extremely pedantic question on big-O notation that I would like some opinions on. One of my uni subjects states “Best O(1) if their first element is the same” for a question on checking if two lists have a common element.
My qualm with this is that it does not describe the function on the entire domain of large inputs, rather the restricted domain of large inputs that have two lists with the same first element. Does it make sense to describe a function by only talking about a subset of that function’s domain? Of course, when restricted to that domain, the time complexity is omega(1), O(1) and therefore theta(1), but this isn’t describing the original function. From my understanding it would be more correct to say the entire function is bounded by omega(1). (and O(m*n) where m, n are the sizes of the two input lists).
What do all of you think?
It is perfectly correct to discuss cases (as you correctly point out, a case is a subset of the function's domain) and bounds on the runtime of algorithms in those cases (Omega, Oh or Theta). Whether or not it's useful is a harder question and one that is very situation-dependent. I'd generally think that Omega-bounds on the best case, Oh-bounds on the worst case and Theta bounds (on the "universal" case of all inputs, when such a bound exists) are the most "useful". But calling the subset of inputs where the first elements of each collection are the same, the "best case", seems like reasonable usage. The "best case" for bubble sort is the subset of inputs which are pre-sorted arrays, and is bound by O(n), better than unmodified merge sort's best-case bound.
Fundamentally, big-O notation is a way of talking about how some quantity scales. In CS we so often see it used for talking about algorithm runtimes that we forget that all the following are perfectly legitimate use cases for it:
The area of a circle of radius r is O(r2).
The volume of a sphere of radius r is O(r3).
The expected number of 2’s showing after rolling n dice is O(n).
The minimum number of 2’s showing after rolling n dice is O(1).
The number of strings made of n pairs of balanced parentheses is O(4n).
With that in mind, it’s perfectly reasonable to use big-O notation to talk about how the behavior of an algorithm in a specific family of cases will scale. For example, we could say the following:
The time required to sort a sequence of n already-sorted elements with insertion sort is O(n). (There are other, non-sorted sequences where it takes longer.)
The time required to run a breadth-first search of a tree with n nodes is O(n). (In a general graph, this could be larger if the number of edges were larger.)
The time required to insert n items in sorted order into an initially empty binary heap is On). (This can be Θ(n log n) for a sequence of elements that is reverse-sorted.)
In short, it’s perfectly fine to use asymptotic notation to talk about restricted subcases of an algorithm. More generally, it’s fine to use asymptotic notation to describe how things grow as long as you’re precise about what you’re quantifying. See #Patrick87’s answer for more about whether to use O, Θ, Ω, etc. when doing so.

Searching for groups of objects given a reduction function

I have a few questions about a type of search.
First, is there a name and if so what is the name of the following type of search? I want to search for subsets of objects from some collection such that a reduction and filter function applied to the subset is true. For example, say I have the following objects, each of which contains an id and a value.
[A,10]
[B,10]
[C,10]
[D,9]
[E,11]
I want to search for "all the sets of objects whose summed values equal 30" and I would expect the output to be, {{A,B,C}, {A,D,E}, {B,D,E}, {C,D,E}}.
Second, is the only strategy to perform this search brute-force? Is there some type of general-purpose algorithm for this? Or are search optimizations dependent on the reduction function?
Third, if you came across this problem, what tools would you use to solve it in a general way? Assume the reduction and filter functions could be anything and are not necessarily the sum function. Does SQL provide a good API for this type of search? What about Prolog? Any interesting tips and tricks would be appreciated.
Thanks.
I cannot comment on the problem in general but brute forcing search can be easily done in prolog.
w(a,10).
w(b,10).
w(c,10).
w(d,9).
w(e,11).
solve(0, [], _).
solve(N, [X], [X|_]) :- w(X, N).
solve(N, [X|Xs], [X|Bs]) :-
w(X, W),
W < N,
N1 is N - W,
solve(N1, Xs, Bs).
solve(N, [X|Xs], [_|Bs]) :- % skip element if previous clause fails
solve(N, [X|Xs], Bs).
Which gives
| ?- solve(30, X, [a, b, c, d, e]).
X = [a,b,c] ? ;
X = [a,d,e] ? ;
X = [b,d,e] ? ;
X = [c,d,e] ? ;
(1 ms) no
Sql is TERRIBLE at this kind of problem. Until recently there was no way to get 'All Combinations' of row elements. Now you can do so with Recursive Common Table Expressions, but you are forced by its limitations to retain all partial results as well as final results which you would have to filter out for your final results. About the only benefit you get with SQL's recursive procedure is that you can stop evaluating possible combinations once a sub-path exceeds 30, your target total. That makes it slightly less ugly than an 'evaluate all 2^N combinations' brute force solution (unless every combination sums to less than the target total).
To solve this with SQL you would be running an algorithm that can be described as:
Seed your result set with all table entries less than your target total and their value as a running sum.
Iteratively join your prior result with all combinations of table that were not already used in the result set and whose value added to running sum is less than or equal to target total. Running sum becomes old running sum plus value, and append ID to ID LIST. Union this new result to the old results. Iterate until no more records qualify.
Make a final pass of the result set to filter out the partial sums that do not total to your target.
Oh, and unless you make special provisions, solutions {A,B,C}, {C,B,A}, and {A,C,B} all look like different solutions (order is significant).

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled.
So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so if you have references for me to read, I'm sure it is a subject that may interest several persons.
The solutions I found (examples are in R) :
Levenshtein distance, which equals the number of characters you have to insert, delete or change to transform one word into another.
agrep("acusait", c("accusait", "abusait"), max = 2, value = TRUE)
## [1] "accusait" "abusait"
The comparison of phonemes
library(RecordLinkage)
soundex(x<-c('accusait','acusait','abusait'))
## [1] "A223" "A223" "A123"
The use of a spelling corrector (eventually a bayesian one like Peter Norvig's), but not very efficient on addresses I guess.
I thougt about using the suggestions of Google suggest, but likewise, it is not very efficient on personal postal addresses.
You can imagine using a machine learning supervised approach but you need to have stored the mispelled requests of users to do so which is not an option for me.
I look at this as a spelling-correction problem, where you need to find the nearest-matching word in some sort of dictionary.
What I mean by "near" is Levenshtein distance, except the smallest number of single-character insertions, deletions, and replacements is too restrictive.
Other kinds of "spelling mistakes" are also possible, for example transposing two characters.
I've done this a few times, but not lately.
The most recent case had to do with concomitant medications for clinical trials.
You would be amazed how many ways there are to misspell "acetylsalicylic".
Here is an outline in C++ of how it is done.
Briefly, the dictionary is stored as a trie, and you are presented with a possibly misspelled word, which you try to look up in the trie.
As you search, you try the word as it is, and you try all possible alterations of the word at each point.
As you go, you have an integer budget of how may alterations you can tolerate, which you decrement every time you put in an alteration.
If you exhaust the budget, you allow no further alterations.
Now there is a top-level loop, where you call the search.
On the first iteration, you call it with a budget of 0.
When the budget is 0, it will allow no alterations, so it is simply a direct lookup.
If it fails to find the word with a budget of 0, you call it again with a budget of 1, so it will allow a single alteration.
If that fails, try a budget of 2, and so on.
Something I have not tried is a fractional budget.
For example, suppose a typical alteration reduces the budget by 2, not 1, and the budget goes 0, 2, 4, etc.
Then some alterations could be considered "cheaper". For example, a vowel replacement might only decrement the budget by 1, so for the cost of one consonant replacement you could do two vowel replacements.
If the word is not misspelled, the time it takes is proportional to the number of letters in the word.
In general, the time it takes is exponential in the number of alterations in the word.
If you are working in R (as I was in the example above), I would have it call out to the C++ program, because you need the speed of a compiled language for this.
Extending what Mike had to say, and using the string matching library stringdist in R to match a vector of addresses that errored out in ARCGIS's geocoding function:
rows<-length(unmatched$addresses)
#vector to put our matched addresses in
matched_add<-rep(NA, rows)
score<-rep(NA, rows)
#for instructional purposes only, you should use sapply to apply functions to vectors
for (u in c(1:rows)){
#gives you the position of the closest match in an address vector
pos<-amatch(unmatched$address[u],index$address, maxDist = Inf)
matched_add[u]<-index$address[pos]
#stringsim here will give you the score to go back and adjust your
#parameters
score[u]<-stringsim(unmatched$address[u],index$address[pos])
}
Stringdist has several methods you can use to find approximate matches, including Levenshtein (method="lv"). You'll probably want to tinker with these to fit your dataset as well as you can.

Building ranking with genetic algorithm,

Question after BIG edition :
I need to built a ranking using genetic algorithm, I have data like this :
P(a>b)=0.9
P(b>c)=0.7
P(c>d)=0.8
P(b>d)=0.3
now, lets interpret a,b,c,d as names of football teams, and P(x>y) is probability that x wins with y. We want to build ranking of teams, we lack some observations P(a>d),P(a>c) are missing due to lack of matches between a vs d and a vs c.
Goal is to find ordering of team names, which the best describes current situation in that four team league.
If we have only 4 teams than solution is straightforward, first we compute probabilities for all 4!=24 orderings of four teams, while ignoring missing values we have :
P(abcd)=P(a>b)P(b>c)P(c>d)P(b>d)
P(abdc)=P(a>b)P(b>c)(1-P(c>d))P(b>d)
...
P(dcba)=(1-P(a>b))(1-P(b>c))(1-P(c>d))(1-P(b>d))
and we choose the ranking with highest probability. I don't want to use any other fitness function.
My question :
As numbers of permutations of n elements is n! calculation of probabilities for all
orderings is impossible for large n (my n is about 40). I want to use genetic algorithm for that problem.
Mutation operator is simple switching of places of two (or more) elements of ranking.
But how to make crossover of two orderings ?
Could P(abcd) be interpreted as cost function of path 'abcd' in assymetric TSP problem but cost of travelling from x to y is different than cost of travelling from y to x, P(x>y)=1-P(y<x) ? There are so many crossover operators for TSP problem, but I think I have to design my own crossover operator, because my problem is slightly different from TSP. Do you have any ideas for solution or frame for conceptual analysis ?
The easiest way, on conceptual and implementation level, is to use crossover operator which make exchange of suborderings between two solutions :
CrossOver(ABcD,AcDB) = AcBD
for random subset of elements (in this case 'a,b,d' in capital letters) we copy and paste first subordering - sequence of elements 'a,b,d' to second ordering.
Edition : asymetric TSP could be turned into symmetric TSP, but with forbidden suborderings, which make GA approach unsuitable.
It's definitely an interesting problem, and it seems most of the answers and comments have focused on the semantic aspects of the problem (i.e., the meaning of the fitness function, etc.).
I'll chip in some information about the syntactic elements -- how do you do crossover and/or mutation in ways that make sense. Obviously, as you noted with the parallel to the TSP, you have a permutation problem. So if you want to use a GA, the natural representation of candidate solutions is simply an ordered list of your points, careful to avoid repitition -- that is, a permutation.
TSP is one such permutation problem, and there are a number of crossover operators (e.g., Edge Assembly Crossover) that you can take from TSP algorithms and use directly. However, I think you'll have problems with that approach. Basically, the problem is this: in TSP, the important quality of solutions is adjacency. That is, abcd has the same fitness as cdab, because it's the same tour, just starting and ending at a different city. In your example, absolute position is much more important that this notion of relative position. abcd means in a sense that a is the best point -- it's important that it came first in the list.
The key thing you have to do to get an effective crossover operator is to account for what the properties are in the parents that make them good, and try to extract and combine exactly those properties. Nick Radcliffe called this "respectful recombination" (note that paper is quite old, and the theory is now understood a bit differently, but the principle is sound). Taking a TSP-designed operator and applying it to your problem will end up producing offspring that try to conserve irrelevant information from the parents.
You ideally need an operator that attempts to preserve absolute position in the string. The best one I know of offhand is known as Cycle Crossover (CX). I'm missing a good reference off the top of my head, but I can point you to some code where I implemented it as part of my graduate work. The basic idea of CX is fairly complicated to describe, and much easier to see in action. Take the following two points:
abcdefgh
cfhgedba
Pick a starting point in parent 1 at random. For simplicity, I'll just start at position 0 with the "a".
Now drop straight down into parent 2, and observe the value there (in this case, "c").
Now search for "c" in parent 1. We find it at position 2.
Now drop straight down again, and observe the "h" in parent 2, position 2.
Again, search for this "h" in parent 1, found at position 7.
Drop straight down and observe the "a" in parent 2.
At this point note that if we search for "a" in parent one, we reach a position where we've already been. Continuing past that will just cycle. In fact, we call the sequence of positions we visited (0, 2, 7) a "cycle". Note that we can simply exchange the values at these positions between the parents as a group and both parents will retain the permutation property, because we have the same three values at each position in the cycle for both parents, just in different orders.
Make the swap of the positions included in the cycle.
Note that this is only one cycle. You then repeat this process starting from a new (unvisited) position each time until all positions have been included in a cycle. After the one iteration described in the above steps, you get the following strings (where an "X" denotes a position in the cycle where the values were swapped between the parents.
cbhdefga
afcgedbh
X X X
Just keep finding and swapping cycles until you're done.
The code I linked from my github account is going to be tightly bound to my own metaheuristics framework, but I think it's a reasonably easy task to pull the basic algorithm out from the code and adapt it for your own system.
Note that you can potentially gain quite a lot from doing something more customized to your particular domain. I think something like CX will make a better black box algorithm than something based on a TSP operator, but black boxes are usually a last resort. Other people's suggestions might lead you to a better overall algorithm.
I've worked on a somewhat similar ranking problem and followed a technique similar to what I describe below. Does this work for you:
Assume the unknown value of an object diverges from your estimate via some distribution, say, the normal distribution. Interpret your ranking statements such as a > b, 0.9 as the statement "The value a lies at the 90% percentile of the distribution centered on b".
For every statement:
def realArrival = calculate a's location on a distribution centered on b
def arrivalGap = | realArrival - expectedArrival |
def fitness = Σ arrivalGap
Fitness function is MIN(fitness)
FWIW, my problem was actually a bin-packing problem, where the equivalent of your "rank" statements were user-provided rankings (1, 2, 3, etc.). So not quite TSP, but NP-Hard. OTOH, bin-packing has a pseudo-polynomial solution proportional to accepted error, which is what I eventually used. I'm not quite sure that would work with your probabilistic ranking statements.
What an interesting problem! If I understand it, what you're really asking is:
"Given a weighted, directed graph, with each edge-weight in the graph representing the probability that the arc is drawn in the correct direction, return the complete sequence of nodes with maximum probability of being a topological sort of the graph."
So if your graph has N edges, there are 2^N graphs of varying likelihood, with some orderings appearing in more than one graph.
I don't know if this will help (very brief Google searches did not enlighten me, but maybe you'll have more success with more perseverance) but my thoughts are that looking for "topological sort" in conjunction with any of "probabilistic", "random", "noise," or "error" (because the edge weights can be considered as a reliability factor) might be helpful.
I strongly question your assertion, in your example, that P(a>c) is not needed, though. You know your application space best, but it seems to me that specifying P(a>c) = 0.99 will give a different fitness for f(abc) than specifying P(a>c) = 0.01.
You might want to throw in "Bayesian" as well, since you might be able to start to infer values for (in your example) P(a>c) given your conditions and hypothetical solutions. The problem is, "topological sort" and "bayesian" is going to give you a whole bunch of hits related to markov chains and markov decision problems, which may or may not be helpful.

Representing multiply-linked lists in SQL

I have a data structure consisting of a set of objects which are arranged into a multiply-linked list which is also (isomorphically) a valid DAG. It could be viewed as one single multiply-linked list, or as a series of n doubly-linked lists which may share members. (This is the same data structure from Algorithm for quickly obtaining a partial ordering over multiple linked lists, for those of you following my questions.)
I am looking for a general technique, in no specific SQL dialect, for expressing this multiply-linked list/DAG in SQL, such that it's easy to take a given node and obtain:
The previous and next links in the DAG, given a topological ordering of the DAG
The previous and next links in each doubly-linked list to which this node belongs
Using the example data from that other question:
first = [a, b, d, f, h, i];
second = [a, b, c, f, g, i];
third = [a, e, f, g, h, i];
I'd want to be able to, given node f, obtain [(c|d|e), g] from the overall DAG's topology and also {first: [d, h], second: [c, g], third: [e, g]} from each of the lists orderings.
Here's the fun part: n, the number of doubly-linked lists, is not fixed and may grow at any time. I'd rather not redo the schema each time that happens.
All of the algorithms I've come up with so far either (a) stuff a big pickle into the DB and pull it out in order to calculate orderings, or (b) require that the lists be explicitly enumerated as recursive relations in the DB.
I'll go with an option in (b) if I can't find something better but I'm hoping that there's something magical out there to make this easier.
Pre:
This is a question and answer forum, not 'lets sit down, group think for a bit, and solve the whole problem' forum.
I think what you want to investigate in a technique called 'modified preordered tree traversal' a mouthful i know, but it allows the storing of hierarchical data in a flat database and individual enties. Sadly, you do have to do some rewriting on inserts, but the selects can be done in a single query, so it's best for 'many view/ few changes' situations like a website. Luckily, you rarely have to rewrite the whole dataset (only the parts you changed and those hierarchically after them.
I remember a good article on the basics on it ( a couple years ago) but can't find the bookmark atm, so start with just a google search.
EDIT/UPDATE:
link: http://www.sitepoint.com/hierarchical-data-database/
No matter what, from dealing with this issue extensively, you will have to choose were to put the brunt of the work, on view, or on change. Depending on the size of the 'master' tree, you may (like me) decide to break the tree up into parts and use a tree of trees, limiting the update cost.