I have a dataset with 3.3M rows and 8k unique products.
I wanted to apply apriori algorithm to find association rules and connections between products.
Well, I did it before on a much smaller database with 50k rows and maybe 200 unique products..
Someone knows how can I do it effectively with larger scales of data? How can I still make it work for me maybe there are tricks to reduce the scale of the data but still get effective results.
Any help would be amazing! Reach me out if you experienced with this algorithm.
The trick is: Don't use Apriori.
Use LCM or the top-down version of FP-Growth.
You can find my implementations here:
command line programs: https://borgelt.net/fim.html (eclat with option -o gives LCM)
Python: https://borgelt.net/pyfim.html
R: https://borgelt.net/fim4r.html
Related
I'm trying to understand continuous optimization algorithms applied on some test functions.
Here are the results obtaind by some algorithms used for this issue on some of test functions :
enter image description here
I didn't understand the difference between the two underlined phrases. would you please help me in this?
P.S. sometimes they use the term (median number) instead of (mean number ) what's the difference between the two??
This question lacks some context. It would have been better to link to th text too to get a grasp of what is going on.
But i read it as this (and i think that's how someone with some experience in optimization-algorithms would read it; you have to check it though with your knowledge of the context):
The bold 1.0s are the normalized number of function-evaluations on different functions to optimize (each row is a different function)
The values in the brackets are unnormalized numbers explaining the same
When ACO used 820 evaluations (unnormalized), normalized to 1.0, CACO used 8.3 * 820 evaluations
The mean and median are two different measures of central tendency. Check wikipedia out to understand the differences.
In my application I need a SQL-like query of the documents. The big picture is that there is a page with a paginated table showing the couchdb documents of a certain "type". I have about 15 searchable columns like timestamp, customer name, the us state, different numeric fields, etc. All of these columns are orderable, also there is a filter form allowing the user to filter by each of the fields.
For a more concrete below is a typical query which is a result by a customer setting some of the filter options and following to the second page. Its written in a pseodo-sql code, just to explain the problem:
timestamp > last_weeks_monday_epoch AND timestamp < this_weeks_monday_epoch AND marked_as_test = False AND dataspace="production" AND fico > 650
SORT BY timestamp DESC
LIMIT 15
SKIP 15
This would be a trivial problem if I were using any sql-like database, but couchdb is way more fun ;) To solve this I've created a view with the following structure of the emitted rows:
key: [field, value], id: doc._id, value: null
Now, to resolve the example query above I need to perform a bunch of queries:
{startkey: ["timestamp", last_weeks_monday_epoch], endkey: ["timestamp", this_weeks_monday_epoch]}, the *_epoch here are integers epoch timestamps,
{key: ["marked_as_test", False]},
{key: ["dataspace", "production"]},
{startkey: ["fico", 650], endkey: ["fico", {}]}
Once I have the results of the queries above I calculate intersection of the sets of document IDs and apply the sorting using the result of timestamp query. Than finally I can apply the slice resolving the document IDs of the rows 15-30 and download their content using bulk get operation.
Needless to say, its not the fastest operation. Currently the dataset I'm working with is roughly 10K documents big. I can already see that the part when I'm calculating the intersection of the sets can take like 4 seconds, obviously I need to optimize it further. I'm afraid to think, how slow its going to get in a few months when my dataset doubles, triples, etc.
Ok, so having explained the situation I'm at, let me ask the actual questions.
Is there a better, more natural way to reach my goal without loosing the flexibility of the tool?
Is the view structure I've used optimal ? At some point I was considering using a separate map() function generating the value of each field. This would result in a smaller b-trees but more work of the view server to generate the index. Can I benefit this way ?
The part of algorithm where I have to calculate intersections of the big sets just to later get the slice of the result bothers me. Its not a scalable approach. Does anyone know a better algorithm for this ?
Having map function:
function(doc){
if(doc.marked_as_test) return;
emit([doc.dataspace, doc.timestamp, doc.fico], null):
}
You can made similar request:
http://localhost:5984/db/_design/ddoc/_view/view?startkey=["production", :this_weeks_monday_epoch]&endkey=["production", :last_weeks_monday_epoch, 650]&descending=true&limit=15&skip=15
However, you should pass :this_weeks_monday_epoch and :last_weeks_monday_epoch values from the client side (I believe they are some calculable variables on database side, right?)
If you don't care about dataspace field (e.g. it's always constant), you may move it into the map function code instead of having it in query parameters.
I don't think CouchDB is a good fit for the general solution to your problem. However, there are two basic ways you can mitigate the ways CouchDB fits the problem.
Write/generate a bunch of map() functions that use each separate column as the key (for even better read/query performance, you can even do combinatoric approaches). That way you can do smart filtering and sorting, making use of a bunch of different indices over the data. On the other hand, this will cost extra disk space and index caching performance.
Try to find out which of the filters/sort orders your users actually use, and optimize for those. It seems unlikely that each combination of filters/sort orders is used equally, so you should be able to find some of the most-used patterns and write view functions that are optimal for those patterns.
I like the second option better, but it really depends on your use case. This is one of those things SQL engines have been pretty good at traditionally.
I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that Lucene could do it, but is Lucene's levenshtein algorithm fast enough for hundreds of edits? Perhaps something out of the world of plagiarism detection? Any advice is appreciated.
Q-grams could be one approach, but there are others such as Blast, BlastP - which are used for Protein, nucleotide matches etc.
The Simmetrics library is a comprehensive collection of string distance approaches.
Lucene does not seem to be the right tool here. In addition to Mikos' fine suggestions, I have heard about AGREP, FASTA and Locality-Sensitive Hashing(LSH). I believe that an efficient method should first prune the search space heavily, and only then do more sophisticated scoring on the remaining candidates.
I found this on an "interview questions" site and have been pondering it for a couple of days. I will keep churning, but am interested what you guys think
"10 Gbytes of 32-bit numbers on a magnetic tape, all there from 0 to 10G in random order. You have 64 32 bit words of memory available: design an algorithm to check that each number from 0 to 10G occurs once and only once on the tape, with minimum passes of the tape by a read head connected to your algorithm."
32-bit numbers can take 4G = 2^32 different values. There are 2.5*2^32 numbers on tape total. So after 2^32 count one of numbers will repeat 100%. If there were <= 2^32 numbers on tape then it was possible that there are two different cases – when all numbers are different or when at least one repeats.
It's a trick question, as Michael Anderson and I have figured out. You can't store 10G 32b numbers on a 10G tape. The interviewer (a) is messing with you and (b) is trying to find out how much you think about a problem before you start solving it.
The utterly naive algorithm, which takes as many passes as there are numbers to check, would be to walk through and verify that the lowest number is there. Then do it again checking that the next lowest is there. And so on.
This requires one word of storage to keep track of where you are - you could cut down the number of passes by a factor of 64 by using all 64 words to keep track of where you're up to in several different locations in the search space - checking all of your current ones on each pass. Still O(n) passes, of course.
You could probably cut it down even more by using portions of the words - given that your search space for each segment is smaller, you won't need to keep track of the full 32-bit range.
Perform an in-place mergesort or quicksort, using tape for storage? Then iterate through the numbers in sequence, tracking to see that each number = previous+1.
Requires cleverly implemented sort, and is fairly slow, but achieves the goal I believe.
Edit: oh bugger, it's never specified you can write.
Here's a second approach: scan through trying to build up to 30-ish ranges of contiginous numbers. IE 1,2,3,4,5 would be one range, 8,9,10,11,12 would be another, etc. If ranges overlap with existing, then they are merged. I think you only need to make a limited number of passes to either get the complete range or prove there are gaps... much less than just scanning through in blocks of a couple thousand to see if all digits are present.
It'll take me a bit to prove or disprove the limits for this though.
Do 2 reduces on the numbers, a sum and a bitwise XOR.
The sum should be (10G + 1) * 10G / 2
The XOR should be ... something
It looks like there is a catch in the question that no one has talked about so far; the interviewer has only asked the interviewee to write a program that CHECKS
(i) if each number that makes up the 10G is present once and only once--- what should the interviewee do if the numbers in the given list are present multple times? should he assume that he should stop execting the programme and throw exception or should he assume that he should correct the mistake by removing the repeating number and replace it with another (this may actually be a costly excercise as this involves complete reshuffle of the number set)? correcting this is required to perform the second step in the question, i.e. to verify that the data is stored in the best possible way that it requires least possible passes.
(ii) When the interviewee was asked to only check if the 10G weight data set of numbers are stored in such a way that they require least paases to access any of those numbers;
what should the interviewee do? should he stop and throw exception the moment he finds an issue in the algorithm they were stored in, or correct the mistake and continue till all the elements are sorted in the order of least possible passes?
If the intension of the interviewer is to ask the interviewee to write an algorithm that finds the best combinaton of numbers that can be stored in 10GB, given 64 32 Bit registers; and also to write an algorithm to save these chosen set of numbers in the best possible way that require least number of passes to access each; he should have asked this directly, woudn't he?
I suppose the intension of the interviewer may be to only see how the interviewee is approaching the problem rather than to actually extract a working solution from the interviewee; wold any buy this notion?
Regards,
Samba
My drive has DMG-blocks. The sum of their sizes is strictly below 47GB. I have 11 DVDs, each of the size 4.7GB. I want to use as small amount of DVDs as possible, without using compressing (the problem may be superflous, since it considers the most optimal combinations in terms of DMG-files. You can think it in terms of compressed files, if you want.).
You can see that the DMG-files have arbitrary sizes. So many solutions are possible.
find . -iname "*.dmg" -exec du '{}' \; 3&> /dev/null
1026064 ./Desktop/Desktop2.dmg
5078336 ./Desktop/Desktop_2/CS_pdfs.dmg
2097456 ./Desktop/Desktop_2/Signal.dmg
205104 ./Dev/things.dmg
205040 ./Dev/work.dmg
1026064 ./DISKS/fun.dmg
1026064 ./DISKS/school.dmg
1026064 ./DISKS/misc.dmg
5078336 ./something.dmg
The files in DVDs can have an arbitrary order. For example, CS_pdfs.dmg and Signal.dmg do not need to be on the some disk.
So how can you find the way to use as small amount of DVDs as possible?
Your problem is called bin packing problem mathematically (which is related to the knapsack problem.)
Since it is np-hard, it very difficult to solve this efficiently! There is a recursive solution (dynamic programming + backtracking) but even this may require big amounts of space and computation time.
The most straightforward solution is a greedy algorithm (see Blindy's post), but this may give bad results.
It depends on how many items (n) you want to pack and how precise the solution must be (more precision will increase the runtime!). For small n the recursive/bruteforce or backtracking solution is sufficient, for bigger problems I'd advice to use some metaheuristic - especially genetic algorithms work quite well and yield good approximations in acceptable timespans.
Totally alternate solution: Use split and cut up the borders onto multiple DVDs. You'll get 100% utilization of every disk but the last. http://unixhelp.ed.ac.uk/CGI/man-cgi?split
You should probably try the greedy algorithm before anything else - That is, pick the largest item that can fit on the remaining DVD each time. While this is not guaranteed to work well, this problem is NP-complete and so no efficient solution exists. I had a similar problem recently, and the greedy algorithm worked quite well in my case - maybe it'll be good enough in yours as well.
The most generic solution would involve implementing a simple backtracking algorithm, but I'm fairly certain that in this particular case you can just sort them by size and pick the largest file that fits on your disc over and over until it's full, then move on to the next with the remaining files.