Is it possible for RapidMiner to mine 50,000 documents using 4GB RAM? - text-mining

I am facing a challenge to process 50k rows of data with feedback in text form and I am trying to find a good way to reduce the dimensions. so far i have used the text processing steps - tokenize- transform to low cases- remove stop words and stem but it is still giving a very big dimension space of about 15,000 with words that have no meaning included as well. What else can I do to extract only the relevant words?

Related

2D Chunked Grid Fast Access

I am working on a program that stores some data in cells (small structs) and processes each one individually. The processing step accesses the 4 neighbors of the cell (2D). I also need them partitioned in chunks because the cells might be distributed randomly trough a very large surface, and having a large grid with mostly empty cells would be a waste. I also use the chunks for some other optimizations (skipping processing of chunks based on some conditions).
I currently have a hashmap of "chunk positions" to chunks (which are the actual fixed size grids). The position is calculated based on the chunk size (like Minecraft). The issue is that, when processing the cells in every chunk, I lose a lot of time doing a lookup to get the chunk of the neighbor. Most of the time, the neighbor is in the same chunk we are processing, so I did a check to prevent looking up a chunk if the neighbor is in the same chunk we are processing.
Is there a better solution to this?
This lacks some details, but hopefully you can employ a solution such as this:
Process the interior of a chunk (ie excluding the edges) separately. During this phase, the neighbours are for sure in the same chunk, so you can do this with zero chunk-lookups. The difference between this and doing a check to see whether a chunk-lookup is necessary, is that there is not even a check. The check is implicit in the loop bounds.
For edges, you can do a few chunk lookups and reuse the result across the edge.
This approach gets worse with smaller chunk sizes, or if you need access to neighbours further than 1 step away. It breaks down entirely in case of random access to cells. If you need to maintain a strict ordering for the processing of cells, this approach can still be used with minor modifications by rearranging it (there wouldn't be strict "process the interior" phase, but you would still have a nice inner loop with zero chunk-lookups).
Such techniques are common in general in cases where the boundary has different behaviour than the interior.

Is there a method to predict the size of a .doc?

I'd like to know if there is a way to predict the size of a .doc based on the number of pages it contains.
I'm trying to form a rough estimate of how much storage space I need when anticipating X number of .doc's of Y page length.
For example: "I need to plan for 100 .DOC files, each being 15 pages in length. These .DOCs will consume roughly _KB of space."
Thanks
Not really. There's too many variables.
For example, a .doc (Word 2003) containing a page of nothing but the character 'a' in one font takes up 37kb, in another the same number of characters takes 32kb, in another 43.
The most practical way would be for you to average the sizes of the sample files, and work off that, but be aware that the smallest change can have rather large effects.

Table Detection Algorithms

Context
I have a bunch of PDF files. Some of them are scanned (i.e. images). They consist of text + pictures + tables.
I want to turn the tables into CSV files.
Current Plan:
1) Run Tesseract OCR to get text of all the documents.
2) ??? Run some type of Table Detection Algorithm ???
3) Extract the rows / columns / cells, and the text in them.
Question:
Is there some standard "Table Extraction Algorithm" to use?
Thanks!
Abbyy Fine Reader includes table detection and will be the easiest approach. It can scan, import PDF', TIFF's etc. You will also be able to manually adjust the tables and columns when the auto detection fails.
www.abbyy.com - You should be able to download a trial version and you will also find the OCR results are much more accurate than Tesseract which will also save you a lot of time.
Trying to write something yourself will be hit and miss as there are too many different types of tables to cope with. ie. with lines, without lines, shaded, multiple lines, different alignments, headers, footers etc..
Good luck.

Lucene SimpleFacetedSearch Facet count exceeded 2048

I've stumbled into an issue using Lucene.net in one of my project where i'm using the SimpleFacetedSearch feature to have faceted search.
I get an exception thrown
Facet count exceeded 2048
I've a 3 columns which I'm faceting as soon as a add another facet I get the exception.
If I remove all the other facets the new facet works.
Drilling down into the source of SimpleFacetedSearch I can see inside the constructor of SimpleFacetedSearch it's checking of the number of facets don't exceed MAX_FACETS which is a constant set to 2048.
foreach (string field in groupByFields)
{
...
num *= fieldValuesBitSets1.FieldValueBitSetPair.Count;
if (num > SimpleFacetedSearch.MAX_FACETS)
throw new Exception("Facet count exceeded " + (object) SimpleFacetedSearch.MAX_FACETS);
fieldValuesBitSets.Add(fieldValuesBitSets1);
...
}
However as it's public I am able to set it like so.
SimpleFacetedSearch.MAX_FACETS = int.MaxValue;
Does anyone know why it is set to 2048 and if there are issues changing it? I was unable to find any documentation on it.
No there should't be any issue in changing it. But remember that using Bitsets(as done by SimpleFacetedSearch internally) is more performant when the search results are big but facet counts don't exceed some number. (Say 1000 facets 10M hits)
If you have much more facets but search results are not big you can iterate on the results(in a collector) and create facets. This way you may get a better performance. (say 100K facets 1000 hits)
So, 2048 may be an optimized number where exceeding it may result in performance loss.
The problem that MAX_FACETS is there to avoid is one of memory usage and performance.
Internally SimpleFS uses bitmaps to record which documents each facet value is used in. There is a bit for each document and each value has a separate bitmap. So if you have a lot of values the amount of memory needed grows quickly especially if you also have a lot of documents. memory = values * documents /8 bytes.
My company has indexes with millions of documents and 10's of thousands of values which would require many GB's of memory.
I've created another implementation which I've called SparseFacetedSearcher. This records the doc IDs for each value. So you only pay for hits not a bits per doc. If you have exactly one value in each document (like a product category) then the break even point is if you have more than 32 values (more than 32 product categories).
In our case the memory usage has dropped to a few 100MB.
Feel free to have a look at https://github.com/Artesian/SparseFacetedSearch

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.