Is there a method to predict the size of a .doc? - file-upload

I'd like to know if there is a way to predict the size of a .doc based on the number of pages it contains.
I'm trying to form a rough estimate of how much storage space I need when anticipating X number of .doc's of Y page length.
For example: "I need to plan for 100 .DOC files, each being 15 pages in length. These .DOCs will consume roughly _KB of space."
Thanks

Not really. There's too many variables.
For example, a .doc (Word 2003) containing a page of nothing but the character 'a' in one font takes up 37kb, in another the same number of characters takes 32kb, in another 43.
The most practical way would be for you to average the sizes of the sample files, and work off that, but be aware that the smallest change can have rather large effects.

Related

Optimizing a vb.net code that uses a very large string list, over 400 000 entries

I use a static list of unique strings T (french dictionary), with 402 325 entries. My code is playing Scrabble and uses a specialized construction called a DAGGAD to construct playable words and verifies that the words are actually in the list. Is there a way faster than list.indexof(T) to find if a word exists ? I looked at HashSet.Contains(T) but it does not use an index that I can use to retrieve a word. For example, for a given turn of play there could be thousands of valid solutions: I actually store only the index of the list, but with a HashSet I would not be able to do that and will have to store all words, which increases memory usage. In most cases solutions are found in one or two seconds, but in some cases (i.e. with blanks) it takes up to 15 seconds, and I need to reduce that if at all possible with VB !
As Craig suggested, using List.BinarySearch(T) on a sorted list of T improves the speed by around 10 fold. A Scrabble play with a blank letter is now taking no more than 1 or 2 seconds compared to 15 to 20 seconds when I was using IndexOf.

Is it possible for RapidMiner to mine 50,000 documents using 4GB RAM?

I am facing a challenge to process 50k rows of data with feedback in text form and I am trying to find a good way to reduce the dimensions. so far i have used the text processing steps - tokenize- transform to low cases- remove stop words and stem but it is still giving a very big dimension space of about 15,000 with words that have no meaning included as well. What else can I do to extract only the relevant words?

Using multiple threads for faster execution

Approximate program behavior:
I have a map image with data associated with the map indicated by RGB index. The data has been populated into an MS Access database. I imported the information in the database into my program as an array and sorted them to go in the order I want the program to run.
I want the program to find the nearest pixel that has a different color from the incumbent pixel being compared. (Colors are stored as string attributes of object Pixel)
First question: Should I use integers to represent my colors instead of string? Would this make the comparison function run significantly faster?
In order to find the nearest pixel of different color, the program begins with all 8 adjacent pixels around the incumbent. If a nonMatch is not found, it then continues onto the next "degree", and in this fashion, it spirals out from the incumbent pixel until it hits a nonMatch. When found, the color of the nonMatch is saved as an attribute of incumbent. After I find the nonMatch for each of the Pixels, the data is re-inserted into the database
The program accomplishes what I want in the manner I've written it, but it is very very slow. After 24 hours, I am only about 3% through with execution.
Question Two: Does my program behavior sound about right? Is this algorithm you would use if you had to accomplish this task?
Question Three: Would it be appropriate for me to use threads in order to finish execution of the program faster? How exactly does that work? (I am brand new to threads, but know a little of the syntax)
Question Four: Would it be more "intelligent" for my program to find the nonMatch for each pixel and insert it into the database immediately after finding it? (I'm making a guess that this would be good in multi-threading, because while one record is accessing the database (to insert), another record is accessing the array of pixels (shared global variable in program).
Question Five: If threading is a good idea, I'm guessing I would split the records up into more manageable chunks (i.e. quarters), and have each thread run the same functions for their specified number of records? Am I close at all?
Please let me know if I can clarify or provide code samples, I just figured that this is more of a conceptual topic so do not want to overburden the post.
1.) Yes, integers compare much faster than strings. Additionally the y use much less memory
2.) I would adapt the algorithm in this way:
E.g.: #1: Let's say, for pixel(87,23) you found the nearest nonMatch to be (88,24) at degree=1 - you can immediately invert the relation and record, that the nearest nonMatch to (88,24) is (87,23). On degree=1 you finished 2 pixels with 1 search.
E.g. #2: Let's say, for pixel (17,18) you found the nearest nonMatch to be (17,20) at degree=2. You can immediately record, that all pixels, that border on both (16,19), (17,19) and (18,19) have the nearest noMatch (17,20) at degree=1, and that one of them is the nearest noMatch to (17,20). On degree=2 (or higher), you finished 5 pixels with 1 search.
3.) Using threads is a two-sided sword: You can do searches in parallel, but you need locking if you write to your array. So this depends on how many CPU cores you can throw at the problem. If this is 3 or more, threads will surely speed up the search.
4.) The results from 2.) make it necessary to mark a pixel as "done" in your array, as you might have finished up to 5 pixels with 1 search. I recommend you put those into a queue and use a dedicated thread to write the queue back into the database: MS Access can't handle concurrent updates, so a single database writer thread looks like a good idea.
5.) I recommend you NOT chunk up the array: You will run into problems with pixels on the edges of a chunk having their nearest nonMatch in a different chunk. Instead if you use e.g. 4 Threads, let them run 1.) From NW corner E, then S 2.) From SE Corner W, then N 3.) From NE Corner S, then W 4. From SW Corner N, then E
Yes. Using a integer would make it much faster
You can reuse the work you have done for previous pixel. Eg. If (a,b) is the nearest non-equal pixel of (x,y), it is likely that points around (x,y) might also have (a,b) as the nearest non-equal pixel
You can use different threads to work on different pixels instead of dividing searching for one pixel
IMHO, steps 1&2 should make your program much faster and you might not need multi-threading.
Yes, I'd convert colour strings to Integers for speed, or even Color structures if you intend to display them on the screen.
Don't work directly with the database if you can avoid it. Copy the necessary data out of the database into an array before you start, and copy your results back when you're finished.

How do I get Average field length and Document length in Lucene?

I am trying to implement BM25f scoring system on Lucene. I need to make a few minor changes to the original implementation given here for my needs, I got lost at the part where he gets Average Field Length and document length... Could someone guide me as to how or where I get it from?
You can get field length from TermVector instances associated with documents' fields, but that will increase your index size. This is probably the way to go unless you cannot afford a larger index. Of course you will still need to calculate the average yourself, and store it elsewhere (or perhaps in a special document with a well-known external id that you just update when the statistics change).
If you can store the data outside of the index, one thing you can do is count the tokens when documents are tokenized, and store the counts for averaging. If your document collection is static, just dump the values for each field into a file & process after indexing. If the index needs to get updated with additions only, you can store the number of documents and the average length per field, and recompute the average. If documents are going to be removed, and you need an accurate count, you will need to re-parse the document being removed to know how many terms each field contained, or get the length from the TermVector if you are using that.

Compare Images in SQL

What is the best way to compare to Images in the database?
I tried to compare them (#Image is the type Image):
Select * from Photos
where [Photo] = #Image
But receives the error "The data types image and image are incompatible in the equal to operator".
Since the Image data type is a binary and huge space for storing data, IMO, the easiest way to compare Image fields is hash comparison.
So you need to store hash of the Photo column on your table.
If you need to compare the image you should retrive all the images from the database and do that from the language that you use for accessing the database. This is one of the reasons why it's not a best practice to store images or other binary files in a relational database. You should create unique file name every time when you want to store an image in a database. Rename the file with this unique file name, store the image on the disk and insert in your database it's name on the disk and eventually the original name of the file or the one provided by the user of your app.
Generally, as it's been mentioned already, you need to use dedicated algorithms from the image processing shelve.
Moreover, it's hard to give precise answer because the question is too general. Two images may be considered as different or not based on number of properties.
For instance, you can have one image of a flower 100x100 pixels and image of the same flower but resized to 50x50 pixels. For some purposes and applications these two will be considered as similar (regardless of different dimensions), but will be different images for other purposes.
You may want to check how image comparison is realised by some tools and learn how it works:
pdiff
ImageMagick compare
If you don't want to compare image content but just check if two binary streams (image files, other binary files, binary object with image content) are equivalent, then you can compare MD5 checksums of images.
It depends on how accurate you want to be and how many images you need to compare. You can use various functions like DATALENGTH and SUBSTRING or READTEXT to do some comparisons. Alternatively, you could write code in the CLR and implement it through a stored procedure to do comparisons.
Comparing images falls under a specific category of Computer science, that is called image processing. You should search some libraries that provide image processing capabilities. Using that you can compare upto what ratio given two images are same or identical. you can have 2 images matching each other upto 50% or more or less. There are mathematical algorithm that define the comparison formulae which returns the ratio.
Hope this gives you a direction to work further on your Problem..