I'm new in Apache Lucene.
Is it possible to store files (e.g. pdf, doc) in Apache Lucene and later on to retrieve it? Or if i have to store those files somewhere else and just use it for indexing?
Technically you can, of course, store the contents of a file (e.g. in the StoredField or elsewhere) but I don't see any reason why you should. This will simply bring no added value but pain while serializing and deserializing file contents - and you will still have to keep the file name indexed somewhere else. Apart from serialization/deserialization pain, your app will likely have to block longer while Lucene will be merging index segments.
The best approach IMO is to store the path to the file relative to some file repository root - e.g. if your file is in /home/users/bob/files/123/file.txt, you might want to store the files/123/file.txt part without tokenization (using StringField).
Related
I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?
I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..
PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.
Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.
I recently recovered a 1.5TB external HDD that crashed. The program I used to recover the files was Active Undelete Enterprise, it's excellent. When the files were successfully recovered they were all saved with a .efs extension so files looked like mydocument.docx.efs. At first I thought they were encrypted and needed to be decrypted, I spent 10 mins on it and realized I just need to remove the .efs from the entire filename and the mydocument.docx works perfectly. Problem is now I have over 55,000 files within hundreds of folders where I need to simply remove the .efs after each file. Does anyone know how to do this?
From a command prompt window, navigate to the top level directory where these files reside.
Type the command
DIR /S/B >>filelist.txt
This command will give you a bare format file listing of the current directory plus all nested subdirectories without any extraneous information. The list will be contained in the text file named "filelist.txt" or whatever else you choose to call it. I would then use this text file in a text editor to convert every line of text from, for example,
C:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs
to
rename c:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs file1.png
to give a simple example of a file that I just found on my system using this method.
You will need to use a text editor with a columnar editing capability since you have to modify som many files. Old programmer's editors such as CodeWright made this really simple while modern editors such as Eclipse or Notepad++ make this a little more difficult and may require a columnar editing plugin, depending on version. You basically have to make a columnar copy of all of the text in the file, and then paste the copy off to the far right - far enough that a second column of filenames and paths won't overwrite any of the existing file names and paths. You can then use columnar editing features to select and delete the path names of the text in the 2nd column since the rename command requires that the 2nd argument be simply the base filename and extension without the path information. You can use the columnar editing features to prepend every line with "RENAME ". If you attempt to do this without columnar editing features, you will find it slow going!
An alternate way to do this is to use a command formed from a "regular expression" to create the rename command. If you are not familiar with "regular expressions", ask a programmer friend as this is not an easy topic to learn from scratch. If you are familiar with regular expressions, this is probably the simplest way to perform this task. I haven't used them in many years and no longer recall the exact syntax to use or I would tell you myself.
Regardless of what kind of editor you use, the goal is to turn this ASCII file list of paths and filenames into a batch file (simply rename file1.txt to file1.bat when you are finished editing). You can then run the batch file by typing file1.bat at a command prompt.
I have just run into this same problem myself using the same really wonderful tool that you used. I am writing this while waiting for the undelete program to finish. That it restores files with this extra extension seems very anti-intuitive so I will look for an option to make it not do this when it finishes. If I find one, I will post a new answer here that is more specific to this tool. Otherwise, I am going to have rename all kazillion files just as you had to.
You experienced this problem because the disk that you recovered your files to "does not support encryption", according to the Active# UNDELETE documentation. The documentation offers no further explanation of what kind of disks support encryption, etc.
They offer a Decrypt command that restores the file's proper names as a post processing step. Unfortunately, this requires that you "include" each and every file to be decrypted, with no support for wildcards and parsing subdirectories so that is a non-starter, in my opinion given that both of us have hundreds of thousands of files to be renamed.
I did find that by selecting a normal fixed (non-removable) hard drive as the destination of the recovery effort, that the resulting files do not end up encrypted (i.e., they are recovered with the proper file name and extension). I originally chose a large USB based flash drive and the files were stored in their "encrypted" state (not really encrypted, but possibly potentially so and thus they give the .efs extension). Of course, this meant that I had to run the command all over again after switching to a regular hard drive (takes about 16 hours to recover 80GB worth of files due to presence of many sector CRC errors).
This is a bit of a two part question, for working with 40mb xml files.
• What’s a reasonable size to store in memory for a program running continually in the background?
• How to find what has changed in an XML file.
So on the first read the XML is loaded into NSData, then uploaded to the server.
Now instead of uploading a 40mb XML every time it changes, I would prefer to upload a “delta” file containing only what has changed. The program would monitor the file for change, and activate when it’s been modified. From what I can see, I would need to parse an old version of the xml file and parse the modified xml file, then compare them? Is it unreasonable to store 80mb in memory like this every time the file is modified?. Now I’m assuming that this has to be done with a DOM parser because I can’t see how you could compare two files like that with a SAX parser since it only has part of the file stored?
I'm a newbie at this so any help would be appreciated!
To compare two files:
There are many ways to do, (As file is to be considered, I may not be correct):
sdiff file1.xml file2.xml A unix command
You can use this command with apple script.
-[NSFileManager contentsEqualAtPath:andPath:]
This method checks to see if two files at given path are the same file, then compares their size, and finally compares their contents.
For other part:
What size is considered for background process, I dont think so, for an application it matters. You can save these into temporary files. Even safari uses 130+ MB as you can easily check through Activity monitor.
NSXMLParser ended up being the most useful for this
I have lots of ".txt" files in a single directory and I wants to give it to lucene for indexing.
I read all files of directory and for each file make its document and then use indexwriter.addDocument(Document) to give these files to lucene.
Is it possible to make all documents and give all of them to lucene?? I mean does lucene support this feature?
This feature was added in lucene 3.2
No, you will have to add each document on its own.
Furthermore I recommend using a configurable batch size to load just as many txt files and index them and carry on as long as there are more text files. This way you will not run into memory problems when you have bigger files.
I noticed that Apple started using zip archives to replace document packages (folders appearing as a single file in Finder) in the iWork applications. I'm considering doing the same as I keep getting support emails related to my document packages getting corrupted when copying them to a windows fileserver.
My questions is what would be the best way to do this in a NSDocument-based application?
I guess the easiest way would be to create a directory file wrapper, create an archive of it and return it in NSDocument's
- (NSFileWrapper *)fileWrapperOfType:(NSString *)typeName error:(NSError **)outError
But I fail to understand how to create a zip archive of the NSFileWrapper.
If you just want to make a zip file your format (ie, "mydoc.myextension" is actually a zip file), there's no convenient, built-in Cocoa mechanism for creating zip archives with code. Take a look at this Google Code project: ziparchive I don't believe a file wrapper will help in that case, though.
Since you cited iWork, I don't own iWork 09, but previous versions use a package format (ie, NSFileWrapper would be ideal) but zip the XML that describes the document's structure, while keeping attachments (like embedded media, images, etc.) in a resource folder, all within the package. I assume they do this because XML can be quite large for large, complicated documents, but compresses very well because it's text. This results in an overall smaller document.
If indeed Apple has moved to making the entire document one big zip archive (which I would find odd), they'd either be extracting necessary resources to a temp folder somewhere or loading the whole thing into memory (a step backward from their package-based approach, IMO). These are considerations you'll need to take into account as well.
You’ll want to take the data from the file wrapper and feed it into something like ziparchive.
Pierre-Olivier Latour has written an extension to NSData that deals with zip compression. You can get it here: http://code.google.com/p/polkit/
I know this is a little late to the party but I thought I'd offer up another link that could help anyone that comes across this post.
Looks like the ZipBrowser sample from Apple would be a good start http://developer.apple.com/library/mac/#samplecode/ZipBrowser/Introduction/Intro.html
HTH