I have lots of ".txt" files in a single directory and I wants to give it to lucene for indexing.
I read all files of directory and for each file make its document and then use indexwriter.addDocument(Document) to give these files to lucene.
Is it possible to make all documents and give all of them to lucene?? I mean does lucene support this feature?
This feature was added in lucene 3.2
No, you will have to add each document on its own.
Furthermore I recommend using a configurable batch size to load just as many txt files and index them and carry on as long as there are more text files. This way you will not run into memory problems when you have bigger files.
Related
MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.
I'm using Apache's Lucene 3.0.3 on Windows 7. I'm able to index files successfully given any file extensions (.doc, .ppt, .pdf, .txt, .rtf etc). But, I'm able to search for a word(s) in any spoken human language(Indian/foreign) from only the indexed text document(s) but not from indexed Word/Powerpoint/PDF documents. Why is this? Is it possible for Lucene to do this directly?
Do I need to use a higher version of Lucene? I'm aware of Lucene 4.8.1. Do I need to use that to achieve my task stated above or is not possible for Lucene 3 to achieve the same?
Lucene doesn't interpret content. It indexes the content you give it and makes it searchable. If you hand it binary garbage, it will happily index it and make it searchable, it just won't be in searchable via human language. .doc, .ppt, .pdf, and .rtf formats are not plain-text, and so won't index well by just reading and chucking them directly into lucene.
You need to extract the content from the documents in order to search them meaningfully. I'd recommend using Tika.
Could someone tell me from where I have to start to develop a simple full text search engine for local files?
I have a Debian 7 server with LAMP and I have mounted a windows network drive on it. So far I am using this script to show the other local network users the directory tree where they can download files from the mounted network drive.
But I have to build a simple search engine which could index the names and the content (if any) of local files in the mounted folder - Microsoft doc, docx, xls, xlsx, rtf, txt. The search has to return the name of the file, the path and the best would be if there is a part of the text where the search word(s) present (if the file has text).
Could someone point me to the right direction what I have to read and learn to do this? Thanks.
You need a couple of tools for this. You need something to something to index and search content, and you've tagged the question with three good tools for this task, lucene, solr, and elasticsearch. Each one of them is rich with tutorials and examples to help you get started.
The other thing you will need, is a way to read the content from all those different file types. I'd recommend Apache Tika. It's an excellent toolkit for this, reads all the formats you've listed, and works well with Lucene.
You can see an example of their use together in this question : Tika in Action book examples Lucene StandardAnalyzer does not work
You may find this helpful, you may not.
I have Solr and Nutch set up to index my local filesystem and store them in Solr and have guides on how I set them up that way.
This would provide a solid backend for your application.
Here are the links. First two for Solr set up, last two for Nutch integration
http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html
http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html
http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
I'm new in Apache Lucene.
Is it possible to store files (e.g. pdf, doc) in Apache Lucene and later on to retrieve it? Or if i have to store those files somewhere else and just use it for indexing?
Technically you can, of course, store the contents of a file (e.g. in the StoredField or elsewhere) but I don't see any reason why you should. This will simply bring no added value but pain while serializing and deserializing file contents - and you will still have to keep the file name indexed somewhere else. Apart from serialization/deserialization pain, your app will likely have to block longer while Lucene will be merging index segments.
The best approach IMO is to store the path to the file relative to some file repository root - e.g. if your file is in /home/users/bob/files/123/file.txt, you might want to store the files/123/file.txt part without tokenization (using StringField).
Could any of you help me with the following:
I have quite a bunch load of InDesign Documents, and I need to be able to search through them, text wise. I don't have the resources of opening these files, make a pdf, and then do the search. I want, in short, to be able to either extract the textual context and index that, or directly index the file itself.
In the end, I would present the content or the index to a SOLR engine for further processing. This all should take place in a php/apache/mysql environment.
Your insights are highly appreciated.
In order to search the textual contents of an InDesign file, you will have to open the file in InDesign or InDesign server. There is no legal way around this.
However, there is no need to do a time consuming pdf export. You can use the InDesign scripting API to search through the text content of the file and create an index either inside the document or in an external location.
I think you might be looking for an application that can read & allow you to edit text in InDesign without having to actually have InDesign?
If so, I may be wrong, but there is a product in the market called PageZephyr, from Markzware.
You should look into it, I believe there's 30-day free demo as well. I used it awhile ago and it worked great, saved me tons of time. I don't have much InDesign files nowadays though.
Google them.