Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I want to write a tool that helps me search pdf/chm/djvu files in linux. Any pointers on how to go about it?
The major problem is reading/importing data from all these files. Can this be done with C and shell scripting?
Tracker ships with Ubuntu 8.04 -- it was a significant switch from Beagle which users believed was too resource (CPU) intensive and didn't yield good enough results. It indexes both pdf and chm and according to this bug report it also indexes djvu.
Note that djvu is an image compression format (optimized to compress 'pictures of text', typically the results of scanning). As such, you won't be able to search for text, except in the metadata -this is what the link sent by cdleary refers to-, or if you first use OCR on the document to convert it into text.
The same is true for PDFs which content are scanned articles/books.
How about a plugin for Beagle ?
It already searches PDFs but you can add other file types.
Here is the relevant wikipedia page : http://en.wikipedia.org/wiki/Beagle_(software)
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
For a work assignment I have to read through 600-1000 page pdfs and form a general opinion on how similar they are (the extent to which the company putting them out is using 'boilerplate' formats).
I've used Adobe XI Pro DC's comparator software, which can compare two PDFs and highlight the parts that are different or changed. What I'll do is look through the comparison PDF, and any part that isn't highlighted I'll know is the exact same in both PDFs.
The problem with this software is that if the similar portions/strings are very far apart from each other in terms of page count (ex: if the relevant section is on page 100 of the first PDF but page 25 on the second PDF), this comparison software isn't going to catch it.
My ideal PDF comparator software would highlight only sections/paragraphs/strings in the first PDF that cannot be found ANYWHERE in the second PDF. Any section/string/paragraph which is also found in the second PDF, no matter how far apart in terms of page count, would stay unhighlighted.
Any help would be hugely appreciated!
You may want to check xdocdiff project (open-source that you may adapt to your needs) and maybe some of these diff tools recommended in TortoiseSVN documentation (open-source and closed source).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm currently trying to find a documentation (user guide) system that would have following features:
documentation files in text mode (so svn could diff/merge it)
possibility to use images, table, cross-references and table of
contents
export to pdf (or .doc/.odt) that would support cross-references
I tried markdown for documentation source files and pandoc for pdf export but markdown does not support tables.
I really appreciate any help you can provide.
We use Sphinx for this scenario.
It can generate html, pdf and some other formats from reStructuredText Files.
And have a look at list-table when you want to add complex tables.
I use the TeX for electonic and printable documentation
https://tex.stackexchange.com/
https://en.wikipedia.org/wiki/TeX
Probably the most commonly used solution set for documentation is XML in Docbook or DITA. You can certainly manage those in SVN as well as perform diffs. They both provide processing toolchains with many output types include PDF through XSL FO.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I have a list of URLs and I would like to capture the related sites as they would display in a standard view port for a given resolution (say 1024x768).
Does anyone know of a tool/web service/script that does just that?
Any standard PHP methods or libraries I could build on alternatively?
To give you an idea what I intend to use these images for: they should feed into a website, my own little place to collect domain names going to waste.
Web service: Browshot with the PHP library.
Tools: PhantomJS
I was wasting most of the day searching for a simple solution and finally found one. Ann Smarty wrote this article (http://www.makeuseof.com/tag/take-multiple-screenshots-bulk-firefox/) about the free Firefox plugin Grab Them All, which makes it immensely easy to batch-generate snapshots to a specified size from a text file list of source URLs. Easy peasy, no coding necessary, other than maybe to change saved filenames. (There's a setting that uses a safe version of the supplied URL string, with unsafe characters changed to the underscore character.)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there a free way to read PDF files through VBA to extract basic text content? I need to automate a weekly data acquisition process at my company where data is contained in PDF files (which are updated weekly by the data provider). Also, is there a reference I can look into to understand the file structure (DOM?) of a PDF?
Adobe's PDF reference is online here: http://www.adobe.com/devnet/pdf/pdf_reference.html
I'm not sure about the best way to read PDFs from VBA directly, but if you can call an external Java or C# program, then I would recommend using iText for basic text extraction.
EDIT: I should maybe mention that Adobe's PDF reference is an 800 page beast. I found that it's good for looking up answers to particular questions (eg, storing widths of embedded truetype fonts), but it may not be a good place to start. For that, reading through the iText book helped me to get started on the format.
The IText book contains lots of worked examples for general PDF tasks and lots of background info to help you understand PDF files. It more than pays for itself very quickly!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm a jack-of-all-trades-master-of-none programmer and as I jump around languages, quality consistent documentation is becoming more and more important to me. I've recently been using Doxygen, but Wikipedia reveals the usual ridiculous list of similar frameworks.
What is your favorite documentation generator and why? (Vote where you agree to keep it tidy!)
I use different files written in MediaWiki MarkUp, since this is easy to learn for everyone. I convert this to HTML and a CHM file, and to LaTeX for the PDF documentation.
This was the most painless way for me to generate Online documentation AND printable documentation in one strike with a simple way of input.
The tools I use are org.eclipse.mylyn.wikitext with a custom DocumentBuilder for LaTeX, the Microsoft Help compiler (which sadly only runs on windows), and a LaTeX distribution.
EDIT: I managed to get the Microsoft Help compiler running with Wine, so my Linux build server is now able to create the whole documentation automatically.