Free library to read PDF files [closed] - vba

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there a free way to read PDF files through VBA to extract basic text content? I need to automate a weekly data acquisition process at my company where data is contained in PDF files (which are updated weekly by the data provider). Also, is there a reference I can look into to understand the file structure (DOM?) of a PDF?

Adobe's PDF reference is online here: http://www.adobe.com/devnet/pdf/pdf_reference.html
I'm not sure about the best way to read PDFs from VBA directly, but if you can call an external Java or C# program, then I would recommend using iText for basic text extraction.
EDIT: I should maybe mention that Adobe's PDF reference is an 800 page beast. I found that it's good for looking up answers to particular questions (eg, storing widths of embedded truetype fonts), but it may not be a good place to start. For that, reading through the iText book helped me to get started on the format.

The IText book contains lots of worked examples for general PDF tasks and lots of background info to help you understand PDF files. It more than pays for itself very quickly!

Related

Looking for an advanced PDF comparison software [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
For a work assignment I have to read through 600-1000 page pdfs and form a general opinion on how similar they are (the extent to which the company putting them out is using 'boilerplate' formats).
I've used Adobe XI Pro DC's comparator software, which can compare two PDFs and highlight the parts that are different or changed. What I'll do is look through the comparison PDF, and any part that isn't highlighted I'll know is the exact same in both PDFs.
The problem with this software is that if the similar portions/strings are very far apart from each other in terms of page count (ex: if the relevant section is on page 100 of the first PDF but page 25 on the second PDF), this comparison software isn't going to catch it.
My ideal PDF comparator software would highlight only sections/paragraphs/strings in the first PDF that cannot be found ANYWHERE in the second PDF. Any section/string/paragraph which is also found in the second PDF, no matter how far apart in terms of page count, would stay unhighlighted.
Any help would be hugely appreciated!
You may want to check xdocdiff project (open-source that you may adapt to your needs) and maybe some of these diff tools recommended in TortoiseSVN documentation (open-source and closed source).

What documentation system to use for user guide [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm currently trying to find a documentation (user guide) system that would have following features:
documentation files in text mode (so svn could diff/merge it)
possibility to use images, table, cross-references and table of
contents
export to pdf (or .doc/.odt) that would support cross-references
I tried markdown for documentation source files and pandoc for pdf export but markdown does not support tables.
I really appreciate any help you can provide.
We use Sphinx for this scenario.
It can generate html, pdf and some other formats from reStructuredText Files.
And have a look at list-table when you want to add complex tables.
I use the TeX for electonic and printable documentation
https://tex.stackexchange.com/
https://en.wikipedia.org/wiki/TeX
Probably the most commonly used solution set for documentation is XML in Docbook or DITA. You can certainly manage those in SVN as well as perform diffs. They both provide processing toolchains with many output types include PDF through XSL FO.

is there any working/real open source Plagiarism checker available? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I want to develop a plagiarism checker for checking several source codes but I couldn't find any proper source code or even a resource to get an idea about it.
I have checked the Boss2 which is useless. they claim that they use Sherlock module for detecting plagiarism but it seems there is no such tools included in boss2.
if any open source detection tool is available for checking source code please let me know.
regards
I'm aware of open-source plagiarism detectors for text (e.g., WCopyFind), but not code.
I couldn't find... even a resource to get an idea about it.
The authors of the excellent closed-source tool MOSS have published a helpful paper about the technology.
I know the question is old, but I did land here from a google.
Sherlock is an open source plagiarism detector. Sherlock's home page is here
I wrote SimiCheck, and you are welcome to use it. If you are interested in an API, I could probably write one very quickly.
I wrote the original algorithm as part of the CrowdGrader peer-grading tool, but then I decided to make the comparison tools available independently.
SimiCheck can handle code, Word (.docx), html, pdf, text, ..., as well as .zip, .tar, .gz, .tgz, and some more formats, and can deal with variable renaming, code moves, code across multiple files, etc.

Programmatically extracting Adobe PDF package files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
We have a bunch of documents in our organization that were inadvertently saved as Adobe PDF packages (also known as PDF 1.7 "collections").
We would like to convert these to normal PDFs (most of these "packages" contain one bog-standard pdf file), but given the number of files, it's not possible manually.
Any Adobe expert know whether:
There is an open-source or free library that handles PDF package format that I can write a script around?
Does Adobe Pro 9 have a relevant scriptable interface that would allow me to extract the relevant file from each package?
Alternatively, I'm looking at a macro-based approach, but I'd rather not go this route until investigating other options.
Thanks!
After a bunch of digging around, I found pdftk, which is distributed as source and binary on many platforms.
It does almost all of what we need to do, and we can now iterate through our documents and recursively call pdftk on each (some are multi-level attachment chains).
Note pdftk will only burst pages of the visible document into individual documents. The hidden documents remain hidden.
The option you need to use is unpack_files.
Yet another unwanted obfuscation format to hinder interoperability therefore classified as malware.
Using Adobe Acrobat Professional combine all into one pdf and then split by bookmark level
I understand this thread is few years old but if anyone is looking for free utility to extract files from PDF packages (especially from large collections) then check the free utility ByteScout PDF Multitool: it was tested against 500+ MB package files to extract hundreds of multilevel chained attachments.
Disclaimer: I'm affiliated with ByteScout

Desktop search utility for pdf,chm and djvu files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I want to write a tool that helps me search pdf/chm/djvu files in linux. Any pointers on how to go about it?
The major problem is reading/importing data from all these files. Can this be done with C and shell scripting?
Tracker ships with Ubuntu 8.04 -- it was a significant switch from Beagle which users believed was too resource (CPU) intensive and didn't yield good enough results. It indexes both pdf and chm and according to this bug report it also indexes djvu.
Note that djvu is an image compression format (optimized to compress 'pictures of text', typically the results of scanning). As such, you won't be able to search for text, except in the metadata -this is what the link sent by cdleary refers to-, or if you first use OCR on the document to convert it into text.
The same is true for PDFs which content are scanned articles/books.
How about a plugin for Beagle ?
It already searches PDFs but you can add other file types.
Here is the relevant wikipedia page : http://en.wikipedia.org/wiki/Beagle_(software)