How to remove duplicate pages from a PDF? - pdf

I have a pdf of an ebook which has many pages duplicated, and I would like to remove them.

There are many PDF toolkits that could do this. You don't say which language or framework, so it's hard to recommend one.
You could also use Adobe Acrobat if you don't want to do it programmatically.
Disclaimer: I work for Atalasoft
Our DotImage Document Imaging Toolkit has PDF page manipulation classes. Tutorial here:
http://www.atalasoft.com/products/dotimage/white-papers/building-pdf-documents-with-dotimage
It shows adding pages, but we also support removing (and reordering)

Related

Include custom fonts in PDF

I have a question about generating PDFs with wkhtmltopdf. I know it's possible to use custom fonts in my html. But I think it's required that the operating system viewing the pdf has installed these fonts. Correct?
My question is whether it's possible to include these fonts in the PDF? So when the PDF is generated I can send it to a print office to print 50 copies. And they see the pdf exactly the same as I, without having these fonts installed.
This is certainly possible.
It's called "embedding a font" in pdf lingo.
Most pdf generation libraries should support this.
Pdf comes in different flavors (standards). One of the standards pdf/A is meant for long term storage (the A stands for archiving). The idea being that the document look and feel should be preserved as much as possible. In order to achieve this without depending on the operating system (and the fonts it may be shipped with), it is required that the fonts are embedded to fulfill the pdf/A standard.
https://en.wikipedia.org/wiki/PDF/A
I don't know how to do this in the library you are using. But I do know it's possible with iText.
This is a great tutorial on it, which aside from giving you more information about iText, will also illustrate the problem with custom fonts in a very visual way.
https://developers.itextpdf.com/tutorial/using-fonts-pdf-and-itext

API for PDF Library

I need to build a small PDF library that will display many catalogs, the user will be able to view the document and go thru pages but he will not be able to download or share the documents in any way, somehow to work like Google Books (here an example).
I have in mind something like the Google Drive API or some kind of Scribd API, but I don't know if one of those will work, I would like to know if there are more options for these application or the mentioned before will do the job.
Edit: Forgot to mention, all this done in a web browser.
In principle all you need would be the ability to render pages from a PDF file into an image. Your application (you didn't mention where you want to build this) is then responsible for displaying the images, scrolling, moving from page to page etc...
If this is correct there are multiple possible libraries that can do this:
- ImageMagick can convert PDF to images (http://www.imagemagick.org)
- GhostScript has extensions for PDF and can convert PostScript or PDF into images and other formats (http://www.ghostscript.com)
- I'm sure there are many, many more...
There are also a number of commercial tools, for example those from Adobe (licensed through DataLogics, http://www.datalogics.com) and callas software (http://www.callassoftware.com - I'm affiliated with this company)

What is "Tagged PDF"?

Can someone please explain what a "Tagged PDF" is, and how it differs from regular, non-tagged PDF?
Will tagged PDFs contain special content, such as XML, Rich Media, Javascript, or the like?
Which TeX-toolchains generate Tagged PDFs?
Tagged PDF is a PDF file that contains meta-information around certain groups of PDF instructions inside a page content. This meta-information has many use cases: Text-extraction, content-reflow, document accessibility, geographic information in PDF containing maps, etc.
If you need to know more details about this topic I would recommend reading Chapter 10 - Document Interchange of Adobe PDF Reference version 1.7.
The main reason it is used is for accessibility. With the correct tags, a screen reader (for a blind person) can understand where headings fall, what is a table/footnote/graphic and so on. Also there is a feature called PDF Article Threading which is useful for magazine or newspaper layouts where an article is split across boxes/pages.

Creating Thumbnail from PDF without Adobe SDK

I've been looking for ways by which I can generate Thumbnails from pdf, as shown in the explorer. But the problem is that without Adobe Pro, the free version does not expose all ihe COM interfaces. Is there any other way? please help.
Ghostscript (which is what ImageMagick uses) will generate images in a wide variety of different image formats... if you need something really obscure then use the imagemagick wrapper, otherwise, I prefer the straight dope.
If you can afford a commercial option, you could use Amyuni PDF Creator ActiveX for this task, (or .Net version if that suits your needs better). Using this product you can create jpg/png/bmp images from the first page of your PDF files with the specified resolution, and then use them as thumbnails.
Disclaimer: I am part of the development team of this product.
Here are other SO questions proposing other approaches (not involving COM):
Using ImageMagic in command line
Thumbnail of a PDF page (Java)

HowTo extract embedded OCR data from a PDF?

I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data.
So my question is, is it possible to extract this embedded OCR-Data from the pdf Files?
It would be nice to get files with coordinates. But it would also be sufficient to get plaintext files.
You should be able to do this with iText or iTextsharp. iTextsharp has 0 documentation however, and a good number of the functions are not equivalent to those found in iText.
PDFSharp does not support iref streams. Those are pretty much the only comprehensive opensource solutions. If you do not mind paying, vista solutions may have something for you, they mostly handle workflow, but they have some pretty extensive pdf libraries as well.