Pragmatically convert PDF images to 8 bit - pdf

I have a set of PDFs in normal RGB colour. They would benefit from conversion to 8 bit to reduce file sizes. Are there any APIs or tools that would allow me to do this whilst retaining non-raster elements in the PDF?

This is a fun one. Atalasoft dotImage with the PDF Rasterizer and dotPdf can do this (disclaimer: I work for Atalasoft and wrote most of the PDF tools). I'd start off first by finding candidate pages:
List<int> GetCandidatePages(Stream pdf, string password)
{
List<int> retVal = new List<int>();
using (PageCollection pages = new PageCollection(pdf, password)) {
for (int i=0; i < pages.Count; i++) {
if (pages[i].SingleImageOnly())
retVal.Add(i);
}
}
pdf.Seek(0, SeekOrigin.Begin); // restore file pointer
return retVal;
}
Next, I'd rasterize only those pages, turning them into 8-bit images, but to keep things efficient, I'd use an ImageSource which manages memory well:
public class SelectPageImageSource : RandomAccessImageSource {
private List<int> _pages;
private Stream _stm;
public SelectPageImageSource(Stream stm, List<int> pages)
{
_stm = stm;
_pages = pages;
}
protected override ImageSourceNode LowLevelAcquire(int index)
{
PdfDecoder decoder = new PdfDecoder();
_stm.Seek(0, SeekOrigin.Begin);
AtalaImage image = PdfDecoder.Read(_stm, _pages[index], null);
// change to 8 bit
if (image.PixelFormat != PixelFormat.Pixel8bppIndexed) {
AtalaImage changed = image.GetChangedPixelFormat(PixelFormat.Pixel8bppIndexed);
image.Dispose();
image = changed;
}
return new FileReloader(image, new PngEncoder());
}
protected override int LowLevelTotalImages() { return _pages.Count; }
}
Next you need to create a new PDF from this:
public void Make8BitImagePdf(Stream pdf, Stream outPdf, List<int> pages)
{
PdfEncoder encoder = new PdfEncoder();
SelectPageImageSource source = new SelectPageImageSource(pdf, pages);
encoder.Save(outPdf, source, null);
}
Next you need to replace the original pages with the new ones:
public void ReplaceOriginalPages(Stream pdf, Stream image8Bit, Stream outPdf, List<int> pages)
{
PdfDocument docOrig = new PdfDocument(pdf);
PdfDocument doc8Bit = new PdfDocument(image8Bit);
for (int i=0; i < pages.Count; i++) {
docOrig.Pages[pages[i]] = doc8Bit[i];
}
docOrig.Save(outPdf); // this is your final
}
This will do what you want, more or less. The less-than ideal bit of this is that the image pages have been rasterized, which is probably not what you want. The nice thing is that just by rasterizing, generating output is easy, but it might not be at the resolution of the original image. This can be done, but it is significantly more work in that you need to extract the image from SingleImageOnly pages and then change their pixel format. The problem with this is that SingleImageOnly does NOT imply that the image fits the entire page, nor does it imply that the image is placed in any particular location. In addition to the PixelFormat change (actually, before the change), you would want to apply the matrix that is used to place the image on the page to the image itself, and use PdfEncoder with an appropriate set of margins and the original page size to get the image where it should be. This is all cut-and dried, but it is a substantial amount of code.
There is another approach that might also work using our PDF generation API. It involves opening the document and swapping out the image resources for the document with 8-bit ones. This is also doable, but is not entirely trivial. You would do something like this:
public void ReplaceImageResources(Stream pdf, Stream outPdf, List<int> pages)
{
PdfGeneratedDocument doc = new PdfGeneratedDocument(pdf);
doc.Resources.Images.Compressors.Insert(0, new AtalaImageCompressor());
foreach (int page in pages) {
// GetSinglePageImage uses PageCollection, as above, to
// pull a single image from the page (no need to use the matrix)
// then converts it to 8 bpp indexed and returns it or null if it
// is already 8 bpp indexed (or 4bpp or 1bpp).
using (AtalaImage image = GetSinglePageImage(pdf, page)) {
if (image == null) continue;
foreach (string resName in doc.Pages[page].ImportedImages) {
doc.Resources.Images.Remove(resName);
doc.Resources.Images.Add(resName, image);
break;
}
}
}
doc.Save(outPdf);
}
As I said, this is tricky - the PDF generation suite was made for making new PDFs from whole cloth or adding new pages to an existing PDF (in the future, we want to add full editing). But PDF manages all of its images as resources within the document and we have the ability to replace those resources entirely. So to make life easier, we add an ImageCompressor to the Image resource collection that handles AtalaImage objects and remove the existing image resources and replace them with the new ones.
Now I'm going to do something that you probably won't see any vendor do when talking about their own products - I'm going to be critical of it on a number of levels. First, it isn't super cheap. Sorry. You might get sticker shock when you look at the price, but the price includes technical support from a staff that is honestly second to none.
You can probably do a lot of this with iTextPdf Sharp or the Bit Miracle's Docotic PDF library or Tall Components PDF libraries. The latter two also cost money. Bit Miracle's engineers have proven to be pretty helpful and you're likely to see them here (HI!). Maybe they can help you out too. iTextPdfSharp is problematic in that you really need to understand the PDF spec to do the right thing or you're likely to output garbage PDF - I've done this experiment with my own library side-by-side with iTextPdfSharp and found a number of pain points for common tasks that require an in-depth knowledge of the PDF spec to fix. I tried to make decisions in my high-level tools such that you didn't need to know the PDF spec nor did you need to worry about creating bad PDF.
I don't particularly like the fact that there are several apparently different tools in our code base that do similar things. PageCollection is part of our PDF rasterizer for historical reasons. PdfDocument is made strictly for manipulating pages and tries to be lightweight and stingy with memory. PdfGeneratedDocument is made for manipulating/creating page content. PdfDecoder is for generating raster images from existing PDF. PdfEncoder is for generating image-only PDF from images. It can be daunting to have all these apparently overlapping niche tools, but there is a logic to them and their relationship to each other.

Related

How to rotate a specific text in a pdf which is already present in the pdf using itext or pdfbox?

I know we can insert text into pdf with rotation using itext. But I want to rotate the text which is already present in the pdf.
Before.pdf
After.pdf
First of all, in your question you only talk about how to rotate a specific text but in your example you additionally rotate a red rectangle. This answer focuses on rotating text. The process of guessing which graphics might be related to the text and, therefore, probably should be rotated along, is a topic in its own right.
You also mention you are looking for a solution using itext or pdfbox and used the tags itext, pdfbox, and itext7. For this answer I chose iText 7.
You did not explain what kind of text pieces you want to rotate but offered a representative example PDF. In that example I saw that the text to rotate was drawn using a single text showing instruction which is the only such instruction in the encompassing text object in the page content stream. To keep the code in the answer simple, therefore, I can assume the text to rotate is drawn in a consecutive sequence of text showing instructions in a text object in the page content stream framed by instructions that are not text showing ones. This is a generalization of your case.
Furthermore, you did not mention the center of rotation. Based on your example files I assume it to be approximately the start of the base line of the text to rotate.
A Simple Implementation
When editing PDF content streams it is helpful to know the current graphics state at each instruction, e.g. to properly recognize the text drawn by a text showing operation one needs to know the current font to map the character codes to Unicode characters. The text extraction framework in iText already contains code to follow the graphics state. Thus, in this answer a base PdfCanvasEditor class has been developed on top of the text extraction framework.
We can base a solution for the task at hand on that class after a small extension; that class originally sets the text extraction event listener to a dummy implementation but here we'll need a custom one. So we need to add an additional constructor that accepts such a custom event listener as parameter:
public PdfCanvasEditor(IEventListener listener)
{
super(listener);
}
(Additional PdfCanvasEditor constructor)
Based on this extended PdfCanvasEditor we can implement the task by inspecting the existing page content stream instruction by instruction. For a sequence of consecutive text showing instructions we retrieve the text matrix before and after the sequence, and if the text drawn by the sequence turns out to be the text to rotate, we insert an instruction before that sequence setting the initial text matrix to a rotated version of itself and another one after that sequence setting the text matrix back to what it was there originally.
Our implementation LimitedTextRotater accepts a Matrix representing the desired rotation and a Predicate matching the string to rotate.
public class LimitedTextRotater extends PdfCanvasEditor {
public LimitedTextRotater(Matrix rotation, Predicate<String> textMatcher) {
super(new TextRetrievingListener());
((TextRetrievingListener)getEventListener()).limitedTextRotater = this;
this.rotation = rotation;
this.textMatcher = textMatcher;
}
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands) {
String operatorString = operator.toString();
if (TEXT_SHOWING_OPERATORS.contains(operatorString)) {
recentTextOperations.add(new ArrayList<>(operands));
} else {
if (!recentTextOperations.isEmpty()) {
boolean rotate = textMatcher.test(text.toString());
if (rotate)
writeSetTextMatrix(processor, rotation.multiply(initialTextMatrix));
for (List<PdfObject> recentOperation : recentTextOperations) {
super.write(processor, (PdfLiteral) recentOperation.get(recentOperation.size() - 1), recentOperation);
}
if (rotate)
writeSetTextMatrix(processor, finalTextMatrix);
recentTextOperations.clear();
text.setLength(0);
initialTextMatrix = null;
}
super.write(processor, operator, operands);
}
}
void writeSetTextMatrix(PdfCanvasProcessor processor, Matrix textMatrix) {
PdfLiteral operator = new PdfLiteral("Tm\n");
List<PdfObject> operands = new ArrayList<>();
operands.add(new PdfNumber(textMatrix.get(Matrix.I11)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I12)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I21)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I22)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I31)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I32)));
operands.add(operator);
super.write(processor, operator, operands);
}
void eventOccurred(TextRenderInfo textRenderInfo) {
Matrix textMatrix = textRenderInfo.getTextMatrix();
if (initialTextMatrix == null)
initialTextMatrix = textMatrix;
finalTextMatrix = new Matrix(textRenderInfo.getUnscaledWidth(), 0).multiply(textMatrix);
text.append(textRenderInfo.getText());
}
static class TextRetrievingListener implements IEventListener {
#Override
public void eventOccurred(IEventData data, EventType type) {
if (data instanceof TextRenderInfo) {
limitedTextRotater.eventOccurred((TextRenderInfo) data);
}
}
#Override
public Set<EventType> getSupportedEvents() {
return null;
}
LimitedTextRotater limitedTextRotater;
}
final static List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
final Matrix rotation;
final Predicate<String> textMatcher;
final List<List<PdfObject>> recentTextOperations = new ArrayList<>();
final StringBuilder text = new StringBuilder();
Matrix initialTextMatrix = null;
Matrix finalTextMatrix = null;
}
(LimitedTextRotater)
You can apply it to a document like this:
try ( PdfReader pdfReader = new PdfReader(...);
PdfWriter pdfWriter = new PdfWriter(...);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new LimitedTextRotater(new Matrix(0, -1, 1, 0, 0, 0), text -> true);
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++){
editor.editPage(pdfDocument, i);
}
}
(RotateText test testBeforeAkhilNagaSai)
The Predicate used here is text -> true which matches any text. In case of your example PDF that is ok as the text to rotate is the only text. In general you might want a more specific check, e.g. text -> text.equals("The text to be rotated"). In general try not to be too specific, though, as the extracted text might slightly deviate from expectations, e.g. by extra spaces.
The result:
As you can see the text is rotated. In contrast to your After.pdf, though, the red rectangle is not rotated. The reason is - as already mentioned at the start - that that rectangle in no way is part of the text.
Some Ideas
First of all, there are ports of the PdfCanvasEditor to iText 5 (the PdfContentStreamEditor in this answer) and PDFBox (the PdfContentStreamEditor in this answer). Thus, if you eventually prefer to switch to either of these PDF libraries, you can create equivalent implementations.
Then, if the assumption that the text to rotate is drawn in a consecutive sequence of text showing instructions in a text object in the page content stream framed by instructions that are not text showing ones does not hold for you, you can generalize the implementation here somewhat. Have a look at the SimpleTextRemover in this answer for inspiration which is based on the PdfContentStreamEditor for iText 5. Here also texts that start somewhere in one text showing instruction and end somewhere in another one are processed which requires some more detailed data keeping and splitting of existing text drawing instructions.
Also, if you want to rotate graphics along with the text that a human viewer might consider associated with it (like the red rectangle in your example file), you can try and extend the example accordingly, e.g. by also extracting the coordinates of the rotated text and in a second run trying to guess which graphics around those coordinates are related and rotating the graphics along. This is not trivial, though.
Finally note that the Matrix provided in the constructor is not limited to rotations, it can represent an arbitrary affine transformation. So instead of rotating text you can also move it or scale it or skew it, ...

How to get PDF page location without creating new array

Is it possible just to find out locations of PDF pages in byte array?
At the moment I parse full PDF in order to find out page bytes:
public static List<byte[]> splitPdf(byte[] pdfDocument) throws Exception {
InputStream inputStream = new ByteArrayInputStream(pdfDocument);
PDDocument document = PDDocument.load(inputStream);
Splitter splitter = new Splitter();
List<PDDocument> PDDocs = splitter.split(document);
inputStream.close();
List<byte[]> pages = PDDocs.stream()
.map(PDFUtils::getResult).collect(Collectors.toList());
}
private static byte[] getResult(PDDocument pd) {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
pd.save(byteArrayOutputStream);
return byteArrayOutputStream.toByteArray();
}
My code works very well but
I created additional List< byte[] > to save page bytes. I would like just to have byte locations - If I know byte indexes of page (page start location, page end location) I'll extract this from main byte array.
So might be I can find this information in PDF header or somewhere...
Right now I'm trying to optimize memory, because I parse hundreds of documents in parallel. So I don't want to create duplicate arrays.
If I know byte indexes of page (page start location, page end location) I'll extract this from main byte array.
As #Amedee already hinted at in a comment, there is not simply a section of the pdf for each page respectively.
A pdf is constructed from multiple objects (content streams, font resources, image resources,...) and two pages may use the same objects (e.g. use the same fonts or images). Furthermore, a pdf may contain unused objects.
So already the sum of the sizes of your partial pdfs may be smaller than, greater than, or even equal to the size of the full pdf.

replace string in PDF document (ITextSharp or PdfSharp)

We use non-manage DLL that has a funciton to replace text in PDF document (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php).
We are trying to move to managed solution (ITextSharp or PdfSharp).
I know that this question has been asked before and that the answers are "you should not do it" or "it is not easily supported by PDF".
However there exists a solution that works for us and we just need to convert it to C#.
Any ideas how I should approach it?
According to your library reference link, you use the Debenu PDFLibrary function ReplaceTag. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams();
string content = DPL.GetContentStreamToString();
DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
That should be possible with any general purpose PDF library, it definitely is with iText(Sharp):
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
WARNING: Just like in case of the Debenu function, for most documents this code wouldn’t have any effect or would even be destructive. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed.
By the way, the Debenu knowledge base article continues:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
Thus, if during your move to managed solution you also change the way the source documents are created, chances are that neither the Debenu PDFLibrary function ReplaceTag nor the code above will be able to change the content as desired.
for pdfsharp users heres a somewhat usable function, i copied from my project and it uses an utility method which is consumed by othere methods hence the unused result.
it ignores whitespaces created by Kerning, and therefore may mess up the result (all characters in the same space) depending on the source material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}

How a font is detected to be bold/italic/plain that is used in PDF

While Extracting Content from PDF using the MuPDF library, i am getting the Font name only not its font-face.
Do i guess (eg.bold in font-name though not the right way) or there is any other way to detect that specific font is Bold/Italic/Plain.
I have used itextsharp to extract font-family ,font color etc
public void Extract_inputpdf() {
text_input_File = string.Empty;
StringBuilder sb_inputpdf = new StringBuilder();
PdfReader reader_inputPdf = new PdfReader(path); //read PDF
for (int i = 0; i <= reader_inputPdf.NumberOfPages; i++) {
TextWithFont_inputPdf inputpdf = new TextWithFont_inputPdf();
text_input_File = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader_inputPdf, i, inputpdf);
sb_inputpdf.Append(text_input_File);
input_pdf = sb_inputpdf.ToString();
}
reader_inputPdf.Close();
clear();
}
public class TextWithFont_inputPdf: iTextSharp.text.pdf.parser.ITextExtractionStrategy {
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
string curFont = renderInfo.GetFont().PostscriptFontName;
string divide = curFont;
string[] fontnames = null;
//split the words from postscript if u want separate. it will be in this
}
}
public string GetResultantText() {
return result.ToString();
}
The PDF spec contains entries which allow you to specify the style of a font. However unfortunately in the real world you will often find that these are absent.
If the font is referenced rather than embeded this generally means you are stuck with the PostScript name for the font. It requires some heuristics but normally the name provides sufficient clues as to the style. It sounds this is pretty much where you are.
If the font is embedded you can parse it and try and find style information from the embedded font program. If it is subsetted then in theory this information might be removed but in general I don't think it will be. However parsing TrueType/OpenType fonts is boring and you may not feel that it is worth it.
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)"

iTextSharp: Convert PdfObject to PdfStream

I am attempting to pull some font streams out of a pdf file (legality is not an issue, as my company has paid for the rights to display these documents in their original manner - and this requires a conversion which requires the extraction of the fonts).
Now, I had been using MUTool - but it also extracts the images in the pdf as well with no method for bypassing them and some of these contain 10s of thousands of images. So, I took to the web for answers and have come to the following solution:
I get all of the fonts into a font dictionary and then I attempt to convert them into PdfStreams (for flatedecode and then writing to files) using the following code:
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject((PdfObject)cItem.pObj);
PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
try
{
int xrefIdx = ((PRIndirectReference)((PdfObject)cItem.pObj)).Number;
PdfObject pdfObj = (PdfObject)reader.GetPdfObject(xrefIdx);
PdfStream str = (PdfStream)(pdfObj);
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);
}
catch { }
But, when I get to PdfStream str = (PdfStream)(pdfObj); I get the error below:
Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary'
to type 'iTextSharp.text.pdf.PdfStream'.
Now, I know that PdfDictionary derives from (extends) PdfObject so I am uncertain as to what I am doing incorrectly here. Someone please help - I either need advice on patching this code, or if entirely incorrect, either code to extract the stream properly or direction to a place with said code.
Thank you.
EDIT
My revised code is here:
public static void GetStreams(PdfReader pdf)
{
int page_count = pdf.NumberOfPages;
for (int i = 1; i <= page_count; i++)
{
PdfDictionary pg = pdf.GetPageN(i);
PdfDictionary fObj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.FONT));
if (fObj != null)
{
foreach (PdfName name in fObj.Keys)
{
PdfObject obj = fObj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
int xrefIdx = ((PRIndirectReference)obj).Number;
PdfObject pdfObj = pdf.GetPdfObject(xrefIdx);
if (pdfObj == null && pdfObj.IsStream())
{
PdfStream str = (PdfStream)(pdfObj);
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);
}
}
}
}
}
}
However, I am still receiving the same error - so I am assuming that this is an incorrect method of retrieving font streams. The same document has had fonts extracted using muTool successfully - so I know the problem is me and not the pdf.
There are at least two things wrong in your code:
You cast an object to a stream without performing this check: if (pdfObj == null && pdfObj.isStream()) { // cast to stream } As you get the error message that you're trying to cast a dictionary to a stream, I'm 99% sure that the second part of the check will return false whereas pdfObj.isDictionary() probably returns true.
You try extracting a stream from PdfReader and you're trying to cast that object to a PdfStream instead of to a PRStream. PdfStream is the object we use to create PDFs, PRStream is the object used when we inspect PDFs using PdfReader.
You should fix this problem first.
Now for your general question. If you read ISO-32000-1, you'll discover that a font is defined using a font dictionary. If the font is embedded (fully or partly), the font dictionary will refer to a stream. This stream can contain the full font information, but most of the times, you'll only get a subset of the glyphs (because that's best practice when creating a PDF).
Take a look at the example ListFontFiles from my book "iText in Action" to get a first impression of how fonts are organized inside a PDF. You'll need to combine this example with ISO-32000-1 to find more info about the difference between FONTFILE, FONTFILE2 and FONTFILE3.
I've also written an example that replaces an unembedded font with a font file: EmbedFontPostFacto. This example serves as an introduction to explain how difficult font replacement is.
Please go to http://tinyurl.com/iiacsCH16 if you need the C# version of the book samples.