PDFDebugger in pdfbox locks up viewing page content stream - pdfbox

I have an odd PDF that appears to have image data encoded directly into the content stream instead of being tucked away as a resource (there are images in resources but they're not actually in the page for some reason). The actual page's content length is very large (107,988,275).
It's killing our servers so I thought I'd crack open the PDFDebugger to see what is in the content stream. When I open the PDF and navigate to the content stream, it just locks up completely. I've tried increasing the heap size (4g) and it didn't seem to help.
Is there a way I can view just the head of the stream? I'd really love to know what is in this thing. Is there a way of encoding image data directly into a page's content stream?

As Tilman suggested in the comments, you can view the content stream by writing the InputStream given by PDPage.getContents() to a file.

Related

Decreasing size of PDF when using puppeteer for pdf generation

We are using IDR for converting PDF documents to HTML.
After doing some modifications we are using puppeteer for converting that document back to PDF I am getting files with increased page size (even if I don't do any modification to my HTML).
For ex:- If the original page size is 500kb I am getting a page with 1000kb
The page only contains some text.
Please help me to understand what is the reason behind this and how to solve this.

How do I re-use the same image payload in several pages without repeating it?

I'm using tcpdf to generate a report that contains a logo from a svg vector image.
My goal is to efficiently re-use the same image payload over and over in the report, not storing the logo as if it was a different image on each page.
Right now, with the current data, the report generates 32 pages. The file size considerably increases with new pages being added. This seems to be due to the logo being repeated on every page.
I don't have tools to analyze what is inside the pdf but I can see from other reports that are generated by other applications, that the file size of pdfs containing repeated images peaks at 1 page and then on each consecutive page, the size increases very slightly, indicating that the first logo is efficiently re-used.
How can I achieve that using tcpdf?
If in my report, I place the logo only in page 1 and omit it in pages 2 - 32, still outputting all the text data, the file size is greatly reduced, just as in the examples that I mentioned before. This indicates that the svg data is repeated on every page.
From the example 009 in tcpdf's site documentation, I've tried loading the image from file and also tried using a "data stream" (this is encoding the svg in base64 and instead of referencing the image from a file, you use the text-based base64 variable content as a stream that contains the image payload).
I thought that using the data stream would take care of it, but it didn't.
Is there a way to reference the same image over and over in tcpdf?

Android camera, take picture(s) and save as multipage PDF, then upload to server via <input type="file" />

I have a webform with and want to open it on smartphone - than take pictures of some documents which need to be merged in one PDF, and on the end this file need to be uploaded to server.
My solution is to use Google Drive to upload PDF (scan) to GDrive and then somehow download this file from gdrive to server via some sort of widget (any links appreciate) installed on website.
Maybe someone have a better idea?
I know its late but my answer might help others. I also face the same challenge and implemented a custom solution based on Javascript and Since you are using web form so this solution will perfectly fits on your need.
You have to use JSPdf javascript library, JSPdf provide you pdf object in your browser and you can upload it download it and there are many other thing to play with.
First you have to initialize JSPdf object as per your requirement. I am creating PDF with page size width:500px and height 500px.
pdf = new jsPDF("l", "pt", [500,500]);
Simply when you will take picture from camera you will have each picture in form of base64, that base64 format you have to insert in JSPdf object
pdf.addImage(imgData, 'JPEG', 0, 0);
you can repeat the above code to add pictures from camera as much as you want, at the back-end these images are compiling and creating pdf document where each page have each images in sequence.
Once you are done, you can get PDF object in form of base64 object using below code that you can upload to any server.
pdf.output('datauristring')
above is only pdf part, you can find complete working example including camera part here Javascript Component to Scan Document

Displaying the contents of a PDF file on the page using Coldfusion

I have a page that is dedicated to the Standard Operating Procedures (SOP). I want this page to show the the SOP in the page with a download button above it (and for Admin an upload button). Basically I want the user to be able to read the SOP without having to download it. I have the buttons sorted and I almost have the display set, but the format is off.
The admin can upload a PDF of the current SOP. That file then gets stored and overwrites that last upload. I tried using cffile but it was unreadable no matter what charset I tried to use. Currently I am taking the file and extracting it as a .txt, then using cffile to read it to a variable that I then output to the screen. It sort of works, but the formatting is all wrong.
I know I can use cfcontent and just have the page be the PDF, but I'd rather not have to mess with adding a new page just for admins to upload new SOP files. (The way the site is built it would have to be a new page)
<cfpdf
action="extracttext"
source="D:\file_path\SOP.pdf"
overwrite="true"
honourspaces="true"
type="string"
useStructure="true"
destination="D:\file_path\SOP.txt">
<cffile
action="read"
file="D:\file_path\SOP.txt"
variable="dcnSOP">
...
<cfoutput>#dcnSOP#</cfoutput>
Basically I'm getting a block of unformatted (as in spaces and new paragraphs) text. It's the text I want, and It's on the page where I want it. But it looks terrible. It seems to just be getting rid of any new line characters and just presenting the text in a blob. Is there a better way of doing this without just having the whole page be the PDF using cfcontent?
Thanks to #Miguel-F and #Ageax for the suggestions and leading me to a question I missed on here when I was searching for the answer.
<embed src="\file_path\SOP.pdf" width="800px" height="2100px"/>
This works with every browser but Chrome (our clients will not be using mobile browsers). I know you can use Google's PDF reader to get around this, if anyone is interested in that here is an example of that given by #Script47 here:
<embed src="https://drive.google.com/viewerng/
viewer?embedded=true&url=http://example.com/the.pdf" width="500" height="375">

Distorted images after uploading to S3 (optimized with ImageMagick)

I am uploading an image via Amazon SDK to S3. Those PNG images are optimized using ImageMagick ASP.NET library. The problem is that I can see them fine when optimized on my computer (testing locally), but when uploaded to S3, they are being heavily distorted. Do you know what can be the cause for this?
I am using ASP.net. I thought that the reason for that is that the image haven't been completely saved, but that no seems to be a good option because the file should have been locked and couldn't be streamed.
Here, take a look..
http://i1182.photobucket.com/albums/x448/dphotowriter/2011-09-07_002928.png
I did a test. When I upload the image directly to Amazon via AWS it is fine. The problem lies somewhere between saving the image and the moment it is streams. Maybe is asynchronous and the image hasn't finished to be written completely and then uploaded only part of it.
I tried to put:
System.Threading.Thread.Sleep(5000);
after the optimization to but it didn't help either. Maybe it something to do with the STREAM for that PNG files. I do the following:
1) Save the image to a temp.png file.
2) Read the file to an image object
3) convert the file to byte array
4) pass the byte array to the MemoryStream constructor