What application does google use to show PDF attachments in gmail - pdf

I watched the traffic when google displays PDF attachments in gmail in a new window. The content is served as PNG images for each PDF page. And its text can be selected. What does google use on server side to generate a PNG file for a particular page in a pdf file? How does the selection of text on a png file work? Any ideas?

By default attachments are viewed securely using https://docs.google.com/gview, however it turns out you are allowed to request files over plain HTTP. This makes it a little bit easier to figure out what is going on using Wireshark.
As you indicated it was already clear that the PDF is converted on the server side to a PNG (ImageMagick is indeed a reasonable solution for this purpose), the obvious reason for this is to preserve the exact layout while still being able to view the file without requiring a PDF viewer.
However, from looking at the traffic I found out that the entire PDF is also converted to a custom XML format when calling /gview?a=gt&docid=&chan=&thid= (this is done as soon as you request the document). As I couldn't use Wireshark to copy the XML I resorted to the Firefox extension Live HTTP Headers. Here's an excerpt:
<pdf2xml>
<meta name="Author" content="Bruce van der Kooij"/>
<meta name="Creator" content="Writer"/>
<meta name="Producer" content="OpenOffice.org 3.0"/>
<meta name="CreationDate" content="20090218171300+01'00'"/>
<page t="0" l="0" w="595" h="842">
<text l="188" t="99" w="213" h="27" p="188,213">Programmabureau</text>
<text l="85" t="127" w="425" h="27" p="85,117,209,61,277,21,305,124,436,75">Nederland Open in Verbinding (NOiV)</text>
</page>
</pdf2xml>
I'm not quite sure yet what all the attributes on the text element stand for (with the exception of w and h) but they're obviously the coordinates of the text and possibly length. As the JavaScript Google uses is minimized (or possibly obsfuscated, but this is not likely) figuring out precisely how the client-side selection function works is not quite that easy. But most likely it uses this XML file to figure out what text the user is looking at and then copies that to the user's clipboard.
Note that there is an open source (GPL licensed) tool called pdf2xml which has similar but not quite the same output. Here's the example from their homepage:
<?xml version="1.0" encoding="utf-8" ?>
<pdf2xml pages="3">
<title>My Title</title>
<page width="780" height="1152">
<font size="10" face="MHCJMH+FuturaT-Bold" color="#FF0000">
<text x="324" y="37" width="132" height="10">Friday, September 27, 2002</text>
<img x="324" y="232" width="277" height="340" src="text_pic0001.png"/>
<link x="324" y="232" width="277" height="340" dest_page="2" dest_x="141" dest_y="187"/>
</font>
<font size="12" face="AGaramond-Regular" italic="true" bold="true">
<text x="509" y="68" width="121" height="12">This is a test PDF file</text>
<link x="509" y="68" width="121" height="12" href="www.mobipocket.com"/>
</font>
</page>
</pdf2xml>
Hope this information is in any way useful, however like one of the other posters mentioned the only way to be sure what Google does is by asking them. It's a shame Google doesn't have an official IRC channel but they do have a forum for Google Docs support questions.
Good luck.

Google uses a non-open-sourced PDF converter app developed in-house. So you're better off looking into the links posted by other answers, since you can't get your hands on the Google version. Sorry!

if you have the text you can make it what you want offcourse,
more specific you should check out this link : pdf to png using php
so imageMagick will be needed imageMagic
edit : another interesting link.
edit : i found this at google, it looks interesting ... so you could use the google api
Google Document List Data Api and this is a blogpost about it Google API Now Lets You Get Documents in Many Formats
Offcourse to be sure what google uses you need an answer from them ? :)
good luck !

To see what a pdf is created with, right click on it and go to the Document Properties (in Adobe reader). The PDF producer will show up as the "PDF Producer". I think google uses both Prince and IText (not in combination for creating PDFs). Google has created some major modifications on the above toolkits to create that end product.

Well.. this might just be the pdf2xml tool Google is using. They only changed they full words width, height etc and they added the p attribute... which turns out to be the attribute containing the coordinates for the words inside the line. Just played with it and found out :) Going to use this pdf2xml from google :P Upload, let them convert... use xml to transform tooo... epub? :P

You may also want to investigate use Lucence to index those big pdf files and serve related pages to your users.
See http://www.jguru.com/faq/view.jsp?EID=1074237 for more ideas.

Related

Displaying the contents of a PDF file on the page using Coldfusion

I have a page that is dedicated to the Standard Operating Procedures (SOP). I want this page to show the the SOP in the page with a download button above it (and for Admin an upload button). Basically I want the user to be able to read the SOP without having to download it. I have the buttons sorted and I almost have the display set, but the format is off.
The admin can upload a PDF of the current SOP. That file then gets stored and overwrites that last upload. I tried using cffile but it was unreadable no matter what charset I tried to use. Currently I am taking the file and extracting it as a .txt, then using cffile to read it to a variable that I then output to the screen. It sort of works, but the formatting is all wrong.
I know I can use cfcontent and just have the page be the PDF, but I'd rather not have to mess with adding a new page just for admins to upload new SOP files. (The way the site is built it would have to be a new page)
<cfpdf
action="extracttext"
source="D:\file_path\SOP.pdf"
overwrite="true"
honourspaces="true"
type="string"
useStructure="true"
destination="D:\file_path\SOP.txt">
<cffile
action="read"
file="D:\file_path\SOP.txt"
variable="dcnSOP">
...
<cfoutput>#dcnSOP#</cfoutput>
Basically I'm getting a block of unformatted (as in spaces and new paragraphs) text. It's the text I want, and It's on the page where I want it. But it looks terrible. It seems to just be getting rid of any new line characters and just presenting the text in a blob. Is there a better way of doing this without just having the whole page be the PDF using cfcontent?
Thanks to #Miguel-F and #Ageax for the suggestions and leading me to a question I missed on here when I was searching for the answer.
<embed src="\file_path\SOP.pdf" width="800px" height="2100px"/>
This works with every browser but Chrome (our clients will not be using mobile browsers). I know you can use Google's PDF reader to get around this, if anyone is interested in that here is an example of that given by #Script47 here:
<embed src="https://drive.google.com/viewerng/
viewer?embedded=true&url=http://example.com/the.pdf" width="500" height="375">

XPage - Open scans in browser

I need to display uploaded scans (JPG, PNG, TIFF, PDF, etc.) in the browser's window instead downloading them to a local pc and using external apps like Acrobat Reader.
I made some research in the web on that issue but wasn't really successful.
Does anyone have hints, code snippets, how to achieve that ?
EDIT :
Since I am not looking for a solution which supports viewing scans in a typical browser like Chrome, FireFox, etc. but supports viewing scans in an XPage view within Notes I need to ask my question again.
What is the best (recommended) way to view different types of scans, uploaded as PDF, JPG, TIFF, PNG, etc., in Notes within an XPage view ?
Take a look here, XPages: Embed PDF and possibly Office files
Here is some code that I have in an app for PDF's.
I tried using Bumpbox, and pdf.js and while I could get them working, iframes seemed to work best for me with using normal Domino attachment urls in xpages
I am not sure if this solution is right or not, but it works well for an app I have that only has PDFs. It does work on mobile too, at least on iOS.
<iframe
src="#{javascript:
var url = 'https://app.nsf/';
var doc = sessionScope.docID;
var atname = #RightBack(sessionScope.aname,'Body');
var end = '/$file'+atname;
return url+doc+end}"
width="800" height="1000">
</iframe>
If you are looking at using different file types you need to use a renderer, give it the attachment URL, and then display what the renderer returns with. I haven't looked at this in a while so things might have changed. Look for a lightbox clone that can display pdf. I think Orangebox was one, bumpbox looks to not be updated but I was able to get that working for me.
This method will display everything inline. I would love to see some type of renderer like pdf.js for xpages.

Visible watermark (ex libris) on Mobi files

We have a client selling ebooks and he wants to add the buyer's name on every page of the book (for example in the footer) so that it discourages him to share it too widely. Apparently this is called adding an "ex libris". Our client wants to sell ebooks in PDF/ePub/Mobi formats.
I've searched the Interweb about how to do this and so far I've found that doing this to PDFs is quite easy, that there is a library to do exactly this on ePubs. But I've found close to nothing related to mobi files.
So my questions are :
Is it possible to add text on every page of a given .mobi file, for example in the footer?
If it's not possible, how accurate would it be to convert a watermarked epub to the mobi format? What would be the best tools and practices for the job?
This discussion is not about how I could add a hidden watermark to the files through some form of steganography.
We have done something similar before to use a footer image at the bottom of each page in a Mobi file. This image could contain the name of the buyer.
<style>
body{
background-image:url('[watermarkpath].PNG');
background-repeat:no-repeat;
background-attachment:fixed;
background-position:center bottom;
background-size:contain;
}
</style>
I suggest making the image itself transparent so that it doesn't interfere with the text. Hope this helps!

Display webcam snapshots on website

I have a Dlink DCS-942-L webcam that will ftp one snapshot jpeg per hour to my website. It embeds them in a folder by date and then again in subfolders by hour according to time. I would like to display these pictures on a webpage. Any ideas would be appreciated.
If it was me I will put the images in the same folder than once a day. I will put in the image tags
<img alt="A description" src="images/myimage.jpg" width="200px" height="300px" />
There is no automatic way to write code to a webpage. If there was a way the website developers and designers will be out of a job. :-)
Someone might have created a C++, C#, or VB program to do this custom job. You might want to put javascript, C#, C++, and VB tags on this message so those groups can read this message

HTML PDF Viewer

Is there any alternative way to view PDF files on the web instead of using Acrobat Reader? I need to control the viewer to programmatically trigger the printing of the document.
The source of the PDF should come from a webservice URL / AspX
The easiest I would think is to use the Google Doc Viewer:
<iframe src="http://docs.google.com/viewer?url=**PathToMyPdfFile.pdf**&embedded=true" width="600" height="780" style="border: none;"></iframe>
You need to host your PDF files somewhere online, may be in a file in your public website ( it needs to be a public site) and put the link to the PDF file in "PathToMyPdfFile.pdf" in the iFrame above. Then set the width and height you need.
Google even generates this code for you here:
https://docs.google.com/viewer
Then simply put this iframe anywhere in the body of your page where you want to display your PDF. This also supports many other file formats too.
There are quite a few options for document views online, some open source others proprietary. Personally, I've had good experiences with Flex Paper. This will allow you to include the document view on your website, and there are some developer resources which will allow you to integrate it with the functionality you're looking for.
For demos, see here: http://flexpaper.devaldi.com/demo/
You can use FoxIT PDF viewer. It's free and programmable.