Combining a page map with a PDF so that annotations move with pages in updated PDFs - pdf

Has anyone managed to provide an end-user with an updated PDF, allowing that user to transfer his local annotations to the new PDF and keeping the annotations on the correct page, even when there are pages inserted into the update PDF at a point earlier in the PDF than the annotations.
I thought there might be a page map or page guid approach that someone has used.
Sorry - I hope that is clear.

Instead of using the page index as ID of the page, you can use the page content stream instead (after decoding/decompression). Most PDF libraries will give you access to that, so you could compute an MD5 hash from the page content and search for that instead on your "updated" file in order to know where to transfer your annotations.
This is assuming that the page content will be indeed identical, which is not a common scenario.

Related

Apache FOP - Extra content at the end

I have a simple task, which I've found quite complex to implement with Apache FOP.
I have already created some layout. So I do have nice first page, only page, rest pages and last page definitions with well distributed content on it, but now I need sometimes add to the end of the document some extra content (like Terms and conditions, or Agreement conditions) which sometimes takes even few pages. That content shouldn't have any header, footer, page number etc... Just a text flow with paragraphs.
Thank you.
Kind regards.
As I am new to FOP I haven't recognized well FOP capabilities posting my question here.
So I've sorted out the issue with new page-master and new page-sequence definitions.

Get page numbers of searchresult of a pdf in solr

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.
So what I need is the page number and a short text snippet of every search result.
I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.
I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.
I'm now splitting the PDF and sending each page separately to SOLR.
So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.
There is JIRA SOLR-380 with a Patch, which you can check upon.
I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.
I have not tried it myself.
Approach,
Solr customer connector integrating with Apache Tika parser for indexing PDFs
Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
Enable search on all the “page” attributes
When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
Link the PDF with the “#PageNumber” of the PDF and pop up the page on click
A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.
If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

Where I can get hyperlinks in pdf document structure (except "Annots" entry in page dictionary)?

I have two pdf documents (doc1 and doc2) with hyperlinks e.g www.somlink.com, www.somlink2.com.
According to PDF Specification I can get those hyperlinks via Link Annotations. Link Annotations can be found in pdf page's dictionary under "Annots" key.
CGPDFDictionaryRef pageDictionary = CGPDFPageGetDictionary(someCGPDFPage);
CGPDFArrayRef annots;
CGPDFDictionaryGetArray(pageDictionary, "Annots", &annots);
So the problem is that in one pdf document (doc1) I get that "Annots" array but in another document (doc2) there is no such entry in page dictionary.
And the thing is that with PDFKit.framework you can get those annotations in PDFPage class using - (NSArray *)annotations method even if there is no "Annots" entry in page dictionary.
I can't use PDFKit.framework on iPad/iPhone so I am working with Quartz framework :)
So it seems that there is another place where you can specify hyperlinks (or Link Annotations in PDF Reference), not only in "Annots" array and PDFKit.framework somehow know ho to do that.
Any ideas where can I get those hyperlinks?
Links on a page THAT YOU CAN CLICK ON have to be annotations. Period. No annotations, no links.
A string of text "http://blah.com" isn't necessarily a link, it's just a piece of text describing a URL. This may be what's causing your confusion.
It's also possible to embed link actions in bookmarks. I'm not at all familiar with PDFKit or Quartz, so you're on your own as far as API calls are concerned.
And finally, (having reread your question), I believe annotations can be inherited from their parent Pages object. Gonna have to look that one up... Nope. The annotations array MUST be in the leaf page object, or it's not valid.
Can you post links to your PDFs? Something Ain't Right here.
PDF viewer like Adobe Reader simply allows to click and navigate on a plain text, if it looks as a hyperlink - i.e. starts with http://, https://, ftp:// and ends up with some URL delimiter such as space. As simple as that ;)

how to read/parse dynamically generated web content?

I need to find a way to write a program (in any language) that will connect to a website and read dynamically generated data from the website.
Note that it's dynamically generated--it's not enough to get the source html, because the data I'm interested in is generated via javascript that references back-end code. So when i view the webpage source, I can't see the data. (For example, go to google, and do a search. Check the source code on the search results page. Very little of the data your browser is displaying is reflected in the source--most of it is dynamically generated. I need some way to access this data.)
Pick a language and environment that includes an HTML renderer (e.g. .NET and the WebBrowser control). Use the HTML renderer to get the URL and produce an HTML DOM in memory (making sure that scripting is enabled). Read the contents of the HTML DOM after the renderer has done its work.
Example (you'll need to do this inside a System.Windows.Form derived class):
WebBrowser browser = new WebBrowser();
browser.Navigate("http://www.google.com");
HtmlDocument document = browser.Document;
// extract what you want from the document
I used to have a Perl program to access Mapguide.com to get the drive direction from one location to another location. I parsed the returned page and save to database. If the source never change their format, it is OK. the problem is the source format often change, your parser also need change.
A simple thought: if we're talking about AJAX, you can rather look up the urls for the dynamic data. Then you can use the javascript on the page you're talking about to reformat this.
If you have Firefox/greasemonkey making a DOM dumper should be a simple matter.

How to tell image search which image matters?

Google image search seems to do a poor job on a site I run in identifying which image on a page should be indexed. In addition it doesn't seem to link that image with lots of the associated data.
Are there any ways of focusing attention for spiders on particular images and associated data, do they need to be within the same tags, or adjacent on the page?
A few tips:
Use a descriptive name, i.e. "tabby-cat.jpg" instead of "img02396.jpg".
Use alt tags on images.
Use descriptive text on the page and around the image.
Make sure the images are in the generated source, i.e. if you click "View source" in your browser, you see <img> tags.
It's also useful to validate your site at http://validator.w3.org in case there are major errors like missing brackets etc that could prevent a spider from parsing the page. (Note: I wouldn't worry about making everything 100% valid since Google is fine with invalid code)
Images in CSS (i.e. backgrounds) are not indexed AFAIK. However I'd suggest using CSS backgrounds for "design" images (a subtle way of getting Google to ignore site headers, custom borders, shadows, etc).
Nor are any images generated from Javascript.
Make sure you're not blocking images through robots.txt. I know that Joomla does this by default.
Sign up at Google Webmaster Tools, add your site, then allow it to be used in Google's "Image Labeller" game which should help tag images.
All images on a page should be indexed. If they aren't then improve your alt tags and possibly rename the image file. There really isn't anything more you can do since search-engines do not read any other context for the image itself except size. If google thinks the image is a duplicate it won't index it either.
Of course if images really do inherit context from the surrounding page then you could just use less images or move them into CSS.
I think Search robot can not read images as we do, so the simple and must thing you should do to your images is using descriptive names, so that spider could know what this image all about. Second one is using ALT tags on images, put in keywords relating to the images.
Those thing are what I do.