Decide if I need a new PDPage using PDFBox - pdf

I understand that PDFBox offers only a very low-level API for PDF generation, and I need to manually decide if I need a new PDPage.
I'm trying to write a simple logic to create a new PDPage when the current page is full.
And The only option that I can think of is to see if my Y pointer of the page is about to cross the page height. But this needs to be done everytime I try to add anything to PDPageContentStream .
Is there a better approach than this?

Related

Manually add new page in PDFBox using pdfbox-layout

I'm using pdfbox-layout to create and manage PDF documents using Document API.
Document document = new Document();
It manages to create new page automatically if the text size increases beyond current page using Paragraph API.
Paragraph paragraph = new Paragraph();
However, I'm unable to add new page manually as and when needed. I want to print some content starting from new page.
After going through the API's source code, I found that add method accepts parameter of type Element,
document.add(Element element)
And after inspecting all the classes that implements Element interface, I found the one that I need, i.e. ControlElement
So, to add a new page, my code looks like,
document.add(ControlElement.NEWPAGE);

Apache FOP - Extra content at the end

I have a simple task, which I've found quite complex to implement with Apache FOP.
I have already created some layout. So I do have nice first page, only page, rest pages and last page definitions with well distributed content on it, but now I need sometimes add to the end of the document some extra content (like Terms and conditions, or Agreement conditions) which sometimes takes even few pages. That content shouldn't have any header, footer, page number etc... Just a text flow with paragraphs.
Thank you.
Kind regards.
As I am new to FOP I haven't recognized well FOP capabilities posting my question here.
So I've sorted out the issue with new page-master and new page-sequence definitions.

Combining a page map with a PDF so that annotations move with pages in updated PDFs

Has anyone managed to provide an end-user with an updated PDF, allowing that user to transfer his local annotations to the new PDF and keeping the annotations on the correct page, even when there are pages inserted into the update PDF at a point earlier in the PDF than the annotations.
I thought there might be a page map or page guid approach that someone has used.
Sorry - I hope that is clear.
Instead of using the page index as ID of the page, you can use the page content stream instead (after decoding/decompression). Most PDF libraries will give you access to that, so you could compute an MD5 hash from the page content and search for that instead on your "updated" file in order to know where to transfer your annotations.
This is assuming that the page content will be indeed identical, which is not a common scenario.

Where I can get hyperlinks in pdf document structure (except "Annots" entry in page dictionary)?

I have two pdf documents (doc1 and doc2) with hyperlinks e.g www.somlink.com, www.somlink2.com.
According to PDF Specification I can get those hyperlinks via Link Annotations. Link Annotations can be found in pdf page's dictionary under "Annots" key.
CGPDFDictionaryRef pageDictionary = CGPDFPageGetDictionary(someCGPDFPage);
CGPDFArrayRef annots;
CGPDFDictionaryGetArray(pageDictionary, "Annots", &annots);
So the problem is that in one pdf document (doc1) I get that "Annots" array but in another document (doc2) there is no such entry in page dictionary.
And the thing is that with PDFKit.framework you can get those annotations in PDFPage class using - (NSArray *)annotations method even if there is no "Annots" entry in page dictionary.
I can't use PDFKit.framework on iPad/iPhone so I am working with Quartz framework :)
So it seems that there is another place where you can specify hyperlinks (or Link Annotations in PDF Reference), not only in "Annots" array and PDFKit.framework somehow know ho to do that.
Any ideas where can I get those hyperlinks?
Links on a page THAT YOU CAN CLICK ON have to be annotations. Period. No annotations, no links.
A string of text "http://blah.com" isn't necessarily a link, it's just a piece of text describing a URL. This may be what's causing your confusion.
It's also possible to embed link actions in bookmarks. I'm not at all familiar with PDFKit or Quartz, so you're on your own as far as API calls are concerned.
And finally, (having reread your question), I believe annotations can be inherited from their parent Pages object. Gonna have to look that one up... Nope. The annotations array MUST be in the leaf page object, or it's not valid.
Can you post links to your PDFs? Something Ain't Right here.
PDF viewer like Adobe Reader simply allows to click and navigate on a plain text, if it looks as a hyperlink - i.e. starts with http://, https://, ftp:// and ends up with some URL delimiter such as space. As simple as that ;)

how to read/parse dynamically generated web content?

I need to find a way to write a program (in any language) that will connect to a website and read dynamically generated data from the website.
Note that it's dynamically generated--it's not enough to get the source html, because the data I'm interested in is generated via javascript that references back-end code. So when i view the webpage source, I can't see the data. (For example, go to google, and do a search. Check the source code on the search results page. Very little of the data your browser is displaying is reflected in the source--most of it is dynamically generated. I need some way to access this data.)
Pick a language and environment that includes an HTML renderer (e.g. .NET and the WebBrowser control). Use the HTML renderer to get the URL and produce an HTML DOM in memory (making sure that scripting is enabled). Read the contents of the HTML DOM after the renderer has done its work.
Example (you'll need to do this inside a System.Windows.Form derived class):
WebBrowser browser = new WebBrowser();
browser.Navigate("http://www.google.com");
HtmlDocument document = browser.Document;
// extract what you want from the document
I used to have a Perl program to access Mapguide.com to get the drive direction from one location to another location. I parsed the returned page and save to database. If the source never change their format, it is OK. the problem is the source format often change, your parser also need change.
A simple thought: if we're talking about AJAX, you can rather look up the urls for the dynamic data. Then you can use the javascript on the page you're talking about to reformat this.
If you have Firefox/greasemonkey making a DOM dumper should be a simple matter.