Filtering out the header and footer text using pdfbox

Filtering out the header and footer text using pdfbox - pdfbox

I'm trying to get the text from a pdf document using pdfbox, the problem is I'm getting the header and footer text as well. Does anyone know if there's a way to filter that out? Maybe via some settings in TextPosition?

Related

Extract text using iTextSharp with different PDF page labels

I am trying to extract text from PDF using iTextSharp but I get null reference error while trying to call GetTextFromPage:
My guess would be that iTextSharp somehow interprets page label incorrectly as it is indeed strange:
Or is it due danish letters in the text?
However, I am able to extract text from different languages.
Thank you in advance.
EDIT: The problem could also be used fonts and their custom encoding:

How to create a PDF document with header from a template with docx4j?

I want to create a document from an existing Word 2010 document and convert it to PDF using docx4j 3.1.0. I've built upon the sample in
https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/ConvertOutPDF.java
The Word document already contains a header with text and an image that I do not modify in my processing. The resulting PDF document, however, doesn't contain the header.
Is this someting that is supposed to work? If yes: how can I find out what I am missing?

Yes, if you can see the header when you "save as PDF" in Word, then you should also see the header in docx4j's PDF output.
To have it fixed, we'll need to see the docx.

Just for the curious reader: the specific cause for the missing header turned out to be a wrong approach of setting page margins on the document. Instead of modifiying the existing settings via body.getSectPr().getPgMar() (or even simpler: setting it in the template right away), the code created new PageDimensions and set a new SectPtr on the body, thereby somehow overwriting or removing the header.

Extract text without header and footer with PDFBox

I use the PDFTextStripper class to extract pdf text before Lucene indexation.
Is there a possibility to exclude pdf header and footer from text extracted ?

You can use text extraction by area if you know where exactly the header and footer are there in the document. Hope this helps.

Select/Highlight/Copy lines from a PDF in iOS using CG/Quartz

I've searched many sites, and found several links which explains to implement pdf reading/rendering and annotating using javascript on a webview.
I want to highlight text in a pdf just like the way its done in desktop and copy the text and send as sms/email,by a cut/copy/paste like callout which happens in textviews. Is there any link or sources to do the same way for copying text in a pdf document through the app ?
Help is appreciated..!

RDLC rendered to PDF ignores Strikethrough formatting

So, I have a local .rdlc file with some text formatted using strikethrough formatting. My issue is quite simple to explain, but I do not know if it is just a limitation of PDF, or a bug with the .rdlc exporting to PDF.
When I write this code:
var localReport = new LocalReport();
...
byte[] pdf = localReport.Render("PDF");
System.IO.File.WriteAllBytes("MyReport.pdf", pdf);
None of the strike-through formatted text transfers over the the .pdf file properly.
If instead, I export to Word using .Render("Word"), the strikethrough does work on the .doc format. So, I know it isn't a problem with the .rdlc report itself.
Has anyone encountered this? Any solutions or workarounds?

I found this: http://social.msdn.microsoft.com/Forums/en-US/sqlreportingservices/thread/b35ca474-046d-4a38-a765-6c38c3d33105/
which suggests that missing strikethrough in PDFs was a known limitation. (But as mentioned in comments to the question, I couldn't reproduce with 2008r2.)
The two workarounds given there look painful.
(A) finding a font which itself as the strikethrough built into each
glyph/character. (B) trying to mimic a strikethrough using a line
report item. Note that for (B) overlapping items are supported only in
PDF, Print & TIFF formats.
I suppose if it were mine, I would play around with option B if the text is a small amount. Also, it may be worth test some of the html passthrough enabled when a placeholder is set to render as HTML. Maybe using a strikethrough style there would work?

While exporting RDLC report on word, I faced this issue. So while fetching data I replaced style for Strike formatting with strike tag from HTML and it worked.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Filtering out the header and footer text using pdfbox - pdfbox

I'm trying to get the text from a pdf document using pdfbox, the problem is I'm getting the header and footer text as well. Does anyone know if there's a way to filter that out? Maybe via some settings in TextPosition?

Related

Extract text using iTextSharp with different PDF page labels

How to create a PDF document with header from a template with docx4j?

Extract text without header and footer with PDFBox

Select/Highlight/Copy lines from a PDF in iOS using CG/Quartz

RDLC rendered to PDF ignores Strikethrough formatting

Categories

Resources