Data in text is not proper format after converting PDF to text - pdfbox

I have 1 PDF file. I need to fetch key values.
I am able to convert PDF to text by using pdfBox and itext api's but text is not in proper format. I need to get the values from the text and have to do some calculation.
Below is the piece of code:
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper pstripper = new PDFTextStripper();
String text = pstripper.getText(pddoc);
String lines[] = text.split("\\r?\\n");
for (String line : lines) {
System.out.println(line);
}
pddoc.close();
Could you please help me on this?
Thanks.

Related

How can you copy text from source pdf that includes the formatting information?

I am using iText7 to experiment copying a load of seperate pdfs into 1 single pdf document. It's easy to copy the text like this:
var sourcePage = sourcePdf.GetPage(i + 1);
var strategy = new SimpleTextExtractionStrategy();
var text = PdfTextExtractor.GetTextFromPage(sourcePage, strategy);
var currentText = Encoding.
UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
PdfFont regular = PdfFontFactory.CreateFont(FontConstants.HELVETICA);
PdfFont bold = PdfFontFactory.CreateFont(FontConstants.HELVETICA_BOLD);
Text first = new Text(currentText).SetFont(regular);
Text second = new Text("TEST TEST").SetFont(bold);
Paragraph paragraph = new Paragraph().Add(first).Add(second);
outDocument.Add(paragraph);
Here I am testing Helvetica font but its needs to be the same as the source.
The variables "text" and "currentText" are just plain text. How do I get the metadata? The destination document needs to have the same formatting.

Generate PDF from gsp page

I am using grails 2.5.2.
I have created a table which shows all the data from database to gsp page and now i need to save that shown data in a pdf format with a button click.What will be the best way to show them into a PDF and save it to my directory. please Help
You can use itext for converting HTML into pdf using the code below:
public void createPdf(HttpServletResponse response, String args, String css, String pdfTitle) {
response.setContentType("application/force-download")
response.setHeader("Content-Disposition", "attachment;filename=${pdfTitle}.pdf")
Document document = new Document()
Rectangle one = new Rectangle(900, 600)
document.setPageSize(one)
PdfWriter writer = PdfWriter.getInstance(document, response.getOutputStream())
document.open()
ByteArrayInputStream bis = new ByteArrayInputStream(args.toString().getBytes())
ByteArrayInputStream cis = new ByteArrayInputStream(css.toString().getBytes())
XMLWorkerHelper.getInstance().parseXHtml(writer, document, bis, cis)
document.close()
}
Though answering this question late,take a look at grails export plugin.It will be useful if you want to export your data to excel and pdf( useful only if there is no in pre-defined template to export).
Got idea from itext. Used itext 2.1.7 and posted all the values to pdf from a controller method. Used images as background and paragraph and phrase to show values from database.

pdfbox - unable to capture modified values from pdf

I have a requirement to open PDF on JXBrowser and let the user modify values on PDF and upon saving, I should able to read the modified values and save to database.
My issue was, I am unable to fetch modified values from pdf, its always sending back original values from pdf (acroForm.getField(field name);). Could you help me if there is any other way to solve this problem.
I am using pdfbox 2.0.1
Appreciate your help.
Thanks,
Prasad
Update1:
Adding sample code that I have used in my application
PDDocument PDFDoc = PDDocument.load(complaintform.pdf);
LoggerProvider.setLevel(Level.OFF);
Base64Encoder b64 = new Base64Encoder();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PDFDoc.save(baos);
String pdfHTML = "<HTML><BODY style=\"width:100%; height:100%\" > <embed style=\"width:100%; height:100%\" src=\"data:application/pdf;base64,"+b64.encode(baos.toByteArray())+"\"type=\"application/pdf\"></BODY></HTML>";
Browser browser = new Browser();
BrowserView browserView = new BrowserView(browser);
this.add(browserView, BorderLayout.CENTER);
browser.loadHTML(pdfHTML);
save()
{
PDDocumentCatalog docCatalog = PDFDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField("last");
String modifiedValue = field.getValueAsString();
}

How to add a rich Textbox (HTML) to a table cell?

I have a rich text box named:”DocumentContent” which I’m going to add its content to pdf using the below code:
iTextSharp.text.Font font = FontFactory.GetFont(#"C:\Windows\Fonts\arial.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED, 12f, Font.NORMAL, BaseColor.BLACK);
DocumentContent = System.Web.HttpUtility.HtmlDecode(DocumentContent);
Chunk chunkContent = new Chunk(DocumentContent);
chunkContent.Font = font;
Phrase PhraseContent = new Phrase(chunkContent);
PhraseContent.Font = font;
PdfPTable table = new PdfPTable(2);
table.WidthPercentage = 100;
PdfPCell cell;
cell = new PdfPCell(new Phrase(PhraseContent));
cell.Border = Rectangle.NO_BORDER;
table.AddCell(cell);
The problem is when I open PDF file the content appears as HTML not a text as below:
<p>Overview  line1 </p><p>Overview  line2
</p><p>Overview  line3 </p><p>Overview 
line4</p><p>Overview  line4</p><p>Overview 
line5 </p>
But it should look like below
Overview line1
Overview line2
Overview line3
Overview line4
Overview line4
Overview line5
What I'm going to do is to keep all the styling which user apply to the rich text and just change font family to Arial.
I can change Font Family but I need to Decode this content from HTML to Text.
Could you please advise?
Thanks
Please take a look at the HtmlContentForCell example.
In this example, we have the HTML you mention:
public static final String HTML = "<p>Overview line1</p>"
+ "<p>Overview line2</p><p>Overview line3</p>"
+ "<p>Overview line4</p><p>Overview line4</p>"
+ "<p>Overview line5 </p>";
We also create a font for the <p> tag:
public static final String CSS = "p { font-family: Cardo; }";
In your case, you may want to replace Cardo with Arial.
Note that we registered the regular version of the Cardo font:
FontFactory.register("resources/fonts/Cardo-Regular.ttf");
If you need bold, italic and bold-italic, you also need to register those fonts of the same Cardo family. (In case of arial, you'd register arial.ttf, arialbd.ttf, ariali.ttf and arialbi.ttf).
Now we can parse this HTML and CSS into a list of Element objects with the parseToElementList() method. We can use these objects inside a cell:
PdfPTable table = new PdfPTable(2);
table.addCell("Some rich text:");
PdfPCell cell = new PdfPCell();
for (Element e : XMLWorkerHelper.parseToElementList(HTML, CSS)) {
cell.addElement(e);
}
table.addCell(cell);
document.add(table);
See html_in_cell.pdf for the resulting PDF.
I do not have the time/skills to provide this example in iTextSharp, but it should be very easy to port this to C#.
Finally I write this code in c# which is working perfectly, Thanks to Bruno who helped me to understand XMLWorker.
Here is an example using XMLWorker in C#.
I used a sample HTML as below:
public static string HTML = "<p>Overview line1âââŵẅẃŷûâàêÿýỳîïíìôöóòêëéèẁẃẅŵùúúüûàáäâ</p>"
+ "<p>Overview line2</p><p>Overview line3</p>"
+ "<p>Overview line4</p><p>Overview line4</p>"
+ "<p>Overview line5 </p>";
I have created Test.css file and saved it in SharePoint Style Library. (for this test I saved it in D drive to keep it simple)
Here is the content of my test css file:
p { font-family: arial; }
Then using the below c# code I saved the PDF file in D drive. ( In SharePoint I used Memorystream. I keep this example very simple to understand )
string fileName = #"D:\Test.pdf";
var css = #"D:\Test.css";
using (var ActionStream = new MemoryStream(UTF8Encoding.UTF8.GetBytes(HTML)))
{
using (FileStream cssFile = new FileStream(css, FileMode.Open))
{
var document = new Document(PageSize.A4, 30, 30, 10, 10);
var worker = XMLWorkerHelper.GetInstance();
var writer = PdfWriter.GetInstance(document, new FileStream(fileName, FileMode.Create));
document.Open();
worker.ParseXHtml(writer, document, ActionStream, cssFile);
writer.CloseStream = false;
document.Close();
}
}
It creates Test.pdf file adding my HTML with Font Family:Arial. So all of the Welsh Characters can be saved in PDF file.
Note: I have added iTextSharp.dll v:5.5.3 and XMLworker.dll v: 5.5.3 to my project.
using iTextSharp.text;
using iTextSharp.text.html;
using iTextSharp.text.pdf;
using iTextSharp.tool.xml;
using iTextSharp.tool.xml.css;
using iTextSharp.tool.xml.html;
using iTextSharp.tool.xml.parser;
using iTextSharp.tool.xml.pipeline;
Hope this can be useful.
Kate

PDFBOX Printing : Printed PDF contains Junk characters for Arabic text from the PDF

I have a PDF file containing Arabic text and a watermark. I am using PDFBox to print the PDF from Java. My issue is the PDF is printed with high quality, but all the lines with Arabic characters have junk characters instead. Could somebody help on this?
Code:
String pdfFile = "C:/AresEPOS_Home/Receipts/1391326264281.pdf";
PDDocument document = null;
try {
document = PDDocument.load(pdfFile);
//PDFont font = PDTrueTypeFont.loadTTF(document, "C:/Windows/Fonts/Arial.ttf");
PrinterJob printJob = PrinterJob.getPrinterJob();
printJob.setJobName(new File(pdfFile).getName());
PrintService[] printService = PrinterJob.lookupPrintServices();
boolean printerFound = false;
for (int i = 0; !printerFound && i < printService.length; i++) {
if (printService[i].getName().indexOf("EPSON") != -1) {
printJob.setPrintService(printService[i]);
printerFound = true;
}
}
document.silentPrint(printJob);
}
finally {
if (document != null) {
document.close();
}
}
In essence
Your PDF can properly be printed using PDFBox 2.0.0-SNAPSHOT but not using PDFBox 1.8.4. Thus, either the Arabic font in question requires a feature which is not yet supported in PDFBox up to version 1.8.4 or there was a bug in 1.8.4 which meanwhile has been fixed.
The details
Printing the OP's document using PDFBox 1.8.4 resulted in some scrambled output like this
but printing it using the current PDFBox 2.0.0-SNAPSHOT resulted in a proper output like this
In 2.0.0-SNAPSHOT the PDDocument methods print and silentPrint have been removed, though, so the original
document.silentPrint(printJob);
has to be replaced by something like
printJob.setPageable(new PDPageable(document, printJob));
printJob.print();