How can you copy text from source pdf that includes the formatting information? - pdf

I am using iText7 to experiment copying a load of seperate pdfs into 1 single pdf document. It's easy to copy the text like this:
var sourcePage = sourcePdf.GetPage(i + 1);
var strategy = new SimpleTextExtractionStrategy();
var text = PdfTextExtractor.GetTextFromPage(sourcePage, strategy);
var currentText = Encoding.
UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
PdfFont regular = PdfFontFactory.CreateFont(FontConstants.HELVETICA);
PdfFont bold = PdfFontFactory.CreateFont(FontConstants.HELVETICA_BOLD);
Text first = new Text(currentText).SetFont(regular);
Text second = new Text("TEST TEST").SetFont(bold);
Paragraph paragraph = new Paragraph().Add(first).Add(second);
outDocument.Add(paragraph);
Here I am testing Helvetica font but its needs to be the same as the source.
The variables "text" and "currentText" are just plain text. How do I get the metadata? The destination document needs to have the same formatting.

Related

How to create PDF/UA in iText7 with text hyperlink

I am trying to create a PDF/UA compliant file that contains a text hyperlink with iText 7. Both the Acrobat Preflight test for PDF/UA and the PDF Accessibility Checker (PAC 3) complain that the PDF file say that the PDF is not compliant.
PAC 3 says ""Link" annotation is not nested inside a "Link" structure element" and the Acrobat Preflight test says the Link annotation does not have an alternate description in the Contents key.
The following is my attempt to create PDF/UA compliant output that contains a text hyperlink.
Any advice would be appreciated.
public void testHyperLink() throws IOException {
// Create PDF/UA with text hyperlink
String filename = "./results/HyperLink.pdf";
WriterProperties properties = new WriterProperties();
properties.addUAXmpMetadata().setPdfVersion(PdfVersion.PDF_1_7);
PdfWriter writer = new PdfWriter(filename, properties);
pdfDoc = new PdfDocument(writer);
//Make document tagged
pdfDoc.setTagged();
pdfDoc.getCatalog().setLang(new PdfString("en-US"));
pdfDoc.getCatalog().setViewerPreferences(new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdfDoc.getDocumentInfo();
info.setTitle("Hello Hyperlinks!");
document = new Document(pdfDoc);
// Must embed font for PDF/UA
byte[] inputBytes = Files.readAllBytes(Paths.get("./resources/fonts/opensans-regular.ttf"));
boolean embedded = true;
boolean cached = false;
PdfFont font = PdfFontFactory.createFont(inputBytes, PdfEncodings.CP1252, embedded, cached);
Text text = new Text("This is a Text link");
text.setFont(font);
text.setFontSize(16F);
// Add alternate text for hyperlink
text.getAccessibilityProperties().setAlternateDescription("Click here to go to the iText website");
PdfAction act = PdfAction.createURI("https://itextpdf.com/");
text.setAction(act);
Paragraph para = new Paragraph();
para.add(text);
document.add(para);
document.close();
System.out.println("Created "+ filename);
}
A Link object might be what you want:
Link lnk = new Link("This is a Text link",
PdfAction.CreateURI("https://itextpdf.com/"));
lnk.SetFont(font);
lnk.GetLinkAnnotation().SetBorder(new PdfAnnotationBorder(0, 0, 0));//Remove the default border
lnk.GetAccessibilityProperties().SetAlternateDescription("Click here to go to the iText website");
Paragraph para = new Paragraph();
para.Add(lnk);
document.Add(para);

How to pass font name as string in pdf file with Java iText

I am generating pdf report with few inputs like font name, font size. I tried to create a font using below code.
Font font = new Font(FontFamily.TIMES_ROMAN,50.0f,Font.UNDERLINE,BaseColor.RED);
Here, how pass font name that is TIMES_ROMAN as a string?
Here's a quick way on how you can achieve the desired behavior with iText 7:
final PdfDocument pdfDocument = new PdfDocument(new PdfWriter(DEST));
PdfFont font = PdfFontFactory.createFont(FontProgramFactory.createFont(StandardFonts.TIMES_ROMAN));
Style myStyle = new Style()
.setFontSize(50)
.setUnderline()
.setFontColor(RED)
.setFont(font);
try (final Document document = new Document(pdfDocument)) {
document.add(new Paragraph("Hello World!").addStyle(myStyle));
document.add(new Paragraph("Hello World!").setFont(font)
.setFontSize(50)
.setUnderline()
.setFontColor(RED));
}
You can also define the font on a Document level (I'm showing Style and directly on the Paragraph).

Unable to create PDF from JSPDF

Save in DB and import data and create a pdf file using jspdf.
Data is stored up to html tag...
select ct_contents from contract where ct_id = 659;
RESULT : `<p style="text-align:justify"><span style="font-size:10.5pt"><span style="font-family:Century,serif"><span style="font-family:"MS Mincho"">氏  名</span></span></span></p>`
I have this js code :
let pdfName = this.newTemplate.tp_title.trim()
var doc = new jsPDF();
doc.addFileToVFS('NotoSansCJKjp-Regular.ttf', VFS);
doc.addFont('NotoSansCJKjp-Regular.ttf', 'NotoSansCJKjp', 'Bold');
doc.setFont('NotoSansCJKjp', 'Bold');
doc.setFontSize(12);
var paragraph = this.contract.ct_contents;
var lines = doc.splitTextToSize(paragraph, 150);
doc.text(15, 60, lines);
doc.save(pdfName + '.pdf');
add a font to work on it, but check the downloaded pdf, the html tag will also appear.
I want to remove this tag and make it appear only in text.
image is the result of downloading by pdf.
And it is page 3 in ms word and only page 1 of pdf is download.....
How can I get the font to come out without getting the html tag?

how move pdf field with pdfbox?

i'm trying to align pdf fields to the first on the row. I'm able to get fields and its position. I'm also able to locally change it but when i save pdf the fields appears on the same position.
this is the code:
PDDocument pdfDocument = PDDocument.load(new File("MyFile"));
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
String fieldName = "MyField";
PDField f = acroForm.getField(fieldName);
PDRectangle r = f.getWidgets().get(0).getRectangle();
r.setLowerLeftX(10);
r.setLowerLeftY(10);
r.setUpperRightX(10);
r.setUpperRightY(10);
pdfDocument.save(new File("MyModifiedFile"));
pdfDocument.close();
You have to reassign the modified rectangle to the widget:
f.getWidgets().get(0).setRectangle(r);
Because unlike the widget, the rectangle is not backed by the structures in the PDF.

How to add a rich Textbox (HTML) to a table cell?

I have a rich text box named:”DocumentContent” which I’m going to add its content to pdf using the below code:
iTextSharp.text.Font font = FontFactory.GetFont(#"C:\Windows\Fonts\arial.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED, 12f, Font.NORMAL, BaseColor.BLACK);
DocumentContent = System.Web.HttpUtility.HtmlDecode(DocumentContent);
Chunk chunkContent = new Chunk(DocumentContent);
chunkContent.Font = font;
Phrase PhraseContent = new Phrase(chunkContent);
PhraseContent.Font = font;
PdfPTable table = new PdfPTable(2);
table.WidthPercentage = 100;
PdfPCell cell;
cell = new PdfPCell(new Phrase(PhraseContent));
cell.Border = Rectangle.NO_BORDER;
table.AddCell(cell);
The problem is when I open PDF file the content appears as HTML not a text as below:
<p>Overview  line1 </p><p>Overview  line2
</p><p>Overview  line3 </p><p>Overview 
line4</p><p>Overview  line4</p><p>Overview 
line5 </p>
But it should look like below
Overview line1
Overview line2
Overview line3
Overview line4
Overview line4
Overview line5
What I'm going to do is to keep all the styling which user apply to the rich text and just change font family to Arial.
I can change Font Family but I need to Decode this content from HTML to Text.
Could you please advise?
Thanks
Please take a look at the HtmlContentForCell example.
In this example, we have the HTML you mention:
public static final String HTML = "<p>Overview line1</p>"
+ "<p>Overview line2</p><p>Overview line3</p>"
+ "<p>Overview line4</p><p>Overview line4</p>"
+ "<p>Overview line5 </p>";
We also create a font for the <p> tag:
public static final String CSS = "p { font-family: Cardo; }";
In your case, you may want to replace Cardo with Arial.
Note that we registered the regular version of the Cardo font:
FontFactory.register("resources/fonts/Cardo-Regular.ttf");
If you need bold, italic and bold-italic, you also need to register those fonts of the same Cardo family. (In case of arial, you'd register arial.ttf, arialbd.ttf, ariali.ttf and arialbi.ttf).
Now we can parse this HTML and CSS into a list of Element objects with the parseToElementList() method. We can use these objects inside a cell:
PdfPTable table = new PdfPTable(2);
table.addCell("Some rich text:");
PdfPCell cell = new PdfPCell();
for (Element e : XMLWorkerHelper.parseToElementList(HTML, CSS)) {
cell.addElement(e);
}
table.addCell(cell);
document.add(table);
See html_in_cell.pdf for the resulting PDF.
I do not have the time/skills to provide this example in iTextSharp, but it should be very easy to port this to C#.
Finally I write this code in c# which is working perfectly, Thanks to Bruno who helped me to understand XMLWorker.
Here is an example using XMLWorker in C#.
I used a sample HTML as below:
public static string HTML = "<p>Overview line1âââŵẅẃŷûâàêÿýỳîïíìôöóòêëéèẁẃẅŵùúúüûàáäâ</p>"
+ "<p>Overview line2</p><p>Overview line3</p>"
+ "<p>Overview line4</p><p>Overview line4</p>"
+ "<p>Overview line5 </p>";
I have created Test.css file and saved it in SharePoint Style Library. (for this test I saved it in D drive to keep it simple)
Here is the content of my test css file:
p { font-family: arial; }
Then using the below c# code I saved the PDF file in D drive. ( In SharePoint I used Memorystream. I keep this example very simple to understand )
string fileName = #"D:\Test.pdf";
var css = #"D:\Test.css";
using (var ActionStream = new MemoryStream(UTF8Encoding.UTF8.GetBytes(HTML)))
{
using (FileStream cssFile = new FileStream(css, FileMode.Open))
{
var document = new Document(PageSize.A4, 30, 30, 10, 10);
var worker = XMLWorkerHelper.GetInstance();
var writer = PdfWriter.GetInstance(document, new FileStream(fileName, FileMode.Create));
document.Open();
worker.ParseXHtml(writer, document, ActionStream, cssFile);
writer.CloseStream = false;
document.Close();
}
}
It creates Test.pdf file adding my HTML with Font Family:Arial. So all of the Welsh Characters can be saved in PDF file.
Note: I have added iTextSharp.dll v:5.5.3 and XMLworker.dll v: 5.5.3 to my project.
using iTextSharp.text;
using iTextSharp.text.html;
using iTextSharp.text.pdf;
using iTextSharp.tool.xml;
using iTextSharp.tool.xml.css;
using iTextSharp.tool.xml.html;
using iTextSharp.tool.xml.parser;
using iTextSharp.tool.xml.pipeline;
Hope this can be useful.
Kate