I'm using Docx4J to make an invoice model.
In the left-side of the page, it's usual to show a legal sentence as: Registered company in ... Book ... Page ...
I have inserted this in my template with a Word text frame.
Well, my issue is: when exporting to .docx, this legal text is shown perfect, but when exporting to .pdf, it's shown as an horizontal table under the other data.
The code to export to PDF is:
FOSettings foSettings = Docx4J.createFOSettings();
foSettings.setFoDumpFile(foDumpFile);
foSettings.setWmlPackage(template);
fos = new FileOutputStream(new File("/C:/mypath/prueba_OUT.pdf"));
Docx4J.toFO(foSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
Any help would be very appreciated.
Thanks.
You'd need to extend the PDF via FO code; see further How to correctly position a header image with docx4j?
Float left may or may not be easy; similarly the rotated text.
In general, the way to work on this is to take the FO generated by docx4j, then hand edit it to something which FOP can convert to a PDF you are happy with. If you can do that, then its a matter of modifying docx4j to generate that FO.
Related
when I am recreating a pdf after some changes , if I use the output pdf in axesPDF to fix spaces issue I am getting "ERROR". The only difference I observed my input pdf have Array of tokens but my recreated pdf has Dictionary of elements. As shown in below. Does that cause the problem? How can I recreate similar structure? (Left one is input pdf right one output pdf after editing)
Input pdf
output pdf with changes
The code I am using to save the pdf is
PDStream newContents = new PDStream(document);
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(tokens);
out.close();
document.getPage(pg_ind).setContents(newContents);
newPDF.addPage(document.getPage(pg_ind));
newPDF.save()
Please help me on this. Thanks in advance.
Updating question along with error.
Another Input file
The error is
The button I used.
I am wondering this time even the content stream is in COSDictionary format it's giving error. Something else causing this.
I am trying to convert this Word document with a header showing an image on the right
http://www.filesnack.com/files/cduiejc7
to PDF using this sample code:
https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/ConvertOutPDF.java
Here's the result:
http://www.filesnack.com/files/ctjs659h
While the Word document has the header image on the right, the converted PDF shows it on the left.
How can I make docx4j to reproduce the original document as PDF?
Your image is positioned relative to a paragraph:
<w:drawing>
<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251658240" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1" wp14:anchorId="791936E3" wp14:editId="575B92C8">
<wp:simplePos x="0" y="0"/>
<wp:positionH relativeFrom="column">
<wp:posOffset>5317388</wp:posOffset>
</wp:positionH>
<wp:positionV relativeFrom="paragraph">
<wp:posOffset>-325755</wp:posOffset>
</wp:positionV>
docx4j potential to support stuff like that in PDF output is limited by what XSL FO supports. See docx4j's TextBoxTest class for what we can do with text boxes.
Currently, although we can position some textBoxes; we don't do the same for floating images: https://github.com/plutext/docx4j/issues/127
In the meantime, a possible workaround for some cases (eg float right) is to use a table.
Or possibly, you could try putting the image inside a text box!
I'm using Tika* to parse a PDF file.
There are no problems to retrieve the document's text, but I don't figure out how to extract text:
underlined
highlighted
crossed out
Adobe Writer gives you different text edit options, but I'm not able to see where they are "hidden".
Is there a solution to extract these metadata information? (underline, highligh ...)
Do you know if Tika is able to extract this data?
*http://tika.apache.org/
Wow. 4 years is a long time to wait for an answer, and I figure you have found a solution by now. Anyways, for the sake of those who would visit this link, the answer is Yes. Apache Tika can extract not just text in a document, but also the formatting as well (e.g. bold, italicized). This was my Scenario:
//inputStream is the document you wish to parse from.
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler());
Metadata metadata = new Metadata();
parser.parse(inputStream,handler,metadata);
System.out.println(handler.toString());
The print statement prints an XML of your document. With a little work of cleaning up the XML (really HTML tags), you would be left with tags like < b >text< /b> for bold text and < i >text < / i > for italicized text. Then you could find a way to render it. Good luck.
When i saved a native Dynamics AX 2009 report as pdf or pdf-embed it doesn't show the images in the report i.e. company logo in header section, properly. The image comes very distorted, grayish and repeated.
On the other if i export the image in HTML format the image comes properly.
Had anyone experience a similar issue.
Please note that im saving the report as pdf using "file" option that comes when the report print dialog opens.
Any help would be highly appreciated.
Issue will go if the image format used is one of the following
1. 24bit Bitmap
2. TIFF
I have found a solution for this issue in AX 2009:
Bitmap getImageBitmap(ItemId _itemId)
{
HPLInventImages inventImages; // Column HPLInventImages.ItemImage is EDT:BlobData (which is a container)
Image image;
;
if (!_itemId) return inventImages.ItemImage; // Return null bitmap. The whole AX client crashes if you try to do the resizing code below on a null bitmap.
select firstonly inventImages where inventImages.ItemId==_itemId;
//return inventImages.ItemImage; // Would normally just do this, but see comments below.
// Ok, this next bit is weird!
// There is a known issue with AX reports with images in, getting saved as PDFs:
// In some cases, the images appear as garbage on the PDF.
// I have found that resizing the image before rendering it, causes the image to come out ok on the PDF.
// So the code below does a token resize (by 1.0 times!) operation on the image before returning it.
// That is enough to make the image on the PDF turn out ok.
image=new Image(inventImages.ItemImage);
image.resize(image.width()*1.0,image.height()*1.0,InterpolationMode::InterpolationModeHighQuality);
return image.getData();
}
Is there any tool to find the X-Y location on a text content in a pdf file ?
Docotic.Pdf Library can do it. See C# sample below:
using (PdfDocument doc = new PdfDocument("your_pdf.pdf"))
{
foreach (PdfTextData textData in doc.Pages[0].Canvas.GetTextData())
Console.WriteLine(textData.Position + " " + textData.Text);
}
Try running "Preflight..." in Acrobat and choosing PDF Analysis -> List page objects, grouped by type of object.
If you locate the text objects within the results list, you will notice there is a position value (in points) within the Text Properties -> * Font section.
TET, the Text Extraction Toolkit from the pdflib family of products can do that. TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...)
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.