Can I get Code for getting location of text in PDF - pdf

I am facing problem with getting exact Coordinate of text, So can I can get whole code for this problem(Including myLocationExtractionStratey class as well).

Related

slate3k WARNING:pdfminer.layout:Too many boxes (106) to group, skipping

I'm trying to extract text from a PDF in python, but I get the following warning message which limits the amount of text for each page that is extracted. Is there any solution anyone can think of to resolve this issue? Code also below:
WARNING:pdfminer.layout:Too many boxes (106) to group, skipping.
import slate3k as slate
with open("mypdf.pdf",'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)

Docx4J: Vertical text frame not exported to PDF

I'm using Docx4J to make an invoice model.
In the left-side of the page, it's usual to show a legal sentence as: Registered company in ... Book ... Page ...
I have inserted this in my template with a Word text frame.
Well, my issue is: when exporting to .docx, this legal text is shown perfect, but when exporting to .pdf, it's shown as an horizontal table under the other data.
The code to export to PDF is:
FOSettings foSettings = Docx4J.createFOSettings();
foSettings.setFoDumpFile(foDumpFile);
foSettings.setWmlPackage(template);
fos = new FileOutputStream(new File("/C:/mypath/prueba_OUT.pdf"));
Docx4J.toFO(foSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
Any help would be very appreciated.
Thanks.
You'd need to extend the PDF via FO code; see further How to correctly position a header image with docx4j?
Float left may or may not be easy; similarly the rotated text.
In general, the way to work on this is to take the FO generated by docx4j, then hand edit it to something which FOP can convert to a PDF you are happy with. If you can do that, then its a matter of modifying docx4j to generate that FO.

How to correctly position a header image with docx4j?

I am trying to convert this Word document with a header showing an image on the right
http://www.filesnack.com/files/cduiejc7
to PDF using this sample code:
https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/ConvertOutPDF.java
Here's the result:
http://www.filesnack.com/files/ctjs659h
While the Word document has the header image on the right, the converted PDF shows it on the left.
How can I make docx4j to reproduce the original document as PDF?
Your image is positioned relative to a paragraph:
<w:drawing>
<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251658240" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1" wp14:anchorId="791936E3" wp14:editId="575B92C8">
<wp:simplePos x="0" y="0"/>
<wp:positionH relativeFrom="column">
<wp:posOffset>5317388</wp:posOffset>
</wp:positionH>
<wp:positionV relativeFrom="paragraph">
<wp:posOffset>-325755</wp:posOffset>
</wp:positionV>
docx4j potential to support stuff like that in PDF output is limited by what XSL FO supports. See docx4j's TextBoxTest class for what we can do with text boxes.
Currently, although we can position some textBoxes; we don't do the same for floating images: https://github.com/plutext/docx4j/issues/127
In the meantime, a possible workaround for some cases (eg float right) is to use a table.
Or possibly, you could try putting the image inside a text box!

Images display issue in Dynamics AX 2009 reports saved as pdf

When i saved a native Dynamics AX 2009 report as pdf or pdf-embed it doesn't show the images in the report i.e. company logo in header section, properly. The image comes very distorted, grayish and repeated.
On the other if i export the image in HTML format the image comes properly.
Had anyone experience a similar issue.
Please note that im saving the report as pdf using "file" option that comes when the report print dialog opens.
Any help would be highly appreciated.
Issue will go if the image format used is one of the following
1. 24bit Bitmap
2. TIFF
I have found a solution for this issue in AX 2009:
Bitmap getImageBitmap(ItemId _itemId)
{
HPLInventImages inventImages; // Column HPLInventImages.ItemImage is EDT:BlobData (which is a container)
Image image;
;
if (!_itemId) return inventImages.ItemImage; // Return null bitmap. The whole AX client crashes if you try to do the resizing code below on a null bitmap.
select firstonly inventImages where inventImages.ItemId==_itemId;
//return inventImages.ItemImage; // Would normally just do this, but see comments below.
// Ok, this next bit is weird!
// There is a known issue with AX reports with images in, getting saved as PDFs:
// In some cases, the images appear as garbage on the PDF.
// I have found that resizing the image before rendering it, causes the image to come out ok on the PDF.
// So the code below does a token resize (by 1.0 times!) operation on the image before returning it.
// That is enough to make the image on the PDF turn out ok.
image=new Image(inventImages.ItemImage);
image.resize(image.width()*1.0,image.height()*1.0,InterpolationMode::InterpolationModeHighQuality);
return image.getData();
}

GDAL appears to ignore NoDataValue

I'm trying to build a mosaic, and I rely on the NoDataValue feature to treat some parts of the image as transparent.
However, it appears that GDAL doesn't work as expected.
I also created a very simple test case using a vrt dataset and gdal_translate - and I get the same results (that is - the 2nd image draws over the 1st image, ignoring "transparent areas")
I have to 100X100 image files with a white marking (different in each file) over black background (black being exactly equal to 0)
I built a simple vrt file:
<VRTDataset rasterXSize="100" rasterYSize="100">
<VRTRasterBand dataType="Byte" band="1">
<ColorInterp>Gray</ColorInterp>
<SimpleSource>
<SourceFilename relativeToVRT="1">a1.tif</SourceFilename>
<SourceBand>1</SourceBand>
<SrcRect xOff="0" yOff="0" xSize="100" ySize="100"/>
<DstRect xOff="0" yOff="0" xSize="100" ySize="100"/>
<HideNoDataValue>1</HideNoDataValue>
<NoDataValue>0</NoDataValue>
</SimpleSource>
<SimpleSource>
<SourceFilename relativeToVRT="1">a2.tif</SourceFilename>
<SourceBand>1</SourceBand>
<SrcRect xOff="0" yOff="0" xSize="100" ySize="100"/>
<DstRect xOff="0" yOff="0" xSize="100" ySize="100"/>
<HideNoDataValue>1</HideNoDataValue>
<NoDataValue>0</NoDataValue>
</SimpleSource>
</VRTRasterBand>
</VRTDataset>
and I run the command:
gdal_translate mosaic.vrt mosaic.tif
The result is identical to image a2.tif, instead of being a combination of a1.tif and a2.tif
I got the error using gdal 1.8 and 1.9
any ideas?
I got an answer in the gdal-dev list from Even Rouault
Several errors :
The NoDataValue and HideNoDataValue elements are only valid under the VRTRasterBand element, not SimpleSource
You want to change SimpleSource to ComplexSource, and add a <NODATA>0</NODATA> element in it. (well basically rename your current NoDataValue to NODATA.