PDFBox PDFTextStripperByArea region coordinates

PDFBox PDFTextStripperByArea region coordinates - pdfbox

In what dimensions and direction is the Rectangle in the
PDFTextStripperByArea's function addRegion(String regionName, Rectangle2D rect).
In other words, where does the rectangle R start and how big is it (dimensions of the origin values, dimensions of the rectangle) and in what direction does it go (direction of the blue arrows in illustration), if new Rectangle(10,10,100,100) is given as a second parameter?

new Rectangle(10,10,100,100)
means that the rectangle will have its upper-left corner at position (10, 10), so 10 units far from the left and the top of the PDF document. Here a "unit" is 1 pt = 1/72 inch.
The first 100 represents the width of the rectangle and the second one its height.
To sum up, the right picture is the first one.
I wrote this code to extract some areas of a page given as arguments to the function:
Rectangle2D region = new Rectangle2D.Double(x, y, width, height);
String regionName = "region";
PDFTextStripperByArea stripper;
stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
So, x and y are the absolute coordinates of the upper-left corner of the Rectangle and then you specify its width and height. page is a PDPage variable given as argument to this function.

Was looking into doing something like this, so I thought I'd pass what I found along.
Here's the code for creating my original pdf using itext.
import com.lowagie.text.Document
import com.lowagie.text.Paragraph
import com.lowagie.text.pdf.PdfWriter
class SimplePdfCreator {
void createFrom(String path) {
Document d = new Document()
try {
PdfWriter writer = PdfWriter.getInstance(d, new FileOutputStream(path))
d.open()
d.add(new Paragraph("This is a test."))
d.close()
} catch (Exception e) {
e.printStackTrace()
}
}
}
If you crack open the pdf, you'll see the text in the upper left hand corner. Here's the test showing what you are looking for.
#Test
void createFrom_using_pdf_box_to_extract_text_targeted_extraction() {
new SimplePdfCreator().createFrom("myFileLocation")
def doc = PDDocument.load("myFileLocation")
Rectangle2D.Double d = new Rectangle2D.Double(0, 0, 120, 100)
def stripper = new PDFTextStripperByArea()
def pages = doc.getDocumentCatalog().allPages
stripper.addRegion("myRegion", d)
stripper.extractRegions(pages[0])
assert stripper.getTextForRegion("myRegion").contains("This is a test.")
}
Position (0, 0) is the upper left hand corner of the document. The width and height are heading down and to the right. I was able to trim down the range a bit to (35, 52, 120, 3) and still get the test to pass.
All code is written in groovy.

Code in java using PDFBox.
public String fetchTextByRegion(String path, String filename, int pageNumber) throws IOException {
File file = new File(path + filename);
PDDocument document = PDDocument.load(file);
//Rectangle2D region = new Rectangle2D.Double(x,y,width,height);
Rectangle2D region = new Rectangle2D.Double(0, 100, 550, 700);
String regionName = "region";
PDFTextStripperByArea stripper;
PDPage page = document.getPage(pageNumber + 1);
stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
String text = stripper.getTextForRegion(regionName);
return text;
}

Related

docx4j how to insert image into table cell

I can't insert image into table cell using docx4j using following code:
WordprocessingMLPackage wordPackage = WordprocessingMLPackage.createPackage(PageSizePaper.A4,true);
ObjectFactory factory=Context.getWmlObjectFactory();Tbl table = factory.createTbl();
Tr tableRow = factory.createTr();
byte[] imageBytes = Base64.getDecoder().decode(t.getBase64Image());
BinaryPartAbstractImage imagePart = BinaryPartAbstractImage.createImagePart(wordPackage, imageBytes);
Inline inline = imagePart.createImageInline("image", "image", 0, 1, false);
P celPar = addInlineImageToParagraph(inline, factory);
Tc tableCell = factory.createTc();
tableCell.getContent().clear();
tableCell.getContent().add(celPar);
tableRow.getContent().add(tableCell);
wordPackage.getMainDocumentPart().addObject(table);
private P addInlineImageToParagraph(Inline inline, ObjectFactory factory) {
P paragraph = factory.createP();
R run = factory.createR();
paragraph.getContent().add(run);
Drawing drawing = factory.createDrawing();
run.getContent().add(drawing);
drawing.getAnchorOrInline().add(inline);
return paragraph;
}
Word has problem displaying image. I realy don't know where's problem

If you looked at a docx resulting from your code, you would see:
<w:tbl></w:tbl>
You are just missing
table.getContent().add(tableRow);
EDIT 24 Sept
You didn't say you until now that you were trying to add your image in a footer!
For this you need to specify that part, so the rel attaches to the footer. So use https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/openpackaging/parts/WordprocessingML/BinaryPartAbstractImage.java#L247 or https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/openpackaging/parts/WordprocessingML/BinaryPartAbstractImage.java#L339 etc ie one of the signatures which contains Part sourcePart

PDF article region identification

Which PdfBox APIs can I use for identification of regions where a region is a rectangle encapsulating one article so that I can then extract the text of the article.
I am thinking of parsing the PDF content where large white space areas encapsulating text would be identified as regions.
Here is a code which extracts one region where the size and placement of the region is hard coded:
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument pdf = PDDocument.load(new File("SevenPropertiesofHighlySecureDevices.pdf"));
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
// 1 pt is equal to 1/72 inch
Rectangle2D rectangle = new Rectangle2D.Double(0,0,200,200);
String regionName = "First Article Region";
stripper.addRegion(regionName, rectangle);
stripper.extractRegions(pdf.getPage(0));
LOGGER.info("getTextForRegion: \n{}", stripper.getTextForRegion(regionName));
If anyone is tempted to "scan" a PDF to find out where the white space areas are so that this information would be used to determine rectangular regions for the purpose of extraction of articles then I would like to let you know that it is very slow and unlikely usable:
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument pdf = PDDocument.load(new File("4.pdf"));
PDPage page = pdf.getPage(1);
PDRectangle cropBox = page.getCropBox();
float docWidth = cropBox.getUpperRightX();
float docHeight = cropBox.getUpperRightY();
float recWidth = 10;
float rectHeight = 10;
float xStep = recWidth / 2;
float yStep = rectHeight / 2;
String regionName = "docScannerRegion";
String docLeftMarginIndicator = "|";
String docRightMarginIndicator = "|";
String nonTextArea = " ";
String textArea = "-";
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
for (float y = 0; y < docHeight; y = y + yStep) {
System.out.print(docLeftMarginIndicator);
for(float x = 0; x < docWidth; x = x + xStep) {
stripper.addRegion(regionName, new Rectangle2D.Double(x, y, recWidth, rectHeight));
stripper.extractRegions(page);
String txt = stripper.getTextForRegion(regionName).trim();
if (StringUtils.isBlank(txt)) {
System.out.print(nonTextArea);
} else {
System.out.print(textArea);
}
}
System.out.println(docRightMarginIndicator);
}

resizing cell's width of a jtable(inside a jscrollpane) based on the content length

I'm encountering a problem about resizing the column based on the cell that has the longest content. In the following example, I want all cell's width re-adjust according to the length of the content in "ADDRESS" column. Please help, and thank you in advance.
public void setPnLTable(int x, int y){
String[] ColNames = {"Name", "AGE", "ADDRESS"};
private String [][] getData(){t[0][0] = "John Smith"; t[0][1] = "55"; t[0][2] = "1600 Pennsylvania Ave NW, Washington, DC 20500";
DefaultTableModel tm = new DefaultTableModel(getData(),ColNames);
//
jPnLTABLE = new JTable(tm);
// Center Columns
cr.setHorizontalAlignment( JLabel.CENTER );
for (int i = 0; i < ColNames.length; i++){
jPnLTABLE.getColumnModel().getColumn(i).setCellRenderer(cr);
}
jPnLTABLE.setAutoResizeMode(JTable.AUTO_RESIZE_ALL_COLUMNS);
JScrollPane js = new JScrollPane(jPnLTABLE);
//
js.setBounds(x, y, 2000, 500); // IT'S HARD-CODED;
//CAN I RESIZE IT BASED ON THE LENGTH OF THE CELL CONTENT
jPnLTABLE.getTableHeader().setReorderingAllowed(false);
jPnLTABLE.getTableHeader().setFont(new Font(null, Font.BOLD, 12));
add(js);
}

If you make your JFrame a BorderLayout and put your JScrollPane BorderLayout.CENTER no need at all to set the size of the JSrollPane it will take all the available space
If your JScrollPane is inside a cell of a GridLayout no problem neither it will re-ajust no need to set its size
Allways easier to put the JScrollPane inside a JPanel (GridLayout(1,1) or BorderLayout.CENTER) and let the Layout manager to do the adjustement on the JPanel.
or maybe you can use
table.setFillsViewportHeight( true );

itext pdf with list of image and text below

Need help to generate a pdf with a list of image and text describing the image under it.
Tried the below, but getting image and text beside each other. Please need help with this. Thanks.
........
PdfPTable table = new PdfPTable(1);
table.setHorizontalAlignment(Element.ALIGN_CENTER);
table.setSplitRows(true);
table.setWidthPercentage(90f);
Paragraph paragraph = new Paragraph();
for (int counter = 0; counter < empSize; counter++) {
String imgPath = ... ".png");
Image img = Image.getInstance(imgPath);
img.scaleAbsolute(110f, 95f);
Paragraph textParagraph = new Paragraph("Test" + counter));
textParagraph.setLeading(Math.max(img.getScaledHeight(), img.getScaledHeight()));
textParagraph.setAlignment(Element.ALIGN_CENTER);
Phrase imageTextCollectionPhase = new Phrase();
Phrase ph = new Phrase();
ph.add(new Chunk(img, 0, 0, true));
ph.add(textParagraph);
imageTextCollectionPhase.add(ph);
paragraph.add(imageTextCollectionPhase);
}
PdfPCell cell = new PdfPCell(paragraph);
table.addCell(cell);
doc.add(table);

I assume that you want to get a result that looks like this:
In your case, you are adding all the content (all the images and all the text) to a single cell. You should add them to separate cells as is done in the MultipleImagesInTable example:
public void createPdf(String dest) throws IOException, DocumentException {
Image img1 = Image.getInstance(IMG1);
Image img2 = Image.getInstance(IMG2);
Image img3 = Image.getInstance(IMG3);
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
PdfPTable table = new PdfPTable(1);
table.setWidthPercentage(20);
table.addCell(img1);
table.addCell("Brazil");
table.addCell(img2);
table.addCell("Dog");
table.addCell(img3);
table.addCell("Fox");
document.add(table);
document.close();
}
You can easily change this proof of concept so that a loop is used. Just make sure you put the addCell() methods inside the loop instead of outside the loop.
You can also explicitly create a PdfPCell and combine the text and the image in the same cell like this:
PdfPCell cell = new PdfPCell();
cell.addElement(img1);
cell.addElement(new Paragraph("Brazil"));
table.addCell(cell);

table borders not expanding properly in pdf using itext

I would like to generate pdf which contains table with border and having more data in that table so when generating pdf it is generated in two pages. But the problem is table borders not expanding
page to page i.e, in the next page borders(horizontal),previous page vertical border framed again which is wrong. Horizontal in next page, Vertical in previous page should not come.
Please find the attached pdf file and html file for reference.
Generated PDf file with my code
Sample html file

You want a table that looks like this custom_border2.pdf.
As explained in my comments, you need to set the borders of the cell to NO_BORDER, either by changing the default cell:
table.getDefaultCell().setBorder(Rectangle.NO_BORDER);
Or by changing the properties of specific cells:
PdfPCell cell = new PdfPCell(new Phrase(TEXT));
cell.setBorder(Rectangle.NO_BORDER);
Or both.
Then you have to create a table event:
class BorderEvent implements PdfPTableEventAfterSplit {
protected boolean bottom = true;
protected boolean top = true;
public void splitTable(PdfPTable table) {
bottom = false;
}
public void afterSplitTable(PdfPTable table, PdfPRow startRow, int startIdx) {
top = false;
}
public void tableLayout(PdfPTable table, float[][] width, float[] height,
int headerRows, int rowStart, PdfContentByte[] canvas) {
float widths[] = width[0];
float y1 = height[0];
float y2 = height[height.length - 1];
float x1 = widths[0];
float x2 = widths[widths.length - 1];
PdfContentByte cb = canvas[PdfPTable.LINECANVAS];
cb.moveTo(x1, y1);
cb.lineTo(x1, y2);
cb.moveTo(x2, y1);
cb.lineTo(x2, y2);
if (top) {
cb.moveTo(x1, y1);
cb.lineTo(x2, y1);
}
if (bottom) {
cb.moveTo(x1, y2);
cb.lineTo(x2, y2);
}
cb.stroke();
cb.resetRGBColorStroke();
bottom = true;
top = true;
}
}
The splitTable() and afterSplitTable() method will keep track if a top or bottom border needs to be drawn. The actual borders are drawn in the tableLayout() method.
You need to set this table event right after creating the table:
PdfPTable table = new PdfPTable(2);
BorderEvent event = new BorderEvent();
table.setTableEvent(event);
Now you will have the desired behavior as explained in my initial comment. You can find the full code sample here. I have provided a more complex example here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PDFBox PDFTextStripperByArea region coordinates - pdfbox

Related

docx4j how to insert image into table cell

PDF article region identification

resizing cell's width of a jtable(inside a jscrollpane) based on the content length

itext pdf with list of image and text below

table borders not expanding properly in pdf using itext

Categories

Resources