PDF article region identification

PDF article region identification - pdf

Which PdfBox APIs can I use for identification of regions where a region is a rectangle encapsulating one article so that I can then extract the text of the article.
I am thinking of parsing the PDF content where large white space areas encapsulating text would be identified as regions.
Here is a code which extracts one region where the size and placement of the region is hard coded:
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument pdf = PDDocument.load(new File("SevenPropertiesofHighlySecureDevices.pdf"));
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
// 1 pt is equal to 1/72 inch
Rectangle2D rectangle = new Rectangle2D.Double(0,0,200,200);
String regionName = "First Article Region";
stripper.addRegion(regionName, rectangle);
stripper.extractRegions(pdf.getPage(0));
LOGGER.info("getTextForRegion: \n{}", stripper.getTextForRegion(regionName));
If anyone is tempted to "scan" a PDF to find out where the white space areas are so that this information would be used to determine rectangular regions for the purpose of extraction of articles then I would like to let you know that it is very slow and unlikely usable:
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument pdf = PDDocument.load(new File("4.pdf"));
PDPage page = pdf.getPage(1);
PDRectangle cropBox = page.getCropBox();
float docWidth = cropBox.getUpperRightX();
float docHeight = cropBox.getUpperRightY();
float recWidth = 10;
float rectHeight = 10;
float xStep = recWidth / 2;
float yStep = rectHeight / 2;
String regionName = "docScannerRegion";
String docLeftMarginIndicator = "|";
String docRightMarginIndicator = "|";
String nonTextArea = " ";
String textArea = "-";
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
for (float y = 0; y < docHeight; y = y + yStep) {
System.out.print(docLeftMarginIndicator);
for(float x = 0; x < docWidth; x = x + xStep) {
stripper.addRegion(regionName, new Rectangle2D.Double(x, y, recWidth, rectHeight));
stripper.extractRegions(page);
String txt = stripper.getTextForRegion(regionName).trim();
if (StringUtils.isBlank(txt)) {
System.out.print(nonTextArea);
} else {
System.out.print(textArea);
}
}
System.out.println(docRightMarginIndicator);
}

Related

What is the right way to get term positions in a Lucene document?

The example in this question and some others I've seen on the web use postings method of a TermVector to get terms positions. Copy paste from the example in the linked question:
IndexReader ir = obtainIndexReader();
Terms tv = ir.getTermVector( doc, field );
TermsEnum terms = tv.iterator();
PostingsEnum p = null;
while( terms.next() != null ) {
p = terms.postings( p, PostingsEnum.ALL );
while( p.nextDoc() != PostingsEnum.NO_MORE_DOCS ) {
int freq = p.freq();
for( int i = 0; i < freq; i++ ) {
int pos = p.nextPosition(); // Always returns -1!!!
BytesRef data = p.getPayload();
doStuff( freq, pos, data ); // Fails miserably, of course.
}
}
}
This code works for me but what drives me mad is that the Terms type is where the position information is kept. All the documentation I've seen keep saying that term vectors keep position data. However, there are no methods on this type to get that information!
Older versions of Lucene apparently had a method but as of at least version 6.5.1 of Lucene, that is not the case.
Instead I'm supposed to use postings method and traverse the documents but I already know which document I want to work on!
The API documentation does not say anything about postings returning only the current document (the one the term vector belongs to) but when I run it, I only get the current doc.
Is this the correct and only way to get position data from term vectors? Why such an unintuitive API? Is there a document that explains why the previous approach changed in favour of this?

Don't know about "right or wrong" but for version 6.6.3 this seems to work.
private void run() throws Exception {
Directory directory = new RAMDirectory();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(directory, indexWriterConfig);
Document doc = new Document();
// Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES
FieldType type = new FieldType();
type.setStoreTermVectors(true);
type.setStoreTermVectorPositions(true);
type.setStoreTermVectorOffsets(true);
type.setIndexOptions(IndexOptions.DOCS);
Field fieldStore = new Field("tags", "foo bar and then some", type);
doc.add(fieldStore);
writer.addDocument(doc);
writer.close();
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Term t = new Term("tags", "bar");
Query q = new TermQuery(t);
TopDocs results = searcher.search(q, 1);
for ( ScoreDoc scoreDoc: results.scoreDocs ) {
Fields termVs = reader.getTermVectors(scoreDoc.doc);
Terms f = termVs.terms("tags");
TermsEnum te = f.iterator();
PostingsEnum docsAndPosEnum = null;
BytesRef bytesRef;
while ( (bytesRef = te.next()) != null ) {
docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
// for each term (iterator next) in this field (field)
// iterate over the docs (should only be one)
int nextDoc = docsAndPosEnum.nextDoc();
assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
final int fr = docsAndPosEnum.freq();
final int p = docsAndPosEnum.nextPosition();
final int o = docsAndPosEnum.startOffset();
System.out.println("p="+ p + ", o=" + o + ", l=" + bytesRef.length + ", f=" + fr + ", s=" + bytesRef.utf8ToString());
}
}
}

Apache PDFBox replace text results in few character missed

Trying to use Apache PDFBox version 2.0.2 for a text replace (with the below code) produces an output where few of the characters would not be displayed, mostly the capital Case Character. For example a replacement with "ABCDEFGHIJKLMNOPQRSTUVWXYZ" the output appears in pdf as "ABCDEF HIJKLM OP RST W Y ". Is this some bug ?? or we have some workaround to handle these character .
public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
PDPageTree pages = document.getDocumentCatalog().getPages();
for (PDPage page : pages) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
string = StringUtils.replaceOnce(string, searchString, replacement);
cosString.setValue(string.getBytes());
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
out.close();
}
return document;
}

Quoting from
https://pdfbox.apache.org/2.0/migration.html
Why was the ReplaceText example removed?
The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:
[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ
Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.
You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.
======================================================================
Your description suggests that the initial file has been using a font subset, that is missing the characters G, N, Q, V and Y.
And no, there is no easy workaround. You would have to delete the text you don't want from the content stream, and then append a new content stream with the text you want with a new font at the correct place.
P.S. the current PDFBox version is 2.0.7, not 2.0.2.

Automatic PDF Rendering

I've read the MigraDoc/PdfSharp documentation, but it feels a bit thin. I want to render out a PDF, but not have to manually specify width and height. I just want it to align right, center, or left (of margins), and handle all the sizing for me.
Public Sub Write()
Dim document As PdfDocument = New PdfDocument()
Dim page As PdfPage = document.AddPage()
Dim gfx As XGraphics = XGraphics.FromPdfPage(page)
gfx.MUH = PdfFontEncoding.Unicode
gfx.MFEH = PdfFontEmbedding.Default
Dim font As XFont = New XFont("Verdana", 13, XFontStyle.Bold)
Dim migraDocument As New Document
Dim sec As Section = migraDocument.AddSection()
Dim quotationHeader As New Paragraph
quotationHeader.AddText("Quotation" & vbNewLine)
quotationHeader.Format.Alignment = ParagraphAlignment.Right
sec.Add(quotationHeader)
Dim dhAddressInfo As New Paragraph
dhAddressInfo.AddText("ADDRESS GOES HERE")
dhAddressInfo.Format.Alignment = ParagraphAlignment.Left
sec.Add(dhAddressInfo)
Dim quotationInfo As New Paragraph
quotationInfo.AddText("QUOTATION INFO AND DATE HERE")
quotationInfo.Format.Alignment = ParagraphAlignment.Right
sec.Add(quotationInfo)
Dim customerBilling As New Paragraph
With Customer
customerBilling.AddText("CUSTOMER BILLING OBJECT PROPERTIES HERE")
End With
customerBilling.Format.Alignment = ParagraphAlignment.Left
sec.Add(customerBilling)
Dim authorInfo As New Paragraph
authorInfo.AddText("AUTHOR INFO HERE")
authorInfo.Format.Alignment = ParagraphAlignment.Right
sec.Add(authorInfo)
Dim pricingTable As New Table
'pricingTable.Format.Alignment = ParagraphAlignment.Center
pricingTable.AddColumn("13cm")
pricingTable.AddColumn("13cm")
Dim headerRow As New Row
headerRow = pricingTable.AddRow()
headerRow.HeadingFormat = True
headerRow.Cells(0).AddParagraph("Description")
headerRow.Cells(1).AddParagraph("Amount")
For i As Integer = 0 To SelectedPrices.Count - 1
Dim row As Row = pricingTable.AddRow()
Dim price As Pricing = SelectedPrices(i)
row.Cells(0).AddParagraph(price.Item)
row.Cells(1).AddParagraph(price.Price * price.Quantity)
Next
Dim totalRow As Row = pricingTable.AddRow()
totalRow.Cells(0).AddParagraph("Total: ")
Dim total As Double = 0
For Each price As Pricing In SelectedPrices
total = total + (price.Price * price.Quantity)
Next
totalRow.Cells(1).AddParagraph(total.ToString)
sec.Add(pricingTable)
Dim docRenderer As DocumentRenderer = New DocumentRenderer(migraDocument)
docRenderer.PrepareDocument()
docRenderer.RenderObject(gfx, XUnit.FromCentimeter(0), XUnit.FromCentimeter(0), "10cm", quotationHeader)
docRenderer.RenderObject(gfx, XUnit.FromCentimeter(0), XUnit.FromCentimeter(2), "10cm", dhAddressInfo)
docRenderer.RenderObject(gfx, XUnit.FromCentimeter(5), XUnit.FromCentimeter(2), "10cm", quotationInfo)
docRenderer.RenderObject(gfx, XUnit.FromCentimeter(0), XUnit.FromCentimeter(6), "10cm", customerBilling)
docRenderer.RenderObject(gfx, XUnit.FromCentimeter(5), XUnit.FromCentimeter(6), "10cm", authorInfo)
docRenderer.RenderObject(gfx, XUnit.FromCentimeter(3), XUnit.FromCentimeter(10), "10cm", pricingTable)
document.Save(Environment.CurrentDirectory & "\test.pdf")
End Sub
Notice at the bottom I'm specifying the X and Y coordinates of each section. I just want to define spacing. Alignment should take care of the rest.

I found a different tutorial that uses PdfDocumentRenderer and shows how to correctly use it. It's not in VB, but quite easily translated. I copied it below in case the link goes dead.
http://www.c-sharpcorner.com/UploadFile/aftab_ku/create-object-model-document-and-renders-them-into-pdf/
public Document CreateDocument()
{
// Create a new MigraDoc document
this.document = new Document();
this.document.Info.Title = "";
this.document.Info.Subject = "";
this.document.Info.Author = "Aftab";
DefineStyles();
CreatePage();
FillContent();
return this.document;
}
Here, CreateDocument() in PDFform.cs creates a new MigraDoc. Take a look at the three functions called for creating style and page and fill the content of the tables.
//
void DefineStyles()
{
// Get the predefined style Normal.
Style style = this.document.Styles["Normal"];
// Because all styles are derived from Normal, the next line changes the
// font of the whole document. Or, more exactly, it changes the font of
// all styles and paragraphs that do not redefine the font.
style.Font.Name = "Verdana";
style = this.document.Styles[StyleNames.Header];
style.ParagraphFormat.AddTabStop("16cm", TabAlignment.Right);
style = this.document.Styles[StyleNames.Footer];
style.ParagraphFormat.AddTabStop("8cm", TabAlignment.Center);
// Create a new style called Table based on style Normal
style = this.document.Styles.AddStyle("Table", "Normal");
style.Font.Name = "Verdana";
style.Font.Name = "Times New Roman";
style.Font.Size = 9;
// Create a new style called Reference based on style Normal
style = this.document.Styles.AddStyle("Reference", "Normal");
style.ParagraphFormat.SpaceBefore = "5mm";
style.ParagraphFormat.SpaceAfter = "5mm";
style.ParagraphFormat.TabStops.AddTabStop("16cm", TabAlignment.Right);
}
DefineStyles() does the job of styling the document:
void CreatePage()
{
// Each MigraDoc document needs at least one section.
Section section = this.document.AddSection();
// Put a logo in the header
Image image= section.AddImage(path);
image.Top = ShapePosition.Top;
image.Left = ShapePosition.Left;
image.WrapFormat.Style = WrapStyle.Through;
// Create footer
Paragraph paragraph = section.Footers.Primary.AddParagraph();
paragraph.AddText("Health And Social Services.");
paragraph.Format.Font.Size = 9;
paragraph.Format.Alignment = ParagraphAlignment.Center;
............
// Create the item table
this.table = section.AddTable();
this.table.Style = "Table";
this.table.Borders.Color = TableBorder;
this.table.Borders.Width = 0.25;
this.table.Borders.Left.Width = 0.5;
this.table.Borders.Right.Width = 0.5;
this.table.Rows.LeftIndent = 0;
// Before you can add a row, you must define the columns
Column column;
foreach (DataColumn col in dt.Columns)
{
column = this.table.AddColumn(Unit.FromCentimeter(3));
column.Format.Alignment = ParagraphAlignment.Center;
}
// Create the header of the table
Row row = table.AddRow();
row.HeadingFormat = true;
row.Format.Alignment = ParagraphAlignment.Center;
row.Format.Font.Bold = true;
row.Shading.Color = TableBlue;
for (int i = 0; i < dt.Columns.Count; i++)
{
row.Cells[i].AddParagraph(dt.Columns[i].ColumnName);
row.Cells[i].Format.Font.Bold = false;
row.Cells[i].Format.Alignment = ParagraphAlignment.Left;
row.Cells[i].VerticalAlignment = VerticalAlignment.Bottom;
}
this.table.SetEdge(0, 0, dt.Columns.Count, 1, Edge.Box,
BorderStyle.Single, 0.75, Color.Empty);
}
Here CreatePage() adds a header, footer, and different sections into the document and then the table is created to display the records. Columns from the datatable are added into the table inside the document and then a header row that contains the column names is added.
column = this.table.AddColumn(Unit.FromCentimeter(3));
//creates a new column and width of the column is passed as a parameter.
Row row = table.AddRow();
//A new header row is created
row.Cells[i].AddParagraph(dt.Columns[i].ColumnName);
//this will add the column name to header of the row.
this.table.SetEdge(0, 0, dt.Columns.Count, 1, Edge.Box,
BorderStyle.Single, 0.75, Color.Empty);
//sets the border of the row
void FillContent()
{
...............
Row row1;
for (int i = 0; i < dt.Rows.Count; i++)
{
row1 = this.table.AddRow();
row1.TopPadding = 1.5;
for (int j = 0; j < dt.Columns.Count; j++)
{
row1.Cells[j].Shading.Color = TableGray;
row1.Cells[j].VerticalAlignment = VerticalAlignment.Center;
row1.Cells[j].Format.Alignment = ParagraphAlignment.Left;
row1.Cells[j].Format.FirstLineIndent = 1;
row1.Cells[j].AddParagraph(dt.Rows[i][j].ToString());
this.table.SetEdge(0, this.table.Rows.Count - 2, dt.Columns.Count, 1,
Edge.Box, BorderStyle.Single, 0.75);
}
}
.............
}
FillContent() fills the rows from the datatable into the table inside the document:
row1.Cells[j].AddParagraph(dt.Rows[i][j].ToString());
//adds the value of column into the table row
The Default.aspx file contains the code for generating the PDF:
using MigraDoc.DocumentObjectModel;
using MigraDoc.Rendering;
using System.Diagnostics;
MigraDoc libraries are used for generating PDF documents, and System.Diagnostics for starting a PDF Viewer:
PDFform pdfForm = new PDFform(GetTable(), Server.MapPath("img2.gif"));
// Create a MigraDoc document
Document document = pdfForm.CreateDocument();
document.UseCmykColor = true;
// Create a renderer for PDF that uses Unicode font encoding
PdfDocumentRenderer pdfRenderer = new PdfDocumentRenderer(true);
// Set the MigraDoc document
pdfRenderer.Document = document;
// Create the PDF document
pdfRenderer.RenderDocument();
// Save the PDF document...
string filename = "PatientsDetail.pdf";
pdfRenderer.Save(filename);
// ...and start a viewer.
Process.Start(filename);
The PdfForm object is created and using it, a new MigraDoc is generated. PdfDocumentRenderer renders the PDF document and then saves it. Process.Start(filename) starts a PDF viewer to open the PDF file created using MigraDoc.

Resetting the camera far plane

I'm trying to update make my camera far clip plane to sit on Vector3(0,0,0) no matter how close or far the camera gets, I've managed to find a way of updating the far clip plane dynamically but I can't get this plane to face my camera.
Thanks, C.
var matrix = new THREE.Matrix4();
matrix.extractRotation(camera.matrix);
var direction = new THREE.Vector3();
direction.subVectors( new THREE.Vector3(0,0,0), camera.position );
direction.normalize();
var N = new THREE.Vector3(0, 1, 0);
N.applyMatrix4(matrix);
var planePos = new THREE.Vector3(0,0,0);
var clipPlane = new THREE.Plane();
clipPlane.setFromNormalAndCoplanarPoint(N, planePos);
clipPlane.applyMatrix4(camera.matrixWorldInverse);
clipPlane = new THREE.Vector4(clipPlane.normal.x, clipPlane.normal.y, clipPlane.normal.z, clipPlane.constant);
var q = new THREE.Vector4();
var projectionMatrix = camera.projectionMatrix;
q.x = (sgn(clipPlane.x) + projectionMatrix.elements[8]) / projectionMatrix.elements[0];
q.y = (sgn(clipPlane.y) + projectionMatrix.elements[9]) / projectionMatrix.elements[5];
q.z = -1.0;
q.w = (1.0 + projectionMatrix.elements[10]) / camera.projectionMatrix.elements[14];
// Calculate the scaled plane vector
var c = new THREE.Vector4();
c = clipPlane.multiplyScalar(2000.0 ); //clipPlane.multiplyScalar(2.0 / clipPlane.dot(q)); /// clipPlane.dot(q)
// Replace the third row of the projection matrix
projectionMatrix.elements[2] = c.x;
projectionMatrix.elements[6] = c.y;
projectionMatrix.elements[10] = c.z + 1.0;
projectionMatrix.elements[14] = c.w;

If you want to reset the far plane parameter for a camera, you can use this pattern
camera.far = new_value;
camera.updateProjectionMatrix();
In your particular case, you can do this:
camera.far = camera.position.length();
camera.updateProjectionMatrix();
three.js r.72

PDFBox PDFTextStripperByArea region coordinates

In what dimensions and direction is the Rectangle in the
PDFTextStripperByArea's function addRegion(String regionName, Rectangle2D rect).
In other words, where does the rectangle R start and how big is it (dimensions of the origin values, dimensions of the rectangle) and in what direction does it go (direction of the blue arrows in illustration), if new Rectangle(10,10,100,100) is given as a second parameter?

new Rectangle(10,10,100,100)
means that the rectangle will have its upper-left corner at position (10, 10), so 10 units far from the left and the top of the PDF document. Here a "unit" is 1 pt = 1/72 inch.
The first 100 represents the width of the rectangle and the second one its height.
To sum up, the right picture is the first one.
I wrote this code to extract some areas of a page given as arguments to the function:
Rectangle2D region = new Rectangle2D.Double(x, y, width, height);
String regionName = "region";
PDFTextStripperByArea stripper;
stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
So, x and y are the absolute coordinates of the upper-left corner of the Rectangle and then you specify its width and height. page is a PDPage variable given as argument to this function.

Was looking into doing something like this, so I thought I'd pass what I found along.
Here's the code for creating my original pdf using itext.
import com.lowagie.text.Document
import com.lowagie.text.Paragraph
import com.lowagie.text.pdf.PdfWriter
class SimplePdfCreator {
void createFrom(String path) {
Document d = new Document()
try {
PdfWriter writer = PdfWriter.getInstance(d, new FileOutputStream(path))
d.open()
d.add(new Paragraph("This is a test."))
d.close()
} catch (Exception e) {
e.printStackTrace()
}
}
}
If you crack open the pdf, you'll see the text in the upper left hand corner. Here's the test showing what you are looking for.
#Test
void createFrom_using_pdf_box_to_extract_text_targeted_extraction() {
new SimplePdfCreator().createFrom("myFileLocation")
def doc = PDDocument.load("myFileLocation")
Rectangle2D.Double d = new Rectangle2D.Double(0, 0, 120, 100)
def stripper = new PDFTextStripperByArea()
def pages = doc.getDocumentCatalog().allPages
stripper.addRegion("myRegion", d)
stripper.extractRegions(pages[0])
assert stripper.getTextForRegion("myRegion").contains("This is a test.")
}
Position (0, 0) is the upper left hand corner of the document. The width and height are heading down and to the right. I was able to trim down the range a bit to (35, 52, 120, 3) and still get the test to pass.
All code is written in groovy.

Code in java using PDFBox.
public String fetchTextByRegion(String path, String filename, int pageNumber) throws IOException {
File file = new File(path + filename);
PDDocument document = PDDocument.load(file);
//Rectangle2D region = new Rectangle2D.Double(x,y,width,height);
Rectangle2D region = new Rectangle2D.Double(0, 100, 550, 700);
String regionName = "region";
PDFTextStripperByArea stripper;
PDPage page = document.getPage(pageNumber + 1);
stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
String text = stripper.getTextForRegion(regionName);
return text;
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PDF article region identification - pdf

Related

What is the right way to get term positions in a Lucene document?

Apache PDFBox replace text results in few character missed

Automatic PDF Rendering

Resetting the camera far plane

PDFBox PDFTextStripperByArea region coordinates

Categories

Resources