Search for Text a PDF - double results - pdf

i have a question about search text in a PDF file in attach here:
pdf shared link google drive.
If I search text example "1500" , I see 4 occurences but there are only 2 occurenes in page 2.....the same if I search text "musei" find 2 occurrences, but this text is only in page 1.
The research parse the single page and find all document text in every single page, because I have double results.
Can anyone explain why happen this?
Did this PDF file generated in a particular way respect other where searching text is ok?
Thanks a lot

That PDF is indeed special, each page contains the text of both pages. On the first page the text from the second page is right of the right page border, and on the second page the text from the first page is left of the left page border. Furthermore, the contents of the respectively other page are additionally outside the clip area.
I enlarged the page boxes (media box, crop box, ...) of the first page to the right and of the second page to the left, and then marked all text (Ctrl-A) to show even the text outside the clip area, and you see:
For text extraction that only extracts the text in the visible areas, you should restrict your text extraction routine to the crop box of the respective page.

Related

How to detect visible text in a text field in a PDF?

When using PDFBox to populate a text field in a form in a PDF, it is possible that the text overflows the text field and is not visible when opening the PDF in a viewer.
Question: Is it possible to use PDFBox to detect how much text within a text field is visible?
At the risk of falling victim to an XY problem, here is the context in which this came up.
I have a PDF which is provided by the Danish government, and the software I am creating needs to be able to fill out this form programmatically. On pages 5 and 6 of this document, there is a large blank area that needs to be filled out. The way the PDF creators designed it, they just made two text fields (named Text57 and Text58), which a person directly filling out the form would manually need to jump between.
The problem is, I need to be able to populate these fields with text, and if the text is too large to fit in the first text field, then it needs to overflow into the second text field. However, I do not seem to have any way of actually detecting when the text overflows in the first text field.
One workaround which could be acceptable, would be if I could modify the document to remove the second text field, and just have the first text field span multiple pages, but while playing around in Acrobat, this does not seem to be possible.
The PDF in question can be found here: https://www.trafikstyrelsen.dk/~/media/Dokumenter/10%20Bolig/Bolig/Private%20lejeboliger/Lejekontrakt/typeformular-a.pdf
Here is a code snippet which populates the problematic field with 100 lines numbered from 1 to 100.
PDDocument document = PDDocument.load(new File("typeformular-a.pdf"));
PDField text57 = document.getDocumentCatalog().getAcroForm().getField("Text57");
text57.setValue(IntStream.range(1, 101).mapToObj(Integer::toString)
.collect(Collectors.joining(System.lineSeparator())));
document.save("typeformular-a.out.pdf");
After the code is run, we can see that the text gets cut off after line 44. Of course I cannot simply count lines in my text, because under normal circumstances the lines in the text will wrap, which would invalidate that approach.
Auxiliary question: Is there any other approach that could solve this original problem of splitting text across multiple pages?

How do I prevent the last sentence in each MS Word page from breaking to the next page?

I want the last sentence in the page, to be a non-breaking sentence in MS Word.
"Page break before" option does not work for me. Because, It moves the whole paragraph to the next page. I want to break the page down just right after the last punctuation mark at the end of each page. In other words, sentences should not overflow to the next page.
So, I think I need to detect the last punctuation mark in the page and insert a [Enter]/[Page Break] after that. How can I do this in VB.NET.
Or any other simpler solution?
How much text is shown on a page is determined dynamically by Word, based on the margin settings and the current printer-driver. Word calculates the layout dynamically and it's not possible to force specific content into a page, as is possible with page layout software. Increasing the margin settings would be a possible approach, but this would apply to the entire document or entire Section.
To prevent single lines from being split off, Word has the following settings in the Paragraph Format/Line and Page Breaks dialog box, with equivalents in the object model:
Window/Orphan control (active by installation default): - Paragraph.Format.WidowControl (boolean)
True if the first and last lines in the specified paragraph remain on
the same page as the rest of the paragraph when Word repaginates the
document.
Keep lines together - Paragraph.Format.KeepTogether
True if all lines in the specified paragraphs remain on the same page
when Microsoft Word repaginates the document.
The only way to force a page's content would be to put the content in a Drawing SuatoShape TextBox (Shape.Text = msoTextBox). Draw the text box large enough for all the content - it can go outside a margin setting - and insert the content. But note that Word's normal editing behavior will not be the same. For example, adding or deleting content on a previous page will not change the content in the text box and could end up "kicking" it to another page, entirely, with unwanted white space on other pages.

PDFBox retrieve text from overlapping boxes

I've had some success using the PDFTextStripperByArea class to retrieve text contained within a specified rectangle. However, some of the PDFs I an scraping have text that is in slightly different places from page to page. I'm looking for help in how to deal with this.
In the example below, I can open the PDF in Acrobat Edit mode and see multiple text boxes (outlines with thin grey lines). I have indicated two regions (purple and red) that I would like to extract text from. However, instead of just getting the text physically inside the rectangle, I'd like all the text from the overlapping text boxes.
Is there a way to do this?

Quartz-2D : spotting text other the main text in PDF book pages

I would like to know if it is possible (Quartz2D) to programmatically recognize and handle the text above (or below) in a PDF page that shows page number and paragraph title or other information to know where you are in the book. Is it just text like the main text in the page or can be somehow distinguished?
The page number (if printed on the page) is no different to any other text on the page (there are other kinds of page numbers in a PDF file however).
Some kinds of PDF (PDF/A-1a, 'tagged' PDF) do have things like page numbers and titles marked in a separate way, but in the general case PDF files are neither of these and the page number or titles are indistinguishable from the remainder of the text.

set margins for 1st page different than the rest of the pages

I need to set the margins differently for the first page than the rest of the pages.
I've messed around with inserting section breaks (as from what I've read is that Word creates a section break when you choose to apply "this point forward" from the Page Setup Margins tab, but I can't seem to consistently create a continuous section break at the start of the second page.
If not section breaks, any other way would be fine. Need to adjust margins to match new letterhead design for a bunch of existing documents so am planning on fixing the margins in a sub-routine when the print button is clicked (part of another macro).
A Continuous Section Break is used to allow multiple sets of margins within the same page. A Next Page Section Break, which has the properties of both a section break and a page break, will allow one set of margins for the first page and another set of margins for all pages following it. One way it can be created is like this:
Selection.InsertBreak Type:=wdSectionBreakNextPage
Now here's where it gets tricky. Word has two different types of page breaks: Automatic and Manual. Automatic page breaks get created when text no longer fits on a page and Word automatically generates a new page. If the documents you are reformatting via the macro have automatic page breaks, inserting the Next Page Section Break at the end of the first page will cause Word to delete its Automatic Page Break (using the Next Page Section Break to keep the pages separate) and any margin changes you make to the first page will not carry over to the following pages. However, if the documents contain a Manual Page Break between pages one and two, inserting the Next Page Section Break will create a blank second page. As such, if this is a possibility, code will need to be written to detect the Manual Page Break and delete it after the Next Page Section Break has been inserted.
Letterhead layouts and Word can be really difficult and tricky if you need other values only on the first page. In my practice, I find often letterheads with graphical elements in a right margin of the first page up to a certain heigth, e.g. a list of partner names or business information. So on the first page the right margin should be 6 cm, while on all following pages it should be 2,5 cm.
Using a section break is not possible due to the fact that it moves while the user inserts text.
I've used the following approach with some success:
Create a text box in the first page header which is sufficiently big to occupy the space needed. Setup the property for "Text Wrapping" = Square, so that text cannot overlap the box.
Of course you can insert the text box also into the document body to have that effect. Unfortunately, users can then touch the text box easily in a mouse action, and move it to another position. If you put it into the first page header, it will appear only on the first page, and will appear "in the background" of the page. The user can enter text in the document body, but it will stop before the text box, which simulates a right margin on the first page.