Quartz-2D : spotting text other the main text in PDF book pages - pdf

I would like to know if it is possible (Quartz2D) to programmatically recognize and handle the text above (or below) in a PDF page that shows page number and paragraph title or other information to know where you are in the book. Is it just text like the main text in the page or can be somehow distinguished?

The page number (if printed on the page) is no different to any other text on the page (there are other kinds of page numbers in a PDF file however).
Some kinds of PDF (PDF/A-1a, 'tagged' PDF) do have things like page numbers and titles marked in a separate way, but in the general case PDF files are neither of these and the page number or titles are indistinguishable from the remainder of the text.

Related

What is a format for static documents like PDF but not divided into pages?

PDF is for static documents, so a document is shown the same in different applications, even if it has an unusual layout. But PDF documents are divided into pages because the format is designed for documents to be printed.
I would like to have a document with static content but with no page breaks. Which document format can do that? I guess that it could be achieved with PDF with a single page as long as it needs to be, but I don't know that any software could do that, and it seems like abuse of PDF.
I create PDF documents in LATEX, and they almost never are printed, and the page layout is in the way when they are read on a screen. So I'm looking for how I could have documents where the layout is fixed because of hyphenation, mathematics and graphics, but more suitable for reading on screens.

Search for Text a PDF - double results

i have a question about search text in a PDF file in attach here:
pdf shared link google drive.
If I search text example "1500" , I see 4 occurences but there are only 2 occurenes in page 2.....the same if I search text "musei" find 2 occurrences, but this text is only in page 1.
The research parse the single page and find all document text in every single page, because I have double results.
Can anyone explain why happen this?
Did this PDF file generated in a particular way respect other where searching text is ok?
Thanks a lot
That PDF is indeed special, each page contains the text of both pages. On the first page the text from the second page is right of the right page border, and on the second page the text from the first page is left of the left page border. Furthermore, the contents of the respectively other page are additionally outside the clip area.
I enlarged the page boxes (media box, crop box, ...) of the first page to the right and of the second page to the left, and then marked all text (Ctrl-A) to show even the text outside the clip area, and you see:
For text extraction that only extracts the text in the visible areas, you should restrict your text extraction routine to the crop box of the respective page.

Export PDF Page contents to individual pages

I have a pdf document which contains more than one page within each page.
The original document is only 2 pages - size A4, but has multiple pages on each of the 2 pages.
I need to export each of these "pages within each page" to an individual pdf page.
I have tried increasing the zoom of the pages and printing from there, but it prints incorrectly.
What could I do within Adobe reader or similar program to export each of these pages each as their own pdf page ?
Link to PDF
Within Acrobat reader, you could make a clever use of custom poster printing (possibly to print as a new PDF):
https://apple.stackexchange.com/questions/12305/split-a-single-page-pdf-into-multiple-pages
Otherwise you can do any of these:
Splitting single page into two pages with ghostscript
Alternatively you could use other tools such as Inkscape to do the splitting.

Is there a way to change the order/way Acrobat selects text of a PDF?

I have a visual basic program that extracts text from a PDF and imports the text into excel. It relies on reading the text like a human, reading left to right across the page. However, there are instances on this particular PDF where if I go to select the text with my mouse, I click and drag straight across but Adobe starts to select/highlight words on the above and below lines before continuing to highlight across the page. This gives me data that I do not want/need. The page has renderable text and is not from a scanned document.
Is there a way to "reset" the way Adobe interprets the text on the PDF? Since the information on the left is far from the information on the right, it treats them almost like separate columns.
I've tried saving the PDF in different formats such as a txt or postscript and distilling to another PDF but they all seem to result in the same outcome. This is weird to me because I have other similar PDFs where this isn't an issue.
Any help or thoughts would be greatly appreciated, thanks.
As PDF (in its basic form) essentially means placing strings on a canvas, the concept of "sentence" or "reading order" is not built in.
In order to extract text, you would have to read out the bounding box of the piece of text, and then use some logic and heuristics to assemble your text based on the coordinates of the bounding box.
Things can be easier if the PDF is a structured PDF, where the text contents is embedded as text in the document. This is also the prime requirement for an accessible document. So, if your document is accessible, you can rely on the structure for the correct reading order.

Parse Body Text from PDF

I have just recently been experimenting with parsing the text data from a PDF document using iTextSharp in a VB2010 app. the document doesn't contain any images or other fancy elements, just text. Ive read some articles and used some code snippets and it looks promising. However, what Ive been trying to do is just parse out the body of each page, minus a header or footer. I haven't found any guidance for that particular function.
Currently using the snippet found here Reading PDF content with itextsharp dll in VB.NET or C# but it parses all text in a page. There's got to be a way to just get the body. Or at least I hope so.
PDFs generally do not contain information about logical structure of contained text.
So there are no headers, footers, body, paragraphs and anything like this in a PDF. There is only bunch of operations like "draw this glyph here", "move to this position and draw that group of glyphs there". I wrote glyph and not character because PDFs are not required to contain readable text. Only visual appearance required to be specified.
One exception is Tagged PDF but most of PDFs in the wild are not tagged.
Given all of the above you are probably left with following approach:
Extract all text from each page
Analyze text and find similar parts at the beginning / end of each page
Remove similar parts
This is a heuristic-based detection, so it probably won't always give excellent results.