How to detect if a pdf is text searchable or non text searchable? - vb.net

i have a set of pdfs, from which i want to process( VB.NET) only those which are non text searchable, can you please tell me how to go about this?

Generally speaking, the way to do this is open up each page and rip the content stream and see if any text operators are executed that place text on the page.
Let me explain what that means - PDF content is a small RPN language that contains operations that mark the page in some way. For example, you might see something like this:
BT 72 400 Td /F0 12 Tf (Throatwarbler Mangrove) Tj ET
Which means:
Begin a text area
Set the position of the text baseline to (72, 400) in PDF units
Set the font to a resource named F0 from the current page's font resource dictionary
Draw the text "Throatwarbler Mangrove"
End a text area
So you can try short cuts
does my page's resource dictionary contain any fonts?
This will fail in some cases because some PDF generation tools put fonts into the resource
dictionary and don't use them (false positive). It will also fail if the page content contains a Form XObject which contains text (false negative).
does my page's content stream have BT/ET opertors?
This will get you closer, but will fail if there is not content in them (false positive) or if they're not present, but there's a Form XObject which contains text (false negative).
So really, the thing to do is to execute the entire page's content stream, including recursing on all XObject to look for text operators.
Now, there's another approach that you can take using my Atalasoft's software (disclaimer, I work for Atalasoft and have written most of the PDF handling code, I also worked on Acrobat versions 1-4). Instead of asking, does this page contain any text, you can ask "does this page contain only a single image?"
bool allPagesImages = true;
using (Document doc = new Document(inputStream))
{
foreach (Page p in doc.Pages)
{
if (!p.SingleImageOnly)
{
allPagesImages = false;
break;
}
}
}
Which will leave allPagesImages with a pretty decent indication that each page is all images, which if you're looking to OCR is the non-searchable documents, is probably what you really want.
The down side is that this will be a very high price for a single predicate, but it also gets you a PDF rasterizer and the ability to extract the images directly out of the file.
Now, I have no doubt that a solid engineer could work their way through the PDF spec and write some code to extend iTextPdfSharp to do this task I think that if I sat down with it, I might be able to write that predicate in a few days, but I already know most of the PDF spec. So it might take you more like two weeks to a month. So your choice.

I think this option could be your consideration, though I haven't tested the code yet but I think it can be done by read the properties for each PDF files that you want to proceed.
You might check this link :
http://www.codeguru.com/columns/vb/manipulating-pdf-files-with-itextsharp-and-vb.net-2012.htm
You have to read the producer properties right after you proceeded it. That's just only example. But my advice please include your code here so we can give a try to help you. Bless you

Related

Content in meta tag

Like the first image, the meta tag is displayed correctly in inspect elements mode but incorrectly displayed in view page source mode as in the second image. Thank you for suggesting a solution to this problem.
I understood the answer:
Because, by default, the HTML encoding engine will only safelist the basic latin alphabet (because browsers have bugs. So we're trying to protect against unknown problems). The &XXX values you see still render as correctly as you can see in your screen shots, so there's no real harm, aside from the increased page size.
If the increased page size bothers you then you can customise the encoder to safe list your own character pages (not language, Unicode doesn't think in terms on language)
To widen the characters treated as safe by the encoder you would insert the following line into the ConfigureServices() method in startup.cs;
services.AddSingleton<HtmlEncoder>(
HtmlEncoder.Create(allowedRanges: new[] { UnicodeRanges.BasicLatin,
UnicodeRanges.Arabic }));
Arabic has quite a few blocks in Unicode, so you may need to add more blocks to get the full range you need.

How to get rid of unwanted extra pages when converting a goole document to pdf via google-apps-script?

I have an old script that (among other things) converts a google document to pdf.
It used to work ok, but now two extra blank pages appear in the pdf version of the file.
I just discovered that this problem affects also the "download as pdf" menu option in google documents. There is a number of workarounds in that case, but I need a workaround for google-apps-script.
In this post the solution to a similar problem seems to involve a fine tuning of the page size. I tried something like that, but it does not trivially apply.
I also tried some other (kind of random) variations for the page size and margins, but to no avail.
Below I'm pasting a minimal working example. It should create a document file "test" and its pdf version "test.pdf" in your main drive folder.
Any help getting rid of the two extra pages is greatly appreciated.
Thanks
function myFunction() {
// this function
// - creates a google document "test",
// - writes "this is a test" inside it
// - saves and closes the document
// - creates a pdf version of the document, called "test.pdf"
//
// the conversion is ok, except two extra blank pages appear in the pdf version.
// create google document
var doc = DocumentApp.create('test');
var docFile = DriveApp.getFileById( doc.getId() );
// set margins (I need landscape layout)
// this is an attempt to a solution, inspired by https://stackoverflow.com/questions/18426817/extra-blank-page-when-converting-html-to-pdf
var body = doc.getBody();
body.setPageHeight(595.2).setPageWidth(841.8);
var mrg = 40; // in points
body.setMarginTop(mrg).setMarginBottom(mrg);
body.setMarginLeft(mrg).setMarginRight(mrg);
// write something
body.appendParagraph('this is a test').setHeading(DocumentApp.ParagraphHeading.HEADING2).setAlignment(DocumentApp.HorizontalAlignment.CENTER);
// save and close file
doc.saveAndClose();
// convert file to pdf
var docblob = docFile.getAs('application/pdf');
// set pdf name
docblob.setName("test.pdf");
// save pdf file
var file = DriveApp.createFile(docblob);
}
I found the source of the problem and a solution in this post on the google product forum, dating 8 months back.
The extra pages appear in the pdf if the option in view -> print layout is not checked.
I did some further tests, with my accounts and my colleagues'.
The results are consistent:
when view -> print layout is not checked two extra pages appear in the pdf version of the document
when view -> print layout is checked the pdf version of the document has the expected number of pages.
this setting affects also the documentApp services in Google Apps Script. That is: the above script produces the expected pdf version only if the "view->print layout" option in Google Documents is checked.
I do not see how this behaviour could be a "feature", so I think it's a bug. By the way "print layout" does not seem to have any visible effect on my documents (other than messing up the pdf version). I'm surprised that after 8 months the bug is still out there.
Number 3 above surprised me, because I did not think that an option set manually in a (any) google document would affect my scripts.
I'm currently looking for a way of setting the "print layout" option from inside the script. So far I had no luck with that.

Java- stretch pdf pages content

How can I make the page larger in the way that the text/images will stretch
according to the new size?
I only found ways to scale down but not scale up... any idea?
Thanks in advance!!!
There are two conceptional options:
For each page enlarge MediaBox and CropBox, prepend current transformation matrix scaling operation to the page content stream or array of streams, and update annotation positions and sizes.
For each page set the property UserUnit to a value > 1.
The second option can e.g. be implemented like this using PDFBox and a stretch factor of 1.7:
PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getPages()) {
page.getCOSObject().setFloat("UserUnit", 1.7f);
}
document.save(TARGET);
(The second option obviously is much easier to implement than the first one. Feature incomplete PDF viewers might ignore this value, though. If you need to support such incomplete viewers, you probably have to go for the first option.)

Ghostscript - create a pdf with multiple identical pages and keep size down

Im trying to use Ghostscript to create a PDF with multiple identical pages. I will later use this together with another multipaged PDF to stamp on unique information onto every page.
Is it possible to use Ghostscript to create such a PDF and keep the size of the final file down? Maby there is a flag that i have not noticed that can do this in a better way than the script below?
I have tried to use a regular merge command like the one below but the size of the resulting PDF grows alot and the original file size of 2,061MB merged to a 100page pdf results in a final size of 46,117MB.
"C:\Program Files\gs\gs9.20\bin\gswin64.exe"^
-dBATCH^
-dNOPAUSE^
-q^
-sDEVICE=pdfwrite^
-sOutputFile=outputpdf.pdf^
"inputpdf.pdf"^
"inputpdf.pdf"^
"inputpdf.pdf"(and so on 100 times)
You can construct such a file manually easily enough, which is much smaller, by reusing the page content stream for each page.
However Ghostscript's pdfwrite device won;t do that, not least because it can't. It cannot know in advance that the page its about to receive is the same as the previous page. As a result it will create a new page content stream for each page, and create new content for it.
Note that resources (forms, patterns, colour spaces, image XObjects etc) which are used on each page will be reused on other pages.
However, it seems to me that you're already getting nearly a 5:1 ratio (2k * 100 pages = 200Kb, the final file is 46Kb) though in fairness a good bit of that 2Kb is 'stuff' around the page.
Without seeing your input file I can't really comment any further, but frankly I doubt its possible to make it any smaller without hand-crafting the file. What's the problem with a 46Kb file anyway ?

UILables, Text Flow and Layouts

Consider the following, I have paragraph data being sent to a view which needs to be placed over a background image, which has at the top and the bottom, fixed elements (fig1)
Fig1.
My thought was to split this into 4 labels (Fig1.example2) my question here is how I can get the text to flow through labels 1 - 4 given that label 1,2 & 3 ar of fixed height. I assumed here that label 3 should be populated prior to 4 hence the layout in the attached diagram.
Can someone suggest the best way of doing this with maybe an example?
Thanks
Wish I could help more, but I think I can at least point you in the right direction.
First, your idea seems very possible, but would involve lots of calculations of text size that would be ugly and might not produce ideal results. The way I see it working is a binary search of testing portions of your string with sizeWithFont: until you can get the best guess for what the label will fit into that size and still look "right". Then you have to actually break up the string and track it in pieces... just seems wrong.
In iOS 6 (unfortunately doesn't apply to you right now but I'll post it as a potential benefit to others), you could probably use one UILabel and an NSAttributed string. There would be a couple of options to go with here, (I haven't done it so I'm not sure which would be the best) but it seems that if you could format the page with html, you can initialize the attributed string that way.
From the docs:
You can create an attributed string from HTML data using the initialization methods initWithHTML:documentAttributes: and initWithHTML:baseURL:documentAttributes:. The methods return text attributes defined by the HTML as the attributes of the string. They return document-level attributes defined by the HTML, such as paper and margin sizes, by reference to an NSDictionary object, as described in “RTF Files and Attributed Strings.” The methods translate HTML as well as possible into structures of the Cocoa text system, but the Application Kit does not provide complete, true rendering of arbitrary HTML.
An alternative here would be to just use the available attributes, setting line indents and such according to the image size. I haven't worked with attributed strings at this level, so I the best reference would be the developer videos and the programming guide for NSAttributedString. https://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/AttributedStrings/AttributedStrings.html#//apple_ref/doc/uid/10000036-BBCCGDBG
For lesser versions of iOS, you'd probably be better off becoming familiar with CoreText. In the end you'll be rewarded with a better looking result, reusability/flexibility, the list goes on. For that, I would start with the CoreText programming guide: https://developer.apple.com/library/mac/#documentation/StringsTextFonts/Conceptual/CoreText_Programming/Introduction/Introduction.html
Maybe someone else can provide some sample code, but I think just looking through the docs will give you less of a headache than trying to calculate 4 labels like that.
EDIT:
I changed the link for CoreText
You have to go with CoreText: create your AttributedString and a CTFramesetter with it.
Then you can get a CTFrame for each of your textboxes and draw it in your graphics context.
https://developer.apple.com/library/mac/#documentation/Carbon/Reference/CTFramesetterRef/Reference/reference.html#//apple_ref/doc/uid/TP40005105
You can also use a UIWebView