I want to open and view a password protected PDF file in VB6/VB.NET program. I have tried using the Acrobat PDF Library but could not do it.
The reason I want to create a password protected PDF file is because I dont want the PDF file to be opened without the password externally i.e outside the program.
To open a password protected PDF you will need to develop at least a PDF parser, decryptor and generator. I wouldn't recommend to do that, though. It's nowhere near an easy task to accomplish.
With help of a PDF library everything is much simpler. You might want to try Docotic.Pdf library for the task.
Here is a sample for you task:
public static void unprotectPdf(string input, string output)
{
bool passwordProtected = PdfDocument.IsPasswordProtected(input);
if (passwordProtected)
{
string password = null; // retrieve the password somehow
using (PdfDocument doc = new PdfDocument(input, password))
{
// clear both passwords in order
// to produce unprotected document
doc.OwnerPassword = "";
doc.UserPassword = "";
doc.Save(output);
}
}
else
{
// no decryption is required
File.Copy(input, output, true);
}
}
Docotic.Pdf can also extract text (formatted or not) from PDFs. It might be useful for indexing (I guess it's what you are up to because you mentioned Adobe IFilter)
you can convert code to vb over the internet
Related
I have an application where I need to show the .pptx and .pdf files. For pdf files I am using react-native-pdf and file is opening fine in my App but when it comes to .pptx files we have 2 libraries:
1. https://www.npmjs.com/package/react-native-doc-viewer
2. https://www.npmjs.com/package/react-native-file-viewer
react-native-doc-viewer is not being actively maintained and a lot of issues :(
But both of them were giving a prompt to select an app like Wps Office or Microsoft apps but they were not opening as Pdf files opened in my app. Whats the reason behind this? We cannot open pptx file in our app?
I read the react-native-doc-viewer android native code. it is actually is to download a doc not to view it. the following is the code:
#ReactMethod
public void openDoc(ReadableArray args, Callback callback) {
final ReadableMap arg_object = args.getMap(0);
try {
if (arg_object.getString("url") != null && arg_object.getString("fileName") != null) {
// parameter parsing
final String url = arg_object.getString("url");
final String fileName =arg_object.getString("fileName");
final String fileType =arg_object.getString("fileType");
final Boolean cache =arg_object.getBoolean("cache");
final byte[] bytesData = new byte[0];
// Begin the Download Task
new FileDownloaderAsyncTask(callback, url, cache, fileName, fileType, bytesData).execute();
}else{
callback.invoke(false);
}
} catch (Exception e) {
callback.invoke(e.getMessage());
}
}
it uses FileDownloaderAsyncTask to download files. if you are familiar with it.
if you want to show excels, Docx, you can use the google doc line convert it to Html, then in the webView to show it. the format like it: https://docs.google.com/gview?embedded=true&url=[doc address], the same effect as ios.
We have a method to check if a checkbox in a PDF (No forms) is checked or not and it works great on one company's PDF. But on another, there is no way to tell if the checkbox is checked or not.
Here is the code that works on one company's PDF
protected static final String[] HOLLOW_CHECKBOX = {"\uF06F", "\u0086"};
protected static final String[] FILLED_CHECKBOX = {"\uF06E", "\u0084"};
protected boolean isBoxChecked(String boxLabel, String content) {
content = content.trim();
for (String checkCharacter : FILLED_CHECKBOX) {
String option = String.format("%s %s", checkCharacter, boxLabel);
String option2 = String.format("%s%s", checkCharacter, boxLabel);
if (content.contains(option) || content.contains("\u0084 ") || content.contains(option2)) {
return true;
}
}
return false;
}
However, when I do the same for another company's PDF there is nothing in the extracted text near the checkbox to tell us if it is checked or not.
The big issue is we have no XML Schema, no Metadata, and no forms on these PDFs, it is just raw String, so you can see a checkbox is difficult to have in a String, but that is all we have. Here is code example of pulling the String in the PDF from a page to some other page, all the text in between
protected String getTextFromPages(int startPage, int endPage, PDDocument document) throws IOException {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(startPage);
stripper.setEndPage(endPage);
return stripper.getText(document);
}
I wish the pdfs had an easier way to extract the text/data, but these vendors that make the PDFs decided it was better to keep that out of them.
No we cannot have the vendor's/ other companies change anything, we receive these PDFs from the courts system that have been submitted by lawyers that we don't know and that the lawyers bought the PDF software that generates these files.
We also cannot do it the even longer way of trying to figure out the object model that PDFBox creates of the document with things like
o.apache.pdfbox.util.PDFStreamEngine - processing substream token: PDFOperator{Tf}
because these are 80-100 page PDFs and would take us years just to code to parse one vendor's format.
We use non-manage DLL that has a funciton to replace text in PDF document (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php).
We are trying to move to managed solution (ITextSharp or PdfSharp).
I know that this question has been asked before and that the answers are "you should not do it" or "it is not easily supported by PDF".
However there exists a solution that works for us and we just need to convert it to C#.
Any ideas how I should approach it?
According to your library reference link, you use the Debenu PDFLibrary function ReplaceTag. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams();
string content = DPL.GetContentStreamToString();
DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
That should be possible with any general purpose PDF library, it definitely is with iText(Sharp):
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
WARNING: Just like in case of the Debenu function, for most documents this code wouldn’t have any effect or would even be destructive. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed.
By the way, the Debenu knowledge base article continues:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
Thus, if during your move to managed solution you also change the way the source documents are created, chances are that neither the Debenu PDFLibrary function ReplaceTag nor the code above will be able to change the content as desired.
for pdfsharp users heres a somewhat usable function, i copied from my project and it uses an utility method which is consumed by othere methods hence the unused result.
it ignores whitespaces created by Kerning, and therefore may mess up the result (all characters in the same space) depending on the source material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}
I am using PDFBox for validating a pdf document and one of the validation states that whether the pdf document is printable or not.
I use the following code to perform this operation:
PDDocument document = PDDocument.load("<path_to_pdf_file>");
System.out.println(document.getCurrentAccessPermission().canPrint());
but this is returning me true though when the pdf is opened, it shows the print icon disabled.
Access permissions are integrated into a document by means of encryption.
Even PDF documents which don't ask for a password when opened in Acrobat Reader may be encrypted, they essentially are encrypted using a default password. This is the case in your PDF.
PDFBox determines the permissions of an encrypted PDF only while decrypting it, not already when loading a PDDocument. Thus, you have to try and decrypt the document before inspecting its properties if it is encrypted.
In your case:
PDDocument document = PDDocument.load("<path_to_pdf_file>");
if (document.isEncrypted())
{
document.decrypt("");
}
System.out.println(document.getCurrentAccessPermission().canPrint());
The empty string "" represents the default password. If the file is encrypted using a different password, you'll get an exception here. Thus, catch accordingly.
PS: If you do not know all the passwords in question, you may still use PDFBox to check the permissions, but you have to work more low-level:
PDDocument document = PDDocument.load("<path_to_pdf_file>");
if (document.isEncrypted())
{
final int PRINT_BIT = 3;
PDEncryptionDictionary encryptionDictionary = document.getEncryptionDictionary();
int perms = encryptionDictionary.getPermissions();
boolean printAllowed = (perms & (1 << (PRINT_BIT-1))) != 0;
System.out.println("Document encrypted; printing allowed?" + printAllowed);
}
else
{
System.out.println("Document not encrypted; printing allowed? true");
}
While Extracting Content from PDF using the MuPDF library, i am getting the Font name only not its font-face.
Do i guess (eg.bold in font-name though not the right way) or there is any other way to detect that specific font is Bold/Italic/Plain.
I have used itextsharp to extract font-family ,font color etc
public void Extract_inputpdf() {
text_input_File = string.Empty;
StringBuilder sb_inputpdf = new StringBuilder();
PdfReader reader_inputPdf = new PdfReader(path); //read PDF
for (int i = 0; i <= reader_inputPdf.NumberOfPages; i++) {
TextWithFont_inputPdf inputpdf = new TextWithFont_inputPdf();
text_input_File = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader_inputPdf, i, inputpdf);
sb_inputpdf.Append(text_input_File);
input_pdf = sb_inputpdf.ToString();
}
reader_inputPdf.Close();
clear();
}
public class TextWithFont_inputPdf: iTextSharp.text.pdf.parser.ITextExtractionStrategy {
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
string curFont = renderInfo.GetFont().PostscriptFontName;
string divide = curFont;
string[] fontnames = null;
//split the words from postscript if u want separate. it will be in this
}
}
public string GetResultantText() {
return result.ToString();
}
The PDF spec contains entries which allow you to specify the style of a font. However unfortunately in the real world you will often find that these are absent.
If the font is referenced rather than embeded this generally means you are stuck with the PostScript name for the font. It requires some heuristics but normally the name provides sufficient clues as to the style. It sounds this is pretty much where you are.
If the font is embedded you can parse it and try and find style information from the embedded font program. If it is subsetted then in theory this information might be removed but in general I don't think it will be. However parsing TrueType/OpenType fonts is boring and you may not feel that it is worth it.
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)"