I just realized that FileSystem.FindInFiles reports success even if it finds the containsText within the document properties of a .pdf file. This is annoying because the end-user looks for the string in the document and does not find it (in this case, the string was only present in the Title field of the document properties).
Please let me know if there is a way around this problem.
Alternatively, can anyone recommend a reliable, fast search VB-compatible api/library/method to use on a windows fileshare?
Related
Here is issue screenshot
Here is the sample code.
Dim rawData As Byte() = "sample data"
Response.ContentType = "application/pdf"
Response.ContentEncoding = System.Text.Encoding.UTF8
Response.BinaryWrite(rawData)
Response.End()
Space getting added between characters while writing to PDF using binary write
The underlying issue here is that you actually are not writing a PDF at all!
Your code essentially returns pure text data and then claims that it is a PDF. Such a claim doesn't change the text data in any way, though, they remain text and don't become a PDF.
The PDF viewer you use apparently attempts to somehow display what it got nonetheless but the result thereof turns out to be very unsatisfying (a proportional font seems to be used in a monospaced manner).
If you actually want to return a PDF, you have to explicitly create one. PDF is a complicated binary format best to create using a dedicated library.
Look for pdf libraries for your environment. You can find some that have explicit ways to add table or paragraph structures to the pdf, and some that create content by conversion from another structured format e.g. html.
The output of these libraries is a binary in pdf format which you can return from your code using Response.BinaryWrite.
Recently one can read in a number of questions that people have data in text or html format, return it setting some binary content type (PDF in this question, MS office formats in other ones), and then assume they so have generated a file in that format.
This is wrong, claiming a format doesn't transform into that format!
All this setting of the content type does, is informing the client what kind of viewer to use to open the data.
Probably this anti-pattern came up because MS Word (and most likely other word processors, too) can also open plain text and html text files and display them fairly properly. Thus, this anti-pattern at first glance appears to work somehow.
If you promised your client, though, that your application returns MS Office documents, don't return HTML or plain text claiming it to be an Office document, instead do create actual MS Office documents! Otherwise knowledgeable clients will not accept your implementation and clients who did accept it will eventually be informed by knowledgeable users that you cheated them which will at least lessen your renown.
I am trying to make a dynamic PDF generator as an .NET Core API. I want to take an existing PDF, or .docx file, and edit it so it replaces the current name (John Doe) with something that can be replaced like #NAME_PLACEHOLDER.
I then want to transform #NAME_PLACEHOLDER -> John Doe (or whatever is in the KeyValuePair or Dictionary<string, string>).
I am running this on a Docker environment, so I can easily execute commands and I am willing to do that as well.
So far I have tried a few things:
1) pdf2htmlEX
Executes as pdf2htmlEX file.pdf
Does the job pretty well
Can be converted back to PDF using Google Chrome headless or similar
Problem: Only the characters used in the PDF can be used to replace. So if I only use A, B, C as characters, it will make D into Times New Roman (or default font)
2) LibreOffice ODT to PDF
This was pretty nice, because I could simply unzip the .odt file, open content.xml, search and replace, then save it as an .odt file again
Could be converted into PDF rather easily using soffice --convert-to pdf
LibreOffice is quite nice
Problem 1: Microsoft Word -> Save as ODT tends to break the formatting, so we have to use LibreOffice to go and change it back again
Problem 2: We don't want to move away from Microsoft's Office suite
3) HTML to PDF using Chrome Headless
What you see is what you get
By far the best option, if we're all developers aaand have unlimited time
Problem 1: Only our developers can make changes, since our marketing department do not know HTML
Problem 2: Our existing PDFs would have to be rewritten in HTML
As you can see, I have tried a bunch of things. None of them, except Chrome Headless, has lived up to my expectations. What I really like about #3 is what you see is what you get. I can make the whole thing in HTML, press CTRL+P and see what it looks like as a finished PDF, basically.
I am looking for a better solution, though. It can be paid. It can be free. All I need is to change out words/phrases with other words dynamically, which apparently seems like a tough thing to do.
Thanks for specifying what you've already found clearly. It helps a lot providing a succinct answer.
The conversion is always tricky - I'm sure you know Word has trouble displaying/editing some Word documents itself.
I have experience regarding point #2 "LibreOffice ODT to PDF" and can suggest a few things to test:
Don't use Microsoft to do the docx->odt conversion. It's not good as you know. Use LibreOffice itself to do this step. The rest of your process remains the same.
For some documents, Libre Office does doc->odt much better. So, you can instead work with DOC format and get a better result without any other changes.
You won't be able to remove the devs from the process, but you can certainly reduce their role allowing your business/marketing teams to have more direct input simply by:
get the starting point document to the devs to run through the conversion process. The devs can "clean up" the document to make it convert nicely.
make this version of the document the "official" starting point. The business or technical teams can load it, adjust it, and put it back into the process.
if possible, expose a test-platform to the business teams so they can download, adjust, upload and render to PDF. This cycle means they will be able to achieve more and if they're good, do impressive stuff without any dev input.
the above steps simply mean don't expect perfect conversion of arbitrary complex documents. Starting from a (even complex) working baseline is great.
Some of that might show you that your #2 is actually going to get the best overall results.
I hope that helps.
Any idea if it would be possible to extract text from a illustrator file without opening it?
I have an AppleScript currently extracting the text but it takes a long time when I'm working on hundreds of files. I was wondering if it would be possible to get the information without opening the AI file.
+1 for show your own code first. (Also, typo in first line: I think you meant “Illustrator”, not “photoshop”.)
If you’re only getting plain text it should only take a fraction of a second per document (opening the file will take longer):
tell application "Adobe Illustrator"
get contents of every text frame of document 1
end tell
(i.e. Never iterate over individual application objects, querying each one, when a single query will do everything for you. Apple events are relatively expensive for apps to resolve; sending lots of them unnecessarily really kills performance.)
Also be aware that AppleScript also has serious performance problems when iterating over large lists, but that’s a separate issue, the solution to which should already be covered elsewhere.
I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word "hello" for instance.
Is there any way to do this search in the binary file?
You could go down the route of using commercial tools. Aspose.Words can load a document from a stream and has all sorts of methods for finding text within the document.
If you have the stream from the DB, then you code would look like this:
Aspose.Words.Document doc = new Aspose.Words.Document(streamObjectFromDatabase);
if (doc.GetText().ToLower().Contains("hello world"))
MessageBox.Show("Hello World exists");
Note: The benefit of this tool is that it does not require Word objects to be installed and it can work with streams in memory.
Not without a lot of pain, as far as I can tell. According to Wikipedia, Microsoft has within the past few years finally released the .doc specification. So you could create a parser based on the spec if you have the time, assuming all of your documents are in the same version of the .doc format.
Of course you could just search for the text you're looking for amid all the binary data, on the assumption that the actual text is stored as plain text. But even if that assumption were true, how could you be sure that the plain text you found was the actual document text, and not some of the document meta data that's also stored in plain text? And there's always the off chance that the binary data will match your text pattern.
If the Word libraries are available to you, I would go that route. If not, a homegrown parser may be your least bad option.
I'd like to write some (java) code that takes a PDF document, and creates named destinations from all of the bookmarks. I think the iText API is the easiest way of doing this, but I have never used the API before.
How would you go about writing this sort of code with the iText API? Can iText do the parsing needed to manipulate existing PDFs by itself? The kind of manipulations I am thinking of are:
Open,
Find bookmarks,
Create destinations,
Save,
Close.
Or is there a different API that would be better?
Followup: I submitted a patch to iText a few months ago (it has now been accepted and is part of HEAD) that adds text parsing capabilities to iText. PdfBox (mentioned below) has (had?) problems with reading newer PDFs that use xref streams instead of the older xref table format.
Another library that is very good at parsing existing PDF files is PdfBox It can also be used for modifying an existing PDF. FYI - this is the text parser that Lucene uses.
I will also mention that iText does have the ability to parse a PDF file, it's just not great at parsing the text content on each page. If you are looking at accessing the PDF higher level constructs (Dictionaries, etc...) that are used for storing bookmarks, etc... and you don't mind getting your hands a little dirty with reading the PDF spec, you can absolutely do what you are asking about (we do it quite a bit ourselves).
The PDF Spec is big, but readable for the most part, and you don't have to worry about the bulk of it (which is geared towards actual page content and rendering) if all you are trying to do is extract bookmarks.
I'll just warn you up front that you may be disappointed with this. iText isn't really intended to be used as a parser. It's really more for creating entirely new PDF documents, but you can take a whack at it.
To start, using iText, you won't be able to modify the existing PDF document. What you can do, though, is to make a copy with the additional features that you want. (If somebody else knows better, please let me know, this drives me crazy.)
What you will want to do is create a PdfReader object from an input stream on your source file. Then create a PdfCopy object (which is just an extended PdfWriter that makes getting data from an existing source more convenient) for your destination.
As far as I can tell, the bookmarks cannot be obtained from iText at all. Another library may be needed. I think jpedal may have the ability to extract them (it can get them as an XML document, which you may then have to parse to get what you want.) However you get them, you can then add them to a java.util.List, and set that list as outline on the PDFCopy. The bookmarks themselves are just HashMaps with a particular set of keys. I'm not sure what all of the values are, but they include "Title", "Action" (which seems to be where you'd specify that this is a named destination, though I don't know what that value would be), and "URI" (which is used if this is an external link -- I suspect that this would specify the name of the named destination that you're linking to). Again, the specifics are hard to find.
Then iterate over the pages of the reader, importing each page to the PdfCopy. this page may help you.
Sorry I'm not more helpful to you. Good luck.
P.S. If anybody else knows of a better tool that's either (L)GPL or BSD licensed, I'd love to hear about it.