Extract text from illustrator file without opening file

Extract text from illustrator file without opening file - automation

Any idea if it would be possible to extract text from a illustrator file without opening it?
I have an AppleScript currently extracting the text but it takes a long time when I'm working on hundreds of files. I was wondering if it would be possible to get the information without opening the AI file.

+1 for show your own code first. (Also, typo in first line: I think you meant “Illustrator”, not “photoshop”.)
If you’re only getting plain text it should only take a fraction of a second per document (opening the file will take longer):
tell application "Adobe Illustrator"
get contents of every text frame of document 1
end tell
(i.e. Never iterate over individual application objects, querying each one, when a single query will do everything for you. Apple events are relatively expensive for apps to resolve; sending lots of them unnecessarily really kills performance.)
Also be aware that AppleScript also has serious performance problems when iterating over large lists, but that’s a separate issue, the solution to which should already be covered elsewhere.

Related

ExportAsFixedFormat for VBA saving Files

i have a problem regarding the process of saving a bunch of pdfs (exported from WordDocuments).
The runtime of my program just behaves a bit weird and thats why i am asking.
So I want to save the files on a global drive.
In my program i create a folder (in that drive where) i put all the pdfs.
Somehow if i do this operation for the first time it is really slow.
But if I do this operation ( for the same fodler a second time) it is somehow really fast. (after I deleted the "old" pdfs, or the old pdfs where overwritten)
I am a bit frustrated and I cannot explain why that is
Could somebody help pls
Would be very happy for an answer
Greetings
Jonas
Using this simple code
doc.ExportAsFixedFormat wholefile, ExportFormat:=wdExportFormatPDF

There are multiple reasons why the ExportAsFixedFormat method can be slow. But I would start from dealing with local files only. Then I'd play with arguments specifying the document quality and exported format. It makes sense to play with other parameters left in your code sample to its default values.

Replace words/phrases in existing PDF or docx with other words

I am trying to make a dynamic PDF generator as an .NET Core API. I want to take an existing PDF, or .docx file, and edit it so it replaces the current name (John Doe) with something that can be replaced like #NAME_PLACEHOLDER.
I then want to transform #NAME_PLACEHOLDER -> John Doe (or whatever is in the KeyValuePair or Dictionary<string, string>).
I am running this on a Docker environment, so I can easily execute commands and I am willing to do that as well.
So far I have tried a few things:
1) pdf2htmlEX
Executes as pdf2htmlEX file.pdf
Does the job pretty well
Can be converted back to PDF using Google Chrome headless or similar
Problem: Only the characters used in the PDF can be used to replace. So if I only use A, B, C as characters, it will make D into Times New Roman (or default font)
2) LibreOffice ODT to PDF
This was pretty nice, because I could simply unzip the .odt file, open content.xml, search and replace, then save it as an .odt file again
Could be converted into PDF rather easily using soffice --convert-to pdf
LibreOffice is quite nice
Problem 1: Microsoft Word -> Save as ODT tends to break the formatting, so we have to use LibreOffice to go and change it back again
Problem 2: We don't want to move away from Microsoft's Office suite
3) HTML to PDF using Chrome Headless
What you see is what you get
By far the best option, if we're all developers aaand have unlimited time
Problem 1: Only our developers can make changes, since our marketing department do not know HTML
Problem 2: Our existing PDFs would have to be rewritten in HTML
As you can see, I have tried a bunch of things. None of them, except Chrome Headless, has lived up to my expectations. What I really like about #3 is what you see is what you get. I can make the whole thing in HTML, press CTRL+P and see what it looks like as a finished PDF, basically.
I am looking for a better solution, though. It can be paid. It can be free. All I need is to change out words/phrases with other words dynamically, which apparently seems like a tough thing to do.

Thanks for specifying what you've already found clearly. It helps a lot providing a succinct answer.
The conversion is always tricky - I'm sure you know Word has trouble displaying/editing some Word documents itself.
I have experience regarding point #2 "LibreOffice ODT to PDF" and can suggest a few things to test:
Don't use Microsoft to do the docx->odt conversion. It's not good as you know. Use LibreOffice itself to do this step. The rest of your process remains the same.
For some documents, Libre Office does doc->odt much better. So, you can instead work with DOC format and get a better result without any other changes.
You won't be able to remove the devs from the process, but you can certainly reduce their role allowing your business/marketing teams to have more direct input simply by:
get the starting point document to the devs to run through the conversion process. The devs can "clean up" the document to make it convert nicely.
make this version of the document the "official" starting point. The business or technical teams can load it, adjust it, and put it back into the process.
if possible, expose a test-platform to the business teams so they can download, adjust, upload and render to PDF. This cycle means they will be able to achieve more and if they're good, do impressive stuff without any dev input.
the above steps simply mean don't expect perfect conversion of arbitrary complex documents. Starting from a (even complex) working baseline is great.
Some of that might show you that your #2 is actually going to get the best overall results.
I hope that helps.

Why should applications read a PDF file backwards?

I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?

I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..

PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.

Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.

Reducing PDF size - From 5MB to ~200KB

INTRO
I have this 2,7 MB PDF file.
It's a certificate with two fields that I have to fill: name and course.
After filling those fields I save it for later printing.
THE PROBLEM
After saving, the new file comes up with ~5MB.
I have tried many saving options and but I only managed to reduce it to the final size of 4,7MB (still larger than the original file).
For instance, I tried open the original file (2,7MB) and save it right after opening (without making any change). The result is the same: a new ~5MB file.
That means that it isn't the information (Name and Course) the faulty.
SOLVING
At some point, trying new methods of saving, I managed to save it to the size of 180KB.
Unfortunately, I'm not being able to reproduce this made.
After several hours trying to achieve this made again and not succeeding, I came here ask for help :(

As you are in Acrobat, you might use "save as optimized…" (where you are already, in order to show the space usage), and remove as much as possible (mainly structure information, private data (which means data allowing the original-creating application to edit the file again), etc.).
You might also start from a minimum-sized blank file, and copy/paste the form fields into it. (although I don't think that would cause much reduction, as, AFAIK, fonts used in form fields are counted in the Fonts item).

Make a non convertible pdf file

I want to make a script that locks a pdf file and make it non convertible to other types WORD or TXT.
Is there any script to make this possible ?
Thank you

You can't. Not really. I mean, there are ways but the effort expended is rarely worth it.
One of the simplest ways is to start with creating the document with an owner password and no user password. This lets anyone open the file, but will abide by the user access permissions in the encrypt dictionary. Permissions you can set can include:
Allow/Deny printing
Allow/Deny copying text/images
Allow/Deny modifying text annots and form fields
This will work with Acrobat, but doesn't stop any 3rd party tools from allowing this.
You could also make your own font that had an "unusual" encoding but reads correctly. This is equivalent to encoding your document with a Caesar cipher, which is not even encryption by any modern definition.

I have had the same question multiple times and researched different answers. I work as an engineer that writes documents sometimes for several subcontracted jobs and even sub it out to other technicians. They need my paperwork for those specific jobs but I don't want anyone copying my paperwork. I found converting a MS Word file to PDF can always be undone. I found the easiest way for me (though time consuming) was to print the document, then scan after printing as a pdf. That would make the text non-editable. Hope this helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas