I know that PNG doesn't support EXIF metadata (see these questions) but it does support its own format of Chunks including text as a "name=value" pair. However, I can't find any standard for the naming of text apart from these basic keywords.
There's a reference in the second question to this failed standard from 2000 which defines fields like:
GPSAltitude
GPSAltitudeRef
GPSInfo
GPSLatitude
GPSLatitudeRef
GPSLongitude
GPSLongitudeRef
GPSMapDatum
GPSVersionID
and they seem to be used in this discussion. Should I just use these or is there an official standard I can use?
Related
I am looking for a way to write compound data (e.g. lists) to a file that can later be read back into the program. In Lisps, this is simply a matter of writing s-expressions to a file, and reading it back in later (using write and read for Scheme; prin1 and read for Common Lisp). Is there a similar method for doing this in Standard ML? Is there anything built-in that can help? (By "built-in", I mean something that is part of the language or basis library).
I found borb - a cool Python package to analyze and create PDFs.
And there are several translation APIs available, e.g. Google Translate and DeepL.
(I realize the length of translated text is likely different than the original text, but to first order I'm willing to ignore this for now).
But I'm not clear from the borb documentation how to replace all texts with their translations, while maintaining all formatting.
Disclaimer: I am Joris Schellekens, the author of borb.
I don't think it will be easy to replace the text in the PDF. That's generally something that isn't really possible in PDF.
The problem you are facing is called "reflowing the content", the idea that you may cause a line of text to be longer/shorter. And then the whole paragraph changes. And perhaps the paragraph is part of a table, and the whole table needs to change, etc.
There are a couple of quick hacks.
You could write new content on top of the pdf, in a separate layer. The PDF spec calls this "optional content groups".
There is code in borb that does this already (the code related to OCR).
Unfortunately, there is no easy free or foolproof way to translate pdf documents and maintain document formatting.
DeepL's new Python Library allows for full document translation in this manner:
import deepl
auth_key = "YOUR_AUTH_KEY"
translator = deepl.Translator(auth_key)
translator.translate_document_from_filepath(
"path/to/original/file.pdf",
"path/to/write/translation/to.pdf",
target_lang="EN-US"
)
and the company now offers a free API with a character limit. If you have a few short pdfs you'd like to translate, this will probably be the way to go.
If you have many, longer pdfs and don't mind paying a base of $5.49/month + $25.00 per 1 million characters translated, the DeepL API is still probably the way to go.
EDIT: After attempting to use the DeepL full document translation feature with Mandarin text, this method is far from foolproof/accurate. At least with the Mandarin documents I examined, the formatting of each document varied significantly, and DeepL was unable to accurately translate full documents over a wide range of formatting. If you need only the rough translation of a document, I would still recommend using DeepL's doc translator. However, if you need a high degree of accuracy, there won't be an 'easy' way to do this (read the rest of the answer). Again, however, I have only tried this feature using mandarin pdf files.
However, if you'd like to focus on text extraction, translation, and formatting without using DeepL's full document translation feature, and are able to sink some real time into building a software that can do this, I would recommend using pdfplumber. While it has a steep learning curve, it is an incredibly powerful tool that provides data on each character in the pdf, image area information, offers visual debugging tools, and has table extraction tools. It is important to note that it only works machine generated pdfs, and has no OCR feature.
Many of the pdf's I deal with are in the Mandarin language and have characters that are listed out of order, but using the data that pdfplumber provides on each character, it is possible to determine their position on the page...for instance, if character n's Distance of left side of character from left side of page (char properties section of the docs) is less than the distance for character n+1, and each has the same Distance of top of character from bottom of page, then it can be reasonably assumed that they are on the same line.
Figuring out what looks the most typical for the body of pdfs that you typically work with is a long process, but performing the text extraction while maintaining line fidelity in this manner can be done with a high degree of accuracy. After extraction, passing the strings to DeepL and writing them in an outfile is an easy task.
If you can provide one of the pdfs you work with for testing that would be helpful!
PDF is very nice for humans to read, but it is pretty awful to extract the data from. There are tons of tools to extract the data from PDF (pdftotext from poppler, pdftohtml, XPdf, tabula, a-pdf, ...).
As you can see in questions like this, those tools are not optimal.
It would be better if the PDF contained already the data in a structured way to be extracted. Something like a striped-down version of HTML. Especially for tables, there is a lot of information lost. For example, when you convert a Word document to PDF and then to text.
Does the PDF standard provide a way to store the structure of a table? If not, is it possible to extend the PDF standard? What would be the process for that?
What you are looking for, most likely are tagged PDFs.
Tagged PDFs are specified in ISO 32000-1, section 14.8. They mark content parts as paragraphs, headers, lists (and list items), tables (and table rows, headers, and data cells) etc. with assorted attributes.
To do so they make use of the PDF logical structure facilities (see ISO 32000-1, section 12.7) which in turn use the marked content operators (see ISO 32000-1, section 12.6) to tag pieces of content streams with IDs which are referenced from a structure tree object model outside the content streams.
In a tagged PDF you can walk that structure tree like a XML DOM and retrieve the associated text pieces making use of the ID markers in the content.
For details please study the PDF specification ISO 32000-1 or its update ISO 32000.2.
Adobe shared a copy of ISO 32000-1 (merely replacing ISO headers and references), simply search the web for "PDF32000_2008". Currently it's located here: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
We are designing a database that needs to store various versions of a file (pdf/image/reduced image) in a table. The powers that be have opted against using Filestream for whatever reason so this is not up for debate.
I can't seem to find anything online that indicates what the appropriate datatype is for storing pdf and image data. That or I'm being just a total idiot while searching for it.
I'm not trying to start a debate, so I'm not looking for opinionated responses. But rather I am trying to find out if one or the other was actually designed for what I'm trying to do. If either will work, that's all I need to know.
Given your binary choice of nvarchar vs varbinary there's no choice: it's varbinary. nvarchar is for storing unicode character based data. varbinary is going to store a bit-perfect copy of the data you put in there. PDFs and images are binary file types so varbinary it is.
As for the BLOB suggestion, no. That's not even supported with 2012. Oh, and perhaps you meant TEXT/NTEXT/IMAGE data type. Those are deprecated too so don't build anything new using them.
Finally, you said you can't use FileStream, but what about FileTable. I'm not sure if you're looking for just storage of data or you need it searchable in which case, FileTable's pretty slick.
I'm trying to see what the best way to store large amounts of text (more than 255 characters) in Cocoa would be. Being a big fan of Core Data, I would assume there's an effective way to do so. However, I feel like 'string' is the wrong data type for this type of thing. Does anyone have any info on this? I don't see an option for BLOB in Core Data
Well you can't very well compress the text or store it as a binary that must be translated, otherwise you give up SQLite's querying speed (because all text-stored-as-binary-encoded-data) records must be read into memory, translated/decompressed, then searched). Otherwise, you'd have to mirror (and maintain) the text-only representation in your Core Data store alongside the more full-featured stuff.
How about a hybrid solution? Core Data stores all but the actual text; the text itself is archived a one-file-per-entry-in-Core-Data on the file system. Each file named for its unique identifier in the Core Data store. This way a search could do two things (in the background, of course): search the Core Data store for things like titles, dates, etc; search the files (maybe even with Spotlight) for content search. If there's a file search match, its file name is used to find the matching record in Core Data for display in your app's UI.
This lets you leverage your app-specific internal search criteria and Spotlight's programmatic asynchronous search. It's a little more work, granted, but if you're talking about a LOT of text, I can't think of a better way.
The BLOB data type is called "Binary data" in Core Data. As middaparka has pointed out, the Core Data Programming Guide offers some guidance on how to deal with binary data in Core Data. Depending on your requirements, an alternative to using BLOBs would be to just store references to files on disk.
I'd recommend a read of Apple's Core Data Programming Guide (specifically the "Core Data Performance" section). This specifically mentions BLOBs (see the "Large Data Objects (BLOBs)" section) and gives some, albeit vague, guidelines.