Is there any text data inside jpeg file? - camera

Is there any text data being stored in a jpeg file or is it only colours and intensity/brightness?
Suppose we have a barcode and say we click a jpeg picture of the barcode. Is it possible to decode that jpeg picture to get the barcode type/data?
Admins, please add the right tags.

There are various ways to store textual data inside a JPEG file. This is typically used for things like copyrights, photographer names, captions, subject keywords etc. "JPEG metadata" is probably the phrase to search for.
Given a JPEG image of a barcode, it is certainly possible to decode the barcode using a barcode decoder, and...
...if you want to, store that barcode in some suitable metadata location. However, ...
...you can not, in general, assume that this has already been done for some image which happens to contain a barcode.

Related

What's the best way to extract text from pdf in python without changing the layout and format?

I want text with exact format and layout from pdf.
If pdf to text is not the direct choice, is it possible to do pdf -> xml -> text?
I have already tried PyPDF2, pdfminer and pdftotxt. Even I've tried using AWS textract and got incorrect layout.
Basically if I can construct sentence from the text extracted from pdf, that's enough.
I used Zamzar API which gives exact output but they're quiet expensive.
Any possible solution?
If you are looking to keep the structure of the PDF but not the font, colour, size etc., then try the pdftables_api library. This should hold the layout of your PDF. Convert PDF to CSV as a CSV file is just a comma seperated text file.
If you are looking to keep font, colour etc., Zamzar API is probably your best option.

Batch check Adobe Acrobat .pdf's for files containing rotated text

Does anybody know if there is a way to check whether a list of Adobe Acrobat .pdf files contain rotated text (any text not at 0 degrees)?
I thought this would be simple, but I'm struggling to find an answer.
I am using ABBYY Recognition Server to OCR thousands of files and the results are quite poor where the text is rotated. I need to get a list of files that have rotated text to allow me to perform some pre-processing on them.
I usually use iTextSharp for .pdf automation and modification but don't seem to be able to find anything for checking text rotation.
Thanks
You could achieve your goal by extracting all words from these PDFs and checking if any of the words is rotated.
I would recommend you to use a PDF library higher level abilities for the task. Docotic.Pdf library is a good choice (of course, I am one of the developers of the library).
Here is an articles that shows how to extract words from PDFs with extra info about their position etc.
Each extracted word comes in PdfTextData object. The PdfTextData contains IsTransformed property to check if word is rotated, scaled, and / or flipped. You can also analyze PdfTextData.TransformationMatrix for more information about the transformation.

How can I programmatically verify that a PDF file is first-generation?

I'm working on a project that involves the Fannie Mae/Freddie Mac Uniform Appraisal Dataset. The specification requires that the embedded appraisal PDF file be first-generation.
I understand conceptually what a first-generation PDF file is (printing of a document directly to PDF, rather than a scanned copy or printed and scanned copy). However, I've done some research and haven't found anything that specifies the properties of a first-generation PDF that could be verified programmatically.
I found a product that allows one to check if a PDF contains text, images, or both: Apose.Pdf.Kit for .NET, but I'm looking for a way to program this myself, for budgetary and other reasons. Also, I'm not sure that determining that the file contains text will be sufficient to verify that it's first-generation.
Given that this is an industry requirement of a very large industry, I feel like someone must have already tackled this issue, but I'm having a hard time finding anything.
Thanks in advance for any help.
There is no way to know for certain if a PDF is "first generation". Technically, a scanned PDF is just a PDF that contains images and perhaps OCR'ed text on top of that. A "first generation" PDF could easily have the same characteristics, so you have to use some heuristics.
For example, a PDF that contains only images and invisible text (from OCR) is likely to be scanned, a PDF that has visible text or vector graphics is probably "first generation" (OCR for scanned PDFs works by overlaying invisible text on top of the original image, so that text selection works, but the original document's fidelity is preserved).
Open pdf, ctrl "f" type in Appraisal. If you have a hit for the word, you have a first generation apprsl. Rather, the dataset exist.

How can I overlay text on a TIFF image, creating something like a searchable pdf?

I would like to have an application where a user views an image of a document in TIFF Format.
If the words "foo" and "bar" appear on the page. And a selection is made on the image that only contains "foo", then I would like to only select the word "foo".
Is there a format that lends itself to storing both the location of text and the text of an image?
Since you know about searchable PDF, and it perfectly implements what you are suggesting, I assume that there is some reason why you can't use it. If not, you should use PDF -- the format supports mixed-content and overlaying them. All of the viewers that your users are likely to have will understand what to do with text beneath the image.
The TIFF format does not support this directly, but if you are making the viewer, and it only needs to work there, then you could try to store the text and positions in a custom tag.
Then your viewer would need to read this tag, interpret mouse positions, and look up the text that is being selected on the image. No other viewer would support your text tag, but they would show the TIFF.
For either of these mechanisms, you will need OCR and a way to encode the data you get either into PDF or the custom TIFF tag. For open source OCR, take a look at Tesseract from Google.
Disclaimer: I work at Atalasoft. Our imaging SDK, DotImage, has add-ons for OCR that can make searchable PDF, and can add and edit TIFF tags.

read 1D barcode (code 128) direct from a pdf file

Has anyone experience with a tool (it can be also a commercial one) which can extract barcodes direct from a pdf file? The most tools I have seen can read barcodes only from images.
Thanks
Well, important question:
Is the barcode in a "barcode" PDF form field? If so, pretty much any PDF form capable library can do the trick. PDF barcode form fields are just text form fields with an appearance stream to display the barcode. The text value of the barcode however is exactly the data you would be looking for (and hence wouldn't need to care about the appearance stream).
If not and the barcode is on a scanned PDF (and hence in an image internally), you could use something like:
Ghostscript http://www.ghostscript.com
FoxIt http://www.foxitsoftware.com/pdf/sdk/
QuickPDF http://www.quickpdf.org/
to convert each page in the PDF to an image. (The list of PDF rasterizers above are under very different licensing terms, but IMHO in decreasing order of quality and maturity.) Then use one of the many barcode image libraries on the image as a whole.