Here’s what I’m trying to do. In my e-commerce store I need to print orders. On that PDF that is generated is an order #. I need to convert that number string into a scannable barcode that will then imbed itself onto that PDF next to the order number.
I have Adobe acrobat DC and it’s able to scan the document. I’m just at a loss for the next step.
The problem is that the shipping program that I use can only be connected to a 1d barcode scanner and rather than typing in either the order number or order name and then searching for the right one and opening it up manually I’m trying to simplify the search part.
Any idea on what to do?
There are a bunch of 1D-barcode symbologies around. Some of them do actually have a 1:1 relationship between the character and the symbol. That means, it is possible to represent the symbol directly by using an appropriate font.
Depending on the kind of characters you have to encode, you will chose the according symbology and font. You will also have to verify that your scanner can read the code (and the software understand it). Available as font, and easy to implement is for example Code39, which can display numbers and letters.
Related
I've implemented a card game in a spreadsheet (I'm using Apache OpenOffice, but can get my stuff converted to "normal" MS Xcel if need be).
What my sheet does is in fact deal out four cards at a time, from a deck that only contains aces (counted as ones) and twos through tens - so basically it's four random number generators activated by a keystroke.
The computer parts end here. The players are supposed to do stuff with the four numbers using their heads, paper and pencils.
Now I'd like to convert the spreadsheet to pdf so I can share it with people without them having to use Apache, Xcel or any other spreadsheet software. But when I export the content, the pdf file just shows the numbers last generated; I cannot make it "reshuffle" and deal again. Is it at all possible?
A direct conversion is not possible; Excel macros and JavaScript are too far away from each other.
You will have to recreate the logic using Acrobat JavaScript.
I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...
I am writing an application that has to read and interpret data stored in some PDF files. The reading part is done but I am only able to get a dump of all the words on a page and not the format of the words. What I mean is that if I have to extract a table, I am getting the numbers in the table but not the markup which defines the table.
Further, there is some formatting used which displays a few of these numbers within parentheses (meaning that those numbers are negative) but the parentheses themselves are not part of the text. Hence, I am not able to distinguish between positive and negative numbers present in the PDF table!
How do you get the PDF markup along with the text? Is a PDF similar in structure to an XML with tags used to markup tables etc.? If not, then, is there a resource which describes the salient features of the PDF DOM?
I am using VBA and the Acrobat library (AcroExch etc.)
There is no such thing as "PDF markup" in the sense of HTML etc. A table in PDF cannot be distinguished from line art, other than by using OCR, which can be error-prone if the layout is complex. It is simply drawn using geometrical shapes, like in a vector-based graphics program.
"Is a PDF similar in structure to an XML with tags used to markup tables etc.?"
No, not at all.
And there is no such thing as a 'DOM' either. Google for a file named *PDF32000_2008.pdf*. The current PDF specification for v1.7 (ISO spec) is that file. You should be able to locate it on the Adobe website.
As omz stated, text inside PDF does not really have a structure. You can take a look on the specification here. However, for some very specific files, there is something called PDF Tags, or PDF Marked Content, which is fairly new, and it aims to give PDF documents some kind of structure. If you target this kind of files specifically, you might be able to achieve something. Take a look on chapter 10 (Document Interchange) of the Adobe's specification for further details.
Maybe what you want to achieve can be done with less effort and faster by using TET, the Text Extraction Toolkit made by the fine folks from pdflib.com ( http://www.pdflib.com/products/tet/ ) ??
AFAIR, the TET has some (limited) support for table detection as well....
I have a PDF file which has the marklist of certain exam.
I am particularly interested in the first list, but which unfortunately has 2112 entries. And they aren't properly formatted. I need to sort all these entries (based on marks in last 2 columns- sum of marks in Aptitude and Computer), to know what my rank is.
I tried to copy in in MS Word and Excel, but if you try it, you can see it won't help. After pasting it in a plain text file, I tried to format it using regular expressions (in Notepad++), wrote a code in C to properly separate each field by '\t' (so that later I can properly copy them in an Excel sheet), but the inconsistency made me fail (some entries are spawned multiple lines, the "names" do not have fixed no. of fields).
Can someone come up with any idea that will make it possible to copy the first list in PDF to a spreadsheet in tabular form exactly as the original file?
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting point '1.' above! -- see these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Well I sort of managed it. I first copied it to a plain text file, deleted all letters from it leaving only the serial number and corresponding marks, separated by spaces or tabs. Then using "import" in an OpenOffice Spreadsheet, told it the delimiters are spaces and tabs (combine them if necessary) and bingo! I got my rank.
But I would still like to know if it's possible to copy the whole table as it is. So keeping this question open.
I once was tasked with building a parser which would extract data from a pdf with tabular and non-tabular data in a number of different encodings and with a mix a rtl and ltr text. That project took quite the effort but with a simple English table you should be able to dissect the pdf in no time. Look for the PDF specs on adobe.com and if it is that desperate start digging in.
Also you'll first need to use pdftk.exe to uncompress the file.
A shortcut that me be of aid:
http://www.adobe.com/devnet/pdf/pdf_reference.html
This is the shortcut I meant: http://www.codeproject.com/KB/cs/PDFToText.aspx
My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.
I have tried a few of different things, but I did not get very far in any of them:
Convert PDF to text. It does not work for me as I lose images and the structure of the document.
Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
Convert PDF to XML. Same as above.
Anyone has any suggestions on how to tackle this problem?
There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).
On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".
To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).
PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.
Grobid available as a opensource on github.
https://github.com/kermitt2/grobid
You may do use the following approach like this with iTextSharp or other open source libraries:
Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
Sort all text objects by coordinates so you will have them all together
Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not
Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:
extract text and images along with analyzing the layout of the text
XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.
Disclaimer: I am affiliated with ByteScout
PDF files can be parsed with tabula-py, or tabula-java.
I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.
Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/
As mentioned in the answers above, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.
If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.
If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.
PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.
However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.
Disclaimer:I was involved in writing the blogpost.
iText api:
PdfReader pr=new PdfReader("C:\test.pdf");
References:
PDFReader