I am struggling to open wmf files and run a simple ascii find and replace to look for certain characters in the metadata and replace with different values. any idea can this be done? I have a wmf editor that shows me the text inside the image is a proper windows font and not combination of lines and arcs. any ideas are welcome!
Related
I want text with exact format and layout from pdf.
If pdf to text is not the direct choice, is it possible to do pdf -> xml -> text?
I have already tried PyPDF2, pdfminer and pdftotxt. Even I've tried using AWS textract and got incorrect layout.
Basically if I can construct sentence from the text extracted from pdf, that's enough.
I used Zamzar API which gives exact output but they're quiet expensive.
Any possible solution?
If you are looking to keep the structure of the PDF but not the font, colour, size etc., then try the pdftables_api library. This should hold the layout of your PDF. Convert PDF to CSV as a CSV file is just a comma seperated text file.
If you are looking to keep font, colour etc., Zamzar API is probably your best option.
I have a PDF file that contains several annotations.If you notice the image there are several boxes in Yellow and Beige. These boxes can be edited in Adobe Reader. Could anyone help me find-out the total number of these boxes present in the pdf file using VBA?
Also, I tried converting the pdf to word using vba, but those boxes weren't present in the word file; so it didn't work out.
Here is the pdf file: https://drive.google.com/file/d/0B7uN4B3mxUlZMjB1T3BuM0o1VGs/view?usp=sharing
The text in those boxes is always blue while other text is black. Maybe that could be used.
Another way would be to use pdfseparate from http://www.foolabs.com/xpdf/download.html, and count how often the string <</AP <</N occurs in the generated file.
Or you could convert the pdf to an image and then count the number of colored rectangles.
You could also use one of the commercial tools available for creating/editing pdfs e.g. http://www.pdflib.com/, I believe that one supports VBA.
I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)
I'm developing a new function to "my" program. This function is able to write PDF files by the simple way, making a simple text file with some codes of PDF standard.
I'm trying to understand how it works yet, but my first problem is about how to apply bold on some line of my document.
I've already downloaded the PDF REFERENCES GUIDE, but I've not found nothing about it.
Any idea?
PDF is not like HTML where you can apply formatting tags for emphasis. As you've read in the PDF reference, all that you do in PDF is to setup a graphics environment (colours used, fonts used, etc) and then put text on the page.
If you want to have something show in bold, use a font that is bold. If you want to have something show in italic, use a font that is italic.
Older software used dirty tricks to create "bold-alike" text, but the good (and easy) way to do it is to make sure you select the correct font before you start drawing text.
I have been successfully able to read a word document containing images usiong POI.
I have even be able to extract a section from Word document including the images.
I am writing the extracted portion containg images to a new word document.
My problem is that I have to display this extracted portion (containing text, fonts, colors images) on the screen using any standard Java Swing component.
Please advise how can I do that?
I tried JText, Panel, editor but all would take only text and I loose my formatting and images.
Regards