How to convert marathi data from pdf to excel in proper format - pdf

I am converting Marathi data from PDF to excel or word but it is not getting proper format.
I have copied some data from PDF and pasted in word document but it was not getting proper format.
e.g. प्रविण सुधाकर शिरवाडकर this line is in PDF
but when i copied and pasted in word it has been getting
-प्रववर् सुधाकर शिरवाडकर
what should i do for this?
anyone please help me.
thank you in advance

There seem to be problems in the way PDF stores unicode devnagri text. Try this alternative route: convert your PDF to an image. Can use an online tool or downloaded, or if on linux use this command in terminal:
for f in *.pdf; do convert -density 200 "$f" "${f}_200dpi.jpg"; done
change the density from 200 to other as per need. Each page from your document should be converted into an image file. For a windows tool, try https://www.pdfill.com/pdf_tools_free.html
Then, go to http://www.i2ocr.com/free-online-hindi-ocr, upload the image and convert. That uses OCR (optical character recognition).

check the font in your PDF and try making it available to the word document.

I think you dont have perticular fonts which are used in PDF
In Adobe Reader -- -- File menu > Properties > Fonts tab gives you a list of all fonts used in the document.

Related

How to convert text pdf to image pdf using ghostscript

I need to convert text in pdf file to images, so users cannot copy it from the pdf etc.
This should be equivalent to converting the entire pdf to a set of images and then merging them to one single document. I did so, but it seems slow, is there any way to do it with ghostscipt options?
Welp, looks like I only need to specify option -dNoOutputFonts.

What's the best way to extract text from pdf in python without changing the layout and format?

I want text with exact format and layout from pdf.
If pdf to text is not the direct choice, is it possible to do pdf -> xml -> text?
I have already tried PyPDF2, pdfminer and pdftotxt. Even I've tried using AWS textract and got incorrect layout.
Basically if I can construct sentence from the text extracted from pdf, that's enough.
I used Zamzar API which gives exact output but they're quiet expensive.
Any possible solution?
If you are looking to keep the structure of the PDF but not the font, colour, size etc., then try the pdftables_api library. This should hold the layout of your PDF. Convert PDF to CSV as a CSV file is just a comma seperated text file.
If you are looking to keep font, colour etc., Zamzar API is probably your best option.

Open pdf file in Microsoft Word using OLE

I am looking for the method (of Word ole-object) which can open pdf in the Microsoft Word.
I want to copy all pages of pdf into doc/docx and add there footers.
Could anybody give the cue how to import pdf?
PS: any sample code for this problem would be great.
Thanks,
Lilya
You need OCR (Optical Character Recognition) engine for converting PDF to document. PDF is generic format and it can include text as image. So it is very hard to convert PDF to document. SAP hasn't got any OCR function for doing this. Maybe OpenText (if customer using it) has this functionality, I haven't got detail information about opentext. You need third party tools for this. You can use online services or command line utilities to converting PDF files to text files easelly if PDF included text, otherwise you need professional SDKs (for example Abbyy Finereader) for doing this.
I used FoxIT PDF Reader to save the PDF file into text file and make a macro to read the text file. Of course, by doing so, you can only get the text, but nothing else.

Reading PDF with format info like Bold, text size

I need to read a PDF file for its text while also getting the formatting. Like I don't need it to be exactly the same format, I just need to know if some text is bold or not (font weight) and the size of the text.
I can read docx files while also getting the formatting information, so if there is a way to convert the PDF to doc/docx file that also works.
Any language will do as long as the resources needed can run on a Linux server.
What I have tried
pypdf / PyPDF2 ( Cant get the formatting)
PDFminer ( cant get the formatting )
Xpdf by FooLabs ( again no option to get any formatting)
Using shell script to automate the process of opening a PDF in Microsoft Word and saving it as docx file( it works... but not feasible for Linux server)

How to convert PDF file to .doc format in Objective-C?

right now i am working on one ipad application where i am giving facility of opening the pdf file and also to customize it,now i want to add one functionality like i want to convert that pdf file in .doc format.
I researched but did not get any way around. Can anybody help me out?
Thanking you in advance.
I wrote an article on PDF to text conversion issues. If you look at some of the existing PDF to Word conversion tools (ie BCL) you will see what is realistically possible with a lot of work.
It’s not possible to convert a generic PDF back into a text format. I guess you could render the PDF into images and create a DOC from those, but that doesn’t sound very useful.