a total self-taught noob here. I'm using Windows Command Promt to run Tesseract-ocr.
I managed to find the right command to get as output a two-layers pdf file with the original scanned page BUT ALSO a searchable text.
tesseract filename.tif output -l ita pdf
Quite simple for me too.
But how do I repeat this operation for all the 200+ .tif files in the folder without doing it manually? It makes no difference to me to get as many output pdfs or to get a single output pdf.
Thanks to everyone who will help me.
I found a way in the meanwhile: create a txt file containing the list of all the paths to each .tif file (with the command dir/s/b *.tif > listname.txt) and then use it as input for Tesseract.
Maybe there's a faster way, but this works.
Related
currently, I'm trying to write a program (in C), which would allow me to search/replace text of pdfs. I already can search for the string and edit the string itself, but somehow I cannot find a way to really edit the object of the document.
Does someone know, how to "send" the changes back to the document in the library of mutool/mupdf?
If you consider using Command Line Tools you can use Mutool clean.
mutool-clean
Using this tool you can rewrite content streams.
"The clean command pretty prints and rewrites the syntax of a PDF file. It can be used to repair broken files, expand compressed streams, filter out a range of pages, etc."
I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."
When I start this process and it asks for the directory, I've chose C:, my main hard drive.
It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.
Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.
So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.
My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.
I think you can do this using a combination of
regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
iText : to iterate over the PDF document and extract all images
Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text
Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.
After some work with PHPExcel, I finally get it to generate sheets of 3000cells in ~5 seconds by using a big array.
With same data, I'll need to generate some pdf files. I've tried to do it with PHPExcel, but it is not a good choice. Generating a pdf file with PHPExcel, took a lot of time and a lot of resources.
I've tried to generate a pdf file with html2pdf php library. The file which contain a table with 3000 cells took me 20 seconds o generate.
My problem is that I can't find a good solution to my problem. Do you know any good library? Do you know any good practices in generating pdf files faster, with a low load on server side?
You can use the FPDF library to generate PDF files in a fast manner and you can use the Write HTML tables add-on to achieve what you want (see example at the bottom of the page).
PhpExcel uses TCPDF to generate PDF, the same as HTML2PDF with PHP5:
HTML2PDF is a HTML to PDF converter written in PHP4 (use FPDF), and PHP5 (use TCPDF).
I think that when generating a PDF, PhpExcel first generates XLS, then converts it to HTML, then again converts it to PDF. Not very efficient.
That is why by using HTML2PDF you can cut to 20 seconds.
--
To cut waiting time even more, maybe you could try another library, like dompdf, and keep skiping PhpExcel when what you need is a PDF.
If your table doesn't have formulas, you can generate all the content in an array, and pass it to some function to generate an XLS with PhpExcel, and to another to generate a PDF.
Someone knows how can I create one pdf file from multiple ppt files ?
Whether it to write script or computer program. However if it can be done with some program it will be the best.
I searched the web for something like this but I didn't get any results.
If you want to convert the PPT/PPTX files to PDF and then join those converted PDF files into a single PDF using either .NET or Java, you may try Aspose.Slides and Aspose.Pdf.Kit components.
Aspose.Slides allows you to convert the PPT/PPTX files to PDF and Aspose.Pdf.kit allows you to join the PDF files into a single PDF. Please see if this solution can work for your scenario.
Disclosure: I work as developer evangelist at Aspose.
I'm currently planning an application which involves manipulating PDFs. My goal is to have a program that i can pass in a PDF as an input which then saves separated grayscale images of the colour channels that the PDF consists of as an output. This is basically a simple RIP.
I'm currently using a solution using GhostScript but i want to rewrite the application to optimise speed and usability. (GhostScript doesn't separate PDFs for example.)
Do you know of any other open source libraries that i may find useful to achieve this?
Did you ever try to run (I'm assuming Windows here):
mkdir separated
gswin32c.exe ^
-o separated/page_%04d.tif ^
-sDEVICE=tiffsep ^
d:/path/to/input.pdf
(you can also try -sDEVICE=tiffsep1) and then looked at the files you've gotten in the separated sub directory?!? And this is not a case of Ghostscript separated PDFs in your mind?
The device tiffsep creates multiple output files:
One single 32bit composite CMYK file (tiff32nc format) per PDF page.
Multiple tiffgray files (each compressed with LZW) per PDF page, one for each separation.
The device tiffsep1 behaves similarly, but...
...doesn't create the composite output file...
...and it creates tiffg4 output files for the separations.