GIMP Script.Fu script to batch convert JPEG to PNG - batch-processing

Can someone give me the script I would need to run to batch convert many *.jpeg files to *.png in Script.Fu in GIMP?
Currently I am spending way too much time manually exporting every image and it's a waste of time.
I can't install anything right now so can't use alternative applications.

Alright, after a lot of trials and errors I finally figured out how to convert one file format to another using only GIMP.
This is the Script-Fu script for conversion to PNG:
(
let* ((filename "{{filename}}")
(output "{{output}}")
(image (car (gimp-file-load 1 filename filename)))
(drawable (car (gimp-image-get-active-layer image))))
(file-png-save-defaults 1 image drawable output output)
)
Where {{filename}} is input file that needs to be converted (a jpeg file, for example), {{output}} is the output file that you need (it can be simply the same file name but with PNG extension)
How to run it: it can probably be improved
gimp -i -n -f -d --batch "{{one-line script-fu}}"
More about command line options you can find in GIMP online documentation.
The place that needs to be changed is {{one-line script-fu}} and it has to be... one-line! You can probably do all of this in one file using cmd (in case if you use Windows), but for me it was easier to use Python, so here's the script for it:
import subprocess, os
def convert_to_png(file_dds):
#Loads the command to run gimp cli (second code block)
#Note: remove "{{one-line script-fu}}" and leave one space after the --batch
with open("gimp-convert.bat", "r") as f:
main_script = f.read()
#Prepares the Script-Fu script to be run, replacing necessary file names and makes it one-line (the firs code block)
with open("gimp-convert-png.fu", "r") as f:
script = f.read().replace("\n", " ").replace("{{filename}}", file_dds) \
.replace("{{output}}", file_dds[:-3]+"PNG").replace("\\", "\\\\").replace("\"", "\\\"")
subprocess.run(main_script + " \"" + script + "\" --batch \"(gimp-quit 1)\"",
cwd = os.getcwd(),
shell = True)
And you should get your file converted to PNG!
I needed this for my texture upscale project, all of the code below you can find here.
Tested with GIMP 2.10

The real solution is to use ImageMagicks convert, as simple as magick convert some.jpeg some.png. There must be a "portable" version somewhere that you can use off a USB key.
Otherwise with Gimp, a much less manual way that doesn't need for a new script, since it uses an existing script:
get/install ofn-export-layers
File>Open the first JPEG
File>Open as layers more Jpegs. You can select several/all jpegs in one call (actual number limited by available RAM mostly). Once this is done you have many Jpegs stacked in the same image
File>Export all layers, making sure the name pattern you use ends in .png (the doc that comes with the script explains how that works).

Related

Batch extract Hex colour from images to file

I have around 10k images that I need to get the Hex colour from for each one. I can obviously do this manually with PS or other tools but I'm looking for a solution that would ideally:
Run against a folder full of JPG images.
Extract the Hex from dead center of the image.
Output the result to a text file, ideally a CSV, containing the file name and the resulting Hex code on each row.
Can anyone suggest something that will save my sanity please? Cheers!
I would suggest ImageMagick which is installed on most Linux distros and is available for OSX (via homebrew) and Windows.
So, just at the command-line, in a directory full of JPG images, you could run this:
convert *.jpg -gravity center -crop 1x1+0+0 -format "%f,%[fx:int(mean.r*255)],%[fx:int(mean.g*255)],%[fx:int(mean.b*255)]\n" info:
Sample Output
a.png,127,0,128
b.jpg,127,0,129
b.png,255,0,0
Notes:
If you have more files in a directory than your shell can glob, you may be better of letting ImageMagick do the globbing internally, rather than using the shell, with:
convert '*.jpg' ...
If your files are large, you may better off doing them one at a time in a loop rather than loading them all into memory:
for f in *.jpg; do convert "$f" ....... ; done

Inkscape "PDF + Latex" export

I'm using inkscape to produce vector figures, save them in SVG format to export them later as "PDF + Latex" much in the vein of TUG inkscape+pdflatex guide.
Trying to produce a simple figure, however, turns out to be extremely frustating.
The first figure
is an example of the figure I would like to export in the form of "PDF + Latex" (shown here in PNG format).
If I export this to a PDF figure without latex macros the PDF produced looks exactly the same, except for some minor differences with the fonts used to render the text.
When I try to export this using the "PDF + Latex" option the PDF file produced consists on a PDF document of 2 pages (again as .png here):
This, of course, does not looks good when compiling my latex document. So far the guide at TUG has been very helpful, but I still can't produce a working "PDF + Latex" export from inkscape.
What am I doing wrong?
I worked around this by putting all the text in my drawing at the top
select text and then Object -> Raise to top
Inkscape only generates the separate pages if the text is below another object.
I asked this question on the Inkscape online discussion page and got some very helpful guidance from one of the users there.
This is a known bug https://bugs.launchpad.net/ubuntu/+bug/1417470 which was inadvertently introduced in Inkscape 0.91 in an attempt to fix a previous bug https://bugs.launchpad.net/inkscape/+bug/771957.
It seems this bug does two things:
The *.pdf_tex file will have an extra \includegraphics statement which needs to be deleted manually as described in the link to the bug above.
The *.pdf file may be split into multiple pages, regardless of the size of the image. In my case the line objects were split off onto their own page. I worked around this by turning off the text objects (opacity to zero) and then doing a standard PDF export.
If you can execute linux commands, this works:
# Generate the .pdf and .pdf_tex files
inkscape -z -D --file="$SVGFILE" --export-pdf="$PDFFILE" --export-latex
# Fix the number of pages
sed -i 's/\\\\/\n/g' ${PDFFILE}_tex;
MAXPAGE=$(pdfinfo $PDFFILE | grep -oP "(?<=Pages:)\s*[0-9]+" | tr -d " ");
sed -i "/page=$(($MAXPAGE+1))/,\${/page=/d}" ${PDFFILE}_tex;
with:
$SVGFILE: path of the svg
$PDF_FILE: path of the pdf
It is possible to include these commands in a script and execute it automatically when compiling your tex file (so that you don't have to manually export from inkscape each time you modify your svg).
Try it with an illustration that is less wide.
Alternatively, use a wider paperwidth setting.

Ansys multiphysics: blank output file

I have a model of a heating process on Ansys Multiphysics, V11.
After running the simulation, I have a script to plot a temperature profile:
!---------------- POST PROCESSING -----------------------
/post1 ! tdatabase postprocessor
!---define profile temperature
path,s_temp1,2,,100 ! define a path
ppath,1,,dop/2,0,0 ! create a path point
ppath,2,,dop/2,1.5,0 ! create a path point
PDEF,surf_t1,TEMP, ,noav ! print a path
plpath,surf_t1 ! plot a path
What I now need, is to save the resulting path in a text file. I have already looked online for a solution, and found the following code to do it, which I appended after the lines above:
/OUTPUT,filename,extension
PRPATH,surf_t1
/OUTPUT
Ansys generates the file filename.extension but it is empty. I tried to place the OUTPUT command in a few locations in the script, but without any success.
I suspect I need to define something else, but I have no idea where to look, as Ansys documentation online is terribly chaotic, and all internet pages I've opened before writing this question are not better.
A final note: Ansys V11 is an old version of the software, but I don't want to upgrade it and fit the old model to the new software.
For the output of the simulation (which includes all calculation steps, and sub-steps description and node-by-node results) the output must be declared in the beginning of the code, and not in the postprocessing phase.
Declaring
/OUTPUT,filename,extension
in the preamble of the main script makes such that the output is stored in the right location, with the desired extension. At the end of the scripts, you must then declare
/OUTPUT
to reset the output file location for ANSYS.
The output to the PATH call made in the postprocessing script is however not printed in the file.
It is convenient to use
*CFOPEN,file,ext
*VWRITE,Vector(1,1).Vector(1,2)
(2F12.6)
*CFCLOSE
where Vector(1,1) is a two column array created by *DIM, and stores your data to output to file
As this is a special command, run it from file i.e. macro_output.mac

Batch OCR Program for PDFs [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they don't have any text and I will have to settle for OCR). The files are in the same directory and I can't tell which is which. Ultimately I want to turn them into .txt files and then do string processing on them. So I want the most accurate OCR possible.
It seems like people have recommended:
adobe pdf (I don't have a licensed copy of this so ... plus if ABBYY finereader or something is better, why pay for it if I won't use it)
ocropus (I can't figure out how to use this thing),
Tesseract (which seems like it was great in 1995 but I'm not sure if there's something more accurate plus it doesn't do pdfs natively and I've have to convert to TIFF. that raises its own problem as I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff. plus i don't want 10,000 30 page documents turned into 30,000 individual tiff images).
wowocr
pdftextstream (that was from 2009)
ABBYY FineReader (apparently its' $$$, but I will spend $600 to get this done if this thing is significantly better, i.e. has more accurate ocr).
Also I am a n00b to programming so if it's going to take like weeks to learn how to do something, I would rather pay the $$$. Thx for input/experiences.
BTW, I'm running Linux Mint 11 64 bit and/or windows 7 64 bit.
Here are the other threads:
Batch OCRing PDFs that haven't already been OCR'd
Open source OCR
PDF Text Extraction Approach Using OCR
https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred
Just to put some of your misconceptions straight...
" I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff."
You can convert PDFs to TIFF with the help of Free (as in liberty) and free (as in beer) Ghostscript. Your choice if you want to do it on Linux Mint or on Windows 7. The commandline for Linux is:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
input.pdf
"i don't want 10,000 30 page documents turned into 30,000 individual tiff images"
You can have "multipage" TIFFs easily. Above command does create such TIFFs of the G4 (fax tiff) flavor. Should you even want single-page TIFFs instead, you can modify the command:
gs \
-o input_page_%03d.tif \
-sDEVICE=tiffg4 \
input.pdf
The %03d part of the output filename will automatically translate into a series of 001, 002, 003 etc.
Caveats:
The default resolution for the tiffg4 output device is 204x196 dpi. You probably want a better value. To get 720 dpi you should add -r720x720 to the commandline.
Also, if your Ghostscript installation uses letter as its default media size, you may want to change it. You can use -gXxY to set widthxheight in device points. So to get ISO A4 output page dimensions in landscape you can add a -g8420x5950 parameter.
So the full command which controls these two parameters, to produce 720 dpi output on A4 in portrait orientation, would read:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
-r720x720 \
-g5950x8420 \
input.pdf
Figured I would try to contribute by answering my own question (have written some nice code for myself and could not have done it without help from this board). If you cat the pdf files in unix (well, osx for me), then the pdf files that have text will have the word "Font" in them (as a string, but mixed in with other text) b/c that's how the file tells Adobe what fonts to do display.
The cat command in bash seems to have the same output as reading the file in binary mode in python (using 'rb' mode when opening file instead of 'w' or 'r' or 'a'). So I'm assuming that all pdf files that contain text with have the word "Font" in the binary output and that no image-only files ever will. If that's always true, then this code will make a list of all pdf files in a single directory that have text and a separate list of those that have only images. It saves each list to a separate .txt file, then you can use a command in bash to move the pdf files to the appropriate folder.
Once you have them in their own folders, then you can run your batch ocr solution on just the pdf files in the images_only folder. I haven't gotten that far yet (obviously).
import os, re
#path is the directory with the files, other 2 are the names of the files you will store your lists in
path = 'C:/folder_with_pdfs'
files_with_text = open('files_with_text.txt', 'a')
image_only_files = open('image_only_files.txt', 'a')
#have os make a list of all files in that dir for a loop
filelist = os.listdir(path)
#compile regular expression that matches "Font"
mysearch = re.compile(r'.*Font.*', re.DOTALL)
#loop over all files in the directory, open them in binary ('rb'), search that binary for "Font"
#if they have "Font" they have text, if not they don't
#(pdf does something to understand the Font type and uses this word every time the pdf contains text)
for pdf in filelist:
openable_file = os.path.join(path, pdf)
cat_file = open(openable_file, 'rb')
usable_cat_file = cat_file.read()
#print usable_cat_file
if mysearch.match(usable_cat_file):
files_with_text.write(pdf + '\n')
else:
image_only_files.write(pdf + '\n')
To move the files, I entered this command in bash shell:
cat files_with_text.txt | while read i; do mv $i Volumes/hard_drive_name/new_destination_directory_name; done
Also, I didn't re-run the python code above, I just hand-edited the thing, so it might be buggy, Idk.
This is an interesting problem. If you are willing to work on Windows in .NET, you can do this with dotImage (disclaimer, I work for Atalasoft and wrote most of the OCR engine code). Let's break the problem down into pieces - the first is iterating over all your PDFs:
string[] candidatePDFs = Directory.GetFiles(sourceDirectory, "*.pdf");
PdfDecoder decoder = new PdfDecoder();
foreach (string path in candidatePDFs) {
using (FileStream stm = new FileStream(path, FileMode.Open)) {
if (decoder.IsValidFormat(stm)) {
ProcessPdf(path, stm);
}
}
}
This gets a list of all files that end in .pdf and if the file is a valid pdf, calls a routine to process it:
public void ProcessPdf(string path, Stream stm)
{
using (Document doc = new Document(stm)) {
int i=0;
foreach (Page p in doc.Pages) {
if (p.SingleImageOnly) {
ProcessWithOcr(path, stm, i);
}
else {
ProcessWithTextExtract(path, stm, i);
}
i++;
}
}
}
This opens the file as a Document object and asks if each page is image only. If so it will OCR the page, else it will text extract:
public void ProcessWithOcr(string path, Stream pdfStm, int page)
{
using (Stream textStream = GetTextStream(path, page)) {
PdfDecoder decoder = new PdfDecoder();
using (AtalaImage image = decoder.Read(pdfStm, page)) {
ImageCollection coll = new ImageCollection();
coll.Add(image);
ImageCollectionImageSource source = new ImageCollectionImageSource(coll);
OcrEngine engine = GetOcrEngine();
engine.Initialize();
engine.Translate(source, "text/plain", textStream);
engine.Shutdown();
}
}
}
what this does is rasterizes the PDF page into an image and puts it into a form that is palatable for engine.Translate. This doesn't strictly need to be done this way - one could get an OcrPage object from the engine from an AtalaImage by calling Recognize, but then it would be up to client code to loop over the structure and write out the text.
You'll note that I've left out GetOcrEngine() - we make available 4 OCR engines for client use: Tesseract, GlyphReader, RecoStar, and Iris. You would select the one that would be best for your needs.
Finally, you would need the code to extract text from the pages that already have perfectly good text on them:
public void ProcessWithTextExtract(string path, Stream pdfStream, int page)
{
using (Stream textStream = GetTextStream(path, page)) {
StreamWriter writer = new StreamWriter(textStream);
using (PdfTextDocument doc = new PdfTextDocument(pdfStream)) {
PdfTextPage page = doc.GetPage(i);
writer.Write(page.GetText(0, page.CharCount));
}
}
}
This extracts the text from the given page and writes it to the output stream.
Finally, you need GetTextStream():
public Stream GetTextStream(string sourcePath, int pageNo)
{
string dir = Path.GetDirectoryName(sourcePath);
string fname = Path.GetFileNameWithoutExtension(sourcePath);
string finalPath = Path.Combine(dir, String.Format("{0}p{1}.txt", fname, pageNo));
return new FileStream(finalPath, FileMode.Create);
}
Will this be a 100% solution? No. Certainly not. You could imagine PDF pages that contain a single image with a box draw around it - this would clearly fail the image only test but return no useful text. Probably, a better approach is to just use the extracted text and if that doesn't return anything, then try an OCR engine. Changing from one approach to the other is a matter of writing a different predicate.
The simplest approach would be to use a single tool such a ABBYY FineReader, Omnipage etc to process the images in one batch without having to sort them out into scanned vs not scanned images. I believe FineReader converts the PDF's to images before performing OCR anyway.
Using an OCR engine will give you features such as automatic deskew, page orientation detection, image thresholding, despeckling etc. These are features you would have to buy an image processng library for and program yourself and it could prove difficult to find an optimal set of parameters for your 10,000 PDF's.
Using the automatic OCR approach will have other side effects depending on the input images and you would find you would get better results if you sorted the images and set optimal parameters for each type of images. For accuracy it would be much better to use a proper PDF text extraction routine to extract the PDF's that have perfect text.
At the end of the day it will come down to time and money versus the quality of the results that you need. At the end of the day, a commercial OCR program will be the quickest and easiest solution. If you have clean text only documents then a cheap OCR program will work as well as an expensive solution. The more complex your documents, the more money you will need to spend to process them.
I would try finding some demo / trial versions of commercial OCR engines and just see how they perform on your different document types before spending too much time and money.
I have written a small wrapper for Abbyy OCR4LINUX CLI engine (IMHO, doesn't cost that much) and Tesseract 3.
The wrapper can batch convert files like:
$ pmocr.sh --batch --target=pdf --skip-txt-pdf /some/directory
The script uses pdffonts to determine if a PDF file has already been OCRed to skip them. Also, the script can work as system service to monitor a directory and launch an OCR action as soon as a file enters the directory.
Script can be found here:
https://github.com/deajan/pmOCR
Hopefully, this helps someone.

convert pdf to svg

I want to convert PDF to SVG please suggest some libraries/executable that will be able to do this efficiently. I have written my own java program using the apache PDFBox and Batik libraries -
PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
GenericDOMImplementation.getDOMImplementation();
// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);
// Ask the test to render into the SVG Graphics2D implementation.
for(int i = 0 ; i < document.getNumberOfPages() ; i++){
String svgFName = svgDir+"page"+i+".svg";
(new File(svgFName)).createNewFile();
// Create an instance of the SVG Generator.
SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
Printable page = document.getPrintable(i);
page.print(svgGenerator, document.getPageFormat(i), i);
svgGenerator.stream(svgFName);
}
This solution works great but the size of the resulting svg files in huge.(many times greater than the pdf). I have figured out where the problem is by looking at the svg in a text editor. it encloses every character in the original document in its own block even if the font properties of the characters is the same. For example the word hello will appear as 6 different text blocks. Is there a way to fix the above code? or please suggest another solution that will work more efficiently.
Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.
To use Inkscape's command-line interface to convert a PDF to an SVG, use:
inkscape -l out.svg in.pdf
Which you can then probably call using:
Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")
http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29
I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.
Take a look at pdf2svg (also on on github):
To use
pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]
When using all give a filename with %d in it (which will be replaced by the page number).
pdf2svg input.pdf output_page%d.svg all
And for some troubleshooting see:
http://www.calcmaster.net/personal_projects/pdf2svg/
pdftocairo can be used to convert pdf to svg. pdfcairo is part of poppler-utils.
For example to convert 2nd page of a pdf, following command can be run.
pdftocairo -svg -f 1 -l 1 input.pdf
pdftk 82page.pdf burst
sh to-svg.sh
contents of to-svg.sh
#!/bin/bash
FILES=burst/*
for f in $FILES
do
inkscape -l "$f.svg" "$f"
done
I have encountered issues with the suggested inkscape, pdf2svg, pdftocairo, as well as the not suggested convert and mutool when trying to convert large and complex PDFs such as some of the topographical maps from the USGS. Sometimes they would crash, other times they would produce massively inflated files. The only PDF to SVG conversion tool that was able to handle all of them correctly for my use case was dvisvgm. Using it is very simple:
dvisvgm --pdf --output=file.svg file.pdf
It has various extra options for handling how elements are converted, as well as for optimization. Its resulting files can further be compacted by svgcleaner if necessary without perceptual quality loss.
inkscape (#jbeard4) for me produced svgs with no text in them at all, but I was able to make it work by going to postscript as an intermediary using ghostscript.
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdf2ps -dFirstPage=$page -dLastPage=$page -dNoOutputFonts $1.pdf $1_$page.ps
inkscape -z -l $1_$page.svg $1_$page.ps
rm $1_$page.ps
done
However this is a bit cumbersome, and the winner for ease of use has to go to pdf2svg (#Koen.) since it has that all flag so you don't need to loop.
However, pdf2svg isn't available on CentOS 8, and to install it you need to do the following:
git clone https://github.com/dawbarton/pdf2svg.git && cd pdf2svg
#if you dont have development stuff specific to this project
sudo dnf config-manager --set-enabled powertools
sudo dnf install cairo-devel poppler-glib-devel
#git repo isn't quite ready to ./configure
touch README
autoreconf -f -i
./configure && make && sudo make install
It produces svgs that actually look nicer than the ghostscript-inkscape one above, the font seems to raster better.
pdf2svg $1.pdf $1_%d.svg all
But that installation is a bit much, too much even if you don't have sudo. On top of that, pdf2svg doesn't support stdin/stdout, so the readily available pdftocairo (#SuperNova) worked a treat in these regards, and here's an example of "advanced" use below:
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdftocairo -svg -f $page -l $page $1.pdf - | gzip -9 >$1_$page.svg.gz
done
Which produces files of the same quality and size (before compression) as pdf2svg, although not binary-identical (and even visually, jumping between output of the two some pixels of letters shift, but neither looks wrong/bad like inkscape did).
Inkscape does not work with the -l option any more. It said "Can't open file: /out.svg (doesn't exist)". The long form that option is in the man page as --export-plain-svg and works but shows a deprecation warning. I was able to fix and update the command by using the -o option on Inkscape 1.1.2-3ubuntu4:
inkscape in.pdf -o out.svg