Combine two commands using GNU parallel for OCR project - pdf

I would like to write a script which runs a command to OCR pdfs, which deletes the resulting images, after the text files has been written.
The two commands I want to combine are the following.
This command create folders, extract pgm from each PDF and adds them into each folder:
time find . -name \*.pdf | parallel -j 4 --progress 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}'
This commands does the OCR and deletes the resulting images (pgm):
time find . -name \*.pgm | parallel -j 4 --progress 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
I would like to combine both commands so that the script deletes the pgm images after each OCR. If I run the above commands, the first will extract images and will eat up my disk space, then the second command would do the OCR and only after that delete the images as a last step.
So,
Create folder
Extract PGM from PDF
OCR from PGM to txt
Delete PGM images, which just have been used (missing)
Basically, I would like this 4 steps to be done in this order for each PDF separated and not for all PDF at once. How can I do this?
Edit:
My first attempt to solve my issues was to create the following command:
time find . -name \*.pdf | parallel -j 4 -m --progress --eta 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
However, tesseract would not find the language package.

Updated Answer
I have not tested this please run it on a copy of a small subset of your files. You can turn off the messages with DEBUG: at the start if you are happy it looks good:
#!/bin/bash
# Declare a function for "parallel" to call
doit() {
# Get name of PDF with and without extension
withext="$1"
noext="$2"
echo "DEBUG: Processing $withext into $noext"
# Make output directory
mkdir -p "$noext"
# Extract as PGM into subdirectory
gs ... -o "$noext"/"${noext}-%03d.pgm $withext"
# Go to target directory or die with error message
cd "$noext" || { echo ERROR: Failed to cd to $noext ; exit 1; }
# OCR and remove each PGM
n=0
for f in *pgm; do
echo "DEBUG: OCR $f into $n"
tesseract "$f" "$n" -l deu_frak
echo "DEBUG: Remove $f"
rm "$f"
((n=n+1))
done
}
# Ensure the function is exported to subshells
export -f doit
find . -name \*.pdf -print0 | parallel -0 doit {} {.}
You should be able to test the doit() function without parallel by running:
doit someFile.pdf someFile
Original Answer
If you want to do lots of things for each argument in GNU Parallel, the simplest way is to declare a bash function and then call that.
It looks like this:
# Declare a function for "parallel" to call
doit() {
echo "$1" "$2"
# mkdir something
# extract PGM
# do OCR
# delete PGM
}
# Ensure the function is exported to subshells
export -f doit
find some files -print0 | parallel -0 doit {} {.}

Related

Two input file types at the same time in GNU parallel?

Is it possible to have two input file types at the same time using one instance of gnu parallel?
This long command:
find . -name \*.pdf | parallel -j 4 --progress --eta 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/ebook -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
a)
creates a folder for each pdf it reads (first input file type)
convert the pdf with Ghostscript to pgm images
moves them in the respective folder
then it would use tesseract to perform OCR on each pgm (second input file type)
after which it save text files in each respective folder
and finally, deletes all pgm image files.
However, the above command actually consists of two commands combined with &&, splitting the above routine into two separate parts. The result is that it would:
b)
convert first all pdfs into pgm image files (which eat up a lot of disk
space!)
before it would start with ocr and a subsequent purge of the then
unneeded pgm image files.
This is undesired, as it would eat up all my disk space before the second part of the command would execute!
Is is possible to combine both commands to one, so that parallel would go through the whole process of a) for the first four pdfs (as parallel does 4 jobs at the same time -j 4), before going to the next four pdf files?
However, it seems that something like the below minimal example is not possible with parallel:
parallel -j 4 --progress --eta 'mkdir -p {.} && gs -sDEVICE=pgmraw -r300 -o {.}/{.}-%03d.pgm {}' && tesseract {} {.} -l deu_frak && rm {.}.pgm’ ::: *.pdf *.pgm
Note, the two input file extensions ::: *.pdf *.pgm at the end.
What can I do to make parallel follow routine a)?
EDIT:
This is the entire code I have tried as proposed by Ole Tange:
generate_pgm() {
PDF="$1"
find . -name \*.pdf | parallel 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/ebook -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}' ::: *.pdf
}
export -f generate_pgm
ocr() {
PGM="$1"
find . -name \*.pgm | parallel 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
rm "$PGM"
}
export -f ocr
time parallel -j 4 --progress --eta 'generate_pgm {}; parallel --argsep ,,, ocr ,,, pgm/*.pgm' ::: *pdf
Unfortunately, it has been unsuccessful as this script would basically do the same as my original script. It would create folders of all PDF and start converting all PDF to PGM while starting the OCR on the first PGM images, instead of going through the all process for each four PDF before starting with the next four.
I see 2 solutions:
generate_pgm() {
PDF="$1"
# gs stuff
}
export -f generate_pgm
ocr() {
PGM="$1"
# tesseract stuff
rm "$PGM"
}
export -f ocr
parallel 'generate_pgm {}; parallel --argsep ,,, ocr ,,, pgm/*.pgm' ::: *pdf
This will process a file completely before going to the next.
It will, however, run up to N^2 processes (N=number of cores). To avoid that use --load:
parallel 'generate_pgm {}; parallel --load 100% --argsep ,,, ocr ,,, pgm/*.pgm' ::: *pdf
This way you should only get one active process per CPU core.
If you want it to only convert one PDF at a time:
parallel -j1 'generate_pgm {}; parallel --argsep ,,, ocr ,,, pgm/*.pgm' ::: *pdf
Another solution is to use the dir processor https://www.gnu.org/software/parallel/man.html#EXAMPLE:-GNU-Parallel-as-dir-processor:
nice parallel generate_pgm ::: *pdf &
inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f pgm_output_dir | parallel ocr
This way the the pgm-generation will be done in parallel. The risk here is that if the pgm-generation is much faster than the ocr, it will still fill your disk.

Open PDF found with volatility

my task is to analyze a memory dump. I've found the location of a PDF-File and I want to analyze it with virustotal. But I can't figure out how to "download" it from the memory dump.
I've already tried it with this command:
python vol.py -f img.vmem dumpfiles -r pdf$ -i --name -D dumpfiles/
But in my dumpfile-directory there is just a .vacb file which is not a valid pdf.
I think you may have missed a command line argumenet from your command:
python vol.py -f img.vmem dumpfiles -r pdf$ -i --name -D dumpfiles/
If you are not getting a .dat file in your output folder you can add -u:
-u, --unsafe Relax safety constraints for more data
Can't test this with out access to the dump but you should be able to rename the .dat file created to .pdf.
So it should look something like this:
python vol.py -f img.vmem dumpfiles -r pdf$ -i --name -D dumpfiles/ -u
You can check out the documentation on the commands here
VACB is "virtual address control block". Your output type seems to be wrong.
Try something like:
$ python vol.py -f img.vmem dumpfiles --output=pdf --output-file=bla.pdf --profile=[your profile] -D dumpfiles/
or check out the cheat sheet: here

How to locate code in PHP inside a directory and edit it

I've been having problems with multiple hidden infected PHP files which are encrypted (ClamAV can't see them) in my server.
I would like to know how can you run an SSH command that can search all the infected files and edit them.
Up until now I have located them by the file contents like this:
find /home/***/public_html/ -exec grep -l '$tnawdjmoxr' {} \;
Note: $tnawdjmoxr is a piece of the code
How do you locate and remove this code inside all PHP files in the directory /public_html/?
You can add xargs and sed:
find /home/***/public_html/ -exec grep -l '$tnawdjmoxr' {} \; | xargs -d '\n' -n 100 sed -i 's|\$tnawdjmoxr||g' --
You may also use sed immediately than using grep -but- it can alter the modification time of that file and may also give some unexpected modifications like perhaps some line endings, etc.
-d '\n' makes it sure that every argument is read line by line. It's helpful if filenames has spaces on it.
-n 100 limits the number of files that sed would process in one instance.
-- makes sed recognize filenames starting with a dash. It's also commendable that grep would have it: grep -l -e '$tnawdjmoxr' -- {} \;
File searching may be faster with grep -F.
sed -i enables inline editing.
Besides using xargs it would also be possible to use Bash:
find /home/***/public_html/ -exec grep -l '$tnawdjmoxr' {} \; | while IFS= read -r FILE; do sed -i 's|\$tnawdjmoxr||g' -- "$FILE"; done
while IFS= read -r FILE; do sed -i 's|\$tnawdjmoxr||g' -- "$FILE"; done < <(exec find /home/***/public_html/ -exec grep -l '$tnawdjmoxr' {} \;)
readarray -t FILES < <(exec find /home/***/public_html/ -exec grep -l '$tnawdjmoxr' {} \;)
sed -i 's|\$tnawdjmoxr||g' -- "${FILES[#]}"

background xargs/wget not adhering to -P and -n limits

I'm having a problem with xargs and Wget when run as shell scripts in an Applescript app. I want Wget to run 4 parallel processes in the background. The problem: basically, when I try to run the process in the background with
cat urls.txt | xargs -P 4 -n 1 /usr/local/bin/wget -q -E -b 1> NUL 2> NUL
a Wget process is apparently started for each URL passed in from the .txt file. This is too burdensome on the user's memory. When I run it in the foreground, however, with something like:
cat urls.txt | xargs -P 4 -n 1 /usr/local/bin/wget -q -E
I seem to get the four parallel Wget processes I need. Does anybody know how to get this script to run in the background with only 4 processes? I'm a bit of a novice, and I'm afraid I can't figure out why backgrounding the process causes this change.
You might run xargs on the background instead:
cat urls.txt | xargs -P4 -n1 wget -q &
Or if you want to return control to the AppleScript, disown the xargs process:
do shell script "cat urls.txt | xargs -P4 -n1 /usr/local/bin/wget -q & disown $!"
As far as I can tell, I have solved the problem with
cat urls.txt| (xargs -P4 -n1 wget -q -E >/dev/null 2>&1) &
There may well be a better solution, though...

Using wget 1.12 centos 6 to batch download and rename output files

using file wget
wget -c --load-cookies cookies.txt http://www.example.com/file
works fine
wget -c --load-cookies cookies.txt http://www.example.com/file.mpg -O filename_to_save_as.mpg
when I use
wget -c --load-cookies cookies.txt -i /dir/inputfile.txt
to pass urls from a text file it wget it works as expected. Is there any way to pass a url from a text file and still rename the out put file as in example 2 above. I have tried passing the -O option with an argument but wget tell me "invalid URL http://site.com/file.mpg -O new_name.mpg: scheme missing"
also I have tried escaping after the url, quotes and formatting in such a way as
url = "http://foo.bar/file.mpg" -O new_name.mpg
is there any way to use an input file and still change the output file name using wget?
if not would a shell script be more appropriate? If so how should it be written?
I don't think that wget supports it, but it's possible to do with a small shell script.
First, create an input file like this (inputfile.txt):
http://www.example.com/file1.mpg filename_to_save_as1.mpg
http://www.example.com/file2.mpg filename_to_save_as2.mpg
http://www.example.com/file3.mpg filename_to_save_as3.mpg
The url and the filename are separated by a tab character.
Then use this bash script (wget2.sh):
#!/bin/bash
while read line
do
URL=$(echo "$line" | cut -f 1 )
FILENAME=$(echo "$line" | cut -f 2 )
echo wget -c --load-cookies cookies.txt "$URL" -O "$FILENAME"
done
with this command:
echo input.txt | wget2.sh
A more simple solution is to write a shell script which contains the wget command for every file:
#!/bin/bash
wget -c --load-cookies cookies.txt http://www.example.com/file.mpg1 -O filename_to_save_as1.mpg
wget -c --load-cookies cookies.txt http://www.example.com/file.mpg2 -O filename_to_save_as2.mpg
wget -c --load-cookies cookies.txt http://www.example.com/file.mpg3 -O filename_to_save_as3.mpg