Does grep work properly on pdf files?

Does grep work properly on pdf files? - pdf

Is it possible to search multiple pdf files using the 'grep' command. It doesn't seem to work, how do people search content on multiple pdf files?

Well, PDF is a binary format, and grep can search binary files as if they were text
grep -a
or you can just use pdftotext (which comes with xpdf) like this:
pdftotext whee.pdf | grep pattern

You don't mention which OS you're using, but under Mac OS X you can use mdfind from the command line:
mdfind -onlyin search/directory/path "kind:pdf search text"

use something like Solr or clucene I think they can do what you want.

Pdf is a binary format, that's why searching it with grep is not that helpful. You can search the strings is a pdf with grep like this:
ls dir_with_pdfs/*.pdf|xargs strings|grep "keyword"
Or you can use the pdf2text command on pdf's and then search result with grep.

This tool pdfgrep will do the work. It has a syntax similar to grep. To search in several files just a simple shell script. For example:
$> ls Documents/*.pdf | xargs pdfgrep -n -H "system"
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: designed episodic memory system
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: how ISAC's episodic memory system is
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: cognitive system employs a combination
....

PDF is a binary dump of objects used to display the pages. There may be some meta data you can grep but the actual page text is in a Postscript stream and may be encoded in a variety of ways. Its also not guaranteed to be in any order. You need to think of PDF as more like a Vector image file than a text file.
There is a short article explaining text in PDFs in more detail at http://pdf.jpedal.org/java-pdf-blog/bid/27187/Understanding-the-PDF-file-format-text-streams

If you have pdftotext installed via the popplar package, then try this perl script :
#!/usr/bin/perl
my $p = shift;
foreach my $fn (#ARGV) {
open(F,"pdftotext $fn - |");
while (<F>) { print "$fn:$_" if /$p/; }
close(F);
}

Related

How to identify the pdf object in raw pdf file?

I want to remove certain objects using programs.
Using cpdf I can get the objects, if I can somehow identify the objects that I want to delete, then I should be able to modify pdf files with programs.
$ cpdf in.pdf -output-json -output-json-parse-content-streams -o out.json
$ cpdf -j out.json -o out.pdf
However, I can not find out the object corresponding to my target text. For example, text search does not work on a raw pdf file. What is the best way to identify the target object of a text?
EDIT: Here is a test pdf. Please remove XYZ from the top of each page. Note that the test is a significant simplification of the real pdf file. So the solution should not be so simple so that it can not be applied to real complicated pdf files.
curl -s https://i.stack.imgur.com/whsnm.gif | tail -c +43 > test.pdf

The output of cpdf -output-json -output-json-parse-content-streams may or may not contain text which is recognisable to you. This depends on the font encodings in use, and the way in which text is layed out. In your file, for example, the painting of the string "XYZ" is represented as
[ "\u0000;\u0000<\u0000=", "Tj" ]
This is a string representing three codepoints indexing into the font. Cpdf presently has no way to show you what actual text this corresponds to; a future version will.
So I don't think your task can be done via cpdf -output-json in the general case, or indeed in this specific case.

How to make "pandoc" use a specific font when converting a plaintext file to PDF?

After a lot of trouble, I was finally able to run the command without errors:
pandoc -i 1.txt -o 1.pdf
The result is a PDF with completely messed up text because it uses some other font than Courier[ New]. Some varying-width, default font.
After reading and searching for a long time, I found this: https://pandoc.org/MANUAL.html#creating-a-pdf
The option "fontfamily" is mentioned, so I tried to do:
pandoc -i 1.txt -o 1.pdf --fontfamily=Courier
However, this results in:
Unknown option --fontfamily.
Try pandoc --help for more information.
I have looked through the entire "pandoc --help" output without finding any mention of fonts.
How do I set the font to be used?
(I'm trying my very best to not also add: "and why is it so incredibly difficult/cryptic/undocumented to do the most basic imaginable thing?"...)
I'm not even sure that this will fix all the problems. I just assume that the document is all messed up because the font isn't using fixed-width letters.

text to pdf with utf8 encoding (alternative to a2ps)

The programm a2ps does not support utf-8. At least my version does only
support the latin-X encodings:
a2ps --list=encoding
Version:
GNU a2ps 4.14
How can I convert a simple utf-8 text to postscript or pdf?

If what you actually want is to use a2ps or enscript (which is a similar tool), and if your single need is to use them with some UTF-8 document, you only have to convert your document to ISO-8859-1 or some supported encoding. Various tools allow this. For instance, here is a workflow for enscript (but you can surely do the same with a2ps):
cat document.txt | iconv -c -f utf-8 -t ISO-8859-1 | enscript -o document.ps
But you may lose some characters during the conversion because such encodings have a smaller range than UTF-8.
On the other hand, if UTF-8 is a requirement, you may rather have to look for some recent tool allowing to convert UTF-8 to PDF. I wrote myself a Python program called txt2pdf; you may find it here. Have also a look at tools like pandoc, gimli, rst2pdf or wkhtmltopdf.

You can use Vim. Open the file and execute the command :hardcopy > output.ps in normal mode. You can also do this directly from the shell. Executing
$ vim -c ":hardcopy > output.ps" -c ":quit" input.txt
in your shell will open Vim, generate the output.ps, and then close Vim.

Use paps! For instance I use it as follow:
paps --font="Monospace 10" input.txt > output.ps
and I have no problem with utf encoding.
If you need a pdf file then
pdf2ps output.ps

I've gotten acceptable results (for printing code listings) from https://github.com/arsv/u2ps

https://gitlab.com/gnomify/u2ps is the replacement of gnome-u2ps.

If the text file is small, paps converts to text to ps, which then can be fed to ps2pdf. The problem is ps file from paps causes ps2pdf to create a very big pdf file. If that is ok, this is possible. Currently, I am having a large file size pdf from paps.

There's a utility based on gnome libraries and named gnome-u2ps. It has less functionality than a2ps, and it seems that it is not maintained anymore.

How to create large PDF files (10MB, 50MB, 100MB, 200MB, 500MB, 1GB, etc.) for testing purposes?

I tried this:
for ((i=1; i<=10; i++)); do convert 100MB.pdf 10MB.pdf 100MB.pdf; done
to create 100MB file but very quickly run out of RAM.

The most simple tool: use pdftk (or pdftk.exe, if you are on Windows):
pdftk 10_MB.pdf 100_MB.pdf cat output 110_MB.pdf
This will be a valid PDF. Download pdftk here.
Update: if you want really large (and valid!), non-optimized PDFs, use this command:
pdftk 100MB.pdf 100MB.pdf 100MB.pdf 100MB.pdf 100MB.pdf cat output 500_MB.pdf
or even (if you are on Linux, Unix or Mac OS X):
pdftk $(for i in $(seq 1 100); do echo -n "100MB.pdf "; done) cat output 10_GB.pdf

Windows: fsutil
Usage:
fsutil file createnew [filename].[extension] [# of bytes]
Source: https://www.windows-commandline.com/how-to-create-large-dummy-file/
Linux: fallocate
Usage:
fallocate -l 10G [filename].[extension]
Source: Quickly create a large file on a Linux system?

For those using macOS mkfile might be a good alternative to fallocate or dd
mkfile 100m some100mfile.pdf
reference -
https://stackoverflow.com/a/33478049/711401

according to http://www.maketecheasier.com/combine-multiple-pdf-files-with-pdftk/ the command should be
pdftk file1.pdf file2.pdf file3.pdf cat output newfile.pdf
note that you should download windows version of pdftk

I had problems using pdftk with the cat parameter had a better success with output.
The following command worked for me:
pdftk file_1.pdf file_1.pdf file_1.pdf file_1.pdf cat output.pdf
Using cat produced the following error:
Error: Unexpected text in page range end, here:
output.pdf
Exiting.
Acceptable keywords, for example: "even" or "odd".
To rotate pages, use: "north" "south" "east"
"west" "left" "right" or "down"
Errors encountered. No output created.
Done. Input errors, so no output created.
http://www.pdflabs.com/docs/pdftk-cli-examples/.
I created a 172mb PDF is no time at all.

If you want a really big valid PDF file, then
take all the biggest valid pdf you can
With a tool like PDF24Creator make a fusion of pdfs
It works for me to create a big file (140MB) after some minutes.

Under Linux there is pdfunite (part of poppler) that can concatenate the same pdf files to get one large pdf file:
pdfunite in.pdf in.pdf in.pdf out.pdf
see manpage

Partly it depends on what you are trying to increase the size of... number of pages, number of images, size of a single image. In my experience, the vast bulk (90%+) of any given 'large' PDF file will be the images.
You could try using a pro product like Adobe InDesign to quickly build a large project and export it as a PDF.
Adobe Acrobat Pro has built-in tools to optimize PDF files -- you try using the tools to 'un-optimize' your file. :)

One possibility is, if you are familiar with PDF format:
Create some simply PDF with one page (Page should be contained within one object)
Copy object multiply times
Add references to the copied objects to the page catalog
Fix xref table
You get an valid document of any size, entire file will be processed by a reader.

Have you tried using cat to combine the files?
cat 10MB.pdf 10MB.pdf > 20MB.pdf
That should result in a 20MB file.

convert pdf to svg

I want to convert PDF to SVG please suggest some libraries/executable that will be able to do this efficiently. I have written my own java program using the apache PDFBox and Batik libraries -
PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
GenericDOMImplementation.getDOMImplementation();
// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);
// Ask the test to render into the SVG Graphics2D implementation.
for(int i = 0 ; i < document.getNumberOfPages() ; i++){
String svgFName = svgDir+"page"+i+".svg";
(new File(svgFName)).createNewFile();
// Create an instance of the SVG Generator.
SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
Printable page = document.getPrintable(i);
page.print(svgGenerator, document.getPageFormat(i), i);
svgGenerator.stream(svgFName);
}
This solution works great but the size of the resulting svg files in huge.(many times greater than the pdf). I have figured out where the problem is by looking at the svg in a text editor. it encloses every character in the original document in its own block even if the font properties of the characters is the same. For example the word hello will appear as 6 different text blocks. Is there a way to fix the above code? or please suggest another solution that will work more efficiently.

Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.
To use Inkscape's command-line interface to convert a PDF to an SVG, use:
inkscape -l out.svg in.pdf
Which you can then probably call using:
Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")
http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29
I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.

Take a look at pdf2svg (also on on github):
To use
pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]
When using all give a filename with %d in it (which will be replaced by the page number).
pdf2svg input.pdf output_page%d.svg all
And for some troubleshooting see:
http://www.calcmaster.net/personal_projects/pdf2svg/

pdftocairo can be used to convert pdf to svg. pdfcairo is part of poppler-utils.
For example to convert 2nd page of a pdf, following command can be run.
pdftocairo -svg -f 1 -l 1 input.pdf

pdftk 82page.pdf burst
sh to-svg.sh
contents of to-svg.sh
#!/bin/bash
FILES=burst/*
for f in $FILES
do
inkscape -l "$f.svg" "$f"
done

I have encountered issues with the suggested inkscape, pdf2svg, pdftocairo, as well as the not suggested convert and mutool when trying to convert large and complex PDFs such as some of the topographical maps from the USGS. Sometimes they would crash, other times they would produce massively inflated files. The only PDF to SVG conversion tool that was able to handle all of them correctly for my use case was dvisvgm. Using it is very simple:
dvisvgm --pdf --output=file.svg file.pdf
It has various extra options for handling how elements are converted, as well as for optimization. Its resulting files can further be compacted by svgcleaner if necessary without perceptual quality loss.

inkscape (#jbeard4) for me produced svgs with no text in them at all, but I was able to make it work by going to postscript as an intermediary using ghostscript.
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdf2ps -dFirstPage=$page -dLastPage=$page -dNoOutputFonts $1.pdf $1_$page.ps
inkscape -z -l $1_$page.svg $1_$page.ps
rm $1_$page.ps
done
However this is a bit cumbersome, and the winner for ease of use has to go to pdf2svg (#Koen.) since it has that all flag so you don't need to loop.
However, pdf2svg isn't available on CentOS 8, and to install it you need to do the following:
git clone https://github.com/dawbarton/pdf2svg.git && cd pdf2svg
#if you dont have development stuff specific to this project
sudo dnf config-manager --set-enabled powertools
sudo dnf install cairo-devel poppler-glib-devel
#git repo isn't quite ready to ./configure
touch README
autoreconf -f -i
./configure && make && sudo make install
It produces svgs that actually look nicer than the ghostscript-inkscape one above, the font seems to raster better.
pdf2svg $1.pdf $1_%d.svg all
But that installation is a bit much, too much even if you don't have sudo. On top of that, pdf2svg doesn't support stdin/stdout, so the readily available pdftocairo (#SuperNova) worked a treat in these regards, and here's an example of "advanced" use below:
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdftocairo -svg -f $page -l $page $1.pdf - | gzip -9 >$1_$page.svg.gz
done
Which produces files of the same quality and size (before compression) as pdf2svg, although not binary-identical (and even visually, jumping between output of the two some pixels of letters shift, but neither looks wrong/bad like inkscape did).

Inkscape does not work with the -l option any more. It said "Can't open file: /out.svg (doesn't exist)". The long form that option is in the man page as --export-plain-svg and works but shows a deprecation warning. I was able to fix and update the command by using the -o option on Inkscape 1.1.2-3ubuntu4:
inkscape in.pdf -o out.svg

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Does grep work properly on pdf files? - pdf

Is it possible to search multiple pdf files using the 'grep' command. It doesn't seem to work, how do people search content on multiple pdf files?

Well, PDF is a binary format, and grep can search binary files as if they were text grep -a or you can just use pdftotext (which comes with xpdf) like this: pdftotext whee.pdf | grep pattern

You don't mention which OS you're using, but under Mac OS X you can use mdfind from the command line: mdfind -onlyin search/directory/path "kind:pdf search text"

use something like Solr or clucene I think they can do what you want.

Pdf is a binary format, that's why searching it with grep is not that helpful. You can search the strings is a pdf with grep like this: ls dir_with_pdfs/*.pdf|xargs strings|grep "keyword" Or you can use the pdf2text command on pdf's and then search result with grep.

If you have pdftotext installed via the popplar package, then try this perl script : #!/usr/bin/perl my $p = shift; foreach my $fn (#ARGV) { open(F,"pdftotext $fn - |"); while (<F>) { print "$fn:$_" if /$p/; } close(F); }

Related

How to identify the pdf object in raw pdf file?

How to make "pandoc" use a specific font when converting a plaintext file to PDF?

text to pdf with utf8 encoding (alternative to a2ps)

How to create large PDF files (10MB, 50MB, 100MB, 200MB, 500MB, 1GB, etc.) for testing purposes?

convert pdf to svg

Categories

Resources